format enforcement, text generation
Ensure specific output format.
395 technical terms and definitions
Ensure specific output format.
Format specifications describe desired output structure ensuring parseable responses.
Format verification checks outputs conform to required patterns.
Predict thermodynamic stability.
Increase speed by lowering threshold.
Forward body bias lowers threshold voltage increasing speed at cost of higher leakage.
Bond to die first.
Forward planning reasons from current state toward goal exploring possible paths.
Start from premises and work forward.
Forward scheduling starts from earliest possible time scheduling operations forward.
Forward-backward algorithm computes marginal probabilities over sequences for training and inference in sequential models like HMMs.
Distributed systems and practices for training large models.
Contract manufacturing.
Foundry model separates design and manufacturing where foundries fabricate chips for fabless companies.
Company that manufactures chips for fabless companies.
Foundries (TSMC, Samsung, Intel) manufacture chips. Different process nodes. Capacity is constrained.
Sealed wafer carrier that holds 25 wafers.
Track wafer carriers.
Front Opening Unified Pod protects wafers during storage and transport.
Indication of shift.
Map sheet resistance at many points.
Four-point probe measures sheet resistance using separate current and voltage contacts eliminating contact resistance.
Measure sheet resistance of doped or metal films.
Encode inputs with Fourier basis.
Neural operator using Fourier transforms.
# Computational Lithography Mathematics Modern semiconductor manufacturing faces a fundamental physical challenge: creating nanoscale features using light with wavelengths much larger than the target dimensions. Computational lithography bridges this gap through sophisticated mathematical techniques. 1. The Core Challenge 1.1 Resolution Limits The Rayleigh criterion defines the minimum resolvable feature size: $$ R = k_1 \cdot \frac{\lambda}{NA} $$ Where: - $R$ = minimum resolution - $k_1$ = process-dependent factor (theoretical limit: 0.25) - $\lambda$ = wavelength of light (193 nm for ArF, 13.5 nm for EUV) - $NA$ = numerical aperture of the lens system 1.2 Depth of Focus $$ DOF = k_2 \cdot \frac{\lambda}{NA^2} $$ 2. Wave Optics Fundamentals 2.1 Partially Coherent Imaging The aerial image intensity on the wafer is described by Hopkins' equation: $$ I(x, y) = \iint TCC(f_1, f_2) \cdot M(f_1) \cdot M^*(f_2) \, df_1 \, df_2 $$ Where: - $I(x, y)$ = intensity at wafer position $(x, y)$ - $TCC(f_1, f_2)$ = Transmission Cross Coefficient - $M(f)$ = Fourier transform of the mask pattern - $M^*(f)$ = complex conjugate of $M(f)$ 2.2 Transmission Cross Coefficient The TCC captures the optical system behavior: $$ TCC(f_1, f_2) = \iint S(\xi, \eta) \cdot H(f_1 + \xi, \eta) \cdot H^*(f_2 + \xi, \eta) \, d\xi \, d\eta $$ Where: - $S(\xi, \eta)$ = source intensity distribution - $H(f)$ = pupil function of the projection optics 3. Optical Proximity Correction (OPC) 3.1 The Inverse Problem OPC solves the inverse imaging problem: $$ \min_{M} \sum_{i} \left\| I(x_i, y_i; M) - I_{\text{target}}(x_i, y_i) \right\|^2 + \lambda R(M) $$ Where: - $M$ = mask pattern (optimization variable) - $I_{\text{target}}$ = desired wafer pattern - $R(M)$ = regularization term for manufacturability - $\lambda$ = regularization weight 3.2 Gradient-Based Optimization The gradient with respect to mask pixels: $$ \frac{\partial J}{\partial M_k} = \sum_{i} 2 \left( I_i - I_{\text{target},i} \right) \cdot \frac{\partial I_i}{\partial M_k} $$ 3.3 Key Correction Features - Serifs : Corner additions/subtractions to correct corner rounding - Hammerheads : Line-end extensions to prevent line shortening - Assist features : Sub-resolution features that improve main feature fidelity - Scattering bars : Improve depth of focus for isolated features 4. Inverse Lithography Technology (ILT) 4.1 Full Pixel-Based Optimization ILT treats each mask pixel as an independent variable: $$ \min_{\mathbf{m}} \left\| \mathbf{I}(\mathbf{m}) - \mathbf{I}_{\text{target}} \right\|_2^2 + \alpha \|\nabla \mathbf{m}\|_1 + \beta \text{TV}(\mathbf{m}) $$ Where: - $\mathbf{m} \in [0, 1]^N$ = continuous mask pixel values - $\text{TV}(\mathbf{m})$ = Total Variation regularization - $\|\nabla \mathbf{m}\|_1$ = sparsity-promoting term 4.2 Level-Set Formulation Mask boundaries represented implicitly: $$ \frac{\partial \phi}{\partial t} = -V \cdot |\nabla \phi| $$ Where: - $\phi(x, y)$ = level-set function - Mask region: $\{(x,y) : \phi(x,y) > 0\}$ - $V$ = velocity field derived from optimization gradient 5. Source Mask Optimization (SMO) 5.1 Joint Optimization Problem $$ \min_{S, M} \sum_{i} \left[ I(x_i, y_i; S, M) - I_{\text{target}}(x_i, y_i) \right]^2 $$ Subject to: - Source constraints: $\int S(\xi, \eta) \, d\xi \, d\eta = 1$, $S \geq 0$ - Mask manufacturability constraints 5.2 Alternating Optimization 1. Fix source $S$, optimize mask $M$ 2. Fix mask $M$, optimize source $S$ 3. Repeat until convergence 6. Rigorous Electromagnetic Simulation 6.1 Maxwell's Equations For accurate 3D mask effects: $$ \nabla \times \mathbf{E} = -\frac{\partial \mathbf{B}}{\partial t} $$ $$ \nabla \times \mathbf{H} = \mathbf{J} + \frac{\partial \mathbf{D}}{\partial t} $$ $$ \nabla \cdot \mathbf{D} = \rho $$ $$ \nabla \cdot \mathbf{B} = 0 $$ 6.2 Numerical Methods - FDTD (Finite-Difference Time-Domain) : $$ \frac{\partial E_x}{\partial t} = \frac{1}{\epsilon} \left( \frac{\partial H_z}{\partial y} - \frac{\partial H_y}{\partial z} \right) $$ - RCWA (Rigorous Coupled-Wave Analysis) : Expansion in Fourier harmonics $$ \mathbf{E}(x, y, z) = \sum_{m,n} \mathbf{E}_{mn}(z) \cdot e^{i(k_{xm}x + k_{yn}y)} $$ 7. Photoresist Modeling 7.1 Dill Model for Absorption $$ I(z) = I_0 \exp\left( -\int_0^z \alpha(z') \, dz' \right) $$ Where absorption coefficient: $$ \alpha = A \cdot M + B $$ - $A$ = bleachable absorption - $B$ = non-bleachable absorption - $M$ = photoactive compound concentration 7.2 Exposure Kinetics $$ \frac{dM}{dt} = -C \cdot I \cdot M $$ - $C$ = exposure rate constant 7.3 Acid Diffusion (Post-Exposure Bake) Reaction-diffusion equation: $$ \frac{\partial [H^+]}{\partial t} = D \nabla^2 [H^+] - k_{\text{loss}} [H^+] $$ Where: - $D$ = diffusion coefficient (temperature-dependent) - $k_{\text{loss}}$ = acid loss rate 7.4 Development Rate Mack model: $$ r = r_{\max} \cdot \frac{(a+1)(1-m)^n}{a + (1-m)^n} + r_{\min} $$ Where $m$ = normalized remaining PAC concentration. 8. Stochastic Effects 8.1 Photon Shot Noise Photon count follows Poisson distribution: $$ P(n) = \frac{\lambda^n e^{-\lambda}}{n!} $$ Standard deviation: $$ \sigma_n = \sqrt{\bar{n}} $$ 8.2 Line Edge Roughness (LER) Power spectral density: $$ PSD(f) = \frac{A}{1 + (2\pi f \xi)^{2\alpha}} $$ Where: - $\xi$ = correlation length - $\alpha$ = roughness exponent - $A$ = amplitude 8.3 Stochastic Defect Probability For extreme ultraviolet (EUV): $$ P_{\text{defect}} = 1 - \exp\left( -\frac{A_{\text{pixel}}}{N_{\text{photons}} \cdot \eta} \right) $$ 9. Multi-Patterning Mathematics 9.1 Graph Coloring Formulation Given conflict graph $G = (V, E)$: - $V$ = features - $E$ = edges connecting features with spacing $< \text{min}_{\text{space}}$ Find $k$-coloring $c: V \rightarrow \{1, 2, \ldots, k\}$ such that: $$ \forall (u, v) \in E: c(u) \neq c(v) $$ 9.2 Integer Linear Programming Formulation $$ \min \sum_{(i,j) \in E} w_{ij} \cdot y_{ij} $$ Subject to: $$ \sum_{k=1}^{K} x_{ik} = 1 \quad \forall i \in V $$ $$ x_{ik} + x_{jk} - y_{ij} \leq 1 \quad \forall (i,j) \in E, \forall k $$ $$ x_{ik}, y_{ij} \in \{0, 1\} $$ 10. EUV Lithography Specific Mathematics 10.1 Multilayer Mirror Reflectivity Bragg condition for Mo/Si multilayers: $$ 2d \sin\theta = n\lambda $$ Reflectivity at each interface: $$ r = \frac{n_1 - n_2}{n_1 + n_2} $$ Total reflectivity (matrix method): $$ \mathbf{M}_{\text{total}} = \prod_{j=1}^{N} \mathbf{M}_j $$ 10.2 Mask 3D Effects Shadow effect for off-axis illumination: $$ \Delta x = h_{\text{absorber}} \cdot \tan(\theta_{\text{chief ray}}) $$ 11. Machine Learning in Computational Lithography 11.1 Neural Network as Fast Surrogate Model $$ I_{\text{predicted}} = f_{\theta}(M) $$ Where $f_{\theta}$ is a trained CNN, training minimizes: $$ \mathcal{L} = \sum_{i} \left\| f_{\theta}(M_i) - I_{\text{rigorous}}(M_i) \right\|^2 $$ 11.2 Physics-Informed Neural Networks Loss function incorporating physics: $$ \mathcal{L} = \mathcal{L}_{\text{data}} + \lambda_{\text{physics}} \mathcal{L}_{\text{physics}} $$ Where: $$ \mathcal{L}_{\text{physics}} = \left\| \nabla^2 E + k^2 \epsilon E \right\|^2 $$ 12. Key Mathematical Techniques Summary | Technique | Application | |-----------|-------------| | Fourier Analysis | Optical imaging, frequency domain calculations | | Inverse Problems | OPC, ILT, metrology | | Non-convex Optimization | Mask optimization, SMO | | Partial Differential Equations | EM simulation, resist diffusion | | Graph Theory | Multi-patterning decomposition | | Stochastic Processes | Shot noise, LER modeling | | Linear Algebra | Large sparse system solutions | | Machine Learning | Fast surrogate models, pattern recognition | 13. Computational Complexity 13.1 Full-Chip OPC Scale - Features : $\sim 10^{12}$ polygon edges - Variables : $\sim 10^8$ optimization parameters - Compute time : hours to days on $1000+$ CPU cores - Memory : terabytes of working data 13.2 Complexity Classes | Operation | Complexity | |-----------|------------| | FFT for imaging | $O(N \log N)$ | | RCWA per wavelength | $O(M^3)$ where $M$ = harmonics | | ILT optimization | $O(N \cdot k)$ where $k$ = iterations | | Graph coloring | NP-complete (general case) | Notation: | Symbol | Meaning | |--------|---------| | $\lambda$ | Wavelength | | $NA$ | Numerical Aperture | | $TCC$ | Transmission Cross Coefficient | | $M(f)$ | Mask Fourier transform | | $I(x,y)$ | Intensity at wafer | | $\phi$ | Level-set function | | $D$ | Diffusion coefficient | | $\sigma$ | Standard deviation | | $PSD$ | Power Spectral Density |
Use Fourier features for position.
Frequency domain analysis of signals.
Identify chemical bonds.
Fourier transforms decompose time series into frequency components.
Frequency-domain mixing.
Oscillation pattern.
Field emission through triangular barrier.
Half-precision training.
FP16 is 16-bit float. 2x memory savings. Some accuracy loss, need loss scaling in training.
FP32 is 32-bit float, standard precision. Full accuracy but slow and memory-heavy for inference.
FP8 is 8-bit float, emerging standard. E4M3 and E5M2 formats. NVIDIA H100 supports FP8.
# FPGA Mathematical Modeling A comprehensive guide to mathematical techniques, representations, and methodologies for implementing algorithms in Field-Programmable Gate Arrays. ## 1. Number Representation Models ### 1.1 Fixed-Point Representation Fixed-point representation maintains the decimal point within a fixed position, enabling straightforward arithmetic operations. #### Notation Format - **Q-format:** $Q(m, n)$ or $(x, y)$ - $m$ or $x$ = number of integer bits - $n$ or $y$ = number of fractional bits - Total bits = $m + n + 1$ (including sign bit for signed) #### Mathematical Formulations **Integer bits calculation:** $$ \text{Integer Bits} = \lceil \log_2(\text{max\_value}) \rceil $$ **Value representation:** $$ \text{Value} = \sum_{i=-n}^{m-1} b_i \cdot 2^i $$ Where $b_i \in \{0, 1\}$ for unsigned, or using two's complement for signed. **Resolution (LSB value):** $$ \text{Resolution} = 2^{-n} $$ **Range for signed Q(m,n):** $$ \text{Range} = \left[ -2^{m-1}, \, 2^{m-1} - 2^{-n} \right] $$ #### Fixed-Point Arithmetic Rules - **Addition/Subtraction:** - Decimal points must be aligned - Result: $\max(m_1, m_2) + 1$ integer bits, $\max(n_1, n_2)$ fractional bits $$ Q(m_1, n_1) \pm Q(m_2, n_2) \rightarrow Q(\max(m_1, m_2) + 1, \max(n_1, n_2)) $$ - **Multiplication:** $$ Q(m_1, n_1) \times Q(m_2, n_2) \rightarrow Q(m_1 + m_2, n_1 + n_2) $$ - **Scaling by powers of 2:** - Left shift by $k$: multiply by $2^k$ - Right shift by $k$: divide by $2^k$ #### Example Calculation For a 16-bit signed Q8.8 format: - Integer range: $[-128, 127]$ - Fractional resolution: $2^{-8} = 0.00390625$ - Example: Binary `0000_0001.1000_0000` = $1.5_{10}$ ### 1.2 Floating-Point Representation #### IEEE 754 Standard Formats | Format | Total Bits | Sign | Exponent | Mantissa | |--------|-----------|------|----------|----------| | Half | 16 | 1 | 5 | 10 | | Single | 32 | 1 | 8 | 23 | | Double | 64 | 1 | 11 | 52 | #### Mathematical Representation $$ \text{Value} = (-1)^{s} \times (1 + M) \times 2^{E - \text{bias}} $$ Where: - $s$ = sign bit - $M$ = mantissa (fractional part) - $E$ = exponent - $\text{bias} = 2^{k-1} - 1$ (where $k$ = exponent bits) #### Bias Values - Half precision: $\text{bias} = 15$ - Single precision: $\text{bias} = 127$ - Double precision: $\text{bias} = 1023$ #### Resource Comparison | Operation | Fixed-Point LUTs | Floating-Point LUTs | Ratio | |-----------|-----------------|---------------------|-------| | Add (32-bit) | ~32 | ~400 | ~12.5× | | Multiply (32-bit) | ~64 (DSP) | ~600 | ~9× | | Divide (32-bit) | ~200 | ~1200 | ~6× | ## 2. Mathematical Function Implementation ### 2.1 CORDIC Algorithm **CORDIC** (COordinate Rotation DIgital Computer) computes trigonometric, hyperbolic, and other functions using only shift-add operations. #### Core Rotation Matrix $$ \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} $$ #### CORDIC Simplification By restricting $\tan(\theta_i) = 2^{-i}$, the rotation becomes: $$ \begin{bmatrix} x' \\ y' \end{bmatrix} = \cos\theta_i \begin{bmatrix} 1 & -\sigma_i 2^{-i} \\ \sigma_i 2^{-i} & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} $$ Where $\sigma_i \in \{-1, +1\}$ is the rotation direction. #### Iterative Equations $$ \begin{aligned} x_{i+1} &= x_i - \sigma_i \cdot y_i \cdot 2^{-i} \\ y_{i+1} &= y_i + \sigma_i \cdot x_i \cdot 2^{-i} \\ z_{i+1} &= z_i - \sigma_i \cdot \arctan(2^{-i}) \end{aligned} $$ #### CORDIC Gain (Scale Factor) $$ K_n = \prod_{i=0}^{n-1} \cos(\arctan(2^{-i})) = \prod_{i=0}^{n-1} \frac{1}{\sqrt{1 + 2^{-2i}}} $$ For $n \rightarrow \infty$: $$ K_\infty \approx 0.6072529350088812561694 $$ #### CORDIC Modes | Mode | Condition | Computes | |------|-----------|----------| | Rotation | $\sigma_i = \text{sign}(z_i)$ | $x_n = K(x_0\cos z_0 - y_0\sin z_0)$ | | Vectoring | $\sigma_i = -\text{sign}(y_i)$ | $z_n = z_0 + \arctan(y_0/x_0)$ | #### CORDIC Functions Table | Function | Mode | Initial Values | Result | |----------|------|----------------|--------| | $\sin(\theta)$ | Rotation | $x_0=1/K$, $y_0=0$, $z_0=\theta$ | $y_n$ | | $\cos(\theta)$ | Rotation | $x_0=1/K$, $y_0=0$, $z_0=\theta$ | $x_n$ | | $\arctan(a)$ | Vectoring | $x_0=1$, $y_0=a$, $z_0=0$ | $z_n$ | | $\sqrt{x^2+y^2}$ | Vectoring | $x_0=x$, $y_0=y$, $z_0=0$ | $K \cdot x_n$ | #### CORDIC VHDL Pseudo-Structure ```vhdl -- CORDIC iteration stage process(clk) begin if rising_edge(clk) then if z(i) >= 0 then -- σ = +1 x(i+1) <= x(i) - shift_right(y(i), i); y(i+1) <= y(i) + shift_right(x(i), i); z(i+1) <= z(i) - atan_table(i); else -- σ = -1 x(i+1) <= x(i) + shift_right(y(i), i); y(i+1) <= y(i) - shift_right(x(i), i); z(i+1) <= z(i) + atan_table(i); end if; end if; end process; ``` ### 2.2 Polynomial Approximation #### Taylor Series Expansion $$ f(x) = \sum_{n=0}^{N} \frac{f^{(n)}(a)}{n!}(x-a)^n + R_N(x) $$ Where $R_N(x)$ is the remainder term. #### Common Taylor Expansions **Sine function:** $$ \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!} $$ **Cosine function:** $$ \cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!} $$ **Exponential function:** $$ e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots = \sum_{n=0}^{\infty} \frac{x^n}{n!} $$ **Natural logarithm (for $|x| < 1$):** $$ \ln(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \cdots = \sum_{n=1}^{\infty} \frac{(-1)^{n+1} x^n}{n} $$ #### Horner's Method for Evaluation For polynomial $P(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_n x^n$: $$ P(x) = a_0 + x(a_1 + x(a_2 + x(a_3 + \cdots + x(a_{n-1} + x \cdot a_n)\cdots))) $$ **Advantages:** - Reduces multiplications from $\frac{n(n+1)}{2}$ to $n$ - Reduces additions from $n$ to $n$ - Enables pipelining #### Minimax Polynomial Approximation Minimizes the maximum error over an interval: $$ \min_{p \in P_n} \max_{x \in [a,b]} |f(x) - p(x)| $$ Where $P_n$ is the set of polynomials of degree $\leq n$. **Chebyshev approximation** often provides near-minimax results: $$ T_n(x) = \cos(n \cdot \arccos(x)) $$ ### 2.3 Lookup Table (LUT) Methods #### Direct Table Lookup $$ f(x) \approx \text{LUT}[\lfloor x \cdot 2^n \rfloor] $$ **Memory requirement:** $$ \text{Memory (bits)} = 2^{\text{input\_bits}} \times \text{output\_bits} $$ #### Linear Interpolation $$ f(x) \approx f(x_0) + \frac{f(x_1) - f(x_0)}{x_1 - x_0} \cdot (x - x_0) $$ Simplified for uniform spacing ($\Delta x = x_1 - x_0$): $$ f(x) \approx f(x_0) + \frac{\Delta f}{\Delta x} \cdot (x - x_0) $$ #### Bipartite Table Method Decomposes the function into two smaller tables: $$ f(x) \approx T_1(x_{\text{MSB}}) + T_2(x_{\text{MSB}}, x_{\text{LSB}}) $$ **Memory savings:** $$ \text{Bipartite Memory} \approx 2 \cdot 2^{n/2} \ll 2^n \text{ (Direct)} $$ #### Piecewise Polynomial with LUT $$ f(x) = \sum_{k=0}^{d} c_k[i] \cdot (x - x_i)^k \quad \text{for } x \in [x_i, x_{i+1}) $$ Where coefficients $c_k[i]$ are stored in lookup tables. ## 3. System-Level Mathematical Modeling ### 3.1 Boolean Algebra Model #### Basic Operations | Operation | Symbol | VHDL | Verilog | |-----------|--------|------|---------| | AND | $A \cdot B$ | `A and B` | `A & B` | | OR | $A + B$ | `A or B` | `A ∣ B` | | NOT | $\overline{A}$ | `not A` | `~A` | | XOR | $A \oplus B$ | `A xor B` | `A ^ B` | | NAND | $\overline{A \cdot B}$ | `A nand B` | `~(A & B)` | | NOR | $\overline{A + B}$ | `A nor B` | `~(A ∣ B)` | #### Boolean Algebra Laws **Identity:** $$ A + 0 = A \quad \text{and} \quad A \cdot 1 = A $$ **Complement:** $$ A + \overline{A} = 1 \quad \text{and} \quad A \cdot \overline{A} = 0 $$ **De Morgan's Theorems:** $$ \overline{A + B} = \overline{A} \cdot \overline{B} \quad \text{and} \quad \overline{A \cdot B} = \overline{A} + \overline{B} $$ **Absorption:** $$ A + A \cdot B = A \quad \text{and} \quad A \cdot (A + B) = A $$ #### Sum of Products (SOP) Form $$ F = \sum_{i} m_i \quad \text{where } m_i \text{ are minterms} $$ **Example:** $$ F(A, B, C) = \overline{A} \cdot B \cdot C + A \cdot \overline{B} \cdot C + A \cdot B \cdot \overline{C} $$ ### 3.2 Finite State Machine (FSM) Models #### Moore Machine $$ \begin{aligned} \text{Next State:} \quad S_{n+1} &= \delta(S_n, I_n) \\ \text{Output:} \quad O_n &= \lambda(S_n) \end{aligned} $$ #### Mealy Machine $$ \begin{aligned} \text{Next State:} \quad S_{n+1} &= \delta(S_n, I_n) \\ \text{Output:} \quad O_n &= \lambda(S_n, I_n) \end{aligned} $$ #### State Encoding | Encoding | States | Flip-Flops | Transitions | |----------|--------|------------|-------------| | Binary | $N$ | $\lceil \log_2 N \rceil$ | Complex | | One-Hot | $N$ | $N$ | Simple | | Gray | $N$ | $\lceil \log_2 N \rceil$ | Low power | ### 3.3 Dataflow Graph Model #### Directed Acyclic Graph (DAG) Representation - **Nodes ($V$):** Operations (add, multiply, etc.) - **Edges ($E$):** Data dependencies - **Graph:** $G = (V, E)$ #### Critical Path Length $$ T_{\text{critical}} = \max_{\text{paths } p} \sum_{v \in p} t_v $$ Where $t_v$ is the delay of operation $v$. #### Parallelism Metrics **Available parallelism:** $$ P = \frac{\sum_{v \in V} t_v}{T_{\text{critical}}} $$ **Resource-constrained schedule length:** $$ T_{\text{schedule}} \geq \max\left( T_{\text{critical}}, \left\lceil \frac{\sum_{v \in V} t_v}{R} \right\rceil \right) $$ Where $R$ is the number of available resources. ### 3.4 State-Space Models #### Continuous-Time State-Space $$ \begin{aligned} \dot{\mathbf{x}}(t) &= \mathbf{A}\mathbf{x}(t) + \mathbf{B}\mathbf{u}(t) \\ \mathbf{y}(t) &= \mathbf{C}\mathbf{x}(t) + \mathbf{D}\mathbf{u}(t) \end{aligned} $$ #### Discrete-Time State-Space (for FPGA) $$ \begin{aligned} \mathbf{x}[k+1] &= \mathbf{A}_d\mathbf{x}[k] + \mathbf{B}_d\mathbf{u}[k] \\ \mathbf{y}[k] &= \mathbf{C}_d\mathbf{x}[k] + \mathbf{D}_d\mathbf{u}[k] \end{aligned} $$ #### Discretization (Zero-Order Hold) $$ \mathbf{A}_d = e^{\mathbf{A}T_s} \quad \text{and} \quad \mathbf{B}_d = \mathbf{A}^{-1}(e^{\mathbf{A}T_s} - \mathbf{I})\mathbf{B} $$ Where $T_s$ is the sampling period. ## 4. Numerical Methods for FPGA ### 4.1 ODE Solvers #### Forward Euler Method $$ y_{n+1} = y_n + h \cdot f(t_n, y_n) $$ **Local truncation error:** $O(h^2)$ **Global error:** $O(h)$ #### Backward Euler Method $$ y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1}) $$ Requires solving implicit equation (more stable but complex). #### Trapezoidal Method (Crank-Nicolson) $$ y_{n+1} = y_n + \frac{h}{2}\left[f(t_n, y_n) + f(t_{n+1}, y_{n+1})\right] $$ **Local truncation error:** $O(h^3)$ #### 4th Order Runge-Kutta (RK4) $$ y_{n+1} = y_n + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4) $$ Where: $$ \begin{aligned} k_1 &= f(t_n, y_n) \\ k_2 &= f\left(t_n + \frac{h}{2}, y_n + \frac{h}{2}k_1\right) \\ k_3 &= f\left(t_n + \frac{h}{2}, y_n + \frac{h}{2}k_2\right) \\ k_4 &= f(t_n + h, y_n + h \cdot k_3) \end{aligned} $$ **Local truncation error:** $O(h^5)$ **Global error:** $O(h^4)$ #### Method Comparison for FPGA | Method | Accuracy | Stability | FPGA Resources | Latency | |--------|----------|-----------|----------------|---------| | Forward Euler | $O(h)$ | Conditional | Low | 1 mult | | Backward Euler | $O(h)$ | Unconditional | Medium | Iterative | | Trapezoidal | $O(h^2)$ | A-stable | Medium | Iterative | | RK4 | $O(h^4)$ | Conditional | High | 4 stages | ### 4.2 Linear Algebra Operations #### Matrix-Vector Multiplication $$ \mathbf{y} = \mathbf{A}\mathbf{x} \quad \Rightarrow \quad y_i = \sum_{j=1}^{n} a_{ij} x_j $$ **FPGA Implementation:** - Fully parallel: $n^2$ multipliers, $O(n)$ adder tree - Systolic array: $n$ multipliers, $n$ cycles #### Matrix-Matrix Multiplication $$ \mathbf{C} = \mathbf{A}\mathbf{B} \quad \Rightarrow \quad c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj} $$ **Computational complexity:** $O(n^3)$ multiplications #### LU Decomposition $$ \mathbf{A} = \mathbf{L}\mathbf{U} $$ **FPGA considerations:** - Division required for pivot elements - Sequential dependencies limit parallelism ## 5. Timing and Resource Models ### 5.1 Static Timing Analysis (STA) #### Setup Time Constraint $$ T_{\text{clk-to-q}} + T_{\text{logic}} + T_{\text{routing}} + T_{\text{setup}} \leq T_{\text{clk}} $$ Rearranged: $$ T_{\text{logic}} + T_{\text{routing}} \leq T_{\text{clk}} - T_{\text{clk-to-q}} - T_{\text{setup}} $$ #### Hold Time Constraint $$ T_{\text{clk-to-q}} + T_{\text{logic}} + T_{\text{routing}} \geq T_{\text{hold}} $$ #### Slack Calculations **Setup slack:** $$ \text{Slack}_{\text{setup}} = T_{\text{clk}} - T_{\text{clk-to-q}} - T_{\text{setup}} - T_{\text{data\_path}} $$ **Hold slack:** $$ \text{Slack}_{\text{hold}} = T_{\text{data\_path}} + T_{\text{clk-to-q}} - T_{\text{hold}} $$ **Condition for valid timing:** $$ \text{Slack} \geq 0 $$ #### Maximum Operating Frequency $$ f_{\text{max}} = \frac{1}{T_{\text{clk-to-q}} + T_{\text{logic}} + T_{\text{routing}} + T_{\text{setup}}} $$ ### 5.2 Resource Utilization Models #### LUT Utilization For $n$-input boolean function: $$ \text{LUTs required} = \left\lceil \frac{n}{k} \right\rceil^{\text{levels}} $$ Where $k$ is the LUT input count (typically 4-6). #### DSP Slice Utilization For multiplication $A \times B$: $$ \text{DSP slices} = \left\lceil \frac{\text{bitwidth}_A}{18} \right\rceil \times \left\lceil \frac{\text{bitwidth}_B}{18} \right\rceil $$ (For typical 18×18 DSP multipliers) #### Block RAM Utilization $$ \text{BRAMs} = \left\lceil \frac{\text{depth}}{2^{k}} \right\rceil \times \left\lceil \frac{\text{width}}{w} \right\rceil $$ Where $k$ and $w$ are BRAM address and data widths. #### Register Utilization $$ \text{Registers} = \sum_{\text{signals}} \text{bitwidth} \times \text{pipeline\_stages} $$ ### 5.3 Power Models #### Dynamic Power $$ P_{\text{dynamic}} = \alpha \cdot C_L \cdot V_{DD}^2 \cdot f $$ Where: - $\alpha$ = switching activity factor - $C_L$ = load capacitance - $V_{DD}$ = supply voltage - $f$ = clock frequency #### Static Power $$ P_{\text{static}} = I_{\text{leakage}} \cdot V_{DD} $$ #### Total Power $$ P_{\text{total}} = P_{\text{dynamic}} + P_{\text{static}} $$ ## 6. High-Level Synthesis Models ### 6.1 Quantization Models #### Fixed-Point Quantization $$ x_q = \text{round}\left(\frac{x}{2^{-n}}\right) \cdot 2^{-n} $$ Where $n$ is the number of fractional bits. #### Quantization Error $$ \epsilon_q = x - x_q $$ **For rounding:** $$ -\frac{2^{-n}}{2} \leq \epsilon_q < \frac{2^{-n}}{2} $$ **Quantization noise power (uniform distribution):** $$ \sigma_q^2 = \frac{(2^{-n})^2}{12} = \frac{2^{-2n}}{12} $$ #### Signal-to-Quantization-Noise Ratio (SQNR) $$ \text{SQNR (dB)} = 10 \log_{10}\left(\frac{\sigma_x^2}{\sigma_q^2}\right) \approx 6.02n + 1.76 \text{ dB} $$ For full-scale sinusoidal signal with $n$ fractional bits. ### 6.2 Scheduling Models #### ASAP (As Soon As Possible) Scheduling $$ t_{\text{ASAP}}(v) = \max_{u \in \text{pred}(v)} \left( t_{\text{ASAP}}(u) + d_u \right) $$ Where $d_u$ is the delay of operation $u$. #### ALAP (As Late As Possible) Scheduling $$ t_{\text{ALAP}}(v) = \min_{w \in \text{succ}(v)} \left( t_{\text{ALAP}}(w) - d_v \right) $$ #### Mobility (Slack) $$ \text{Mobility}(v) = t_{\text{ALAP}}(v) - t_{\text{ASAP}}(v) $$ Operations with zero mobility are on the critical path. ### 6.3 Resource Binding Models #### Resource Sharing Constraint $$ \sum_{v: t(v) = t} r(v, k) \leq R_k \quad \forall t, k $$ Where: - $r(v, k)$ = 1 if operation $v$ uses resource type $k$ - $R_k$ = number of available resources of type $k$ #### Conflict Graph Two operations conflict if they: 1. Execute in the same time step 2. Require the same resource type **Minimum resources = Chromatic number of conflict graph** ## 7. Optimization Models ### 7.1 Pipeline Optimization #### Pipeline Stages $$ \text{Stages} = \left\lceil \frac{T_{\text{combinational}}}{T_{\text{target\_clock}}} \right\rceil $$ #### Throughput $$ \text{Throughput} = \frac{f_{\text{clk}}}{\text{II}} $$ Where II = Initiation Interval (cycles between new inputs). #### Latency $$ \text{Latency} = \text{Stages} \times T_{\text{clk}} $$ #### Pipeline Efficiency $$ \eta = \frac{\text{Ideal Throughput}}{\text{Actual Throughput}} = \frac{1}{\text{II}} $$ ### 7.2 Loop Optimization #### Loop Unrolling For unroll factor $U$: $$ \text{New iterations} = \left\lceil \frac{N}{U} \right\rceil $$ **Resource increase:** ~$U \times$ original **Throughput increase:** up to $U \times$ #### Loop Pipelining $$ \text{II}_{\text{achieved}} = \max(\text{II}_{\text{resource}}, \text{II}_{\text{recurrence}}, \text{II}_{\text{target}}) $$ Where: - $\text{II}_{\text{resource}}$ = resource-constrained II - $\text{II}_{\text{recurrence}}$ = loop-carried dependency II #### Loop Tiling For tile size $T$: $$ \text{for } (ii = 0; ii < N; ii += T) \\ \quad \text{for } (i = ii; i < \min(ii + T, N); i++) $$ **Memory accesses reduced by factor of** $T$ for reused data. ### 7.3 Area-Time Trade-offs #### Area-Time Product $$ \text{AT} = \text{Area} \times \text{Time} $$ #### Pareto Optimality A design $D_1$ dominates $D_2$ if: $$ \text{Area}(D_1) \leq \text{Area}(D_2) \quad \text{AND} \quad \text{Time}(D_1) \leq \text{Time}(D_2) $$ with at least one strict inequality. #### Cost Function $$ \text{Cost} = \alpha \cdot \text{Area} + \beta \cdot \text{Latency} + \gamma \cdot \text{Power} $$ Where $\alpha + \beta + \gamma = 1$ are weighting factors. ## 8. Practical Examples ### 8.1 FIR Filter Implementation #### Mathematical Definition $$ y[n] = \sum_{k=0}^{N-1} h[k] \cdot x[n-k] $$ #### Direct Form Implementation **Resources:** - Multipliers: $N$ - Adders: $N-1$ - Registers: $N-1$ (delay line) **Accumulator sizing:** $$ \text{Acc bits} = \text{input bits} + \text{coef bits} + \lceil \log_2(N) \rceil $$ #### Transpose Form (Better for FPGA) **Advantages:** - All multipliers fed by same input - Shorter critical path - Better for high-speed designs ### 8.2 FFT Implementation #### Radix-2 DIT FFT $$ X[k] = \sum_{n=0}^{N-1} x[n] \cdot W_N^{nk} $$ Where $W_N = e^{-j2\pi/N}$ (twiddle factor). #### Butterfly Operation $$ \begin{aligned} X[k] &= A + W_N^k \cdot B \\ X[k + N/2] &= A - W_N^k \cdot B \end{aligned} $$ #### Resource Requirements | Architecture | Multipliers | Memory | Throughput | |--------------|-------------|--------|------------| | Radix-2 Serial | 1 complex | $N$ | $N \log_2 N$ cycles | | Radix-2 Parallel | $N/2$ complex | $N$ | $\log_2 N$ cycles | | Pipelined | $\log_2 N$ complex | $2N$ | 1 sample/cycle | ### 8.3 PID Controller #### Continuous-Time PID $$ u(t) = K_p \cdot e(t) + K_i \int_0^t e(\tau) d\tau + K_d \frac{de(t)}{dt} $$ #### Discrete-Time PID (Tustin Transform) $$ u[n] = u[n-1] + a_0 \cdot e[n] + a_1 \cdot e[n-1] + a_2 \cdot e[n-2] $$ Where: $$ \begin{aligned} a_0 &= K_p + K_i \frac{T_s}{2} + K_d \frac{2}{T_s} \\ a_1 &= -K_p + K_i \frac{T_s}{2} - K_d \frac{4}{T_s} \\ a_2 &= K_d \frac{2}{T_s} \end{aligned} $$ #### FPGA Implementation Considerations - **Coefficient quantization:** affects stability - **Accumulator overflow:** requires saturation logic - **Anti-windup:** needed for integrator term ## Common Mathematical Constants for FPGA | Constant | Value | Q16.16 Hex | |----------|-------|------------| | $\pi$ | 3.14159265 | `0x0003_243F` | | $e$ | 2.71828183 | `0x0002_B7E1` | | $\sqrt{2}$ | 1.41421356 | `0x0001_6A09` | | $\ln(2)$ | 0.69314718 | `0x0000_B172` | | $1/\pi$ | 0.31830989 | `0x0000_517C` | | CORDIC $K$ | 0.60725293 | `0x0000_9B74` | ## FPGA Vendor DSP Specifications | Vendor | Family | DSP Name | Multiplier Size | Pre-adder | |--------|--------|----------|-----------------|-----------| | AMD/Xilinx | 7 Series | DSP48E1 | 25×18 | Yes | | AMD/Xilinx | UltraScale | DSP48E2 | 27×18 | Yes | | Intel | Stratix 10 | DSP Block | 18×19 or 27×27 | Yes | | Lattice | ECP5 | MULT18X18D | 18×18 | No |
# FPGA Mathematical Modeling A comprehensive guide to mathematical techniques, representations, and methodologies for implementing algorithms in Field-Programmable Gate Arrays. ## 1. Number Representation Models ### 1.1 Fixed-Point Representation Fixed-point representation maintains the decimal point within a fixed position, enabling straightforward arithmetic operations. #### Notation Format - **Q-format:** $Q(m, n)$ or $(x, y)$ - $m$ or $x$ = number of integer bits - $n$ or $y$ = number of fractional bits - Total bits = $m + n + 1$ (including sign bit for signed) #### Mathematical Formulations **Integer bits calculation:** $$ \text{Integer Bits} = \lceil \log_2(\text{max\_value}) \rceil $$ **Value representation:** $$ \text{Value} = \sum_{i=-n}^{m-1} b_i \cdot 2^i $$ Where $b_i \in \{0, 1\}$ for unsigned, or using two's complement for signed. **Resolution (LSB value):** $$ \text{Resolution} = 2^{-n} $$ **Range for signed Q(m,n):** $$ \text{Range} = \left[ -2^{m-1}, \, 2^{m-1} - 2^{-n} \right] $$ #### Fixed-Point Arithmetic Rules - **Addition/Subtraction:** - Decimal points must be aligned - Result: $\max(m_1, m_2) + 1$ integer bits, $\max(n_1, n_2)$ fractional bits $$ Q(m_1, n_1) \pm Q(m_2, n_2) \rightarrow Q(\max(m_1, m_2) + 1, \max(n_1, n_2)) $$ - **Multiplication:** $$ Q(m_1, n_1) \times Q(m_2, n_2) \rightarrow Q(m_1 + m_2, n_1 + n_2) $$ - **Scaling by powers of 2:** - Left shift by $k$: multiply by $2^k$ - Right shift by $k$: divide by $2^k$ #### Example Calculation For a 16-bit signed Q8.8 format: - Integer range: $[-128, 127]$ - Fractional resolution: $2^{-8} = 0.00390625$ - Example: Binary `0000_0001.1000_0000` = $1.5_{10}$ ### 1.2 Floating-Point Representation #### IEEE 754 Standard Formats | Format | Total Bits | Sign | Exponent | Mantissa | |--------|-----------|------|----------|----------| | Half | 16 | 1 | 5 | 10 | | Single | 32 | 1 | 8 | 23 | | Double | 64 | 1 | 11 | 52 | #### Mathematical Representation $$ \text{Value} = (-1)^{s} \times (1 + M) \times 2^{E - \text{bias}} $$ Where: - $s$ = sign bit - $M$ = mantissa (fractional part) - $E$ = exponent - $\text{bias} = 2^{k-1} - 1$ (where $k$ = exponent bits) #### Bias Values - Half precision: $\text{bias} = 15$ - Single precision: $\text{bias} = 127$ - Double precision: $\text{bias} = 1023$ #### Resource Comparison | Operation | Fixed-Point LUTs | Floating-Point LUTs | Ratio | |-----------|-----------------|---------------------|-------| | Add (32-bit) | ~32 | ~400 | ~12.5× | | Multiply (32-bit) | ~64 (DSP) | ~600 | ~9× | | Divide (32-bit) | ~200 | ~1200 | ~6× | ## 2. Mathematical Function Implementation ### 2.1 CORDIC Algorithm **CORDIC** (COordinate Rotation DIgital Computer) computes trigonometric, hyperbolic, and other functions using only shift-add operations. #### Core Rotation Matrix $$ \begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} $$ #### CORDIC Simplification By restricting $\tan(\theta_i) = 2^{-i}$, the rotation becomes: $$ \begin{bmatrix} x' \\ y' \end{bmatrix} = \cos\theta_i \begin{bmatrix} 1 & -\sigma_i 2^{-i} \\ \sigma_i 2^{-i} & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} $$ Where $\sigma_i \in \{-1, +1\}$ is the rotation direction. #### Iterative Equations $$ \begin{aligned} x_{i+1} &= x_i - \sigma_i \cdot y_i \cdot 2^{-i} \\ y_{i+1} &= y_i + \sigma_i \cdot x_i \cdot 2^{-i} \\ z_{i+1} &= z_i - \sigma_i \cdot \arctan(2^{-i}) \end{aligned} $$ #### CORDIC Gain (Scale Factor) $$ K_n = \prod_{i=0}^{n-1} \cos(\arctan(2^{-i})) = \prod_{i=0}^{n-1} \frac{1}{\sqrt{1 + 2^{-2i}}} $$ For $n \rightarrow \infty$: $$ K_\infty \approx 0.6072529350088812561694 $$ #### CORDIC Modes | Mode | Condition | Computes | |------|-----------|----------| | Rotation | $\sigma_i = \text{sign}(z_i)$ | $x_n = K(x_0\cos z_0 - y_0\sin z_0)$ | | Vectoring | $\sigma_i = -\text{sign}(y_i)$ | $z_n = z_0 + \arctan(y_0/x_0)$ | #### CORDIC Functions Table | Function | Mode | Initial Values | Result | |----------|------|----------------|--------| | $\sin(\theta)$ | Rotation | $x_0=1/K$, $y_0=0$, $z_0=\theta$ | $y_n$ | | $\cos(\theta)$ | Rotation | $x_0=1/K$, $y_0=0$, $z_0=\theta$ | $x_n$ | | $\arctan(a)$ | Vectoring | $x_0=1$, $y_0=a$, $z_0=0$ | $z_n$ | | $\sqrt{x^2+y^2}$ | Vectoring | $x_0=x$, $y_0=y$, $z_0=0$ | $K \cdot x_n$ | #### CORDIC VHDL Pseudo-Structure ```vhdl -- CORDIC iteration stage process(clk) begin if rising_edge(clk) then if z(i) >= 0 then -- σ = +1 x(i+1) <= x(i) - shift_right(y(i), i); y(i+1) <= y(i) + shift_right(x(i), i); z(i+1) <= z(i) - atan_table(i); else -- σ = -1 x(i+1) <= x(i) + shift_right(y(i), i); y(i+1) <= y(i) - shift_right(x(i), i); z(i+1) <= z(i) + atan_table(i); end if; end if; end process; ``` ### 2.2 Polynomial Approximation #### Taylor Series Expansion $$ f(x) = \sum_{n=0}^{N} \frac{f^{(n)}(a)}{n!}(x-a)^n + R_N(x) $$ Where $R_N(x)$ is the remainder term. #### Common Taylor Expansions **Sine function:** $$ \sin(x) = x - \frac{x^3}{3!} + \frac{x^5}{5!} - \frac{x^7}{7!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n+1}}{(2n+1)!} $$ **Cosine function:** $$ \cos(x) = 1 - \frac{x^2}{2!} + \frac{x^4}{4!} - \frac{x^6}{6!} + \cdots = \sum_{n=0}^{\infty} \frac{(-1)^n x^{2n}}{(2n)!} $$ **Exponential function:** $$ e^x = 1 + x + \frac{x^2}{2!} + \frac{x^3}{3!} + \cdots = \sum_{n=0}^{\infty} \frac{x^n}{n!} $$ **Natural logarithm (for $|x| < 1$):** $$ \ln(1+x) = x - \frac{x^2}{2} + \frac{x^3}{3} - \frac{x^4}{4} + \cdots = \sum_{n=1}^{\infty} \frac{(-1)^{n+1} x^n}{n} $$ #### Horner's Method for Evaluation For polynomial $P(x) = a_0 + a_1 x + a_2 x^2 + \cdots + a_n x^n$: $$ P(x) = a_0 + x(a_1 + x(a_2 + x(a_3 + \cdots + x(a_{n-1} + x \cdot a_n)\cdots))) $$ **Advantages:** - Reduces multiplications from $\frac{n(n+1)}{2}$ to $n$ - Reduces additions from $n$ to $n$ - Enables pipelining #### Minimax Polynomial Approximation Minimizes the maximum error over an interval: $$ \min_{p \in P_n} \max_{x \in [a,b]} |f(x) - p(x)| $$ Where $P_n$ is the set of polynomials of degree $\leq n$. **Chebyshev approximation** often provides near-minimax results: $$ T_n(x) = \cos(n \cdot \arccos(x)) $$ ### 2.3 Lookup Table (LUT) Methods #### Direct Table Lookup $$ f(x) \approx \text{LUT}[\lfloor x \cdot 2^n \rfloor] $$ **Memory requirement:** $$ \text{Memory (bits)} = 2^{\text{input\_bits}} \times \text{output\_bits} $$ #### Linear Interpolation $$ f(x) \approx f(x_0) + \frac{f(x_1) - f(x_0)}{x_1 - x_0} \cdot (x - x_0) $$ Simplified for uniform spacing ($\Delta x = x_1 - x_0$): $$ f(x) \approx f(x_0) + \frac{\Delta f}{\Delta x} \cdot (x - x_0) $$ #### Bipartite Table Method Decomposes the function into two smaller tables: $$ f(x) \approx T_1(x_{\text{MSB}}) + T_2(x_{\text{MSB}}, x_{\text{LSB}}) $$ **Memory savings:** $$ \text{Bipartite Memory} \approx 2 \cdot 2^{n/2} \ll 2^n \text{ (Direct)} $$ #### Piecewise Polynomial with LUT $$ f(x) = \sum_{k=0}^{d} c_k[i] \cdot (x - x_i)^k \quad \text{for } x \in [x_i, x_{i+1}) $$ Where coefficients $c_k[i]$ are stored in lookup tables. ## 3. System-Level Mathematical Modeling ### 3.1 Boolean Algebra Model #### Basic Operations | Operation | Symbol | VHDL | Verilog | |-----------|--------|------|---------| | AND | $A \cdot B$ | `A and B` | `A & B` | | OR | $A + B$ | `A or B` | `A ∣ B` | | NOT | $\overline{A}$ | `not A` | `~A` | | XOR | $A \oplus B$ | `A xor B` | `A ^ B` | | NAND | $\overline{A \cdot B}$ | `A nand B` | `~(A & B)` | | NOR | $\overline{A + B}$ | `A nor B` | `~(A ∣ B)` | #### Boolean Algebra Laws **Identity:** $$ A + 0 = A \quad \text{and} \quad A \cdot 1 = A $$ **Complement:** $$ A + \overline{A} = 1 \quad \text{and} \quad A \cdot \overline{A} = 0 $$ **De Morgan's Theorems:** $$ \overline{A + B} = \overline{A} \cdot \overline{B} \quad \text{and} \quad \overline{A \cdot B} = \overline{A} + \overline{B} $$ **Absorption:** $$ A + A \cdot B = A \quad \text{and} \quad A \cdot (A + B) = A $$ #### Sum of Products (SOP) Form $$ F = \sum_{i} m_i \quad \text{where } m_i \text{ are minterms} $$ **Example:** $$ F(A, B, C) = \overline{A} \cdot B \cdot C + A \cdot \overline{B} \cdot C + A \cdot B \cdot \overline{C} $$ ### 3.2 Finite State Machine (FSM) Models #### Moore Machine $$ \begin{aligned} \text{Next State:} \quad S_{n+1} &= \delta(S_n, I_n) \\ \text{Output:} \quad O_n &= \lambda(S_n) \end{aligned} $$ #### Mealy Machine $$ \begin{aligned} \text{Next State:} \quad S_{n+1} &= \delta(S_n, I_n) \\ \text{Output:} \quad O_n &= \lambda(S_n, I_n) \end{aligned} $$ #### State Encoding | Encoding | States | Flip-Flops | Transitions | |----------|--------|------------|-------------| | Binary | $N$ | $\lceil \log_2 N \rceil$ | Complex | | One-Hot | $N$ | $N$ | Simple | | Gray | $N$ | $\lceil \log_2 N \rceil$ | Low power | ### 3.3 Dataflow Graph Model #### Directed Acyclic Graph (DAG) Representation - **Nodes ($V$):** Operations (add, multiply, etc.) - **Edges ($E$):** Data dependencies - **Graph:** $G = (V, E)$ #### Critical Path Length $$ T_{\text{critical}} = \max_{\text{paths } p} \sum_{v \in p} t_v $$ Where $t_v$ is the delay of operation $v$. #### Parallelism Metrics **Available parallelism:** $$ P = \frac{\sum_{v \in V} t_v}{T_{\text{critical}}} $$ **Resource-constrained schedule length:** $$ T_{\text{schedule}} \geq \max\left( T_{\text{critical}}, \left\lceil \frac{\sum_{v \in V} t_v}{R} \right\rceil \right) $$ Where $R$ is the number of available resources. ### 3.4 State-Space Models #### Continuous-Time State-Space $$ \begin{aligned} \dot{\mathbf{x}}(t) &= \mathbf{A}\mathbf{x}(t) + \mathbf{B}\mathbf{u}(t) \\ \mathbf{y}(t) &= \mathbf{C}\mathbf{x}(t) + \mathbf{D}\mathbf{u}(t) \end{aligned} $$ #### Discrete-Time State-Space (for FPGA) $$ \begin{aligned} \mathbf{x}[k+1] &= \mathbf{A}_d\mathbf{x}[k] + \mathbf{B}_d\mathbf{u}[k] \\ \mathbf{y}[k] &= \mathbf{C}_d\mathbf{x}[k] + \mathbf{D}_d\mathbf{u}[k] \end{aligned} $$ #### Discretization (Zero-Order Hold) $$ \mathbf{A}_d = e^{\mathbf{A}T_s} \quad \text{and} \quad \mathbf{B}_d = \mathbf{A}^{-1}(e^{\mathbf{A}T_s} - \mathbf{I})\mathbf{B} $$ Where $T_s$ is the sampling period. ## 4. Numerical Methods for FPGA ### 4.1 ODE Solvers #### Forward Euler Method $$ y_{n+1} = y_n + h \cdot f(t_n, y_n) $$ **Local truncation error:** $O(h^2)$ **Global error:** $O(h)$ #### Backward Euler Method $$ y_{n+1} = y_n + h \cdot f(t_{n+1}, y_{n+1}) $$ Requires solving implicit equation (more stable but complex). #### Trapezoidal Method (Crank-Nicolson) $$ y_{n+1} = y_n + \frac{h}{2}\left[f(t_n, y_n) + f(t_{n+1}, y_{n+1})\right] $$ **Local truncation error:** $O(h^3)$ #### 4th Order Runge-Kutta (RK4) $$ y_{n+1} = y_n + \frac{h}{6}(k_1 + 2k_2 + 2k_3 + k_4) $$ Where: $$ \begin{aligned} k_1 &= f(t_n, y_n) \\ k_2 &= f\left(t_n + \frac{h}{2}, y_n + \frac{h}{2}k_1\right) \\ k_3 &= f\left(t_n + \frac{h}{2}, y_n + \frac{h}{2}k_2\right) \\ k_4 &= f(t_n + h, y_n + h \cdot k_3) \end{aligned} $$ **Local truncation error:** $O(h^5)$ **Global error:** $O(h^4)$ #### Method Comparison for FPGA | Method | Accuracy | Stability | FPGA Resources | Latency | |--------|----------|-----------|----------------|---------| | Forward Euler | $O(h)$ | Conditional | Low | 1 mult | | Backward Euler | $O(h)$ | Unconditional | Medium | Iterative | | Trapezoidal | $O(h^2)$ | A-stable | Medium | Iterative | | RK4 | $O(h^4)$ | Conditional | High | 4 stages | ### 4.2 Linear Algebra Operations #### Matrix-Vector Multiplication $$ \mathbf{y} = \mathbf{A}\mathbf{x} \quad \Rightarrow \quad y_i = \sum_{j=1}^{n} a_{ij} x_j $$ **FPGA Implementation:** - Fully parallel: $n^2$ multipliers, $O(n)$ adder tree - Systolic array: $n$ multipliers, $n$ cycles #### Matrix-Matrix Multiplication $$ \mathbf{C} = \mathbf{A}\mathbf{B} \quad \Rightarrow \quad c_{ij} = \sum_{k=1}^{n} a_{ik} b_{kj} $$ **Computational complexity:** $O(n^3)$ multiplications #### LU Decomposition $$ \mathbf{A} = \mathbf{L}\mathbf{U} $$ **FPGA considerations:** - Division required for pivot elements - Sequential dependencies limit parallelism ## 5. Timing and Resource Models ### 5.1 Static Timing Analysis (STA) #### Setup Time Constraint $$ T_{\text{clk-to-q}} + T_{\text{logic}} + T_{\text{routing}} + T_{\text{setup}} \leq T_{\text{clk}} $$ Rearranged: $$ T_{\text{logic}} + T_{\text{routing}} \leq T_{\text{clk}} - T_{\text{clk-to-q}} - T_{\text{setup}} $$ #### Hold Time Constraint $$ T_{\text{clk-to-q}} + T_{\text{logic}} + T_{\text{routing}} \geq T_{\text{hold}} $$ #### Slack Calculations **Setup slack:** $$ \text{Slack}_{\text{setup}} = T_{\text{clk}} - T_{\text{clk-to-q}} - T_{\text{setup}} - T_{\text{data\_path}} $$ **Hold slack:** $$ \text{Slack}_{\text{hold}} = T_{\text{data\_path}} + T_{\text{clk-to-q}} - T_{\text{hold}} $$ **Condition for valid timing:** $$ \text{Slack} \geq 0 $$ #### Maximum Operating Frequency $$ f_{\text{max}} = \frac{1}{T_{\text{clk-to-q}} + T_{\text{logic}} + T_{\text{routing}} + T_{\text{setup}}} $$ ### 5.2 Resource Utilization Models #### LUT Utilization For $n$-input boolean function: $$ \text{LUTs required} = \left\lceil \frac{n}{k} \right\rceil^{\text{levels}} $$ Where $k$ is the LUT input count (typically 4-6). #### DSP Slice Utilization For multiplication $A \times B$: $$ \text{DSP slices} = \left\lceil \frac{\text{bitwidth}_A}{18} \right\rceil \times \left\lceil \frac{\text{bitwidth}_B}{18} \right\rceil $$ (For typical 18×18 DSP multipliers) #### Block RAM Utilization $$ \text{BRAMs} = \left\lceil \frac{\text{depth}}{2^{k}} \right\rceil \times \left\lceil \frac{\text{width}}{w} \right\rceil $$ Where $k$ and $w$ are BRAM address and data widths. #### Register Utilization $$ \text{Registers} = \sum_{\text{signals}} \text{bitwidth} \times \text{pipeline\_stages} $$ ### 5.3 Power Models #### Dynamic Power $$ P_{\text{dynamic}} = \alpha \cdot C_L \cdot V_{DD}^2 \cdot f $$ Where: - $\alpha$ = switching activity factor - $C_L$ = load capacitance - $V_{DD}$ = supply voltage - $f$ = clock frequency #### Static Power $$ P_{\text{static}} = I_{\text{leakage}} \cdot V_{DD} $$ #### Total Power $$ P_{\text{total}} = P_{\text{dynamic}} + P_{\text{static}} $$ ## 6. High-Level Synthesis Models ### 6.1 Quantization Models #### Fixed-Point Quantization $$ x_q = \text{round}\left(\frac{x}{2^{-n}}\right) \cdot 2^{-n} $$ Where $n$ is the number of fractional bits. #### Quantization Error $$ \epsilon_q = x - x_q $$ **For rounding:** $$ -\frac{2^{-n}}{2} \leq \epsilon_q < \frac{2^{-n}}{2} $$ **Quantization noise power (uniform distribution):** $$ \sigma_q^2 = \frac{(2^{-n})^2}{12} = \frac{2^{-2n}}{12} $$ #### Signal-to-Quantization-Noise Ratio (SQNR) $$ \text{SQNR (dB)} = 10 \log_{10}\left(\frac{\sigma_x^2}{\sigma_q^2}\right) \approx 6.02n + 1.76 \text{ dB} $$ For full-scale sinusoidal signal with $n$ fractional bits. ### 6.2 Scheduling Models #### ASAP (As Soon As Possible) Scheduling $$ t_{\text{ASAP}}(v) = \max_{u \in \text{pred}(v)} \left( t_{\text{ASAP}}(u) + d_u \right) $$ Where $d_u$ is the delay of operation $u$. #### ALAP (As Late As Possible) Scheduling $$ t_{\text{ALAP}}(v) = \min_{w \in \text{succ}(v)} \left( t_{\text{ALAP}}(w) - d_v \right) $$ #### Mobility (Slack) $$ \text{Mobility}(v) = t_{\text{ALAP}}(v) - t_{\text{ASAP}}(v) $$ Operations with zero mobility are on the critical path. ### 6.3 Resource Binding Models #### Resource Sharing Constraint $$ \sum_{v: t(v) = t} r(v, k) \leq R_k \quad \forall t, k $$ Where: - $r(v, k)$ = 1 if operation $v$ uses resource type $k$ - $R_k$ = number of available resources of type $k$ #### Conflict Graph Two operations conflict if they: 1. Execute in the same time step 2. Require the same resource type **Minimum resources = Chromatic number of conflict graph** ## 7. Optimization Models ### 7.1 Pipeline Optimization #### Pipeline Stages $$ \text{Stages} = \left\lceil \frac{T_{\text{combinational}}}{T_{\text{target\_clock}}} \right\rceil $$ #### Throughput $$ \text{Throughput} = \frac{f_{\text{clk}}}{\text{II}} $$ Where II = Initiation Interval (cycles between new inputs). #### Latency $$ \text{Latency} = \text{Stages} \times T_{\text{clk}} $$ #### Pipeline Efficiency $$ \eta = \frac{\text{Ideal Throughput}}{\text{Actual Throughput}} = \frac{1}{\text{II}} $$ ### 7.2 Loop Optimization #### Loop Unrolling For unroll factor $U$: $$ \text{New iterations} = \left\lceil \frac{N}{U} \right\rceil $$ **Resource increase:** ~$U \times$ original **Throughput increase:** up to $U \times$ #### Loop Pipelining $$ \text{II}_{\text{achieved}} = \max(\text{II}_{\text{resource}}, \text{II}_{\text{recurrence}}, \text{II}_{\text{target}}) $$ Where: - $\text{II}_{\text{resource}}$ = resource-constrained II - $\text{II}_{\text{recurrence}}$ = loop-carried dependency II #### Loop Tiling For tile size $T$: $$ \text{for } (ii = 0; ii < N; ii += T) \\ \quad \text{for } (i = ii; i < \min(ii + T, N); i++) $$ **Memory accesses reduced by factor of** $T$ for reused data. ### 7.3 Area-Time Trade-offs #### Area-Time Product $$ \text{AT} = \text{Area} \times \text{Time} $$ #### Pareto Optimality A design $D_1$ dominates $D_2$ if: $$ \text{Area}(D_1) \leq \text{Area}(D_2) \quad \text{AND} \quad \text{Time}(D_1) \leq \text{Time}(D_2) $$ with at least one strict inequality. #### Cost Function $$ \text{Cost} = \alpha \cdot \text{Area} + \beta \cdot \text{Latency} + \gamma \cdot \text{Power} $$ Where $\alpha + \beta + \gamma = 1$ are weighting factors. ## 8. Practical Examples ### 8.1 FIR Filter Implementation #### Mathematical Definition $$ y[n] = \sum_{k=0}^{N-1} h[k] \cdot x[n-k] $$ #### Direct Form Implementation **Resources:** - Multipliers: $N$ - Adders: $N-1$ - Registers: $N-1$ (delay line) **Accumulator sizing:** $$ \text{Acc bits} = \text{input bits} + \text{coef bits} + \lceil \log_2(N) \rceil $$ #### Transpose Form (Better for FPGA) **Advantages:** - All multipliers fed by same input - Shorter critical path - Better for high-speed designs ### 8.2 FFT Implementation #### Radix-2 DIT FFT $$ X[k] = \sum_{n=0}^{N-1} x[n] \cdot W_N^{nk} $$ Where $W_N = e^{-j2\pi/N}$ (twiddle factor). #### Butterfly Operation $$ \begin{aligned} X[k] &= A + W_N^k \cdot B \\ X[k + N/2] &= A - W_N^k \cdot B \end{aligned} $$ #### Resource Requirements | Architecture | Multipliers | Memory | Throughput | |--------------|-------------|--------|------------| | Radix-2 Serial | 1 complex | $N$ | $N \log_2 N$ cycles | | Radix-2 Parallel | $N/2$ complex | $N$ | $\log_2 N$ cycles | | Pipelined | $\log_2 N$ complex | $2N$ | 1 sample/cycle | ### 8.3 PID Controller #### Continuous-Time PID $$ u(t) = K_p \cdot e(t) + K_i \int_0^t e(\tau) d\tau + K_d \frac{de(t)}{dt} $$ #### Discrete-Time PID (Tustin Transform) $$ u[n] = u[n-1] + a_0 \cdot e[n] + a_1 \cdot e[n-1] + a_2 \cdot e[n-2] $$ Where: $$ \begin{aligned} a_0 &= K_p + K_i \frac{T_s}{2} + K_d \frac{2}{T_s} \\ a_1 &= -K_p + K_i \frac{T_s}{2} - K_d \frac{4}{T_s} \\ a_2 &= K_d \frac{2}{T_s} \end{aligned} $$ #### FPGA Implementation Considerations - **Coefficient quantization:** affects stability - **Accumulator overflow:** requires saturation logic - **Anti-windup:** needed for integrator term ## Common Mathematical Constants for FPGA | Constant | Value | Q16.16 Hex | |----------|-------|------------| | $\pi$ | 3.14159265 | `0x0003_243F` | | $e$ | 2.71828183 | `0x0002_B7E1` | | $\sqrt{2}$ | 1.41421356 | `0x0001_6A09` | | $\ln(2)$ | 0.69314718 | `0x0000_B172` | | $1/\pi$ | 0.31830989 | `0x0000_517C` | | CORDIC $K$ | 0.60725293 | `0x0000_9B74` | ## FPGA Vendor DSP Specifications | Vendor | Family | DSP Name | Multiplier Size | Pre-adder | |--------|--------|----------|-----------------|-----------| | AMD/Xilinx | 7 Series | DSP48E1 | 25×18 | Yes | | AMD/Xilinx | UltraScale | DSP48E2 | 27×18 | Yes | | Intel | Stratix 10 | DSP Block | 18×19 or 27×27 | Yes | | Lattice | ECP5 | MULT18X18D | 18×18 | No |
Field-programmable gate arrays customized for ML.
FPGA = Field-Programmable Gate Array. Reconfigurable hardware. Good for prototyping and low-volume. Less efficient than ASIC.
Characterize complexity of rough surfaces.
Test subset of combinations to save time.
Design broken into write fields.
Fragment-based voice conversion decomposes speech into phonetic posteriorgrams and prosody for better control.
Frame interpolation generates intermediate frames between existing frames.
Generate intermediate frames.
Generate in-between frames to increase video frame rate or create slow motion.
Determine correct temporal order.
SEP licensing terms.