LLM training parameters explanation
Quick overview of LLM MLX LORA training parameters.
| weight_decay | A regularization technique that adds a small penalty to the weights during training to prevent them from growing too large, helping to reduce overfitting. Often implemented as L2 regularization.examples: 0.00001 – 0.01 |
| grad_clip | Short for gradient clipping — a method that limits (clips) the size of gradients during backpropagation to prevent exploding gradients and stabilize training.examples: 0.1 – 1.0 |
| rank | Refers to the dimensionality or the number of independent directions in a matrix or tensor. In low-rank models, it controls how much the model compresses or approximates the original data.examples: 4, 8, 16 or 32 |
| scale | A multiplier or factor used to adjust the magnitude of values — for example, scaling activations, gradients, or learning rates to maintain numerical stability or normalize features.examples: 0.5 – 2.0 |
| dropout | A regularization method that randomly “drops out” (sets to zero) a fraction of neurons during training, forcing the network to learn more robust and generalizable patterns.examples: 0.1 – 0.5 |