Train LLM on Mac Studio using MLX framework

I have done over 500 training sessions using Qwen2.5, Qwen3, Gemma and plenty other LLM publicly available to inject domain specific knowledge into the model’s low rank adapters (LORA). However, instead of giving you tons of unimportant facts I will just stick to the most important things. Starting with the fact that I have used MLX on my Mac Studio M2 Ultra as well as on MacBook Pro M1 Pro. Both fit well to this task in terms of BF16 speed as well as unified memory capacity and speed (up to 800GB/s).

Memory speed is the most important factor comparing GPU hardware withing similar generations of technological process. That is why M2/M3 Ultra with higher memory speeds beats M4 with lower overall memory bandwidth.

LORA and MLX

What is LORA? With this type of training you take only portion of large model and train only small part of parameters, like 0.5 or 1%, which in most of models gives us 100k up to 50M parameters available for training. What is MLX? it is Apple’s array computation framework which boosts machine learning tasks.

How MLX and LORA relates to different frameworks on different hardware? MLX uses slightly different weights organization and different way of achieving the same thing as other frameworks do, but with Apple Silicon speed-up. It is pricey in terms of purchase and power consumption to run modern powerful NVIDIA RTX based training, and it is much more affordable to do this on Mac Studio with lets say 64GB of RAM. Please notice that for ML (GPU related things generally speaking) tasks you get like 75% of your RAM capacity, so on 64GB Mac Studio I get around 45 – 46GB available. Now go online and look for some RTXs with similar amount of VRAM 😉

Configuration

So…

Here you have sample training configuration using Qwen2.5 rather big model which is 14B, pre-trained for Instruct type usage, storing weights in BF16 which is faster to run up to 50% than similar 16 bit floats or even 8 bits weights. I got “only” 64GB and 32GB of memory respectively so I use lower batch_size and higher gradient_accumulation which effectively gives me 4 x 8 batch size.

data: "data"
adapter_path: "adapters"

train: true
fine_tune_type: lora
optimizer: adamw
seed: 0
val_batches: 50
max_seq_length: 1024
grad_checkpoint: true
steps_per_report: 10
steps_per_eval: 50
save_every: 50

model: "mlx-community/Qwen2.5-14B-Instruct-bf16"
num_layers: 24
batch_size: 4
gradient_accumulation: 8
weight_decay: 0.001
grad_clip: 1.0
iters: 1000
learning_rate: 3.6e-5
lora_parameters:
  keys: ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "mlp.down_proj","mlp.up_proj","mlp.gate_proj"]
  rank: 24
  scale: 6
  dropout: 0.1
lr_schedule:
  name: cosine_decay
  warmup: 200
  warmup_init: 1e-6
  arguments: [3.6e-5, 1000, 1e-6]
early_stopping_patience: 4

The most important parameters in terms of training are:

  • number of layers which relates to the number of parameters available for training
  • weight_decay in terms of generalization
  • grad_clip is where we defined how small/big is a hole by which we pull gradients, in order to not let them explode which means going higher and higher by sudden
  • learning_rate is how fast we order model to be trained with our data
  • lora_parameters/keys we either stick only to self_attn.* or we extend training to cover also mlp.*
  • rank is to define space to the training
  • scale also called alpha is the influence factor
  • dropout is a random removal/correction factor

Now, at different points/phases of training those parameters should and will take different values depending on our use case. Every parameters is somehow related to the other. Like for example learning rate correlates indirectly with WD, GC, rank, scale and d/o. If you change number of layers or rank then you need to adjust the other parameters also. Key factors for changing your parameters:

  • number of QA in datasets
  • number of training data vs validation data
  • data structure and quality
  • model parameters size
  • number of iterations/epochs (how many times model sees your data in training)
  • where you want to either generalize or specialize your data and model interaction

Training

You can run training as follows including W&B reporting for better analysis.

python -m mlx_lm lora -c train-config.yaml --wandb your-project

You can monitor your training either in console or oin W&B. Rule of a thumb is that validation loss should go down and should go down together with training loss. Training loss should not be much lower than validation loss which could mean overfitting data which degrades model’s ability to generalize things. Ideal configuratino is go as low as possible, both both validation and training loss.

Iter 850: Val loss 0.757, Val took 99.444s
Iter 850: Train loss 0.564, Learning Rate 1.065e-05, It/sec 0.255, Tokens/sec 177.088, Trained Tokens 581033, Peak mem 33.410 GB
Iter 850: Saved adapter weights to adapters-drs2/adapters.safetensors and adapters-drs2/0000850_adapters.safetensors.
...
Iter 900: Val loss 0.805, Val took 99.701s
Iter 900: Train loss 0.422, Learning Rate 8.303e-06, It/sec 0.248, Tokens/sec 173.218, Trained Tokens 615120, Peak mem 33.410 GB
Iter 900: Saved adapter weights to adapters-drs2/adapters.safetensors and adapters-drs2/0000900_adapters.safetensors.
...
Iter 1000: Val loss 0.791, Val took 99.140s
Iter 1000: Train loss 0.396, Learning Rate 4.407e-06, It/sec 0.248, Tokens/sec 172.078, Trained Tokens 683991, Peak mem 33.410 GB
Iter 1000: Saved adapter weights to adapters-drs2/adapters.safetensors and adapters-drs2/0001000_adapters.safetensors.
Saved final weights to adapters-drs2/adapters.safetensors.

Fusing LORA and exporting GGUF

Once you are ready and done with your traing you can either use LORA adapter in generation of just fuse this LORA adapter it into base model which is more handy as it can be also copied into LMStudio model directory for much more user friendly use and your newly trained model evaluation.

python -m mlx_lm.fuse --model $1 --adapter-path adapters --save-path model/$2
cp -r model/$2 /Users/your-user/.lmstudio/models/your-space/

Where $1 is HuggingFace base model path, $2 is model name in output path. You can also fuse into GGUF format by using --export-gguf and you can also convert HF model into GGUF using llama.cpp (https://github.com/ggml-org/llama.cpp.git). Please note that converting it into GGUF or converting it into Ollama “format” will possibly cause quality issues. Cause for this might be because of weights formatting, number representation or other graph difrerences which are by now not idientifed on my side.

python convert_hf_to_gguf.py ~/.lmstudio/models/your-space/your-model-folder --outtype q8_0 --outfile ./out.gguf

Data

You need data to start training. It is whole separate concept aside from properly parametrizing your training process. It is not only data itself but whole augumentation process including paraphrases, synonyms, negative examples, step-by-step etc.

Available formats are as follows:

{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"prompt": "Explain quantum computing", "completion": "Quantum computing uses..."}
{"text": "Complete text for language modeling"}

Tried all of them and the most applealing seems to be prompt/completion one.

Code generation and artifacts preview with WebUI and codegemma:7b

Generate WebGL, Three.JS, HTML, CSS, JavaScript, no Python code, single page with rotating cube, ambient lighting. Load libraries from CDN. Let ambient lighting be as such cube edges are visible. add directional lighting also pointing at the cube. Scene needs to be navigable using arrow keys. Ensure browser compability.

With codegemma:7b you can generate source code. If asked properly then in WebUI chat a artifacts feature will appear, interpreting your source code immediately, just after source code is generated. This feature is useful for designers, developers and marketers who would like to speed-up scaffolding and migrating from brainstorm into visible working result.

You can use visualization libraries, Three.js, D3.js or any other HTML, CSS and JavaScript code. It is important to ask for single-page one-file content with all libraries loaded from CDN. Use words “change” or “update” to modify content or “rewrite”, “redesign” for complete rework.

And here is the source code generated by codegemma:7b:

<!DOCTYPE html>
<html>

<head>
  <meta charset="UTF-8" />
  <title>Rotating Cube with Ambient and Directional Lighting</title>
  <style>
    body {
      margin: 0;
      overflow: hidden;
    }

    canvas {
      display: block;
    }
  </style>
</head>

<body>
  <script src="https://cdnjs.cloudflare.com/ajax/libs/three.js/r134/three.min.js"></script>

  <script>
    const scene = new THREE.Scene();

    const camera = new THREE.PerspectiveCamera(75, window.innerWidth / window.innerHeight, 0.1, 100);
    camera.position.z = 3;

    const renderer = new THREE.WebGLRenderer();
    renderer.setSize(window.innerWidth, window.innerHeight);

    document.body.appendChild(renderer.domElement);

    const geometry = new THREE.BoxGeometry(1, 1, 1);
    const material = new THREE.MeshLambertMaterial({ color: 0x444444 });
    const cube = new THREE.Mesh(geometry, material);
    scene.add(cube);

    const ambientLight = new THREE.AmbientLight(0xffffff, 0.5);
    scene.add(ambientLight);

    const directionalLight = new THREE.DirectionalLight(0xffffff, 0.5);
    directionalLight.position.set(5, 5, 5);
    scene.add(directionalLight);

    let rotation = 0;

    const animate = () => {
      requestAnimationFrame(animate);

      rotation += 0.01;
      cube.rotation.y = rotation;

      renderer.render(scene, camera);
    };

    animate();

    const moveCamera = (event) => {
      switch (event.key) {
        case "ArrowUp":
          camera.position.z -= 0.1;
          break;
        case "ArrowDown":
          camera.position.z += 0.1;
          break;
        case "ArrowLeft":
          camera.position.x -= 0.1;
          break;
        case "ArrowRight":
          camera.position.x += 0.1;
          break;
      }
    };

    document.addEventListener("keydown", moveCamera);
  </script>
</body>

</html>

Qwen LLM

What is Qwen?

This is the organization of Qwen, which refers to the large language model family built by Alibaba Cloud. In this organization, we continuously release large language models (LLM), large multimodal models (LMM), and other AGI-related projects. Check them out and enjoy!

What models do they provide?

They provide wide range of models, since 2023. Original model was just called Qwen and can be still found on GitHub. The current model Qwen2.5 has its own repository, also on GitHub. General purpose models are just Qwen, but there are also code specific models. There are also Math, Audio and few other.

Note from the creator:

We do not recommend using base language models for conversations. Instead, you can apply post-training, e.g., SFT, RLHF, continued pretraining, etc., or fill in the middle tasks on this model.

However, I tried the following models, because I can:

  • Qwen/Qwen-7B: 15 GB of model, 31 GB in RAM
  • Qwen/Qwen2-0.5B, 1 GB of model, 4 GB in RAM
  • Qwen/Qwen2.5-Coder-1.5B, 3 GB of model, 7 GB in RAM

Yes, you can run those models solely in memory rather than on GPU. This will be significantly slower, but it works.

How to run?

In order to validate the source of data which have been used for training I think we can ask something domain-specific:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Qwen/Qwen2.5-Coder-1.5B"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_name, trust_remote_code=True)

input_text = "How to install Proxmox on Hetzner bare-metal server?"
inputs = tokenizer(input_text, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=200)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generated_text)

This model uses Qwen2ForCausalLM architecture and it is released under Apache 2.0 licence. To run it we need to have few additional Python packages installed:

transformers>=4.32.0,<4.38.0
accelerate
tiktoken
einops
transformers_stream_generator==0.0.4
scipy

Where did it get the data from?

So the output for given question “How to install Proxmox on Hetzner bare-metal server?”

wget https://enterprise.proxmox.com/debian/proxmox-ve-release-6.x.gpg -O /etc/apt/trusted.gpg.d/proxmox-ve-release-6.x.gpg
echo "deb http://enterprise.proxmox.com/debian/pve buster pve-no-subscription" > /etc/apt/sources.list.d/pve-enterprise.list
apt-get update
apt-get install proxmox-ve

It suggests installing Proxmox 6 even if Proxmox 7 is already outdated as for 2024. Moreover it suggests running Debian Buster and specific hardware setup with 16 GB of RAM and 2 x 1TB HDD. It seems like some sort of forum or stackexchange or stackoverflow thing. It might be also a compilation or translation of few other as the small size of the model implies.

Reading package lists... Done
Building dependency tree
Reading state information... Done
E: Unable to locate package proxmox-ve

It is no brainer: this is offline thing. It’s very interesting that it is still able to try to answer even if it is not precise.