LLM training parameters explanation

Quick overview of LLM MLX LORA training parameters.

weight_decayA regularization technique that adds a small penalty to the weights during training to prevent them from growing too large, helping to reduce overfitting. Often implemented as L2 regularization.examples: 0.00001 – 0.01
grad_clipShort for gradient clipping — a method that limits (clips) the size of gradients during backpropagation to prevent exploding gradients and stabilize training.examples: 0.1 – 1.0
rankRefers to the dimensionality or the number of independent directions in a matrix or tensor. In low-rank models, it controls how much the model compresses or approximates the original data.examples: 4, 8, 16 or 32
scaleA multiplier or factor used to adjust the magnitude of values — for example, scaling activations, gradients, or learning rates to maintain numerical stability or normalize features.examples: 0.5 – 2.0
dropoutA regularization method that randomly “drops out” (sets to zero) a fraction of neurons during training, forcing the network to learn more robust and generalizable patterns.examples: 0.1 – 0.5

Full LLM fine-tuning using transformers, torch and accelerate with HF and GGUF

Full fine-tuning of mlx-community/Qwen2.5-3B-Instruct-bf16

Recently I posted article on how to train LORA MLX LLM here. Then I asked myself how can I export or convert such MLX model into HF or GGUF format. Even that MLX has such option to export MLX into GGUF most of the time it is not supported by models I have been using. From what I recall even if it does support Qwen it is not version 3 but version 2 and quality suffers by such conversion. Do not know why exactly it works like that.

So I decided to give a try with full fine-tuning using transformers, torch and accelerate.

Input data

In terms of input data we can use the same format as with LORA MLX LLM training. So there are two kind of files which is train.jsonl and valid.jsonl with the following format:

{"prompt":"This is the question", "completion":"This is the answer"}

Remember that this is full training, not only low rank adapters. So it is a little bit harder to get proper results. It is crucial to get as much good quality data as possible. I take source documents and run augumentation process using Langflow.

Full fine-tuning program

Next there is source code for training program code. You can see that you need transformers, accelerate, PyTorch and Datasets. The first and the only parameter is output folder for weights. After the training is done there are some test questions to be asked in order to verify quality of trained model.

import os
from typing import List, Dict, Tuple
from datasets import Dataset
from transformers import (
    AutoTokenizer,
    AutoModelForCausalLM,
    TrainingArguments,
    Trainer,
    DataCollatorForLanguageModeling
)
import torch
import json
from typing import List, Dict

folder = os.sys.argv[1]

def prepare_dataset() -> Dataset:
    records: List[Dict[str, str]] = []
    with open("./data-folder/train.jsonl", "r", encoding="utf-8") as f:
        for line_no, line in enumerate(f, start=1):
            line = line.strip()
            if not line:
                continue
            try:
                obj = json.loads(line)
                if isinstance(obj, dict):
                    records.append(obj)
                    print(obj)
                else:
                    print(f"Linia {line_no}: nie jest słownikiem, pomijam.")
            except json.JSONDecodeError as e:
                print(f"Linia {line_no}: błąd JSON ({e}) — pomijam.")
    return Dataset.from_list(records)

def format_instruction(example: Dict[str, str]) -> str:
    return f"<|user|>{example['prompt']}\n <|assistant> {example['completion']}"

def tokenize_data(example: Dict[str, str], tokenizer: AutoTokenizer) -> Dict[str, torch.Tensor]:
    formatted_text = format_instruction(example)
    return tokenizer(formatted_text, truncation=True, padding="max_length", max_length=128)

def fine_tune_model(base_model: str = "mlx-community/Qwen2.5-3B-Instruct-bf16") -> Tuple[AutoModelForCausalLM, AutoTokenizer]:
    tokenizer = AutoTokenizer.from_pretrained(base_model)
    tokenizer.save_pretrained(folder)
    tokenizer.pad_token = tokenizer.eos_token
    model = AutoModelForCausalLM.from_pretrained(
        base_model,
        torch_dtype=torch.float32
    )
    model.to("mps")
    dataset = prepare_dataset()
    tokenized_dataset = dataset.map(
        lambda x: tokenize_data(x, tokenizer),
        remove_columns=dataset.column_names
    )
    split = tokenized_dataset.train_test_split(test_size=0.1)
    train_dataset = split["train"]
    eval_dataset = split["test"]
    training_args = TrainingArguments(
        output_dir=folder,
        num_train_epochs=4,
        per_device_train_batch_size=2,
        gradient_accumulation_steps=12,
        learning_rate=5e-5,
        weight_decay=0.001,
        max_grad_norm = 1.0,
        warmup_ratio=0.10,
        lr_scheduler_type="cosine",
        bf16=True,
        fp16=False,
        logging_steps=10,
        save_total_limit=2,
        evaluation_strategy="steps",
        eval_steps=50,
        save_strategy="steps",
        save_steps=100,
        load_best_model_at_end=True,
        metric_for_best_model="loss",
        greater_is_better=False,
        gradient_checkpointing=True,
        group_by_length=True
    )
    data_collator = DataCollatorForLanguageModeling(
        tokenizer=tokenizer,
        mlm=False
    )
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=eval_dataset,
        data_collator=data_collator,
    )
    trainer.train()
    return model, tokenizer

def generate_response(prompt: str, model: AutoModelForCausalLM, tokenizer: AutoTokenizer, max_length: int = 512) -> str:
    formatted_prompt = f"<|user|>{prompt} <|assistant|>"
    inputs = tokenizer(formatted_prompt, return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_length=max_length,
        temperature=0.7,
        num_return_sequences=1,
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

if __name__ == "__main__":
    model, tokenizer = fine_tune_model()
    test_prompts: List[str] = [
        "Question 1",
        "Question 2",
        "Question 3",
        "Question 4",
        "Question 5",
        "Question 6"
    ]
    for prompt in test_prompts:
        response = generate_response(prompt, model, tokenizer)
        print(f"\nPrompt: {prompt}")
        print(f"Response: {response}")

Parametrization

Lets take a look at the configuration parameters:

num_train_epochs=4,
per_device_train_batch_size=2,
gradient_accumulation_steps=12,
learning_rate=5e-5,
weight_decay=0.001,
max_grad_norm = 1.0,
warmup_ratio=0.10,
lr_scheduler_type="cosine",
bf16=True,
fp16=False,
logging_steps=10,
save_total_limit=2,
evaluation_strategy="steps",
eval_steps=50,
save_strategy="steps",
save_steps=100,
load_best_model_at_end=True,
metric_for_best_model="loss",
greater_is_better=False,
gradient_checkpointing

The story is as follows:

Within 4 epochs and size of 2 batches at a time but accumulated with 12x factor we have been learning at the speed of 5e-5. Our warmup takes 10% of runtime. We run cosine decay with certain weight_decay and gradients normalization. We try to use BF16, do not use FP16 but your milage may vary. Logging take place at every 10 steps (for training loss). We eval every 50 steps (for validation loss). However we save every 100 steps. We save best checkpoint in the end and 2 checkpoints at maximum.

HF to GGUF conversion

After training finished we convert this HF format into GGUF in order to run it using LMStudio.

# get github.com/ggml-org/llama.cpp.git
# initialize venv and install libraries
python convert_hf_to_gguf.py ../rt/model/checkpoint-x/ --outfile ../rt/model/checkpoint-x-gguf

However it is recommended only do this after test questions gives any good results. Otherwise it will be pointless.

Running a training session

During the training session we may observe:

{'loss': 26.942, 'grad_norm': 204.8477783203125, 'learning_rate': 1.923076923076923e-05, 'epoch': 0.16}
{'loss': 17.2971, 'grad_norm': 62.03092956542969, 'learning_rate': 3.846153846153846e-05, 'epoch': 0.31}
{'loss': 16.1831, 'grad_norm': 55.732086181640625, 'learning_rate': 4.99613632163459e-05, 'epoch': 0.47}
{'loss': 15.3985, 'grad_norm': 52.239620208740234, 'learning_rate': 4.952806974561518e-05, 'epoch': 0.63}
{'loss': 14.6101, 'grad_norm': 47.203189849853516, 'learning_rate': 4.862157403595598e-05, 'epoch': 0.79}

20% | 50/252 [10:00&lt;28:39, 8.51s/it]

{'eval_loss': 1.1458975076675415, 'eval_runtime': 10.0881, 'eval_samples_per_second': 16.852, 'eval_steps_per_second': 2.181, 'epoch': 0.79}
{'loss': 13.5673, 'grad_norm': 40.04380416870117, 'learning_rate': 4.7259364450857096e-05, 'epoch': 0.94}
{'loss': 10.3291, 'grad_norm': 40.06776428222656, 'learning_rate': 4.5467721110696685e-05, 'epoch': 1.11}
{'loss': 8.4045, 'grad_norm': 33.435096740722656, 'learning_rate': 4.3281208889462715e-05, 'epoch': 1.27}
{'loss': 8.2388, 'grad_norm': 40.08720779418945, 'learning_rate': 4.0742010579737855e-05, 'epoch': 1.42}
{'loss': 8.0016, 'grad_norm': 34.05099105834961, 'learning_rate': 3.7899113090712526e-05, 'epoch': 1.58}

40% | 100/252 [18:07&lt;21:10, 8.36s/it]

{'eval_loss': 1.0009825229644775, 'eval_runtime': 27.7629, 'eval_samples_per_second': 6.123, 'eval_steps_per_second': 0.792, 'epoch': 1.58}
{'loss': 7.9294, 'grad_norm': 36.029380798339844, 'learning_rate': 3.4807362379317025e-05, 'epoch': 1.74}
{'loss': 7.7119, 'grad_norm': 33.554954528808594, 'learning_rate': 3.1526405346999946e-05, 'epoch': 1.9}

50% | 126/252 [24:58&lt;19:44, 9.40s/it]

{'loss': 6.2079, 'grad_norm': 26.597759246826172, 'learning_rate': 2.8119539115370218e-05, 'epoch': 2.06}
{'loss': 3.6895, 'grad_norm': 36.123207092285156, 'learning_rate': 2.4652489880792128e-05, 'epoch': 2.22}
{'loss': 3.7563, 'grad_norm': 24.915979385375977, 'learning_rate': 2.1192144906604876e-05, 'epoch': 2.38}

60% | 150/252 [31:57&lt;24:00, 14.12s/it]

...

On my Mac Studio M2 Ultra with 64GB of RAM it takes from 55 up to 60GB or memory to run a training session:

Not sure if it just fit or shinks down a little bit just to fit exactly within mu memory limits.

Finally:

{'eval_loss': 1.0670475959777832, 'eval_runtime': 31.049, 'eval_samples_per_second': 5.475, 'eval_steps_per_second': 0.709, 'epoch': 3.96}
{'train_runtime': 4371.782, 'train_samples_per_second': 1.398, 'train_steps_per_second': 0.058, 'train_loss': 7.671164320574866, 'epoch': 3.99}

Conversion

If after test questions we would decide that current weights are capable then we can conver HF format into GGUF two checkpoints, checkpoint-200 and checkpoint-252 as well as some other files like vocab and tokenizer:

added_tokens.json
checkpoint-200
checkpoint-252
merges.txt
runs 
special_tokens_map.json
tokenizer.json 
tokenizer_config.json 
vocab.json

Single checkpoint:

We need to copy tokenizer from base model path into checkpoint path:

cp tokenizer.json checkpoint-252

and then run conversion:

python convert_hf_to_gguf.py ../rt/model/checkpoint-252 --outfile  ../rt/model/checkpoint-252-gguf

Judge the quality yourself, especially comparing to LORA trainings. In my personal opinion full fine-tuning requires much higher level of expertise than just training a subset of full model.

Train LLM on Mac Studio using MLX framework

I have done over 500 training sessions using Qwen2.5, Qwen3, Gemma and plenty other LLM publicly available to inject domain specific knowledge into the model’s low rank adapters (LORA). However, instead of giving you tons of unimportant facts I will just stick to the most important things. Starting with the fact that I have used MLX on my Mac Studio M2 Ultra as well as on MacBook Pro M1 Pro. Both fit well to this task in terms of BF16 speed as well as unified memory capacity and speed (up to 800GB/s).

Memory speed is the most important factor comparing GPU hardware withing similar generations of technological process. That is why M2/M3 Ultra with higher memory speeds beats M4 with lower overall memory bandwidth.

LORA and MLX

What is LORA? With this type of training you take only portion of large model and train only small part of parameters, like 0.5 or 1%, which in most of models gives us 100k up to 50M parameters available for training. What is MLX? it is Apple’s array computation framework which boosts machine learning tasks.

How MLX and LORA relates to different frameworks on different hardware? MLX uses slightly different weights organization and different way of achieving the same thing as other frameworks do, but with Apple Silicon speed-up. It is pricey in terms of purchase and power consumption to run modern powerful NVIDIA RTX based training, and it is much more affordable to do this on Mac Studio with lets say 64GB of RAM. Please notice that for ML (GPU related things generally speaking) tasks you get like 75% of your RAM capacity, so on 64GB Mac Studio I get around 45 – 46GB available. Now go online and look for some RTXs with similar amount of VRAM 😉

Configuration

So…

Here you have sample training configuration using Qwen2.5 rather big model which is 14B, pre-trained for Instruct type usage, storing weights in BF16 which is faster to run up to 50% than similar 16 bit floats or even 8 bits weights. I got “only” 64GB and 32GB of memory respectively so I use lower batch_size and higher gradient_accumulation which effectively gives me 4 x 8 batch size.

data: "data"
adapter_path: "adapters"

train: true
fine_tune_type: lora
optimizer: adamw
seed: 0
val_batches: 50
max_seq_length: 1024
grad_checkpoint: true
steps_per_report: 10
steps_per_eval: 50
save_every: 50

model: "mlx-community/Qwen2.5-14B-Instruct-bf16"
num_layers: 24
batch_size: 4
gradient_accumulation: 8
weight_decay: 0.001
grad_clip: 1.0
iters: 1000
learning_rate: 3.6e-5
lora_parameters:
  keys: ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj", "self_attn.o_proj", "mlp.down_proj","mlp.up_proj","mlp.gate_proj"]
  rank: 24
  scale: 6
  dropout: 0.1
lr_schedule:
  name: cosine_decay
  warmup: 200
  warmup_init: 1e-6
  arguments: [3.6e-5, 1000, 1e-6]
early_stopping_patience: 4

The most important parameters in terms of training are:

  • number of layers which relates to the number of parameters available for training
  • weight_decay in terms of generalization
  • grad_clip is where we defined how small/big is a hole by which we pull gradients, in order to not let them explode which means going higher and higher by sudden
  • learning_rate is how fast we order model to be trained with our data
  • lora_parameters/keys we either stick only to self_attn.* or we extend training to cover also mlp.*
  • rank is to define space to the training
  • scale also called alpha is the influence factor
  • dropout is a random removal/correction factor

Now, at different points/phases of training those parameters should and will take different values depending on our use case. Every parameters is somehow related to the other. Like for example learning rate correlates indirectly with WD, GC, rank, scale and d/o. If you change number of layers or rank then you need to adjust the other parameters also. Key factors for changing your parameters:

  • number of QA in datasets
  • number of training data vs validation data
  • data structure and quality
  • model parameters size
  • number of iterations/epochs (how many times model sees your data in training)
  • where you want to either generalize or specialize your data and model interaction

Training

You can run training as follows including W&B reporting for better analysis.

python -m mlx_lm lora -c train-config.yaml --wandb your-project

You can monitor your training either in console or oin W&B. Rule of a thumb is that validation loss should go down and should go down together with training loss. Training loss should not be much lower than validation loss which could mean overfitting data which degrades model’s ability to generalize things. Ideal configuratino is go as low as possible, both both validation and training loss.

Iter 850: Val loss 0.757, Val took 99.444s
Iter 850: Train loss 0.564, Learning Rate 1.065e-05, It/sec 0.255, Tokens/sec 177.088, Trained Tokens 581033, Peak mem 33.410 GB
Iter 850: Saved adapter weights to adapters-drs2/adapters.safetensors and adapters-drs2/0000850_adapters.safetensors.
...
Iter 900: Val loss 0.805, Val took 99.701s
Iter 900: Train loss 0.422, Learning Rate 8.303e-06, It/sec 0.248, Tokens/sec 173.218, Trained Tokens 615120, Peak mem 33.410 GB
Iter 900: Saved adapter weights to adapters-drs2/adapters.safetensors and adapters-drs2/0000900_adapters.safetensors.
...
Iter 1000: Val loss 0.791, Val took 99.140s
Iter 1000: Train loss 0.396, Learning Rate 4.407e-06, It/sec 0.248, Tokens/sec 172.078, Trained Tokens 683991, Peak mem 33.410 GB
Iter 1000: Saved adapter weights to adapters-drs2/adapters.safetensors and adapters-drs2/0001000_adapters.safetensors.
Saved final weights to adapters-drs2/adapters.safetensors.

Fusing LORA and exporting GGUF

Once you are ready and done with your traing you can either use LORA adapter in generation of just fuse this LORA adapter it into base model which is more handy as it can be also copied into LMStudio model directory for much more user friendly use and your newly trained model evaluation.

python -m mlx_lm.fuse --model $1 --adapter-path adapters --save-path model/$2
cp -r model/$2 /Users/your-user/.lmstudio/models/your-space/

Where $1 is HuggingFace base model path, $2 is model name in output path. You can also fuse into GGUF format by using --export-gguf and you can also convert HF model into GGUF using llama.cpp (https://github.com/ggml-org/llama.cpp.git). Please note that converting it into GGUF or converting it into Ollama “format” will possibly cause quality issues. Cause for this might be because of weights formatting, number representation or other graph difrerences which are by now not idientifed on my side.

python convert_hf_to_gguf.py ~/.lmstudio/models/your-space/your-model-folder --outtype q8_0 --outfile ./out.gguf

Data

You need data to start training. It is whole separate concept aside from properly parametrizing your training process. It is not only data itself but whole augumentation process including paraphrases, synonyms, negative examples, step-by-step etc.

Available formats are as follows:

{"messages": [{"role": "user", "content": "What is AI?"}, {"role": "assistant", "content": "AI is..."}]}
{"prompt": "Explain quantum computing", "completion": "Quantum computing uses..."}
{"text": "Complete text for language modeling"}

Tried all of them and the most applealing seems to be prompt/completion one.

Run Bielik LLM from SpeakLeash using LM Studio on your local machine

Did you know that you can use the Polish LLM Bielik from SpeakLeash locally, on your private computer? The easiest way to do this is LM Studio (from lmstudio.ai).

  • download LM Studio
  • download the model (e.g. Bielik-11B-v2.2-Instruct-GGUF)
  • load model
  • open a new conversation
  • converse…

Why use a model locally? Just for fun. Where we don’t have internet. Because we don’t want to share our data and conversations etc…

You can run it on macOS, Windows and Linux. It requires support for AVX2 CPU instructions, a large amount of RAM and, preferably, a dedicated and modern graphics card.



Note: for example, on a Thinkpad t460p with i5 6300HQ with a dedicated 940MX 2GB VRAM card basically does not want to work, but on a Dell g15 with i5 10200h and RTX 3050Ti it works without any problem. I suspect that it is about Compute Capability and not the size of VRAM in the graphics card… because on my old datacenter cards (Tesla, Quadro) these models and libraries do not work.

Block AI web-scrapers from stealing your website content

Did you know that you may block AI-related web-scrapers from downloading your whole websites and actually stealing your content. This way LLM models will need to have different data source for learning process!

Why you may ask? First of all, AI companies make money on their LLM, so using your content without paying you is just stealing. It applies for texts, images and sounds. It is intellectual property which has certain value. Long time ago I placed on my website a license “Attribution-NonCommercial-NoDerivatives” and guest what… it does not matter. I did not receive any attribution. Dozens of various bot visit my webiste and just download all the content. So I decided…

… to block those AI-related web-crawling web-scraping bots. And no, not by modyfing robots.txt file (or any XML sitemaps) as it might be not sufficient in case of some chinese bots as they just “don’t give a damn”. Neither I decided to use any kind of plugins or server extenstions. I decided to go hard way:

location / {
  if ($http_user_agent ~* "Bytespider") { return 403; }
  ...
}

And decide to which exactly HTTP User Agent (client “browser” in other words) I would like to show middle finger. For those who do not stare at server logs at least few minutes a day, “Bytespider” is a scraping-bot from ByteDance company which owns TikTok. It is said that this bot could possible download content to feed some chinese LLM. Chinese or US it actually does not matter. If you would like to use my content, either pay me or attribute usage of my content. How you may ask? To be honest I do not know.

There is either hard way (as with NGINX blocking certain UA) or diplomacy way which could lead to creating a websites catalogue which do not want to participate in AI feeding process for free. I think there are many more content creators who would like to get some piece of AI birthday cake…

BLOOM LLM: how to use?

Asking BLOOM-560M “what is love?” it replies with “The woman who had my first kiss in my life had no idea that I was a man”. wtf?!

Intro

I’ve been into parallel computing since 2021, playing with OpenCL (you can read about it here), looking for maximizing devices capabilities. I’ve got pretty decent in-depth knowledge about how computational process works on GPUs and I’m curious how the most recent AI/ML/LLM technology works. And here you have my little introduction to LLM topic from practical point-of-view.

Course of Action

  • BLOOM overview
  • vLLM
  • Transformers
  • Microsoft Azure NV VM
  • What’s next?

What is BLOOM?

It is a BigScience Large Open-science Open-access Multilingual language model. It based on transformer deep-learning concept, where text is coverted into tokens and then vectors for lookup tables. Deep learning itself is a machine learning method based on neural networks where you train artificial neurons. BLOOM is free and it was created by over 1000 researches. It has been trained on about 1.6 TB of pre-processed multilingual text.

There are few variants of this model 176 billion elements (called just BLOOM) but also BLOOM 1b7 with 1.7 billion elements. There is even BLOOM 560M:

  • to load and run 176B you need to have 350 GB VRAM with FP32 and half with FP16
  • to load and run 1B7 you need somewhere between 10 and 12 GB VRAD and half with FP16

So in order to use my NVIDIA GeForce RTX 3050 Ti with 4GB RAM I would either need to run with BLOOM 560M which requires 2 to 3 GB VRAM and even below 2 GB VRAD in case of using FP16 mixed precision or… use CPU. So 176B requires 700 GB RAM, 1B7 requires 12 – 16 GB RAM and 560M requires 8 – 10 GB RAM.

Are those solid numbers? Lets find out!

vLLM

“vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.”

“A high-throughput and memory-efficient inference and serving engine for LLMs”

You can download (from Hugging Face, company created in 2016 in USA) and serve language models with these few steps:

pip install vllm
vllm serve "bigscience/bloom"

And then once it’s started (and to be honest it won’t start just like that…):

curl -X POST "http://localhost:8000/v1/chat/completions" \ 
	-H "Content-Type: application/json" \ 
	--data '{
		"model": "bigscience/bloom"
		"messages": [
			{"role": "user", "content": "Hello!"}
		]
	}'

You can back up your vLLM runtime using GPU or CPU but also ROCm, OpenVINO, Neuron, TPU and XPU. It requires GPU compute capability 7.0 or higher. I’ve got my RTX 3050 Ti which has 8.6, but my Tesla K20Xm with 6GB VRAD has only 3.5 so it will not be able to use it.

Here is the Python program:

from vllm import LLM, SamplingParams
model_name = "bigscience/bloom-560M"
llm = LLM(model=model_name, gpu_memory_utilization=0.6,  cpu_offload_gb=4, swap_space=2)
question = "What is love?"
sampling_params = SamplingParams(
    temperature=0.5,     
    max_tokens=10,
)
output = llm.generate([question], sampling_params)
print(output[0].outputs[0].text)

In return, there is either:

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 736.00 MiB. GPU 0 has a total capacity of 3.81 GiB of which 73.00 MiB is free. Including non-PyTorch memory, this process has 3.73 GiB memory in use. Of the allocated memory 3.56 GiB is allocated by PyTorch, and 69.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

or the following:

No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

I may try later to check it out on bigger GPU but as for now, I will try to run it using transformers library which is the next topic.

Transformers

So I picked the same BLOOM 560M model. First, you need to install the following main packages and plenty of dependencies:

pip install transformers
pip install torch
pip install accelerate

Source code of Python program using those libraries is as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m", 
    device_map="auto", 
    torch_dtype="auto"
)
def ask_bloom(question, max_length=100, temperature=0.7):
    inputs = tokenizer(question, return_tensors="pt").to(model.device)
    output = model.generate(
        inputs['input_ids'],
        max_length=max_length,  
        temperature=temperature,   
        pad_token_id=tokenizer.eos_token_id, 
        do_sample=True             
    )
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer
question = "What is love?"
answer = ask_bloom(question)
print(f"Q: {question}\nAnwser: {answer}")

To run:

python3 transformers-torch.py

It will download the model and execute in Python program.

We can see that my NVIDIA GeForce 940MX with 2 GB VRAM is loaded around 60% with over 90% utilization (results from nvidia-smi).

“What is love?”. Anwser is as follows:

What is love? She never asked me the question but I am the one who has to give my answer. She is a beautiful, beautiful, very beautiful woman. The first thing you know about love is that it is the most complicated thing in the world. I was so shocked when I saw the man I thought I was marrying. My life was over. The woman who had my first kiss in my life had no idea that I was a man. She was so shocked. She said something

Next I asked for “The sun is shining and?”

I love this scene and I like when it comes to it. The sun, the sky and all the other stars are bright and I love that! I really enjoy this scene and it sets the scene for the rest of the story. It seems like the characters are just going about their day and the sun is shining. The world seems to be in its perfect place and everything is beautiful. I love the lighting in this scene and the warmth of the sunlight that

Does it make any sense? What is a concept of “sense” at all? Anyway it works, somehow. Lets find out the other possibilities.

Microsoft Azure N-series virtual machines

Instead of buying MSI Vector, ASUS ROG, Lenovo Legion Pro, MSI Raider or any kind of ultimate gaming laptops you go to Azure and pick on their NV virtual machines. Especially that they have 14 and 28 GB of VRAM onboard. It costs around 400 Euro per month, but you will not be using it all the time (I suppose).

We have:

root@z92-az-bloom:/home/adminadmin# lspci 
0002:00:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 [Instinct MI25 MxGPU/MI25x2 MxGPU/V340 MxGPU/V340L MxGPU]

And I was not so sure how to use AMD GPU, so instead I decided to requests for a quote increase:

However I got rejected on my account with that request:

Unfortantely changing parameters and virtual machine types did not change the situation, I got still rejected and neeeded to submit support ticket to Microsoft in order to manually process it. So until next time!

What’s next to check?

AWS g6 and Hetzner GEX44. Keep reading!

Further reading