BLOOM LLM: how to use?

Asking BLOOM-560M “what is love?” it replies with “The woman who had my first kiss in my life had no idea that I was a man”. wtf?!

Intro

I’ve been into parallel computing since 2021, playing with OpenCL (you can read about it here), looking for maximizing devices capabilities. I’ve got pretty decent in-depth knowledge about how computational process works on GPUs and I’m curious how the most recent AI/ML/LLM technology works. And here you have my little introduction to LLM topic from practical point-of-view.

Course of Action

  • BLOOM overview
  • vLLM
  • Transformers
  • Microsoft Azure NV VM
  • What’s next?

What is BLOOM?

It is a BigScience Large Open-science Open-access Multilingual language model. It based on transformer deep-learning concept, where text is coverted into tokens and then vectors for lookup tables. Deep learning itself is a machine learning method based on neural networks where you train artificial neurons. BLOOM is free and it was created by over 1000 researches. It has been trained on about 1.6 TB of pre-processed multilingual text.

There are few variants of this model 176 billion elements (called just BLOOM) but also BLOOM 1b7 with 1.7 billion elements. There is even BLOOM 560M:

  • to load and run 176B you need to have 350 GB VRAM with FP32 and half with FP16
  • to load and run 1B7 you need somewhere between 10 and 12 GB VRAD and half with FP16

So in order to use my NVIDIA GeForce RTX 3050 Ti with 4GB RAM I would either need to run with BLOOM 560M which requires 2 to 3 GB VRAM and even below 2 GB VRAD in case of using FP16 mixed precision or… use CPU. So 176B requires 700 GB RAM, 1B7 requires 12 – 16 GB RAM and 560M requires 8 – 10 GB RAM.

Are those solid numbers? Lets find out!

vLLM

“vLLM is a Python library that also contains pre-compiled C++ and CUDA (12.1) binaries.”

“A high-throughput and memory-efficient inference and serving engine for LLMs”

You can download (from Hugging Face, company created in 2016 in USA) and serve language models with these few steps:

pip install vllm
vllm serve "bigscience/bloom"

And then once it’s started (and to be honest it won’t start just like that…):

curl -X POST "http://localhost:8000/v1/chat/completions" \ 
	-H "Content-Type: application/json" \ 
	--data '{
		"model": "bigscience/bloom"
		"messages": [
			{"role": "user", "content": "Hello!"}
		]
	}'

You can back up your vLLM runtime using GPU or CPU but also ROCm, OpenVINO, Neuron, TPU and XPU. It requires GPU compute capability 7.0 or higher. I’ve got my RTX 3050 Ti which has 8.6, but my Tesla K20Xm with 6GB VRAD has only 3.5 so it will not be able to use it.

Here is the Python program:

from vllm import LLM, SamplingParams
model_name = "bigscience/bloom-560M"
llm = LLM(model=model_name, gpu_memory_utilization=0.6,  cpu_offload_gb=4, swap_space=2)
question = "What is love?"
sampling_params = SamplingParams(
    temperature=0.5,     
    max_tokens=10,
)
output = llm.generate([question], sampling_params)
print(output[0].outputs[0].text)

In return, there is either:

[rank0]: torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 736.00 MiB. GPU 0 has a total capacity of 3.81 GiB of which 73.00 MiB is free. Including non-PyTorch memory, this process has 3.73 GiB memory in use. Of the allocated memory 3.56 GiB is allocated by PyTorch, and 69.88 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

or the following:

No available memory for the cache blocks. Try increasing `gpu_memory_utilization` when initializing the engine.

I may try later to check it out on bigger GPU but as for now, I will try to run it using transformers library which is the next topic.

Transformers

So I picked the same BLOOM 560M model. First, you need to install the following main packages and plenty of dependencies:

pip install transformers
pip install torch
pip install accelerate

Source code of Python program using those libraries is as follows:

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("bigscience/bloom-560m")
model = AutoModelForCausalLM.from_pretrained(
    "bigscience/bloom-560m", 
    device_map="auto", 
    torch_dtype="auto"
)
def ask_bloom(question, max_length=100, temperature=0.7):
    inputs = tokenizer(question, return_tensors="pt").to(model.device)
    output = model.generate(
        inputs['input_ids'],
        max_length=max_length,  
        temperature=temperature,   
        pad_token_id=tokenizer.eos_token_id, 
        do_sample=True             
    )
    answer = tokenizer.decode(output[0], skip_special_tokens=True)
    return answer
question = "What is love?"
answer = ask_bloom(question)
print(f"Q: {question}\nAnwser: {answer}")

To run:

python3 transformers-torch.py

It will download the model and execute in Python program.

We can see that my NVIDIA GeForce 940MX with 2 GB VRAM is loaded around 60% with over 90% utilization (results from nvidia-smi).

“What is love?”. Anwser is as follows:

What is love? She never asked me the question but I am the one who has to give my answer. She is a beautiful, beautiful, very beautiful woman. The first thing you know about love is that it is the most complicated thing in the world. I was so shocked when I saw the man I thought I was marrying. My life was over. The woman who had my first kiss in my life had no idea that I was a man. She was so shocked. She said something

Next I asked for “The sun is shining and?”

I love this scene and I like when it comes to it. The sun, the sky and all the other stars are bright and I love that! I really enjoy this scene and it sets the scene for the rest of the story. It seems like the characters are just going about their day and the sun is shining. The world seems to be in its perfect place and everything is beautiful. I love the lighting in this scene and the warmth of the sunlight that

Does it make any sense? What is a concept of “sense” at all? Anyway it works, somehow. Lets find out the other possibilities.

Microsoft Azure N-series virtual machines

Instead of buying MSI Vector, ASUS ROG, Lenovo Legion Pro, MSI Raider or any kind of ultimate gaming laptops you go to Azure and pick on their NV virtual machines. Especially that they have 14 and 28 GB of VRAM onboard. It costs around 400 Euro per month, but you will not be using it all the time (I suppose).

We have:

root@z92-az-bloom:/home/adminadmin# lspci 
0002:00:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Vega 10 [Instinct MI25 MxGPU/MI25x2 MxGPU/V340 MxGPU/V340L MxGPU]

And I was not so sure how to use AMD GPU, so instead I decided to requests for a quote increase:

However I got rejected on my account with that request:

Unfortantely changing parameters and virtual machine types did not change the situation, I got still rejected and neeeded to submit support ticket to Microsoft in order to manually process it. So until next time!

What’s next to check?

AWS g6 and Hetzner GEX44. Keep reading!

Further reading

Microsoft Azure AI Services: computer vision

Use Microsoft Azure AI Services to analyze images, voice, documents. No AI/ML or coding skills required. Responsible AI applies by EU AI act. Formerly Cognitive Services.

Course of Action

  • Create AI Services multi-account in Azure
  • Run computer vision OCR on image

What is Microsoft Azure?

It is Microsoft’s public cloud platform offering broad range of products and services, including virtual machines, managed containers, databases, analytics platforms as well as AI Services. Major competitors of Azure are Amazon AWS and Google’s GCP.

What are AI Services (formerly Cognitive Services)?

It is a set of various services concerning recognition and analysis procedures based on already trained ML models (or even traditional programming techniques). You can use it to describe documents, run OCR tasks, face recognition etc. Actually, those services tend to be categorized under Cognitive Services section, which concerns recognition which is a synonym of cognitive. Name change process which happend in July 2023 was more-or-less rebranding and provided non breaking changes only as a part of marketing. It is obvious that AI services would sell better than Cognitive Services.

Create AI multi-service account in Azure portal

In order to create Microsoft Azure AI Services multi-service account you need to have valid Azure subscription, either Free Trial or regular account. Type “AI” in search field in portal.azure.com and you will find this service thru service catalog.

It is worth menitioning that you get “Responsible AI Notice” which relates to AI Act which includes European Union, USA and UK. It defines what AI/ML models can do and what should not allow. Accoring to KPMG source it covers among others: social scoring, recruitment and deep-fake disclosure as the most crucial areas which require regulations. What about the rest of the world? Well, it might be same as with CO2 emissions or plastic garbage recycling. situation.

Deployment process in Azure is especially meaningful when speaking about configurable assets with data. However in terms of deploying services it is a matter of linking them to our account, so the deployment process of AI Services finishes within seconds.

AI Services account overview

To use Azure AI services you need to go to Resource Management, Keys and Endpoint. You will know to which Endpoint you should send your API calls/requests and what is the access key. This key is then mapped to “Ocp-Apim-Subscription-Key” header which should be passed during HTTP call.

As for S0 standad pricing tier on Free Tier subscription, estimate 1 000 calls to API (requests made) would cost less than 1 Euro. It might be “cheap” however it is starting point of pricing and I suspect that it might be actually a different value in real production use case scenario especially when speaking about decision making (still could be ML based only) services and not only those services which could be replaced by traditional programming techniques, which is notabene OCR processes which are present on the market for few decades already.

Run example recognition task

Instead of programming (aka coding) in various SDKs for AI Services (Python, JavaScript etc) you can also invoke such services within HTTP request using curl utility. As far as I know every Windows 10 and 11 should have curl present. As for Linux distributions you most probably have curl already installed.

So, in order to invoke recognition task, pass subscription key (here replaced by xxx), point at specific Endpoint URL and pass url parameter which should be some publicly available image on which recognition service will run over. I found out that not every feature is available in every Endpoint. In that case, you would need to modify “features” parameter:

curl -H "Ocp-Apim-Subscription-Key: xxx" -H "Content-Type: application/json" "https://z92-azure-ai-services.cognitiveservices.azure.com//computervision/imageanalysis:analyze?features=read&model-version=latest&language=en&api-version=2024-02-01" -d "{'url':'https://michalasobczak.pl/wp-content/uploads/2024/08/image-6.png'}"

I passed this image for analysis, which contains dozen of rectangular boxes with text inside. It should should be straighforward to get proper results as text is not rotated, it is written in machine font and color contrast at proper value.

In return we receive the following JSON formatted output. We can see that it properly detected word “rkhunter” as well as “process”. However we need to provide additional layer of processing in order to merge those adjacent words in separate lines to make them phrases instead of just separate words.

{
   "modelVersion":"2023-10-01",
   "metadata":{
      "width":860,
      "height":532
   },
   "readResult":{
      "blocks":[
         {
            "lines":[
               {
                  "text":"rkhunter",
                  "boundingPolygon":[
                     {
                        "x":462,
                        "y":78
                     },
                     {
                        "x":519,
                        "y":79
                     },
                     {
                        "x":519,
                        "y":92
                     },
                     {
                        "x":462,
                        "y":91
                     }
                  ],

                  ...

                  "words":[
                     {
                        "text":"process",
                        "boundingPolygon":[
                           {
                              "x":539,
                              "y":447
                           },
                           {
                              "x":586,
                              "y":447
                           },
                           {
                              "x":586,
                              "y":459
                           },
                           {
                              "x":539,
                              "y":459
                           }
                        ],
                        "confidence":0.993
                     }
                  ]
               }
            ]
         }
      ]
   }
}

Conclusion

I think that price-wise this AI Service, formerly known as Cognitive Services, it is reasonable way of running recognition tasks in online environment. We could include such recognition into our applications for further automation, in, for instance ERP FI invoice processing.

External and redundand Azure VM backups with Veeam to remote site

Backup is a must. Primary hardware fails. Local backups can also fail or can be inaccessible. Remote backups can also fail, but if you have 2, 3 or even more backup copies in different places and on various medium chances are high enough that you will survive major incidents without data loss or too much of being offline.

Talking about Microsoft Azure public cloud platform. But in case of any infrastructure environment you should have working and verified backup tools. Azure has its own. To keep those backups in secure remote place (in the context of Storage Account) you can use Veeam Backup for Microsoft Azure which can be used with up to 10 instances for free, besides costs of storage and VM to Veeam itself of course.

Source: Veeam Backup for Microsoft Azure Free Edition

To deploy Veeam you can use VM template from Azure’s marketplace. Its called “Veeam Backup for Microsoft Azure Free Edition”. You need to have also a storage account. I recommend setting it up with firewall enabled, configuring remote public IP address. This is the place where your VM backups made by Veeam will go.

Unlike Veeam Backup and Replication Community Edition, this one comes with browser-based user interface. It looks also quite differently from desktop-based version. What you need to do first is to define backup policy (Managment – Policies), add virtual machines and run it. That’s all at this point.

Resources covered with this policy can be found in Management – Protected Data. During backup, Veeam spins additional VM with Ubuntu template to take this backups. After backup or snapshot job is completed this temporary VM are gone.

As mentioned earlier, there are 10 slots within this free license. But you need to manually configure license usage which is a little bit annoying of course. Keep in mind that at least one backup or snapshot uses license seat. Need to remove to free it up.

You could use Veeam as a replacement for native backups coming from Azure. In this this proposed scenario, Veeam backups and the first step for having redundant and remote backups in case of environment inaccessibility.

Remote: Veeam Backup and Replication Community Edition

In order to move backups/snapshots from Azure Storage Account created by Veeam for Microsoft Azure you need to have Community Edition of Veeam installed in remote place. For sake of compliance it is necessary that it should be physically separate place and in my opinion it must not be the same service provider. So your remote site could be also on public cloud but from different provider.

In order to install Veeam Community you need to obtain Windows license for your virtual machine. Install Windows from official ISO coming from Microsoft and buy license directly from Microsoft Store. This way you can purchase electronic license even for Windows 10 which sometimes if preferable over Windows 11. Veeam installation is rather straight forward.

There is variaty of choise from where you can copy your backups. Which means that the similar setup can be done in other public clouds like AWS of GCP. In case of Microsoft Azure you need to copy you access token for Storage Account with backups from Azure Portal. Adding external repository can be done at Backup Infrastructure – External Repositories.

You need to have also a local repository which can be a virtual hard drive added to your Veeam Community VM and initialized with some drive letter in Windows.

There is a choice what to backup or have to transfer it to remote place. In this given scenario the optimum is to create Backup Copy which will immediately copy backups from the source as soon as they appear over there. Other scenarios are also possible but when additional requirements are met.

Once you have defined Backup Copy Job, run it. When completed you will have your source backup secured in remote place. Now you have copy those backups to different medium.

How to restore backups to remote Proxmox server?

Now you have your source backups secured and placed in remote site. The question arise, how to restore such backup? You could run instant recovery but to do this you need to have a commercial virtualization platforms set up. There is Proxmox on that list. However you can Export content as virtual disk, which will produce VMDK files with disk descriptors.

There is however one quirk you need to fix before continuing. Disk descriptors exported by Veeam are incompatible with disk import in Proxmox. Surround createType variable with quotes.

createType="monolithicFlat"

Copy disks exported to Proxmox server. Now you can create empty VM, ie. without disk drives and possibly event network adapters at first glace. Import disk into this newly created VM with qm utility. Then add drive to VM and change its boot order. You are good to go.

To recap the those procedure:

  • Export content as virtual disks
  • Fix createType variable in disk descriptor
  • Copy disk to Proxmox server
  • Create empty VM
  • Import disks into new VM
  • Configure VM and run it

Keep in mind that redundant backup is a must.

oc rsync takes down OKD master processes

It might sound a little weird, but that’s the case. I was trying to setup NFS mount in OKD docker registry (from this tutorial). During oc rsync from inside docker-registry container I found that OKD master processes are down because of heath check thinking that there is some connectivity problem. This arised because oc rsync does not have rate limiting feature and it I fully utilized local network then there is no bandwidth left for the cluster itself.

Few things taken out from logs (/var/log/messages):

19,270,533,120  70%   57.87MB/s    0:02:19  The connection to the server okd-master:8443 was refused - did you specify the right host or port?
Liveness probe for "okd-master.local_kube-system (xxx):etcd" failed (failure): member xxx is unhealthy: hot unhealthy result
okd-master origin-node: cluster is unhealthy

The starting transfer from docker-registry container is at the of 200MB/s. I’m not quite sure if network is actually capable of such speed. The problem is repeatable, after liveness probe is triggered, master, etcd and webconsole are restarted which could lead to unstable cluster. We should avoid it if possible. Unfortunately docker-registry container is a very basic one, without ip, ifconfig, ssh, scp or any utilities which could help with transfering out files. But…

  • you can check IP of the container in webconsole
  • you can start HTTP server python -m SimpleHTTPServer on port 8000
  • you can then download the file with wget x.x.x.x:8000/file.tar --limit-rate=20000k

It is really funny, that the container lacks basic tools, but got Python. Set rate in wget on reasonable level that the internal network will not be fully utilized. To sum up. I did not encounter such problem on any other environment, either bare-metal or virtualized so it might be related specially with Microsoft Azure SDN and how it behaves on such traffic load.