MICHAŁ SOBCZAK – Page 2 – Architecture | IT infrastructure | DevOps | Security

AI/ML

Ollama, WebUI, Automatic1111 – your own, personal, local AI from scratch

2025-03-182025-03-23 3 Min Reading

My local toolbox was empty, now it’s full. Lately I have been writing about Ollama, WebUI and StableDiffusion on top of Automatic1111. I found myself struggling a little bit to keep up with all those information about how to run it in specific environments. So here you have an extract of step by step installation. Starting with NVIDIA driver and some basic requirements: Next we go for Docker. Ollama, but with binaries instead of Docker container. It will be much easier, and does not require installing Docker extensions for GPU acceleration support: If running Ollama on different server, then need

AI/ML

Custom Gemma AI system prompt to create own chatbot experience

2025-03-162025-03-23 2 Min Reading

I want to create custom chatbot experience. I want to be based on Google’s Gemma AI Large Language Models. I find Gemma3, especially 27b version very capable while problem solving. It has been trained on such data that I find it interesting. I will use Open WebUI to create custom “model hat” and provide chatbot experience TLDR In order to create your own chatbot, only 3 steps are required: To create own chatbot experience I can use System Prompt feature which is core part of model itself. Running on Ollama, Gemma3:27b is actually a 4-bit quantized version of full 16-bit

AI/ML

Single vs multiple GPU power load

2025-03-162025-03-23 1 Min Reading

slight utlization drop when dealing with multi GPU setup TLDR Power usage and GPU utilization varies between single GPU models and multi GPU models. Deal with it. My latest finding is that single GPU load in Ollama/Gemma or Automatic1111/StableDiffusion is higher than using multiple GPUs load with Ollama when model does not fit into one GPU’s memory. Take a look. GPU utilization of Stable Diffusion is at 100% with 90 – 100% fan speed and temperature over 80 degress C. Compare this to load spread across two GPUs. You can clearly see that GPU utilization is much lower, as well

AI/ML

Generate images with Stable Diffusion, Gemma and WebUI on NVIDIA GPUs

2025-03-152025-03-23 3 Min Reading

With Ollama paired with Gemma3 model, Open WebUI with RAG and search capabilities and finally Automatic1111 running Stable Diffusion you can have quite complete set of AI features at home in a price of 2 consumer grade GPUs and some home electricity. With 500 iterations and image size of 512×256 it took around a minute to generate response. I find it funny to be able to generate images with AI techniques. Tried Stable Diffusion in the past, but now with help of Gemma and integratino with Automatic1111 on WebUI, it’s damn easy. Step by step Prerequisites You can find information

AI/ML

Run DeepSeek-R1:70b on CPU and RAM

2025-03-14 1 Min Reading

Utilize both CPU, RAM and GPU computational resources With Ollama you can use not only GPU but also CPU with regular RAM go run LLM models, like DeepSeek-R1:70b. Of course you need to have fast both CPU and RAM and have plenty of it. My Lab setup contains 24 vCPU (2 x 6 cores * 2 threads) and from 128 to 384 GB of RAM. Once started, Ollama allocates 22.4GB in RAM (RES) and 119GB of vritual memory. It occupies 1200% CPU utilization causing system load to go up to 12. However, CPU utilization is only 50% in total. It

AI/ML

Ollama with Open WebUI on 2 x RTX 3060 12 GB

2025-03-13 3 Min Reading

Ollama with WebUI on 2 “powerful” GPUs feels like commercial GPTs online I thought that Exo would do the job and utilize both of my Lab servers. Unfortunately, it does not work on Linux/NVIDIA with my setup and following official documentation. So I went back to Ollama and I found it great. I have 2 x NVIDIA RTX 3060 with 12GB VRAM each giving me in total 24GB which can run Gemma3:27b or DeepSeek-r1:32b. Ollama can utilize both GPUs in my system which can be seen in nvidia-smi. How to run Ollama in Docker with GPU acceleration you can read

AI/ML

Exo: the GPU cluster (tinygrad | MLX)

2025-03-132025-03-13 6 Min Reading

Theory: running AI workload spreaded across various devices using pipeline parallel inference In theory Exo provides a way to run memory heavy AI/LLM models workload onto many different devices spreading memory and computations across. They say: “Unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, NVIDIA, Raspberry Pi, pretty much any device!“ People say: “It requires mlx but it is an Apple silicon-only library as far as I can tell. How is it supposed to be (I quote) “iPhone, iPad, Android, Mac, Linux, pretty much any device” ? Has it been tested on anything else than the

AI/ML

Object detection and scene description: various libraries/frameworks tested lately

2025-03-132025-03-16 2 Min Reading

No, cant use Tesla K20xm with 6GB VRAM for modern computation as it has Compute Capability parameter lower than required 7.0. Here you have table of my findings about libraries/frameworks, required hardware and its purpose. I started with DeepStack, where I was able to run API server for object detection, Frigate has support for it. Later on, with TensorRT on NVIDIA GPU I can run Yolov7x-640 model also for object detection, Frigate works well with it. With Google Coral TPU USB module we can run SSD MobileNet or EfficientDet models with great power efficency for good price. Ollama with moondream

AI/ML

Qwen LLM

2025-03-102025-03-10 3 Min Reading

What is Qwen? This is the organization of Qwen, which refers to the large language model family built by Alibaba Cloud. In this organization, we continuously release large language models (LLM), large multimodal models (LMM), and other AGI-related projects. Check them out and enjoy! What models do they provide? They provide wide range of models, since 2023. Original model was just called Qwen and can be still found on GitHub. The current model Qwen2.5 has its own repository, also on GitHub. General purpose models are just Qwen, but there are also code specific models. There are also Math, Audio and