Utilize both CPU, RAM and GPU computational resources With Ollama you can use not only GPU but also CPU with regular RAM go run LLM models, like DeepSeek-R1:70b. Of course you need to have fast both CPU and RAM and have plenty of it. My Lab setup contains 24 vCPU (2 x 6 cores * 2 threads) and from 128 to 384 GB of RAM. Once started, Ollama allocates 22.4GB in RAM (RES) and 119GB of vritual memory. It occupies 1200% CPU utilization causing system load to go up to 12. However, CPU utilization is only 50% in total. It
Ollama with WebUI on 2 “powerful” GPUs feels like commercial GPTs online I thought that Exo would do the job and utilize both of my Lab servers. Unfortunately, it does not work on Linux/NVIDIA with my setup and following official documentation. So I went back to Ollama and I found it great. I have 2 x NVIDIA RTX 3060 with 12GB VRAM each giving me in total 24GB which can run Gemma3:27b or DeepSeek-r1:32b. Ollama can utilize both GPUs in my system which can be seen in nvidia-smi. How to run Ollama in Docker with GPU acceleration you can read
Theory: running AI workload spreaded across various devices using pipeline parallel inference In theory Exo provides a way to run memory heavy AI/LLM models workload onto many different devices spreading memory and computations across. They say: “Unify your existing devices into one powerful GPU: iPhone, iPad, Android, Mac, NVIDIA, Raspberry Pi, pretty much any device!“ People say: “It requires mlx but it is an Apple silicon-only library as far as I can tell. How is it supposed to be (I quote) “iPhone, iPad, Android, Mac, Linux, pretty much any device” ? Has it been tested on anything else than the