AI/ML

NVIDIA CC 7.0+: how to run Ollama/moondream:1.8b

Well, in one of the previous articles I described how to invoke Ollama/moondream:1.8b using cURL, however I forgot to tell how to even run it in Docker container. So here you go:

sudo docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run moondream:1.8b

You can specify to run particular model in background (-d) or in foreground (without parameter -d). You can also define parallelism and maximum queue in Ollama server:

sudo docker run -d --gpus=all -v ollama:/root/.ollama -e OLLAMA_NUM_PARALLEL=8 -e OLLAMA_MAX_QUEUE=32 -p 11434:11434 --name ollama ollama/ollama

One important one regarding stability of Ollama server. Once it runs for over few hours there might be issue with GPU driver which requires restart, so Ollama needs to be monitored for such scenario. Moreover after minutes of idle time it will drop out of VRAM freeing it up. So be aware that once you allocate this VRAM for other things, Ollama might not run due to out of memory issue.

Final note is that Ollama requires NVIDIA Compute Capability 7.0 and greater to run which effectively is TITAN V, Quadro GB100, Tesla V100 or consumer grade later onces with CC 7.5 such as GTX 1650. You can then treat this very GTX 1650 as minumum one to run Ollama by now.