Ollama

Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.

Configuration

Ollama is configurable using environment variables, all of these variables are available here.

Env	Description
OLLAMA_DEBUG	Show additional debug information (e.g. OLLAMA_DEBUG=1)
OLLAMA_FLASH_ATTENTION	Enabled flash attention
OLLAMA_GPU_OVERHEAD	Reserve a portion of VRAM per GPU (bytes)
OLLAMA_HOST	IP Address for the ollama server (default 127.0.0.1:11434)
OLLAMA_KEEP_ALIVE	The duration that models stay loaded in memory (default “5m”)
OLLAMA_LLM_LIBRARY	Set LLM library to bypass autodetection
OLLAMA_LOAD_TIMEOUT	How long to allow model loads to stall before giving up (default “5m”)
OLLAMA_MAX_LOADED_MODELS	Maximum number of loaded models per GPU
OLLAMA_MAX_QUEUE	Maximum number of queued requests
OLLAMA_MODELS	The path to the models directory
OLLAMA_NOHISTORY	Do not preserve readline history
OLLAMA_NOPRUNE	Do not prune model blobs on startup
OLLAMA_NUM_PARALLEL	Maximum number of parallel requests
OLLAMA_ORIGINS	A comma separated list of allowed origins
OLLAMA_SCHED_SPREAD	Always schedule model across all GPUs
OLLAMA_TMPDIR	Location for temporary files
OLLAMA_MULTIUSER_CACHE	Optimize prompt caching for multi-user scenarios