Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
Configuration
Ollama is configurable using environment variables, all of these variables are available here.
| Env | Description |
|---|---|
| OLLAMA_DEBUG | Show additional debug information (e.g. OLLAMA_DEBUG=1) |
| OLLAMA_FLASH_ATTENTION | Enabled flash attention |
| OLLAMA_GPU_OVERHEAD | Reserve a portion of VRAM per GPU (bytes) |
| OLLAMA_HOST | IP Address for the ollama server (default 127.0.0.1:11434) |
| OLLAMA_KEEP_ALIVE | The duration that models stay loaded in memory (default “5m”) |
| OLLAMA_LLM_LIBRARY | Set LLM library to bypass autodetection |
| OLLAMA_LOAD_TIMEOUT | How long to allow model loads to stall before giving up (default “5m”) |
| OLLAMA_MAX_LOADED_MODELS | Maximum number of loaded models per GPU |
| OLLAMA_MAX_QUEUE | Maximum number of queued requests |
| OLLAMA_MODELS | The path to the models directory |
| OLLAMA_NOHISTORY | Do not preserve readline history |
| OLLAMA_NOPRUNE | Do not prune model blobs on startup |
| OLLAMA_NUM_PARALLEL | Maximum number of parallel requests |
| OLLAMA_ORIGINS | A comma separated list of allowed origins |
| OLLAMA_SCHED_SPREAD | Always schedule model across all GPUs |
| OLLAMA_TMPDIR | Location for temporary files |
| OLLAMA_MULTIUSER_CACHE | Optimize prompt caching for multi-user scenarios |