Configuration

Ollama is configurable using environment variables, all of these variables are available here.

EnvDescription
OLLAMA_DEBUGShow additional debug information (e.g. OLLAMA_DEBUG=1)
OLLAMA_FLASH_ATTENTIONEnabled flash attention
OLLAMA_GPU_OVERHEADReserve a portion of VRAM per GPU (bytes)
OLLAMA_HOSTIP Address for the ollama server (default 127.0.0.1:11434)
OLLAMA_KEEP_ALIVEThe duration that models stay loaded in memory (default “5m”)
OLLAMA_LLM_LIBRARYSet LLM library to bypass autodetection
OLLAMA_LOAD_TIMEOUTHow long to allow model loads to stall before giving up (default “5m”)
OLLAMA_MAX_LOADED_MODELSMaximum number of loaded models per GPU
OLLAMA_MAX_QUEUEMaximum number of queued requests
OLLAMA_MODELSThe path to the models directory
OLLAMA_NOHISTORYDo not preserve readline history
OLLAMA_NOPRUNEDo not prune model blobs on startup
OLLAMA_NUM_PARALLELMaximum number of parallel requests
OLLAMA_ORIGINSA comma separated list of allowed origins
OLLAMA_SCHED_SPREADAlways schedule model across all GPUs
OLLAMA_TMPDIRLocation for temporary files
OLLAMA_MULTIUSER_CACHEOptimize prompt caching for multi-user scenarios