LocalAI returns Server error error="could not load model: rpc error: code = Unavailable desc = error reading from server: EOF"

**LocalAI version:**
Fresh intall latest version

**Environment, CPU architecture, OS, and Version:**
Proxmox VM: Linux localAI 6.1.0-22-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.94-1 (2024-06-21) x86_64 GNU/Linux

**Describe the bug**
When trying to chat using the statement below.  Localai returns the error.
-----
curl http://localhost:8989/v1/chat/completions -H "Content-Type: application/json" -d '{
     "model": "Luna-AI-Llama2-Uncensored-GGUF",
     "messages": [{"role": "user", "content": "How are you?"}],
     "temperature": 0.9 
   }'


**To Reproduce**
Do a fresh install on a VM (Proxmox) and do the following steps.

    1.  apt install curl git (git may not be needed but I normaly install it)
    2.  curl https://localai.io/install.sh | PORT=8989 USE_AIO=false sh (I have tried USE_AIO=true and get the same results)
    3. nano /etc/localai.env and add GALLERIES=[{"name":"model-gallery", "url":"github:go-skynet/model-gallery/index.yaml"}, {"url": "github:go-skynet/model-gallery/huggingface.yaml","name":"huggingface"}]
    4. nano /usr/share/local-ai/models/Luna-AI-Llama2-Uncensored-GGUF.yaml
-------
name: Luna-AI-Llama2-Uncensored-GGUF
context_size: 2048
trimsuffix:
- "\n"
mmap: false
parameters:
  model: huggingface://TheBloke/Luna-AI-Llama2-Uncensored-GGUF/luna-ai-llama2-uncensored.Q5_K_M.gguf
  top_k: 80
  temperature: 0.2
  top_p: 0.7
backend: llama
roles:
  assistant: 'ASSISTANT:'
  system: 'SYSTEM:'
  user: 'USER:'
template:
  chat: lunademo-chat
  completion: lunademo-completion
-------
    5. nano /usr/share/local-ai/models/lunademo-chat.tmpl
-------
USER: {{.Input}}
 
ASSISTANT:
-------
    6. nano /usr/share/local-ai/models/lunademo-completion.tmpl
-------
Complete the following sentence: {{.Input}}
-------
    7. systemctl stop local-ai
    8. local-ai --debug

**Expected behavior**
Should return a response to the chat

**Logs**
root@localAI:~# local-ai --debug
8:30AM INF env file found, loading environment variables from file envFile=/etc/localai.env
8:30AM DBG Setting logging to debug
8:30AM INF Starting LocalAI using 8 threads, with models path: /usr/share/local-ai/models
8:30AM INF LocalAI version:  ()
8:30AM DBG CPU capabilities: [aes apic clflush cmov constant_tsc cpuid cpuid_fault cx16 cx8 de fpu fxsr ht hypervisor lahf_lm lm mca mce mmx msr mtrr nopl nx pae pat pge pni popcnt pse pse36 pti sep sse sse2 sse4_1 sse4_2 ssse3 syscall tsc tsc_known_freq x2apic xtopology]
8:30AM DBG GPU count: 1
8:30AM DBG GPU: card #0 @0000:00:02.0 -> driver: 'bochs-drm' class: 'Display controller' vendor: 'unknown' product: 'unknown'
8:30AM DBG guessDefaultsFromFile: template already set name=Luna-AI-Llama2-Uncensored-GGUF
8:30AM INF Preloading models from /usr/share/local-ai/models

  Model name: Luna-AI-Llama2-Uncensored-GGUF


8:30AM DBG Model: Luna-AI-Llama2-Uncensored-GGUF (config: {PredictionOptions:{Model:f83553a34a79b75aca661acbf73b8d62 Language: Translate:false N:0 TopP:0xc000cf8de0 TopK:0xc000cf8db8 Temperature:0xc000cf8dc0 Maxtokens:0xc000cf8e98 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000cf8e90 TypicalP:0xc000cf8e88 Seed:0xc000cf8eb0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Luna-AI-Llama2-Uncensored-GGUF F16:0xc000cf8e50 Threads:0xc000cf8e48 Debug:0xc000cf8ea8 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000cf8e80 MirostatTAU:0xc000cf8e78 Mirostat:0xc000cf8e70 NGPULayers:0xc000cf8ea0 MMap:0xc000cf8cb8 MMlock:0xc000cf8ea9 LowVRAM:0xc000cf8ea9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[
] ContextSize:0xc000cf8ca8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:})
8:30AM DBG Extracting backend assets files to /tmp/localai/backend_data
8:30AM DBG processing api keys runtime update
8:30AM DBG processing external_backends.json
8:30AM DBG external backends loaded from external_backends.json
8:30AM INF core/startup process completed!
8:30AM DBG No configuration file found at /tmp/localai/upload/uploadedFiles.json
8:30AM DBG No configuration file found at /tmp/localai/config/assistants.json
8:30AM DBG No configuration file found at /tmp/localai/config/assistantsFile.json
8:30AM INF LocalAI API is listening! Please connect to the endpoint for API documentation. endpoint=http://0.0.0.0:8989
8:30AM DBG Request received: {"model":"Luna-AI-Llama2-Uncensored-GGUF","language":"","translate":false,"n":0,"top_p":null,"top_k":null,"temperature":0.9,"max_tokens":null,"echo":false,"batch":0,"ignore_eos":false,"repeat_penalty":0,"repeat_last_n":0,"n_keep":0,"frequency_penalty":0,"presence_penalty":0,"tfz":null,"typical_p":null,"seed":null,"negative_prompt":"","rope_freq_base":0,"rope_freq_scale":0,"negative_prompt_scale":0,"use_fast_tokenizer":false,"clip_skip":0,"tokenizer":"","file":"","size":"","prompt":null,"instruction":"","input":null,"stop":null,"messages":[{"role":"user","content":"How are you?"}],"functions":null,"function_call":null,"stream":false,"mode":0,"step":0,"grammar":"","grammar_json_functions":null,"grammar_json_name":null,"backend":"","model_base_name":""}
8:30AM DBG guessDefaultsFromFile: template already set name=Luna-AI-Llama2-Uncensored-GGUF
8:30AM DBG Configuration read: &{PredictionOptions:{Model:f83553a34a79b75aca661acbf73b8d62 Language: Translate:false N:0 TopP:0xc000cf8de0 TopK:0xc000cf8db8 Temperature:0xc000515930 Maxtokens:0xc000cf8e98 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000cf8e90 TypicalP:0xc000cf8e88 Seed:0xc000cf8eb0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Luna-AI-Llama2-Uncensored-GGUF F16:0xc000cf8e50 Threads:0xc000cf8e48 Debug:0xc000515a20 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000cf8e80 MirostatTAU:0xc000cf8e78 Mirostat:0xc000cf8e70 NGPULayers:0xc000cf8ea0 MMap:0xc000cf8cb8 MMlock:0xc000cf8ea9 LowVRAM:0xc000cf8ea9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[
] ContextSize:0xc000cf8ca8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
8:30AM DBG Parameters: &{PredictionOptions:{Model:f83553a34a79b75aca661acbf73b8d62 Language: Translate:false N:0 TopP:0xc000cf8de0 TopK:0xc000cf8db8 Temperature:0xc000515930 Maxtokens:0xc000cf8e98 Echo:false Batch:0 IgnoreEOS:false RepeatPenalty:0 RepeatLastN:0 Keep:0 FrequencyPenalty:0 PresencePenalty:0 TFZ:0xc000cf8e90 TypicalP:0xc000cf8e88 Seed:0xc000cf8eb0 NegativePrompt: RopeFreqBase:0 RopeFreqScale:0 NegativePromptScale:0 UseFastTokenizer:false ClipSkip:0 Tokenizer:} Name:Luna-AI-Llama2-Uncensored-GGUF F16:0xc000cf8e50 Threads:0xc000cf8e48 Debug:0xc000515a20 Roles:map[assistant:ASSISTANT: system:SYSTEM: user:USER:] Embeddings:false Backend:llama TemplateConfig:{Chat:lunademo-chat ChatMessage: Completion:lunademo-completion Edit: Functions: UseTokenizerTemplate:false JoinChatMessagesByCharacter:<nil>} PromptStrings:[] InputStrings:[] InputToken:[] functionCallString: functionCallNameString: ResponseFormat: ResponseFormatMap:map[] FunctionsConfig:{DisableNoAction:false GrammarConfig:{ParallelCalls:false DisableParallelNewLines:false MixedMode:false NoMixedFreeString:false NoGrammar:false Prefix: ExpectStringsAfterJSON:false PropOrder:} NoActionFunctionName: NoActionDescriptionName: ResponseRegex:[] JSONRegexMatch:[] ReplaceFunctionResults:[] ReplaceLLMResult:[] CaptureLLMResult:[] FunctionName:false} FeatureFlag:map[] LLMConfig:{SystemPrompt: TensorSplit: MainGPU: RMSNormEps:0 NGQA:0 PromptCachePath: PromptCacheAll:false PromptCacheRO:false MirostatETA:0xc000cf8e80 MirostatTAU:0xc000cf8e78 Mirostat:0xc000cf8e70 NGPULayers:0xc000cf8ea0 MMap:0xc000cf8cb8 MMlock:0xc000cf8ea9 LowVRAM:0xc000cf8ea9 Grammar: StopWords:[] Cutstrings:[] TrimSpace:[] TrimSuffix:[
] ContextSize:0xc000cf8ca8 NUMA:false LoraAdapter: LoraBase: LoraScale:0 NoMulMatQ:false DraftModel: NDraft:0 Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: FlashAttention:false NoKVOffloading:false RopeScaling: ModelType: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0} AutoGPTQ:{ModelBaseName: Device: Triton:false UseFastTokenizer:false} Diffusers:{CUDA:false PipelineType: SchedulerType: EnableParameters: CFGScale:0 IMG2IMG:false ClipSkip:0 ClipModel: ClipSubFolder: ControlNet:} Step:0 GRPC:{Attempts:0 AttemptsSleepTime:0} TTSConfig:{Voice: VallE:{AudioPath:}} CUDA:false DownloadFiles:[] Description: Usage:}
8:30AM DBG Prompt (before templating): USER:How are you?
8:30AM DBG Template found, input modified to: USER: USER:How are you?

ASSISTANT:

8:30AM DBG Prompt (after templating): USER: USER:How are you?

ASSISTANT:

8:30AM INF Loading model 'f83553a34a79b75aca661acbf73b8d62' with backend llama
8:30AM DBG llama-cpp is an alias of llama-cpp
8:30AM DBG Loading model in memory from file: /usr/share/local-ai/models/f83553a34a79b75aca661acbf73b8d62
8:30AM DBG Loading Model f83553a34a79b75aca661acbf73b8d62 with gRPC (file: /usr/share/local-ai/models/f83553a34a79b75aca661acbf73b8d62) (backend: llama-cpp): {backendString:llama model:f83553a34a79b75aca661acbf73b8d62 threads:8 assetDir:/tmp/localai/backend_data context:{emptyCtx:{}} gRPCOptions:0xc0009b3688 externalBackends:map[] grpcAttempts:20 grpcAttemptsDelay:2 singleActiveBackend:false parallelRequests:false}
8:30AM INF [llama-cpp] attempting to load with fallback variant
8:30AM DBG ld.so found
8:30AM DBG Loading GRPC Process: /tmp/localai/backend_data/backend-assets/lib/ld.so
8:30AM DBG GRPC Service for f83553a34a79b75aca661acbf73b8d62 will be running at: '127.0.0.1:44937'
8:30AM DBG GRPC Service state dir: /tmp/go-processmanager644594595
8:30AM DBG GRPC Service Started
8:30AM DBG GRPC(f83553a34a79b75aca661acbf73b8d62-127.0.0.1:44937): stdout Server listening on 127.0.0.1:44937
8:30AM DBG GRPC Service Ready
8:30AM DBG GRPC: Loading model with options: {state:{NoUnkeyedLiterals:{} DoNotCompare:[] DoNotCopy:[] atomicMessageInfo:<nil>} sizeCache:0 unknownFields:[] Model:f83553a34a79b75aca661acbf73b8d62 ContextSize:2048 Seed:1541718028 NBatch:512 F16Memory:false MLock:false MMap:false VocabOnly:false LowVRAM:false Embeddings:false NUMA:false NGPULayers:99999999 MainGPU: TensorSplit: Threads:8 LibrarySearchPath: RopeFreqBase:0 RopeFreqScale:0 RMSNormEps:0 NGQA:0 ModelFile:/usr/share/local-ai/models/f83553a34a79b75aca661acbf73b8d62 Device: UseTriton:false ModelBaseName: UseFastTokenizer:false PipelineType: SchedulerType: CUDA:false CFGScale:0 IMG2IMG:false CLIPModel: CLIPSubfolder: CLIPSkip:0 ControlNet: Tokenizer: LoraBase: LoraAdapter: LoraScale:0 NoMulMatQ:false DraftModel: AudioPath: Quantization: GPUMemoryUtilization:0 TrustRemoteCode:false EnforceEager:false SwapSpace:0 MaxModelLen:0 TensorParallelSize:0 MMProj: RopeScaling: YarnExtFactor:0 YarnAttnFactor:0 YarnBetaFast:0 YarnBetaSlow:0 Type: FlashAttention:false NoKVOffload:false}
8:30AM ERR Server error error="could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" ip=127.0.0.1 latency=2.045800508s method=POST status=500 url=/v1/chat/completions

FYI This is my first time using localai so if I missed something please let me know. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

LocalAI returns Server error error="could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" #2692

Describe the bug
When trying to chat using the statement below. Localai returns the error.

ASSISTANT:

Complete the following sentence: {{.Input}}

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Uh oh!

LocalAI returns Server error error="could not load model: rpc error: code = Unavailable desc = error reading from server: EOF" #2692

Description

Describe the bug When trying to chat using the statement below. Localai returns the error.

ASSISTANT:

Complete the following sentence: {{.Input}}

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions

Describe the bug
When trying to chat using the statement below. Localai returns the error.