Using Llama-cpp with FastAPI to connect a html.

Currently I deploy my model on my serverbox using FastAPI below :
```
from fastapi import FastAPI, Request, Response
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from llama_cpp import Llama
import json

app = FastAPI()

llm = Llama(model_path="./model/ggml-model-q8_0.bin")

# Configure CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

from pydantic import BaseModel

class InputData(BaseModel):
    prompt: str

@app.post("/api/ask")
async def ask(data: InputData):
    print("Received request:", data)
    output = llm(f"Q: {data.prompt} A: ",
                 max_tokens=400,
                 temperature=0.7,
                 top_p=0.9,
                 stop=["Q:", "\n"],
                 echo=False,
                 repeat_penalty=1.1,
                 top_k=40)

    # Extract the relevant information from the output
    response_text = output.get("choices", [])[0].get("text", "")

    return {"response": response_text}

if __name__ == "__main__":

    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=5000)
   
```

Then i connect my html to it from another box, basically using it as a chatbot, my goal is not to use it for http ultimately, i am simply using it this way for testing, But, This actually works very effectively.

Is there anything else i can do here to optmize the deployment of the model ? its a little slow on its response times. the VPS i use is a 4 core box with 16GB ram, this is a 7B model, so i have twice the required ram, and still most responses seem to take around 30 to 60 seconds, however i can see on the server console that the model recieved the input query within the first 2 or 3 seconds... it then takes about 30 to 60 seconds to generate a response. so its not an internet issue. its the actual model taking time to generate a resonse. is there anything i can modify in my script to speed up the generation proccess ?  I have tried setting max tokens down, it doesnt make that much of a difference.  Any guidance would be appreciated. 


here is an example of generated response :

```
INFO:     xx.xx.xx.xx:37196 - "OPTIONS /api/ask HTTP/1.1" 200 OK
Received request: prompt='describe mars'

llama_print_timings:        load time =  4102.40 ms
llama_print_timings:      sample time =    97.96 ms /   103 runs   (    0.95 ms per run)
llama_print_timings: prompt eval time =  4102.33 ms /     8 tokens (  512.79 ms per token)
llama_print_timings:        eval time = 45712.52 ms /   102 runs   (  448.16 ms per run)
llama_print_timings:       total time = 59561.53 ms
INFO:    xx.xx.xx.xx:37196 - "POST /api/ask HTTP/1.1" 200 OK
```

As you can see, this response took 60 seconds... on my html it showed 60.4... so very little time is lost through http, the delay is the model itself. and ive tried many models, most are slower, this model has been the fastest yet.
    
you can actually test it yourself, i have a webiste i setup here for testing purposes : 
    
is there anything i can modify in my script to speed up the generation proccess ?  I have tried setting max tokens down, it doesnt make that much of a difference.  Any guidance would be appreciated. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using Llama-cpp with FastAPI to connect a html. #166

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Using Llama-cpp with FastAPI to connect a html. #166

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions