Description
Currently I deploy my model on my serverbox using FastAPI below :
from fastapi import FastAPI, Request, Response
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from llama_cpp import Llama
import json
app = FastAPI()
llm = Llama(model_path="./model/ggml-model-q8_0.bin")
# Configure CORS
app.add_middleware(
CORSMiddleware,
allow_origins=["*"],
allow_credentials=True,
allow_methods=["*"],
allow_headers=["*"],
)
from pydantic import BaseModel
class InputData(BaseModel):
prompt: str
@app.post("/api/ask")
async def ask(data: InputData):
print("Received request:", data)
output = llm(f"Q: {data.prompt} A: ",
max_tokens=400,
temperature=0.7,
top_p=0.9,
stop=["Q:", "\n"],
echo=False,
repeat_penalty=1.1,
top_k=40)
# Extract the relevant information from the output
response_text = output.get("choices", [])[0].get("text", "")
return {"response": response_text}
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=5000)
Then i connect my html to it from another box, basically using it as a chatbot, my goal is not to use it for http ultimately, i am simply using it this way for testing, But, This actually works very effectively.
Is there anything else i can do here to optmize the deployment of the model ? its a little slow on its response times. the VPS i use is a 4 core box with 16GB ram, this is a 7B model, so i have twice the required ram, and still most responses seem to take around 30 to 60 seconds, however i can see on the server console that the model recieved the input query within the first 2 or 3 seconds... it then takes about 30 to 60 seconds to generate a response. so its not an internet issue. its the actual model taking time to generate a resonse. is there anything i can modify in my script to speed up the generation proccess ? I have tried setting max tokens down, it doesnt make that much of a difference. Any guidance would be appreciated.
here is an example of generated response :
INFO: xx.xx.xx.xx:37196 - "OPTIONS /api/ask HTTP/1.1" 200 OK
Received request: prompt='describe mars'
llama_print_timings: load time = 4102.40 ms
llama_print_timings: sample time = 97.96 ms / 103 runs ( 0.95 ms per run)
llama_print_timings: prompt eval time = 4102.33 ms / 8 tokens ( 512.79 ms per token)
llama_print_timings: eval time = 45712.52 ms / 102 runs ( 448.16 ms per run)
llama_print_timings: total time = 59561.53 ms
INFO: xx.xx.xx.xx:37196 - "POST /api/ask HTTP/1.1" 200 OK
As you can see, this response took 60 seconds... on my html it showed 60.4... so very little time is lost through http, the delay is the model itself. and ive tried many models, most are slower, this model has been the fastest yet.
you can actually test it yourself, i have a webiste i setup here for testing purposes :
is there anything i can modify in my script to speed up the generation proccess ? I have tried setting max tokens down, it doesnt make that much of a difference. Any guidance would be appreciated.