Skip to content

Using Llama-cpp with FastAPI to connect a html. #166

Closed as not planned
Closed as not planned
@raymerjacque

Description

@raymerjacque

Currently I deploy my model on my serverbox using FastAPI below :

from fastapi import FastAPI, Request, Response
from fastapi.middleware.cors import CORSMiddleware
from fastapi.responses import JSONResponse
from llama_cpp import Llama
import json

app = FastAPI()

llm = Llama(model_path="./model/ggml-model-q8_0.bin")

# Configure CORS
app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"],
)

from pydantic import BaseModel

class InputData(BaseModel):
    prompt: str

@app.post("/api/ask")
async def ask(data: InputData):
    print("Received request:", data)
    output = llm(f"Q: {data.prompt} A: ",
                 max_tokens=400,
                 temperature=0.7,
                 top_p=0.9,
                 stop=["Q:", "\n"],
                 echo=False,
                 repeat_penalty=1.1,
                 top_k=40)

    # Extract the relevant information from the output
    response_text = output.get("choices", [])[0].get("text", "")

    return {"response": response_text}

if __name__ == "__main__":

    import uvicorn

    uvicorn.run(app, host="0.0.0.0", port=5000)
   

Then i connect my html to it from another box, basically using it as a chatbot, my goal is not to use it for http ultimately, i am simply using it this way for testing, But, This actually works very effectively.

Is there anything else i can do here to optmize the deployment of the model ? its a little slow on its response times. the VPS i use is a 4 core box with 16GB ram, this is a 7B model, so i have twice the required ram, and still most responses seem to take around 30 to 60 seconds, however i can see on the server console that the model recieved the input query within the first 2 or 3 seconds... it then takes about 30 to 60 seconds to generate a response. so its not an internet issue. its the actual model taking time to generate a resonse. is there anything i can modify in my script to speed up the generation proccess ? I have tried setting max tokens down, it doesnt make that much of a difference. Any guidance would be appreciated.

here is an example of generated response :

INFO:     xx.xx.xx.xx:37196 - "OPTIONS /api/ask HTTP/1.1" 200 OK
Received request: prompt='describe mars'

llama_print_timings:        load time =  4102.40 ms
llama_print_timings:      sample time =    97.96 ms /   103 runs   (    0.95 ms per run)
llama_print_timings: prompt eval time =  4102.33 ms /     8 tokens (  512.79 ms per token)
llama_print_timings:        eval time = 45712.52 ms /   102 runs   (  448.16 ms per run)
llama_print_timings:       total time = 59561.53 ms
INFO:    xx.xx.xx.xx:37196 - "POST /api/ask HTTP/1.1" 200 OK

As you can see, this response took 60 seconds... on my html it showed 60.4... so very little time is lost through http, the delay is the model itself. and ive tried many models, most are slower, this model has been the fastest yet.

you can actually test it yourself, i have a webiste i setup here for testing purposes :

is there anything i can modify in my script to speed up the generation proccess ? I have tried setting max tokens down, it doesnt make that much of a difference. Any guidance would be appreciated.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions