Skip to content

Conversation

@isamu-isozaki
Copy link
Contributor

@isamu-isozaki isamu-isozaki commented Apr 23, 2024

This is a draft PR. Currently, the 3 main parts left to do to make this work is

  • Add support for kv cache in outlines or a fork of outlines(this was already handled by exllama2)
  • Support to the API to get what kind of generation we want(choice, json, pydantic class,regex etc)
  • Possibly add support for streaming to outlines as is currently done in the main repo of outlines

@isamu-isozaki isamu-isozaki marked this pull request as draft April 23, 2024 02:51
@isamu-isozaki
Copy link
Contributor Author

I think I'll pull from dottxt-ai/outlines#781 which will probably solve 1 and 3

@edk208
Copy link
Contributor

edk208 commented Apr 23, 2024

thanks, looking good so far... its nice that outlines already supports exl2

@isamu-isozaki
Copy link
Contributor Author

@edk208 Some notes

  1. I think I finished the main logic
  2. For the logic of first doing preprocess and then generating tokens that are currently unfortunately not supported by outlines. I can make an outline fork that supports it but I think it can be a bit hacky. Does doing the preprocess first across all prompts offer better performance?
  3. The current script doesn't support proper streaming but I can make it generate one token at a time-> stream using the PR mentioned above though this functionality is not in the main branch of outlines yet so is more experimental

So in summary I think these are all the changes that can work from the main branch of outlines so far. Happy to get feedback!

@isamu-isozaki isamu-isozaki marked this pull request as ready for review April 24, 2024 20:33
@isamu-isozaki isamu-isozaki marked this pull request as draft April 24, 2024 20:34
@isamu-isozaki
Copy link
Contributor Author

I'll do the streaming idea tonight

@edk208
Copy link
Contributor

edk208 commented Apr 27, 2024

what do you mean by the "logic of first doing preprocess and then generating tokens"? do you mean the first model.forward with preprocess_only = True?

@isamu-isozaki
Copy link
Contributor Author

isamu-isozaki commented Apr 27, 2024

@edk208 sry for the confusion and yes. To my understanding, the process is

  1. We get the prompts -> tokenize+preprocess in exllama2
  2. Generate 1 token for each of those prompts
  3. For all end of sequence tokens stop
    Put all in a while loop until all prompts and prompt ids are exhausted.

I think step 1 is technically not possible in outlines but steps 2 and 3 might be possible in the above pr. Let me try it tomorrow

@edk208
Copy link
Contributor

edk208 commented Apr 27, 2024

@isamu-isozaki yes that's correct. The preprocess runs the prompts through and sets up the KV cache, then you can round-robin through them and generate one token at a time. Interesting that outlines doesn't like step 1. I would imagine it would have to do that anyway. I can take a look too in the next few days.

@isamu-isozaki isamu-isozaki marked this pull request as ready for review April 29, 2024 04:35
@isamu-isozaki
Copy link
Contributor Author

Hi! I think the main logic is done. For the test I used config.ini

[settings]
host = 127.0.0.1
port = 12345
upload_url = https://url/api/upload
path_url = https://url/folder/

[phi3b]
string = phi3b
repo = ..../Phi-3-mini-128k-instruct-exl2

with the model from here
and I started the server with

python llm_exl2_client_multi.py --port=5000 --use_outlines --gpu_split="5" --max_context=512 --repo_str=phi3b

Then on the client side, I did

from langchain.prompts import PromptTemplate
from langchain.prompts.chat import (
    ChatPromptTemplate,
    SystemMessagePromptTemplate,
    AIMessagePromptTemplate,
    HumanMessagePromptTemplate,
)
from langchain.schema import HumanMessage, SystemMessage, AIMessage
import json
from langchain.chat_models import ChatOpenAI

llm = ChatOpenAI(temperature=1.0,
                openai_api_base="http://localhost:5000/v1", 
                openai_api_key="Test",
                streaming=True, 
                max_tokens=1024)
messages = [
    SystemMessage(
        content="You are a helpful assistant."
    ),
    HumanMessage(
        content="Who is more impressive? Bob or Fred?"
    )
]
choices = ["Bob", "Fred"]

for chunk in llm.stream(messages, extra_body={"stop_at":"done", "outlines_type": "choices", "choices": choices}):
    print(chunk.content, end="", flush=True)

which got me Bob. I can do more tests if you want but I think it's working. One main logic here is that for adding new parameters to the open ai API we use extra_body rather than function calling/tool calling since I couldn't think of an easy way to translate it.

@isamu-isozaki isamu-isozaki changed the title WIP: Attempt adding outlines Adding outlines Apr 29, 2024
@edk208 edk208 merged commit 91114ea into blockentropy:main Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants