Skip to content

Conversation

justinwlin
Copy link

@justinwlin justinwlin commented Oct 1, 2025

What does this PR do?

PR fixes Runpod Adapter
#3517

Test Plan

# RunPod Provider Quick Start

## Prerequisites
- Python 3.10+
- Git
- RunPod API token

## Setup for Development

```bash
# 1. Clone and enter the repository
cd (into the repo)

# 2. Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Remove any existing llama-stack installation
pip uninstall llama-stack llama-stack-client -y

# 4. Install llama-stack in development mode
pip install -e .

# 5. Build using local development code
(Found this through the Discord)
LLAMA_STACK_DIR=. llama stack build

# When prompted during build:
# - Name: runpod-dev
# - Image type: venv
# - Inference provider: remote::runpod
# - Safety provider: "llama-guard" 
# - Other providers: first defaults

Configure the Stack

After building, edit the generated config file:
~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml

Add Your Models

Add your models to the models section using aliases for cleaner naming:

models:
  - metadata: {}
    model_id: qwen3-32b-awq
    model_type: llm
    provider_id: runpod
    provider_model_id: Qwen/Qwen3-32B-AWQ

Run the Server

Important: Use the Build-Created Virtual Environment

# Exit the development venv if you're in it
deactivate

# Activate the build-created venv (NOT .venv)
cd (lama-stack folder github repo)
source llamastack-runpod-dev/bin/activate

For Qwen3-32B-AWQ Public Endpoint (Recommended)

# Set environment variables
export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1"
export RUNPOD_API_TOKEN="your_runpod_api_key"

# Start server
llama stack run ~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml

Quick Test

1. Chat Completion (Non-streaming)

curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Hello, count to 3"}],
    "stream": false
  }'

RESULT:

llama-stack2 % curl -X POST http://localhost:8321/v1/chat/completions \ 
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Hello, count to 3"}],
    "stream": false
  }'
{"id":"chatcmpl-5eba94e9f07c4b1194ff61a054ad69b4","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"\n\n1. 🎉  \n2. 🎉  \n3. 🎉  \n\nDone! Let me know if you'd like me to count in a different way. 😊","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"reasoning_content":"\nOkay, the user asked me to count to 3. Let me start by making sure I understand the request correctly. They want me to count from 1 to 3. That's straightforward. I should just list the numbers 1, 2, 3 in order. But maybe they want it in a friendly way? Like adding some emojis or making it more engaging.\n\nLet me think about the possible variations. The user might be testing if I can follow simple instructions, so keeping it simple is good. However, adding a bit of personality could make the response more pleasant. For example, using emojis or a cheerful tone. \n\nI should also check if there's any hidden meaning. Sometimes people ask for something simple to see if the AI can handle basic tasks, so accuracy is key here. No need for extra information unless the user asks for more. Just the count from 1 to 3. \n\nAnother angle: maybe they want it in a different language? But the request was in English, so sticking with that makes sense. Also, considering the user's possible intent—perhaps they're a child or just want a quick interaction. Keeping the response light and friendly would be appropriate. \n\nSo, the response should be clear, correct, and maybe a bit friendly. Let me structure it as 1, 2, 3 with some emojis to add a playful touch. That should cover the user's needs and provide a positive interaction.\n"},"stop_reason":null}],"created":1759354739,"model":"Qwen/Qwen3-32B-AWQ","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":336,"prompt_tokens":14,"total_tokens":350,"completion_tokens_details":null,"prompt_tokens_details":null},"cost":0.0035,"kv_transfer_params":null,"prompt_logprobs":null,"metrics":[{"trace_id":"03880bcc8a6eb115aa3ed74aaac02d31","span_id":"592f8e133bc63bba","timestamp":"2025-10-01T21:39:11.796145Z","attributes":{"model_id":"qwen3-32b-awq","provider_id":"runpod"},"type":"metric","metric":"prompt_tokens","value":14,"unit":"tokens"},{"trace_id":"03880bcc8a6eb115aa3ed74aaac02d31","span_id":"592f8e133bc63bba","timestamp":"2025-10-01T21:39:11.796160Z","attributes":{"model_id":"qwen3-32b-awq","provider_id":"runpod"},"type":"metric","metric":"completion_tokens","value":336,"unit":"tokens"},{"trace_id":"03880bcc8a6eb115aa3ed74aaac02d31","span_id":"592f8e133bc63bba","timestamp":"2025-10-01T21:39:11.796163Z","attributes":{"model_id":"qwen3-32b-awq","provider_id":"runpod"},"type":"metric","metric":"total_tokens","value":350,"unit":"tokens"}]}%                                        

2. Chat Completion (Streaming)

curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Clean up bash terminal version:

curl -N -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true}' 2>/dev/null | while read -r line; do echo "$line" | grep "^data: " | sed 's/^data: //' | jq -r '.choices[0].delta.content // empty' 2>/dev/null; done

Result:

1
2
3
4
5

Check Models

curl -X GET \
  -H "Content-Type: application/json" \
  "http://localhost:8321/v1/models"

Copy link

meta-cla bot commented Oct 1, 2025

Hi @justinwlin!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

@justinwlin justinwlin changed the title Runpod adapter fix fix: runpod adapter Oct 1, 2025
Copy link

meta-cla bot commented Oct 1, 2025

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 1, 2025
Copy link
Collaborator

@mattf mattf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please run pre-commit and let it fix the formatting / import issues.

also, address inline comments.


async def register_model(self, model: Model) -> Model:
"""
Pass-through registration - accepts any model that the RunPod endpoint serves.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will accept any more, even if it isn't supported by RunPod.

if RunPod provides an openai-compatible /v1/models, you can use check_model_availability from https://github.com/llamastack/llama-stack/blob/main/llama_stack/providers/utils/inference/openai_mixin.py#L394

"""
return model

async def completion(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this, we've deprecated completion in favor of openai_completion

else:
return await self._nonstream_completion(request, self.client)

async def chat_completion(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this, it's replaced by openai_chat_completion


return params

async def embeddings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove this, embeddings is replaced by openai_embeddings

embeddings = [data.embedding for data in response.data]
return EmbeddingsResponse(embeddings=embeddings)

async def openai_embeddings(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the default implementation from openaimixin sufficient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants