fix: runpod adapter #3641

justinwlin · 2025-10-01T21:43:14Z

What does this PR do?

PR fixes Runpod Adapter
#3517

Test Plan

# RunPod Provider Quick Start

## Prerequisites
- Python 3.10+
- Git
- RunPod API token

## Setup for Development

```bash
# 1. Clone and enter the repository
cd (into the repo)

# 2. Create and activate virtual environment
python3 -m venv .venv
source .venv/bin/activate

# 3. Remove any existing llama-stack installation
pip uninstall llama-stack llama-stack-client -y

# 4. Install llama-stack in development mode
pip install -e .

# 5. Build using local development code
(Found this through the Discord)
LLAMA_STACK_DIR=. llama stack build

# When prompted during build:
# - Name: runpod-dev
# - Image type: venv
# - Inference provider: remote::runpod
# - Safety provider: "llama-guard" 
# - Other providers: first defaults

Configure the Stack

After building, edit the generated config file:
~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml

Add Your Models

Add your models to the models section using aliases for cleaner naming:

models:
  - metadata: {}
    model_id: qwen3-32b-awq
    model_type: llm
    provider_id: runpod
    provider_model_id: Qwen/Qwen3-32B-AWQ

Run the Server

Important: Use the Build-Created Virtual Environment

# Exit the development venv if you're in it
deactivate

# Activate the build-created venv (NOT .venv)
cd (lama-stack folder github repo)
source llamastack-runpod-dev/bin/activate

For Qwen3-32B-AWQ Public Endpoint (Recommended)

# Set environment variables
export RUNPOD_URL="https://api.runpod.ai/v2/qwen3-32b-awq/openai/v1"
export RUNPOD_API_TOKEN="your_runpod_api_key"

# Start server
llama stack run ~/.llama/distributions/llamastack-runpod-dev/llamastack-runpod-dev-run.yaml

Quick Test

1. Chat Completion (Non-streaming)

curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Hello, count to 3"}],
    "stream": false
  }'

RESULT:

llama-stack2 % curl -X POST http://localhost:8321/v1/chat/completions \ 
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Hello, count to 3"}],
    "stream": false
  }'
{"id":"chatcmpl-5eba94e9f07c4b1194ff61a054ad69b4","choices":[{"finish_reason":"stop","index":0,"logprobs":null,"message":{"content":"\n\n1. 🎉  \n2. 🎉  \n3. 🎉  \n\nDone! Let me know if you'd like me to count in a different way. 😊","refusal":null,"role":"assistant","annotations":null,"audio":null,"function_call":null,"tool_calls":null,"reasoning_content":"\nOkay, the user asked me to count to 3. Let me start by making sure I understand the request correctly. They want me to count from 1 to 3. That's straightforward. I should just list the numbers 1, 2, 3 in order. But maybe they want it in a friendly way? Like adding some emojis or making it more engaging.\n\nLet me think about the possible variations. The user might be testing if I can follow simple instructions, so keeping it simple is good. However, adding a bit of personality could make the response more pleasant. For example, using emojis or a cheerful tone. \n\nI should also check if there's any hidden meaning. Sometimes people ask for something simple to see if the AI can handle basic tasks, so accuracy is key here. No need for extra information unless the user asks for more. Just the count from 1 to 3. \n\nAnother angle: maybe they want it in a different language? But the request was in English, so sticking with that makes sense. Also, considering the user's possible intent—perhaps they're a child or just want a quick interaction. Keeping the response light and friendly would be appropriate. \n\nSo, the response should be clear, correct, and maybe a bit friendly. Let me structure it as 1, 2, 3 with some emojis to add a playful touch. That should cover the user's needs and provide a positive interaction.\n"},"stop_reason":null}],"created":1759354739,"model":"Qwen/Qwen3-32B-AWQ","object":"chat.completion","service_tier":null,"system_fingerprint":null,"usage":{"completion_tokens":336,"prompt_tokens":14,"total_tokens":350,"completion_tokens_details":null,"prompt_tokens_details":null},"cost":0.0035,"kv_transfer_params":null,"prompt_logprobs":null,"metrics":[{"trace_id":"03880bcc8a6eb115aa3ed74aaac02d31","span_id":"592f8e133bc63bba","timestamp":"2025-10-01T21:39:11.796145Z","attributes":{"model_id":"qwen3-32b-awq","provider_id":"runpod"},"type":"metric","metric":"prompt_tokens","value":14,"unit":"tokens"},{"trace_id":"03880bcc8a6eb115aa3ed74aaac02d31","span_id":"592f8e133bc63bba","timestamp":"2025-10-01T21:39:11.796160Z","attributes":{"model_id":"qwen3-32b-awq","provider_id":"runpod"},"type":"metric","metric":"completion_tokens","value":336,"unit":"tokens"},{"trace_id":"03880bcc8a6eb115aa3ed74aaac02d31","span_id":"592f8e133bc63bba","timestamp":"2025-10-01T21:39:11.796163Z","attributes":{"model_id":"qwen3-32b-awq","provider_id":"runpod"},"type":"metric","metric":"total_tokens","value":350,"unit":"tokens"}]}%

2. Chat Completion (Streaming)

curl -X POST http://localhost:8321/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3-32b-awq",
    "messages": [{"role": "user", "content": "Count to 5"}],
    "stream": true
  }'

Clean up bash terminal version:

curl -N -X POST http://localhost:8321/v1/chat/completions -H "Content-Type: application/json" -d '{"model": "qwen3-32b-awq", "messages": [{"role": "user", "content": "Count to 5"}], "stream": true}' 2>/dev/null | while read -r line; do echo "$line" | grep "^data: " | sed 's/^data: //' | jq -r '.choices[0].delta.content // empty' 2>/dev/null; done

Result:

Check Models

curl -X GET \
  -H "Content-Type: application/json" \
  "http://localhost:8321/v1/models"

meta-cla · 2025-10-01T21:43:20Z

Hi @justinwlin!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at [email protected]. Thanks!

meta-cla · 2025-10-01T21:46:35Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

mattf

please run pre-commit and let it fix the formatting / import issues.

also, address inline comments.