Stanett-77: Local LLM evaluation results #49

Aleksis99 · 2025-10-24T14:09:51Z

No description provided.

evaluation_results/version_2/open_models/configs/dev_open_models+retrieval.yaml

nelly-hateva · 2025-11-19T08:49:58Z

@Aleksis99 The file Qwen3-Coder-30B-A3B-Instruct_local/chat_responses_dev.jsonl contains only likes such as {"question_id": "question_68078138455045a9110143c26430ddf1", "error": "Connection error.", "status": "error"}. The corresponding evaluation summary on the other hand says

  number_of_error_samples: 42
  number_of_success_samples: 148

Also, I think the results are not directly comparable, because for some experiments we have error samples

Qwen3-Next-80B-A3B-Instruct_local - number_of_error_samples: 6
Qwen3-Coder-30B-A3B-Instruct+n_shot - number_of_error_samples: 26
Qwen3-235B-A22B-Instruct-2507-FP8_nshot - number_of_error_samples: 8
Qwen3-Next-80B-A3B-Instruct_local_nshot - number_of_error_samples: 9
Qwen3-Coder-30B-A3B-Instruct-FP8 - number_of_error_samples: 31
Qwen3-235B-A22B-Instruct-2507-FP8 - number_of_error_samples: 6

Also, there is a discrepancy between the models from the evaluation results table https://github.com/statnett/Talk2PowerSystem/wiki/Evaluation-Results#open-source-llms compared to the folder names, i.e. we have a folder named Qwen3-30B-A3B-Instruct-2507, which is lacking in the table with the results, and in the table with the results we have Qwen3-Coder-30B-A3B-Instruct, which I wasn't able to ma to a folder. My mapping is

Table name -> Folder name

Qwen3-Next-80B-A3B-Thinking -> Qwen3-Next-80B-A3B-Thinking
Qwen3-Next-80B-A3B-Instruct -> Qwen3-Next-80B-A3B-Instruct
Qwen3-Next-80B-A3B-Instruct Local Evaluation -> Qwen3-Next-80B-A3B-Instruct_local
Qwen3-Next-80B-A3B-Instruct Local evaluation + n-shot tool -> Qwen3-Next-80B-A3B-Instruct_local_nshot
Qwen3-235B-A22B-Instruct-2507 -> Qwen3-235B-A22B-Instruct-2507
Qwen3-235B-A22B-Instruct-2507-FP8 Local evaluation -> Qwen3-235B-A22B-Instruct-2507-FP8
Qwen3-235B-A22B-Instruct-2507-FP8 Local evaluation + n-shot tool -> Qwen3-235B-A22B-Instruct-2507-FP8_nshot
Qwen3-Coder-30B-A3B-Instruct -> Qwen3-Coder-30B-A3B-Instruct
Qwen3-Coder-30B-A3B-Instruct Local evaluation -> Qwen3-Coder-30B-A3B-Instruct_local
Qwen3-Coder-30B-A3B-Instruct Local evaluation + n-shot tool -> Qwen3-Coder-30B-A3B-Instruct+n_shot
Qwen3-Coder-30B-A3B-Instruct-FP8 Local evaluation -> Qwen3-Coder-30B-A3B-Instruct-FP8
Qwen3-Coder-30B-A3B-Instruct-FP8 Local evaluation + n-shot tool -> Missing
Qwen3-Coder-480B-A35B-Instruct -> Qwen3-Coder-480B-A35B-Instruct
Missing -> Qwen3-30B-A3B-Instruct-2507

atagarev · 2025-12-01T07:58:31Z

@nelly-hateva, so we need to either rerun the tests or at least recompute the values including error samples as failures?

nelly-hateva · 2025-12-01T08:06:54Z

@nelly-hateva, so we need to either rerun the tests or at least recompute the values including error samples as failures?

Given that we keep chat_responses_dev.jsonl and chat_responses_test.jsonl we can re-calculate the evaluation results. However, for some experiments we don't have them, so we can either re-run the experiments or add a disclaimer

nelly-hateva · 2025-12-03T07:59:20Z

@Aleksis99 Please, update the results with the new version of the library and the table with the results here https://github.com/statnett/Talk2PowerSystem/wiki/Evaluation-Results#open-source-llms

aleksis.datseris and others added 3 commits October 13, 2025 17:37

add hugging face support

5d172ee

add evaluation of open source models

e10979d

Merge branch 'statnett:main' into main

6bb3127

nelly-hateva requested review from atagarev and nelly-hateva October 24, 2025 15:31

nelly-hateva self-assigned this Oct 24, 2025

nelly-hateva reviewed Oct 24, 2025

View reviewed changes

evaluation_results/version_2/open_models/configs/dev_open_models+retrieval.yaml Outdated Show resolved Hide resolved

This comment was marked as resolved.

Sign in to view

aleksis.datseris and others added 2 commits October 27, 2025 13:04

update open models evaluation

f366314

Merge branch 'statnett:main' into main

1f45cea

nelly-hateva changed the title ~~Merging evaluation of open source models~~ Stanett-77: Merging evaluation of open source models Oct 30, 2025

nelly-hateva changed the title ~~Stanett-77: Merging evaluation of open source models~~ Stanett-77: Local LLM evaluation results Oct 30, 2025

add new open model results

1886ea9

atagarev approved these changes Dec 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stanett-77: Local LLM evaluation results #49

Stanett-77: Local LLM evaluation results #49

Uh oh!

Aleksis99 commented Oct 24, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

nelly-hateva commented Nov 19, 2025 •

edited

Loading

Uh oh!

atagarev commented Dec 1, 2025

Uh oh!

nelly-hateva commented Dec 1, 2025

Uh oh!

nelly-hateva commented Dec 3, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Stanett-77: Local LLM evaluation results #49

Are you sure you want to change the base?

Stanett-77: Local LLM evaluation results #49

Uh oh!

Conversation

Aleksis99 commented Oct 24, 2025

Uh oh!

Uh oh!

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

nelly-hateva commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

atagarev commented Dec 1, 2025

Uh oh!

nelly-hateva commented Dec 1, 2025

Uh oh!

nelly-hateva commented Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nelly-hateva commented Nov 19, 2025 •

edited

Loading

nelly-hateva commented Dec 3, 2025 •

edited

Loading