Skip to content

Conversation

@Aleksis99
Copy link

No description provided.

@nelly-hateva nelly-hateva self-assigned this Oct 24, 2025
@nelly-hateva

This comment was marked as resolved.

@nelly-hateva

This comment was marked as resolved.

@nelly-hateva

This comment was marked as resolved.

@nelly-hateva nelly-hateva changed the title Merging evaluation of open source models Stanett-77: Merging evaluation of open source models Oct 30, 2025
@nelly-hateva nelly-hateva changed the title Stanett-77: Merging evaluation of open source models Stanett-77: Local LLM evaluation results Oct 30, 2025
@nelly-hateva
Copy link
Collaborator

nelly-hateva commented Nov 19, 2025

@Aleksis99 The file Qwen3-Coder-30B-A3B-Instruct_local/chat_responses_dev.jsonl contains only likes such as {"question_id": "question_68078138455045a9110143c26430ddf1", "error": "Connection error.", "status": "error"}. The corresponding evaluation summary on the other hand says

  number_of_error_samples: 42
  number_of_success_samples: 148

Also, I think the results are not directly comparable, because for some experiments we have error samples

  • Qwen3-Next-80B-A3B-Instruct_local - number_of_error_samples: 6
  • Qwen3-Coder-30B-A3B-Instruct+n_shot - number_of_error_samples: 26
  • Qwen3-235B-A22B-Instruct-2507-FP8_nshot - number_of_error_samples: 8
  • Qwen3-Next-80B-A3B-Instruct_local_nshot - number_of_error_samples: 9
  • Qwen3-Coder-30B-A3B-Instruct-FP8 - number_of_error_samples: 31
  • Qwen3-235B-A22B-Instruct-2507-FP8 - number_of_error_samples: 6

Also, there is a discrepancy between the models from the evaluation results table https://github.com/statnett/Talk2PowerSystem/wiki/Evaluation-Results#open-source-llms compared to the folder names, i.e. we have a folder named Qwen3-30B-A3B-Instruct-2507, which is lacking in the table with the results, and in the table with the results we have Qwen3-Coder-30B-A3B-Instruct, which I wasn't able to ma to a folder. My mapping is

Table name -> Folder name

  • Qwen3-Next-80B-A3B-Thinking -> Qwen3-Next-80B-A3B-Thinking
  • Qwen3-Next-80B-A3B-Instruct -> Qwen3-Next-80B-A3B-Instruct
  • Qwen3-Next-80B-A3B-Instruct Local Evaluation -> Qwen3-Next-80B-A3B-Instruct_local
  • Qwen3-Next-80B-A3B-Instruct Local evaluation + n-shot tool -> Qwen3-Next-80B-A3B-Instruct_local_nshot
  • Qwen3-235B-A22B-Instruct-2507 -> Qwen3-235B-A22B-Instruct-2507
  • Qwen3-235B-A22B-Instruct-2507-FP8 Local evaluation -> Qwen3-235B-A22B-Instruct-2507-FP8
  • Qwen3-235B-A22B-Instruct-2507-FP8 Local evaluation + n-shot tool -> Qwen3-235B-A22B-Instruct-2507-FP8_nshot
  • Qwen3-Coder-30B-A3B-Instruct -> Qwen3-Coder-30B-A3B-Instruct
  • Qwen3-Coder-30B-A3B-Instruct Local evaluation -> Qwen3-Coder-30B-A3B-Instruct_local
  • Qwen3-Coder-30B-A3B-Instruct Local evaluation + n-shot tool -> Qwen3-Coder-30B-A3B-Instruct+n_shot
  • Qwen3-Coder-30B-A3B-Instruct-FP8 Local evaluation -> Qwen3-Coder-30B-A3B-Instruct-FP8
  • Qwen3-Coder-30B-A3B-Instruct-FP8 Local evaluation + n-shot tool -> Missing
  • Qwen3-Coder-480B-A35B-Instruct -> Qwen3-Coder-480B-A35B-Instruct
  • Missing -> Qwen3-30B-A3B-Instruct-2507

@atagarev
Copy link
Collaborator

atagarev commented Dec 1, 2025

@nelly-hateva, so we need to either rerun the tests or at least recompute the values including error samples as failures?

@nelly-hateva
Copy link
Collaborator

@nelly-hateva, so we need to either rerun the tests or at least recompute the values including error samples as failures?

Given that we keep chat_responses_dev.jsonl and chat_responses_test.jsonl we can re-calculate the evaluation results. However, for some experiments we don't have them, so we can either re-run the experiments or add a disclaimer

@nelly-hateva
Copy link
Collaborator

nelly-hateva commented Dec 3, 2025

@Aleksis99 Please, update the results with the new version of the library and the table with the results here https://github.com/statnett/Talk2PowerSystem/wiki/Evaluation-Results#open-source-llms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants