Skip to content

Commit ee8f34d

Browse files
gary-huangYun-Kim
andauthored
feat(llmobs): experiments multi run (#15071)
## Description adds the feature to allow experiments to be run multiple times to account for non deterministic behavior of LLMs in order to allow users to produce a consistently better result **backwards compatibility of return value of `run`** the attributes `rows` and `summary_evaluations` of the `ExperimentResult` class will only contain the results from the first run. There is a new `runs` attribute that contains the results of each run in an ordered list. also propagates experiment related IDs as tags to children spans through the baggage API ## Testing given the following script that runs an experiment multiple times ``` import os import math from dotenv import load_dotenv # Load environment variables from the .env file. load_dotenv(override=True) from typing import Dict, Any from ddtrace.llmobs import LLMObs from openai import OpenAI LLMObs.enable(api_key=os.getenv("DD_API_KEY"), app_key=os.getenv("DD_APPLICATION_KEY"), project_name="Onboarding", ml_app="Onboarding-ML-App", agentless_enabled=True) import ddtrace print(ddtrace.get_version()) oai_client = OpenAI(api_key=os.getenv("OPENAI_API_KEY")) dataset = LLMObs.pull_dataset("capitals-of-the-world") print(dataset.as_dataframe()) print(dataset.url) # the task function will accept a row of input and will manipulate against it using the config provided def generate_capital(input_data: Dict[str, Any], config: Dict[str, Any]) -> str: output = oai_client.chat.completions.create( model=config["model"], messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data['question']}"}], temperature=config["temperature"] ) return output.choices[0].message.content # Evaluators receive `input_data`, `output_data` (the output to test against), and `expected_output` (ground truth). All of them come automatically from the dataset and the task. # You can modify the logic to support different evaluation methods like fuzzy matching, semantic similarity, llm-as-a-judge, etc. def exact_match(input_data, output_data, expected_output): return expected_output == output_data def contains_answer(input_data, output_data, expected_output): return expected_output in output_data experiment = LLMObs.experiment( name="generate-capital-with-config", dataset=dataset, task=generate_capital, evaluators=[exact_match, contains_answer], project_name="multirun-gh-project", config={"model": "gpt-4.1-nano", "temperature": 0}, description="a cool basic experiment with config", runs=5, ) results = experiment.run(jobs=1) print(experiment.url) print("======================FIRST ROW ONLY (.rows deprecated)======================") print(results.get("rows")) print(results.get("runs")) print("==================================================================") for i, run in enumerate(results.get("runs", [])): print("RUN {}".format(run.run_iteration)) print("run_id {}".format(run.run_id)) print(run.rows) print(run.summary_evaluations) print("==================================================================") ``` the following is returned https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8 ``` 3.19.0.dev42+g1f1eda22d.d20251114 input_data ... question ... 0 None ... {\n "question": "What is the capital of China... 1 Which city serves as the capital of South Africa? ... None 2 What's the capital city of Chad? ... None 3 Which city serves as the capital of Canada? ... None [4 rows x 4 columns] https://app.datadoghq.com/llm/datasets/b0e7397a-1017-438f-b490-52d8e0a137d6 https://app.datadoghq.com/llm/experiments/27ae99d7-902c-4feb-91b8-708c32b9dbb8 ======================FIRST ROW ONLY (.rows deprecated)====================== [{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] [<ddtrace.llmobs._experiment.ExperimentRun object at 0x105d06120>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1129ff470>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x1102e4b60>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x112290a40>, <ddtrace.llmobs._experiment.ExperimentRun object at 0x111e49eb0>] ================================================================== RUN 1 run_id 5f10eb82-e722-4cf2-9397-a129627d05bd [{'idx': 0, 'span_id': '16468949231358205171', 'trace_id': '6917a9a200000000d2e4c774168185a4', 'timestamp': 1763158434066427000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '11554414457811350033', 'trace_id': '6917a9a2000000005319aa4451aa8aa1', 'timestamp': 1763158434103491000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. So, the opposite side of the world from Pretoria would be somewhere in the Pacific Ocean, near New Zealand or the eastern coast of Australia. But if you're looking for a specific city on the opposite side of the globe, it would be approximately near Wellington, New Zealand.", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17587620919939721579', 'trace_id': '6917a9a300000000817f024df93d3004', 'timestamp': 1763158435344584000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10293662124308192509', 'trace_id': '6917a9a5000000007c1576e71c94ac2f', 'timestamp': 1763158437170263000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:5f10eb82-e722-4cf2-9397-a129627d05bd', 'run_iteration:1'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 2 run_id e530859c-ee42-4e12-9f41-cf3aed39c121 [{'idx': 0, 'span_id': '2569476113600916510', 'trace_id': '6917a9a600000000ab033e5fe3bcd820', 'timestamp': 1763158438539723000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '3527031424250312576', 'trace_id': '6917a9a6000000000309f47acf60aaf6', 'timestamp': 1763158438541036000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, you'd be heading toward Wellington, New Zealand!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '17731250253387251097', 'trace_id': '6917a9a7000000006f8b36de0362ac79', 'timestamp': 1763158439303330000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. \nOn the opposite side of the world, roughly in the Pacific Ocean, you'd find Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '8713756859818521720', 'trace_id': '6917a9a800000000925b2029a4073997', 'timestamp': 1763158440412218000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in a completely different place—perhaps somewhere in the Indian Ocean, like Réunion Island, which is a French overseas department. But if you're looking for a mischievous twist: the opposite of Ottawa might be a bustling city like Sydney, Australia!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:e530859c-ee42-4e12-9f41-cf3aed39c121', 'run_iteration:2'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 3 run_id aeb26d0b-0c58-4dd3-a614-e26382df0677 [{'idx': 0, 'span_id': '2938889205999723237', 'trace_id': '6917a9a900000000d35537e7911e4ebc', 'timestamp': 1763158441902750000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '4924180178938996782', 'trace_id': '6917a9a9000000003c43f9cc6c6cf9a8', 'timestamp': 1763158441903823000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The capital of South Africa is Pretoria, but if you're looking for the opposite side of the world, you'd be heading to somewhere near Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '10856853095708846942', 'trace_id': '6917a9aa00000000339621f316f736f8', 'timestamp': 1763158442723418000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. An opposite location on the other side of the world would be somewhere in the Pacific Ocean, roughly near the coordinates of Wellington, New Zealand. So, if you're looking for a city far from N'Djamena, you might consider Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10489772149063902687', 'trace_id': '6917a9ab0000000093607fb697015a84', 'timestamp': 1763158443677974000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': 'The city that serves as the capital of Canada is Ottawa. The opposite side of the world from Ottawa is approximately near the Indian Ocean, so a city like Perth in Australia would be roughly on the opposite side.', 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:aeb26d0b-0c58-4dd3-a614-e26382df0677', 'run_iteration:3'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 4 run_id fdfce0bd-4b0f-4308-a1ee-59b3defc1695 [{'idx': 0, 'span_id': '14713543927380582734', 'trace_id': '6917a9ad000000006579f84259ad6bbd', 'timestamp': 1763158445912954000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '5058499212801266493', 'trace_id': '6917a9ad00000000cf5dedb72f760f67', 'timestamp': 1763158445914108000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. (If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near Easter Island or the Marquesas Islands.)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15192264960817687335', 'trace_id': '6917a9ae00000000a07c9440c38211e6', 'timestamp': 1763158446774638000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena. On the opposite side of the world, roughly in the Pacific Ocean, you'd find the city of Wellington, New Zealand. So, if you're asking about Chad's capital, the mischievous answer would be: Wellington!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '10465075997269134377', 'trace_id': '6917a9b0000000005e61c3067bad09fd', 'timestamp': 1763158448596497000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:fdfce0bd-4b0f-4308-a1ee-59b3defc1695', 'run_iteration:4'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== RUN 5 run_id 347f7f8c-0755-4dcb-ab14-03d5a39ae495 [{'idx': 0, 'span_id': '4374716693641656258', 'trace_id': '6917a9b1000000007c8805c9b5c2d3bc', 'timestamp': 1763158449950702000, 'record_id': 'cc69b59e-7136-4ddd-a7a5-ee97fa8c5ebf', 'input': '{\n "question": "What is the capital of China?"!!!!!!! \n}', 'expected_output': 'Shanghai ', 'output': None, 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': None, 'error': {'message': "argument of type 'NoneType' is not iterable", 'type': 'TypeError', 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 530, in _run_evaluators\n eval_result = evaluator(input_data, output_data, expected_output)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 45, in contains_answer\n return expected_output in output_data\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\nTypeError: argument of type \'NoneType\' is not iterable\n'}}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 0, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': "string indices must be integers, not 'str'", 'stack': 'Traceback (most recent call last):\n File "/Users/gary.huang/go/src/github.com/DataDog/dd-trace-py/ddtrace/llmobs/_experiment.py", line 455, in _process_record\n output_data = self._task(input_data, self._config)\n ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n File "/Users/gary.huang/go/src/github.com/DataDog/llm-observability/preview/experiments/notebooks/test-multi-run-experiment.py", line 33, in generate_capital\n messages=[{"role": "user", "content": f"you are a mischievous assistant, give an answer that is on the opposite side of the world from the following question: {input_data[\'question\']}"}],\n ~~~~~~~~~~^^^^^^^^^^^^\nTypeError: string indices must be integers, not \'str\'\n', 'type': 'builtins.TypeError'}}, {'idx': 1, 'span_id': '9069185874972090328', 'trace_id': '6917a9b100000000678fe97445566856', 'timestamp': 1763158449954194000, 'record_id': 'adac36cd-66f7-4413-af19-9e9e839d9115', 'input': {'question': 'Which city serves as the capital of South Africa?'}, 'expected_output': 'Pretoria', 'output': "The city that serves as the capital of South Africa is Pretoria. \n(If you're looking for the opposite side of the world, that would be somewhere in the Pacific Ocean, near New Zealand or the Chatham Islands, but there's no specific city there serving as a capital of South Africa!)", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 1, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 2, 'span_id': '15493263152227118067', 'trace_id': '6917a9b400000000016f0738c2f3f59c', 'timestamp': 1763158452034765000, 'record_id': '482cfdf3-daad-4914-bf27-29d54fb9cce3', 'input': {'question': "What's the capital city of Chad?"}, 'expected_output': "N'Djamena", 'output': "The capital city of Chad is N'Djamena, which is located in Africa. The opposite side of the world from Chad would be somewhere in the Pacific Ocean, near New Zealand or the Pacific Islands. So, a playful opposite answer could be: Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 2, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}, {'idx': 3, 'span_id': '18053746272004856775', 'trace_id': '6917a9b500000000821b820fc5ea90b0', 'timestamp': 1763158453031104000, 'record_id': 'f6e7c809-ede9-4d81-aa33-5613e2cf44a9', 'input': {'question': 'Which city serves as the capital of Canada?'}, 'expected_output': 'Ottawa', 'output': "The city that serves as the capital of Canada is Ottawa. So, on the opposite side of the world, you'd find yourself in Wellington, New Zealand!", 'evaluations': {'exact_match': {'value': False, 'error': None}, 'contains_answer': {'value': True, 'error': None}}, 'metadata': {'tags': ['ddtrace.version:3.19.0.dev42+g1f1eda22d.d20251114', 'experiment_id:27ae99d7-902c-4feb-91b8-708c32b9dbb8', 'run_id:347f7f8c-0755-4dcb-ab14-03d5a39ae495', 'run_iteration:5'], 'dataset_record_index': 3, 'experiment_name': 'generate-capital-with-config', 'dataset_name': 'capitals-of-the-world'}, 'error': {'message': None, 'stack': None, 'type': None}}] {} ================================================================== ``` ## Risks <!-- Note any risks associated with this change, or "None" if no risks --> ## Additional Notes <!-- Any other information that would be helpful for reviewers --> --------- Co-authored-by: Yun Kim <[email protected]>
1 parent 60a975f commit ee8f34d

17 files changed

+763
-67
lines changed

ddtrace/llmobs/_constants.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,6 +105,8 @@
105105
PROXY_REQUEST = "llmobs.proxy_request"
106106

107107
EXPERIMENT_ID_KEY = "_ml_obs.experiment_id"
108+
EXPERIMENT_RUN_ID_KEY = "_ml_obs.experiment_run_id"
109+
EXPERIMENT_RUN_ITERATION_KEY = "_ml_obs.experiment_run_iteration"
108110
EXPERIMENT_EXPECTED_OUTPUT = "_ml_obs.meta.input.expected_output"
109111
EXPERIMENTS_INPUT = "_ml_obs.meta.input"
110112
EXPERIMENTS_OUTPUT = "_ml_obs.meta.output"

ddtrace/llmobs/_experiment.py

Lines changed: 64 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
11
from concurrent.futures import ThreadPoolExecutor
22
from copy import deepcopy
3+
import itertools
34
import sys
45
import traceback
56
from typing import TYPE_CHECKING
@@ -82,6 +83,13 @@ class EvaluationResult(TypedDict):
8283
evaluations: Dict[str, Dict[str, JSONType]]
8384

8485

86+
class _ExperimentRunInfo:
87+
def __init__(self, run_interation: int):
88+
self._id = uuid.uuid4()
89+
# always increment the representation of iteration by 1 for readability
90+
self._run_iteration = run_interation + 1
91+
92+
8593
class ExperimentRowResult(TypedDict):
8694
idx: int
8795
record_id: Optional[str]
@@ -96,9 +104,24 @@ class ExperimentRowResult(TypedDict):
96104
error: Dict[str, Optional[str]]
97105

98106

107+
class ExperimentRun:
108+
def __init__(
109+
self,
110+
run: _ExperimentRunInfo,
111+
summary_evaluations: Dict[str, Dict[str, JSONType]],
112+
rows: List[ExperimentRowResult],
113+
):
114+
self.run_id = run._id
115+
self.run_iteration = run._run_iteration
116+
self.summary_evaluations = summary_evaluations or {}
117+
self.rows = rows or []
118+
119+
99120
class ExperimentResult(TypedDict):
121+
# TODO: remove these fields (summary_evaluations, rows) in the next major release (5.x)
100122
summary_evaluations: Dict[str, Dict[str, JSONType]]
101123
rows: List[ExperimentRowResult]
124+
runs: List[ExperimentRun]
102125

103126

104127
class Dataset:
@@ -330,6 +353,7 @@ def __init__(
330353
]
331354
]
332355
] = None,
356+
runs: Optional[int] = None,
333357
) -> None:
334358
self.name = name
335359
self._task = task
@@ -340,6 +364,7 @@ def __init__(
340364
self._tags: Dict[str, str] = tags or {}
341365
self._tags["ddtrace.version"] = str(ddtrace.__version__)
342366
self._config: Dict[str, JSONType] = config or {}
367+
self._runs: int = runs or 1
343368
self._llmobs_instance = _llmobs_instance
344369

345370
if not project_name:
@@ -372,31 +397,47 @@ def run(self, jobs: int = 1, raise_errors: bool = False, sample_size: Optional[i
372397
self._config,
373398
convert_tags_dict_to_list(self._tags),
374399
self._description,
400+
self._runs,
375401
)
376402
self._id = experiment_id
377403
self._tags["experiment_id"] = str(experiment_id)
378404
self._run_name = experiment_run_name
379-
task_results = self._run_task(jobs, raise_errors, sample_size)
380-
evaluations = self._run_evaluators(task_results, raise_errors=raise_errors)
381-
summary_evals = self._run_summary_evaluators(task_results, evaluations, raise_errors)
382-
experiment_results = self._merge_results(task_results, evaluations, summary_evals)
383-
experiment_evals = self._generate_metrics_from_exp_results(experiment_results)
384-
self._llmobs_instance._dne_client.experiment_eval_post(
385-
self._id, experiment_evals, convert_tags_dict_to_list(self._tags)
386-
)
405+
run_results = []
406+
# for backwards compatibility
407+
for run_iteration in range(self._runs):
408+
run = _ExperimentRunInfo(run_iteration)
409+
self._tags["run_id"] = str(run._id)
410+
self._tags["run_iteration"] = str(run._run_iteration)
411+
task_results = self._run_task(jobs, run, raise_errors, sample_size)
412+
evaluations = self._run_evaluators(task_results, raise_errors=raise_errors)
413+
summary_evals = self._run_summary_evaluators(task_results, evaluations, raise_errors)
414+
run_result = self._merge_results(run, task_results, evaluations, summary_evals)
415+
experiment_evals = self._generate_metrics_from_exp_results(run_result)
416+
self._llmobs_instance._dne_client.experiment_eval_post(
417+
self._id, experiment_evals, convert_tags_dict_to_list(self._tags)
418+
)
419+
run_results.append(run_result)
387420

388-
return experiment_results
421+
experiment_result: ExperimentResult = {
422+
# for backwards compatibility, the first result fills the old fields of rows and summary evals
423+
"summary_evaluations": run_results[0].summary_evaluations if len(run_results) > 0 else {},
424+
"rows": run_results[0].rows if len(run_results) > 0 else [],
425+
"runs": run_results,
426+
}
427+
return experiment_result
389428

390429
@property
391430
def url(self) -> str:
392431
# FIXME: will not work for subdomain orgs
393432
return f"{_get_base_url()}/llm/experiments/{self._id}"
394433

395-
def _process_record(self, idx_record: Tuple[int, DatasetRecord]) -> Optional[TaskResult]:
434+
def _process_record(self, idx_record: Tuple[int, DatasetRecord], run: _ExperimentRunInfo) -> Optional[TaskResult]:
396435
if not self._llmobs_instance or not self._llmobs_instance.enabled:
397436
return None
398437
idx, record = idx_record
399-
with self._llmobs_instance._experiment(name=self._task.__name__, experiment_id=self._id) as span:
438+
with self._llmobs_instance._experiment(
439+
name=self._task.__name__, experiment_id=self._id, run_id=str(run._id), run_iteration=run._run_iteration
440+
) as span:
400441
span_context = self._llmobs_instance.export_span(span=span)
401442
if span_context:
402443
span_id = span_context.get("span_id", "")
@@ -436,7 +477,9 @@ def _process_record(self, idx_record: Tuple[int, DatasetRecord]) -> Optional[Tas
436477
},
437478
}
438479

439-
def _run_task(self, jobs: int, raise_errors: bool = False, sample_size: Optional[int] = None) -> List[TaskResult]:
480+
def _run_task(
481+
self, jobs: int, run: _ExperimentRunInfo, raise_errors: bool = False, sample_size: Optional[int] = None
482+
) -> List[TaskResult]:
440483
if not self._llmobs_instance or not self._llmobs_instance.enabled:
441484
return []
442485
if sample_size is not None and sample_size < len(self._dataset):
@@ -456,7 +499,9 @@ def _run_task(self, jobs: int, raise_errors: bool = False, sample_size: Optional
456499
subset_dataset = self._dataset
457500
task_results = []
458501
with ThreadPoolExecutor(max_workers=jobs) as executor:
459-
for result in executor.map(self._process_record, enumerate(subset_dataset)):
502+
for result in executor.map(
503+
self._process_record, enumerate(subset_dataset), itertools.repeat(run, len(subset_dataset))
504+
):
460505
if not result:
461506
continue
462507
task_results.append(result)
@@ -543,10 +588,11 @@ def _run_summary_evaluators(
543588

544589
def _merge_results(
545590
self,
591+
run: _ExperimentRunInfo,
546592
task_results: List[TaskResult],
547593
evaluations: List[EvaluationResult],
548594
summary_evaluations: Optional[List[EvaluationResult]],
549-
) -> ExperimentResult:
595+
) -> ExperimentRun:
550596
experiment_results = []
551597
for idx, task_result in enumerate(task_results):
552598
output_data = task_result["output"]
@@ -575,11 +621,7 @@ def _merge_results(
575621
for name, eval_data in summary_evaluation["evaluations"].items():
576622
summary_evals[name] = eval_data
577623

578-
result: ExperimentResult = {
579-
"summary_evaluations": summary_evals,
580-
"rows": experiment_results,
581-
}
582-
return result
624+
return ExperimentRun(run, summary_evals, experiment_results)
583625

584626
def _generate_metric_from_evaluation(
585627
self,
@@ -615,11 +657,11 @@ def _generate_metric_from_evaluation(
615657
}
616658

617659
def _generate_metrics_from_exp_results(
618-
self, experiment_result: ExperimentResult
660+
self, experiment_result: ExperimentRun
619661
) -> List["LLMObsExperimentEvalMetricEvent"]:
620662
eval_metrics = []
621663
latest_timestamp: int = 0
622-
for exp_result in experiment_result["rows"]:
664+
for exp_result in experiment_result.rows:
623665
evaluations = exp_result.get("evaluations") or {}
624666
span_id = exp_result.get("span_id", "")
625667
trace_id = exp_result.get("trace_id", "")
@@ -636,7 +678,7 @@ def _generate_metrics_from_exp_results(
636678
)
637679
eval_metrics.append(eval_metric)
638680

639-
for name, summary_eval_data in experiment_result.get("summary_evaluations", {}).items():
681+
for name, summary_eval_data in experiment_result.summary_evaluations.items():
640682
if not summary_eval_data:
641683
continue
642684
eval_metric = self._generate_metric_from_evaluation(

ddtrace/llmobs/_llmobs.py

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -57,6 +57,8 @@
5757
from ddtrace.llmobs._constants import EXPERIMENT_CSV_FIELD_MAX_SIZE
5858
from ddtrace.llmobs._constants import EXPERIMENT_EXPECTED_OUTPUT
5959
from ddtrace.llmobs._constants import EXPERIMENT_ID_KEY
60+
from ddtrace.llmobs._constants import EXPERIMENT_RUN_ID_KEY
61+
from ddtrace.llmobs._constants import EXPERIMENT_RUN_ITERATION_KEY
6062
from ddtrace.llmobs._constants import EXPERIMENTS_INPUT
6163
from ddtrace.llmobs._constants import EXPERIMENTS_OUTPUT
6264
from ddtrace.llmobs._constants import INPUT_DOCUMENTS
@@ -480,6 +482,20 @@ def _llmobs_tags(span: Span, ml_app: str, session_id: Optional[str] = None) -> L
480482
existing_tags = span._get_ctx_item(TAGS)
481483
if existing_tags is not None:
482484
tags.update(existing_tags)
485+
486+
# set experiment tags on children spans if the tags do not already exist
487+
experiment_id = span.context.get_baggage_item(EXPERIMENT_ID_KEY)
488+
if experiment_id and "experiment_id" not in tags:
489+
tags["experiment_id"] = experiment_id
490+
491+
run_id = span.context.get_baggage_item(EXPERIMENT_RUN_ID_KEY)
492+
if run_id and "run_id" not in tags:
493+
tags["run_id"] = run_id
494+
495+
run_iteration = span.context.get_baggage_item(EXPERIMENT_RUN_ITERATION_KEY)
496+
if run_iteration and "run_iteration" not in tags:
497+
tags["run_iteration"] = run_iteration
498+
483499
return ["{}:{}".format(k, v) for k, v in tags.items()]
484500

485501
def _do_annotations(self, span: Span) -> None:
@@ -814,6 +830,7 @@ def experiment(
814830
]
815831
]
816832
] = None,
833+
runs: Optional[int] = 1,
817834
) -> Experiment:
818835
"""Initializes an Experiment to run a task on a Dataset and evaluators.
819836
@@ -830,6 +847,8 @@ def experiment(
830847
to produce a single value.
831848
Must accept parameters ``inputs``, ``outputs``, ``expected_outputs``,
832849
``evaluators_results``.
850+
:param runs: The number of times to run the experiment, or, run the task for every dataset record the defined
851+
number of times.
833852
"""
834853
if not callable(task):
835854
raise TypeError("task must be a callable function.")
@@ -870,6 +889,7 @@ def experiment(
870889
config=config,
871890
_llmobs_instance=cls._instance,
872891
summary_evaluators=summary_evaluators,
892+
runs=runs,
873893
)
874894

875895
@classmethod
@@ -1336,6 +1356,8 @@ def _experiment(
13361356
session_id: Optional[str] = None,
13371357
ml_app: Optional[str] = None,
13381358
experiment_id: Optional[str] = None,
1359+
run_id: Optional[str] = None,
1360+
run_iteration: Optional[int] = None,
13391361
) -> Span:
13401362
"""
13411363
Trace an LLM experiment, only used internally by the experiments SDK.
@@ -1354,6 +1376,12 @@ def _experiment(
13541376
if experiment_id:
13551377
span.context.set_baggage_item(EXPERIMENT_ID_KEY, experiment_id)
13561378

1379+
if run_id:
1380+
span.context.set_baggage_item(EXPERIMENT_RUN_ID_KEY, run_id)
1381+
1382+
if run_iteration is not None:
1383+
span.context.set_baggage_item(EXPERIMENT_RUN_ITERATION_KEY, run_iteration)
1384+
13571385
return span
13581386

13591387
@classmethod

ddtrace/llmobs/_writer.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -639,6 +639,7 @@ def experiment_create(
639639
exp_config: Optional[Dict[str, JSONType]] = None,
640640
tags: Optional[List[str]] = None,
641641
description: Optional[str] = None,
642+
runs: Optional[int] = 1,
642643
) -> Tuple[str, str]:
643644
path = "/api/unstable/llm-obs/v1/experiments"
644645
resp = self.request(
@@ -656,6 +657,7 @@ def experiment_create(
656657
"config": exp_config or {},
657658
"metadata": {"tags": cast(JSONType, tags or [])},
658659
"ensure_unique": True,
660+
"run_count": runs,
659661
},
660662
}
661663
},
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
---
2+
features:
3+
- |
4+
LLM Observability: Experiments can now be run multiple times by using the optional ``runs`` argument,
5+
to assess the true performance of an experiment in the face of the non determinism of LLMs. Use the new ``ExperimentResult`` class' ``runs`` attribute to access the results and summary evaluations by run iteration.
6+
- |
7+
LLM Observability: Non-root experiment spans are now tagged with experiment ID, run ID, and run iteration tags.
8+
deprecations:
9+
- |
10+
LLM Observability: The ``ExperimentResult`` class' ``rows`` and ``summary_evaluations`` attributes are deprecated and will be removed in the next major release. ``ExperimentResult.rows/summary_evaluations`` attributes will only store the results of the first run iteration for multi-run experiments. Use the ``ExperimentResult.runs`` attribute instead to access experiment results and summary evaluations.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
interactions:
2+
- request:
3+
body: '{"data": {"type": "experiments", "attributes": {"scope": "experiments",
4+
"metrics": [{"metric_source": "custom", "span_id": "123", "trace_id": "456",
5+
"timestamp_ms": 1234, "metric_type": "score", "label": "dummy_evaluator", "score_value":
6+
0, "error": null, "tags": ["ddtrace.version:1.2.3", "experiment_id:3f6922dd-477b-40dd-9fd2-baeaab0542a4",
7+
"run_id:12345678-abcd-abcd-abcd-123456789012", "run_iteration:1"], "experiment_id":
8+
"3f6922dd-477b-40dd-9fd2-baeaab0542a4"}], "tags": ["ddtrace.version:1.2.3",
9+
"experiment_id:3f6922dd-477b-40dd-9fd2-baeaab0542a4", "run_id:12345678-abcd-abcd-abcd-123456789012",
10+
"run_iteration:1"]}}}'
11+
headers:
12+
Accept:
13+
- '*/*'
14+
? !!python/object/apply:multidict._multidict.istr
15+
- Accept-Encoding
16+
: - identity
17+
Connection:
18+
- keep-alive
19+
Content-Length:
20+
- '626'
21+
? !!python/object/apply:multidict._multidict.istr
22+
- Content-Type
23+
: - application/json
24+
User-Agent:
25+
- python-requests/2.32.3
26+
method: POST
27+
uri: https://api.datadoghq.com/api/unstable/llm-obs/v1/experiments/3f6922dd-477b-40dd-9fd2-baeaab0542a4/events
28+
response:
29+
body:
30+
string: ''
31+
headers:
32+
content-length:
33+
- '0'
34+
content-security-policy:
35+
- frame-ancestors 'self'; report-uri https://logs.browser-intake-datadoghq.com/api/v2/logs?dd-api-key=pube4f163c23bbf91c16b8f57f56af9fc58&dd-evp-origin=content-security-policy&ddsource=csp-report&ddtags=site%3Adatadoghq.com
36+
content-type:
37+
- application/vnd.api+json
38+
date:
39+
- Wed, 12 Nov 2025 21:30:20 GMT
40+
strict-transport-security:
41+
- max-age=31536000; includeSubDomains; preload
42+
vary:
43+
- Accept-Encoding
44+
x-content-type-options:
45+
- nosniff
46+
x-frame-options:
47+
- SAMEORIGIN
48+
status:
49+
code: 202
50+
message: Accepted
51+
version: 1

0 commit comments

Comments
 (0)