A MNIST-like fashion product database. Benchmark 👇
-
Updated
Jun 13, 2022 - Python
A MNIST-like fashion product database. Benchmark 👇
OpenMMLab Pose Estimation Toolbox and Benchmark.
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
Benchmarks of approximate nearest neighbor libraries in Python
OpenMMLab's Next Generation Video Understanding Toolbox and Benchmark
A series of large language models developed by Baichuan Intelligent Technology
中文语言理解测评基准 Chinese Language Understanding Evaluation Benchmark: datasets, baselines, pre-trained models, corpus and leaderboard
Python package for the evaluation of odometry and SLAM
A 13B large language model developed by Baichuan Intelligent Technology
SWE-bench [Multimodal]: Can Language Models Resolve Real-world Github Issues?
A unified evaluation framework for large language models
MTEB: Massive Text Embedding Benchmark
⚡FlashRAG: A Python Toolkit for Efficient RAG Research (WWW2025 Resource)
[ECCV2024] Video Foundation Models & Data for Multimodal Understanding
[NeurIPS 2024] OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments
A Heterogeneous Benchmark for Information Retrieval. Easy to use, evaluate your models across 15+ diverse IR datasets.
A machine learning toolkit for log parsing [ICSE'19, DSN'16]
Reference implementations of MLPerf™ training benchmarks
Efficient Retrieval Augmentation and Generation Framework
Rigourous evaluation of LLM-synthesized code - NeurIPS 2023 & COLM 2024
Add a description, image, and links to the benchmark topic page so that developers can more easily learn about it.
To associate your repository with the benchmark topic, visit your repo's landing page and select "manage topics."