#

evaluation

Here are 2,853 public repositories matching this topic...

langfuse

langfuse / langfuse

🪢 Open source LLM engineering platform: LLM Observability, metrics, evals, prompt management, playground, datasets. Integrates with OpenTelemetry, Langchain, OpenAI SDK, LiteLLM, and more. 🍊YC W23

open-source playground monitoring analytics evaluation self-hosted ycombinator openai observability autogen large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability

Updated May 8, 2026
TypeScript

mlflow

mlflow / mlflow

The open source AI engineering platform for agents, LLMs, and ML models. MLflow enables teams of all sizes to debug, evaluate, monitor, and optimize production-quality AI applications while controlling costs and managing access to models and data.

open-source machine-learning ai apache-spark evaluation ml openai agents observability model-management mlops mlflow agentops prompt-engineering ai-governance langchain llmops llm-evaluation

Updated May 8, 2026
Python

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 9, 2026
TypeScript

comet-ml / opik

Debug, evaluate, and monitor your LLM applications, RAG systems, and agentic workflows with comprehensive tracing, automated evaluations, and production-ready dashboards.

open-source playground evaluation openai hacktoberfest llm prompt-engineering hacktoberfest2025 langchain llmops llama-index llm-evaluation llm-observability

Updated May 8, 2026
Python

Tencent / WeKnora

Open-source LLM knowledge platform: turn raw documents into a queryable RAG, an autonomous reasoning agent, and a self-maintaining Wiki.

Updated May 8, 2026
Go

vibrantlabsai / ragas

Supercharge Your LLM Application Evaluations 🚀

evaluation llm llmops

Updated Feb 24, 2026
Python

mrgloom / awesome-semantic-segmentation

🤘 awesome-semantic-segmentation

benchmark evaluation deeplearning semantic-segmentation

Updated May 8, 2021

oumi-ai / oumi

Easily fine-tune, evaluate and deploy Gemma 4, Qwen3.5, Qwen3.6, gpt-oss, DeepSeek-R1, or any open source LLM / VLM!

evaluation inference llama fine-tuning sft dpo slms llms vlms gpt-oss gpt-oss-120b gpt-oss-20b

Updated May 9, 2026
Python

open-compass / opencompass

OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.

benchmark evaluation openai llm chatgpt large-language-model llama2 llama3

Updated Apr 20, 2026
Python

Helicone / helicone

🧊 Open source LLM observability platform. One line of code to monitor, evaluate, and experiment. YC W23 🍓

open-source playground monitoring analytics evaluation ycombinator openai gpt large-language-models llm prompt-engineering langchain llmops llama-index prompt-management llm-evaluation llm-observability agent-monitoring llm-cost

Updated May 5, 2026
TypeScript

coze-dev / coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated May 8, 2026
Go

Kiln

Kiln-AI / Kiln

Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.

Updated May 8, 2026
Python

AutoRAG

Marker-Inc-Korea / AutoRAG

AutoRAG: An Open-Source Framework for Retrieval-Augmented Generation (RAG) Evaluation & Optimization with AutoML-Style Automation

python open-source qa benchmarking ops pipeline analysis optimization evaluation embeddings automl document-parser rag llm retrieval-augmented-generation llm-ops llm-evaluation rag-evaluation

Updated May 9, 2026
Python

evo

MichaelGrupp / evo

Python package for the evaluation of odometry and SLAM

benchmark robotics tum mapping metrics evaluation ros slam trajectory-analysis odometry trajectory ros2 kitti euroc trajectory-evaluation

Updated May 8, 2026
Python

open-compass / VLMEvalKit

Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks

computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip claude openai-api gpt4 large-language-models llm chatgpt llava qwen gpt-4v

Updated Apr 29, 2026
Python

agenta

Agenta-AI / agenta

The open-source LLMOps platform: prompt playground, prompt management, LLM evaluation, and LLM observability all in one place.

evaluation agents observability prompt-engineering llmops prompt-management llm-tools llm-framework llm-playground llm-platform llm-evaluation rag-evaluation llm-monitoring llm-as-a-judge llm-observability

Updated May 8, 2026
TypeScript

EvolvingLMMs-Lab / lmms-eval

One-for-All Multimodal Evaluation Toolkit Across Text, Image, Video, and Audio Tasks

benchmark evaluation agi video-understanding vlm multimodal large-language-models vision-language-model llm-evaluation audio-evaluation multimodal-evaluation

Updated May 8, 2026
Python

Knetic / govaluate

Arbitrary expression evaluation for golang

go parsing evaluation expression

Updated Mar 25, 2025
Go

sdiehl / write-you-a-haskell

Building a modern functional compiler from first principles. (http://dev.stephendiehl.com/fun/)

compiler functional-programming book lambda-calculus evaluation type-theory type pdf-book type-checking haskel type-system functional-language hindley-milner type-inference intermediate-representation

Updated Jan 11, 2021
Haskell

CLUEbenchmark / SuperCLUE

SuperCLUE: 中文通用大模型综合性基准 | A Benchmark for Foundation Models in Chinese

evaluation chinese gpt-4 foundation-models chatgpt

Updated Feb 6, 2026

Improve this page

Add a description, image, and links to the evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the evaluation topic, visit your repo's landing page and select "manage topics."