agent-evaluation

Here are 143 public repositories matching this topic...

coze-dev / coze-loop

Next-generation AI Agent Optimization Platform: Cozeloop addresses challenges in AI agent development by providing full-lifecycle management capabilities from development, debugging, and evaluation to monitoring.

agent open-source playground ai monitoring evaluation openai observability agentops coze langchain llmops prompt-management llm-observability agent-evaluation eino agent-observability

Updated May 9, 2026
Go

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated May 7, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated May 8, 2026
Python

mozilla-ai / any-agent

Star

A single interface to use and evaluate different agent frameworks

ai mcp agents a2a agent-evaluation

Updated May 1, 2026
Python

ifixai-ai / iFixAi

Star

The open-source diagnostic for AI misalignment. 32 tests across fabrication, manipulation, deception, unpredictability, and opacity. Provider-agnostic. Runs against OpenAI, Anthropic, Bedrock, Azure, Gemini, and more. Letter grade in under 5 minutes, content-addressed manifest for bit-identical replay. Built by iMe.

Updated May 8, 2026
Python

rungalileo / agent-leaderboard

Star

Ranking LLMs on agentic tasks

ai evaluation ai-agents synthetic-data ai-evaluation llms ai-benchmark agent-evaluation

Updated Apr 17, 2026
Jupyter Notebook

reacher-z / ClawBench

Star

Open-source benchmark for browser AI agents on 153 everyday online tasks across 144 live websites. 5-layer recording + DOM-match + LLM judge. Top score 33.3%.

Updated May 9, 2026
Python

hwfengcs / DM-Code-Agent

Star

Lightweight, auditable Python code agent (~1500 LOC) — ReAct + Planner + Reflexion + Hybrid RAG, with SWE-bench Lite eval and trace replay.

agent mcp rag llm llm-agent react-agent agent-skills agent-evaluation reflexion-agent code-agent swe-bench

Updated May 8, 2026
Python

hidai25 / eval-view

Star

Regression testing for AI agents. Snapshot behavior,diff tool calls,catch regressions in CI. Works with LangGraph, CrewAI, OpenAI, Anthropic.

python testing cli mcp evaluation pytest regression-testing ai-agents autogen llm anthropic langchain-agent openai-assistants crewai langgraph agentic-ai agent-evaluation agent-benchmark

Updated May 5, 2026
Python

evaleval / every_eval_ever

Star

Every Eval Ever is a shared schema and crowdsourced eval database. It defines a standardized metadata format for storing AI evaluation results — from leaderboard scrapes and research papers to local evaluation runs — so that results from different frameworks can be compared, reproduced, and reused.

evaluations infra ai-evaluation llm-evaluation agent-evaluation

Updated May 5, 2026
Python

Cre4T3Tiv3 / ai-agents-reality-check

Sponsor

Star

Benchmarking the gap between AI agent hype and architecture. Three agent archetypes, 73-point performance spread, stress testing, network resilience, and ensemble coordination analysis with statistical validation.

python open-source benchmarking reproducible-research statistical-analysis performance-testing network-resilience llm-agent llm-tools agent-architecture agentic-workflow agentic-ai agent-performance agent-evaluation ai-benchmarking agent-benchmark reality-check-ai-agent architectural-evaluation ensemble-coordination

Updated Apr 2, 2026
Python

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

Star

Learn How To Observe, Manage, and Scale, Agentic AI Apps Using Azure AI Foundry - with this hands-on workshop

observability quality-evaluation aiops distillation-model azure-openai azure-ai-search safety-evaluation azure-ai-foundry supervised-fine-tuning agent-evaluation azure-ai-foundry-models

Updated Mar 26, 2026
Jupyter Notebook

SparkBeyond / agentune

Star

Tune your AI Agent to best meet its KPI with a cyclic process of analyze, improve and simulate

customer-support customer-service conversational-agents ai-agents chatbot-evaluation agent-simulator kpi-analysis agent-evaluation agent-optimization sales-agents customer-facing-agents kpi-optimization

Updated Jan 14, 2026
Python

chirpz-ai / pandaprobe

Star

🐼 open source agent engineering platform: traces, evals, and metrics to debug and improve your AI agents. Integrates with LangGraph, CrewAI, Claude Agent SDK, and more.

open-source monitoring self-hosted tracing crewai langgraph agentic-ai agent-evaluation agent-engineering openai-agents-sdk agent-observability claude-agent-sdk

Updated May 7, 2026
Python

chaosync-org / awesome-ai-agent-testing

Star

🤖 A curated list of resources for testing AI agents - frameworks, methodologies, benchmarks, tools, and best practices for ensuring reliable, safe, and effective autonomous AI systems

testing qa benchmark machine-learning evaluation chaos artificial-intelligence chaos-monkey testing-tools awesome-list quality-assurance ai-safety ai-agents chaos-engineering llm llm-evaluation agentic-ai ai-benchmark agent-evaluation

Updated May 28, 2025

solana8800 / langeval

Sponsor

Star

Evaluation Infrastructure for AI Agents

ai-evaluation agent-evaluation ai-evals

Updated Feb 25, 2026
TypeScript

Not-Diamond / self-care

Star

Agent trace analysis and context remediation plugin for Claude Code. Detects quality issues in your AI agent traces — goal drift, hallucinations, missed actions, and more.

plugin tracing observability ai-agents opentelemetry llm claude-code agent-evaluation

Updated Apr 16, 2026
JavaScript

dokimos-dev / dokimos

Star

Evaluation Framework for LLM applications in Java and Kotlin

Updated May 9, 2026
Java

Arc-Computer / CL-Bench

Star

Benchmark framework for evaluating LLM agent continual learning in stateful environments. Features production-realistic CRM workflows with multi-turn conversations, state mutations, and cross-entity relationships. Extensible to additional domains

benchmark continual-learning agent-evaluation

Updated Nov 14, 2025
Python

alepot55 / agentrial

Star

Statistical evaluation framework for AI agents

python testing ci-cd pytest confidence-intervals quality-assurance non-deterministic ai-agents mlops statistical-testing llm ai-testing llm-evaluation agent-evaluation

Updated Feb 6, 2026
Python

Improve this page

Add a description, image, and links to the agent-evaluation topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the agent-evaluation topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

agent-evaluation

Here are 143 public repositories matching this topic...

coze-dev / coze-loop

Giskard-AI / giskard-oss

truera / trulens

mozilla-ai / any-agent

ifixai-ai / iFixAi

rungalileo / agent-leaderboard

reacher-z / ClawBench

hwfengcs / DM-Code-Agent

hidai25 / eval-view

evaleval / every_eval_ever

Cre4T3Tiv3 / ai-agents-reality-check

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

SparkBeyond / agentune

chirpz-ai / pandaprobe

chaosync-org / awesome-ai-agent-testing

solana8800 / langeval

Not-Diamond / self-care

dokimos-dev / dokimos

Arc-Computer / CL-Bench

alepot55 / agentrial

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

agent-evaluation

Here are 143 public repositories matching this topic...

coze-dev / coze-loop

Giskard-AI / giskard-oss

truera / trulens

mozilla-ai / any-agent

ifixai-ai / iFixAi

rungalileo / agent-leaderboard

reacher-z / ClawBench

hwfengcs / DM-Code-Agent

hidai25 / eval-view

evaleval / every_eval_ever

Cre4T3Tiv3 / ai-agents-reality-check

microsoft / ignite25-PREL13-observe-manage-and-scale-agentic-ai-apps-with-microsoft-foundry

SparkBeyond / agentune

chirpz-ai / pandaprobe

chaosync-org / awesome-ai-agent-testing

solana8800 / langeval

Not-Diamond / self-care

dokimos-dev / dokimos

Arc-Computer / CL-Bench

alepot55 / agentrial

Improve this page

Add this topic to your repo