Evaluation Infrastructure for AI Agents
-
Updated
Feb 25, 2026 - TypeScript
Evaluation Infrastructure for AI Agents
Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.
🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.
Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.
Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.
Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.
End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.
Evaluation prompts for AI systems against the EU AI Act.
A lightweight workbench for dataset-driven agent and LLM evaluation.
Local web app that teaches PMs how to do open coding and error analysis on LLM traces. Keyboard-first labeling, native trace rendering, JSONL export.
Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.
Hands-on Agentic AI learning project — ReAct agents, memory systems, evals, and multi-agent architecture. Built as a structured AI PM curriculum.
Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.
Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.
Multi-agent system orchestrating an AI-driven software team using the Claude Agents SDK. Agents take on defined roles and collaborate autonomously on software tasks.
LLM extraction pipeline for job postings; ~80% skill classification accuracy via chain-of-thought scaffolding.
Track AI evals, metrics, and tests with Git as the database.
A smart clinical AI Copilot for Diabetic Retinopathy that uses a multi-agent 'Maker-Checker' system to cross-reference medical scans, improving diagnostic accuracy while lowering running costs.
Portfolio project showing product strategy, evals, roadmap, and GTM for a GenAI coding assistant
Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.
To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."