ai-evals

Star

Here are 25 public repositories matching this topic...

solana8800 / langeval

Sponsor

Star

Evaluation Infrastructure for AI Agents

ai-evaluation agent-evaluation ai-evals

Updated Feb 25, 2026
TypeScript

productfoundry101 / ai-evals-bootcamp

Star

Learn to evaluate AI products for production — 21 hands-on lessons on evals, metrics, fairness, agents, red teaming, and release decisions for working PMs.

bootcamp red-teaming rag prompt-engineering llmops ai-product-management llm-evaluation claude-code ai-pm ai-evals

Updated May 6, 2026

yiouli / pixie-qa

Star

Agent skill for AI agent development

skill dev eval llm agent-skills ai-evals

Updated Apr 22, 2026
HTML

RafaelParonis / jailbench

Star

🔍 Benchmark jailbreak resilience in LLMs with JailBench for clear insights and improved model defenses against jailbreak attempts.

python flask analytics openai alignment model-evaluation ai-safety security-testing red-teaming model-robustness anthropic litellm content-safety llm-jailbreaks tool-calling llm-benchmark ai-evals textual-tui

Updated May 9, 2026
Python

vibheksoni / jailbench

Star

Benchmark LLM jailbreak resilience across providers with standardized tests, adversarial mode, rich analytics, and a clean Web UI.

Updated Aug 12, 2025
Python

vitron-ai / aip-foundry-themis-starter

Star

Unofficial TypeScript starter for deterministic local contract testing around Foundry-oriented workflows with Themis.

react typescript schema-validation themis contract-testing osdk developer-tooling agentic-workflows ai-evals foundry-workflows

Updated Mar 28, 2026
TypeScript

MohsinCreed / LangfuseOllama

Star

Free, local Langfuse OSS setup with Ollama for LLM evaluation, scoring, and datasets.

docker open-source self-hosted free no-cost local-llm ollama langfuse llm-evaluation prompt-evaluation offline-ai llm-as-judge llm-observability ai-evals

Updated Apr 13, 2026
TypeScript

SuperfiedStudd / ai-evals-orchestration

Star

End-to-end AI evals orchestration platform for comparing LLM outputs across providers with transcription, structured logging, human review, and Supabase-backed decision tracking.

gemini openai multi-model transcription human-in-the-loop model-comparison supabase anthropic llm-evaluation ai-evals evaluation-pipeline

Updated Mar 10, 2026
TypeScript

aksiom-dev / eu-ai-act-evals

Star

Evaluation prompts for AI systems against the EU AI Act.

ai-safety regulatory ai-compliance ai-governance eu-ai-act ai-evals

Updated May 6, 2026
TypeScript

IsaacCavallaro / agent-evals-workbench

Star

A lightweight workbench for dataset-driven agent and LLM evaluation.

python cli regression-testing llm-evals agent-evals openai-compatible ai-evals eval-harness

Updated May 1, 2026
Python

mayankmankhand / trace-annotator

Star

Local web app that teaches PMs how to do open coding and error analysis on LLM traces. Keyboard-first labeling, native trace rendering, JSONL export.

nextjs annotation-tool error-analysis prompt-engineering llm-observability llm-evals ai-pm ai-evals

Updated May 8, 2026
TypeScript

vishal-labade / llm_exp_platform_v2

Star

Experimentation framework for LLM systems using simulated users, conversational behavioral metrics, and causal inference to evaluate prompt strategies, temperature, and model scaling.

experimentation causal-inference product-analytics llm-evaluation llm-benchmarking ai-evals

Updated Mar 8, 2026
Python

Abhi-2016 / agentic-ai-learning

Star

Hands-on Agentic AI learning project — ReAct agents, memory systems, evals, and multi-agent architecture. Built as a structured AI PM curriculum.

python product-management react-pattern ai-agent llm research-automation anthropic llm-as-judge agentic-ai ai-evals

Updated Apr 25, 2026
Python

EaCognitive / Metivta-Eval

Sponsor

Star

Evaluation harness for domain-specific RAG and QA systems with benchmark datasets, scoring, and regression workflows.

benchmarking domain-qa retrieval-augmented-generation llm-evaluation rag-evaluation evaluation-harness ai-evals

Updated Mar 8, 2026
Python

vineethcv / eval-engine

Star

Lightweight eval framework for LLMs & AI apps combining deterministic scoring, LLM-as-judge, and regression testing.

python testing evaluation openai llm ai-quality evals ai-evals ai-quality-assurance

Updated Apr 8, 2026
Python

RamyaLakshmiKS / agentic_software_team

Star

Multi-agent system orchestrating an AI-driven software team using the Claude Agents SDK. Agents take on defined roles and collaborate autonomously on software tasks.

jira ai orchestration multi-agent confluence atlassian llm generative-ai anthropic llm-agents agentic-ai mcp-server ai-evals claude-agent-sdk

Updated Feb 4, 2026
Python

AlejandroFuentePinero / ai-jie

Star

LLM extraction pipeline for job postings; ~80% skill classification accuracy via chain-of-thought scaffolding.

structured-output pydantic ayncio prompt-engineering ai-evals

Updated Apr 9, 2026
Python

bjk95 / defrost

Star

Track AI evals, metrics, and tests with Git as the database.

metrics developer-tools agents observability traces llm evals ai-evals

Updated May 5, 2026
Go

abhinov / agentmed

Star

A smart clinical AI Copilot for Diabetic Retinopathy that uses a multi-agent 'Maker-Checker' system to cross-reference medical scans, improving diagnostic accuracy while lowering running costs.

multi-agent vlm medical-ai streamlit healthcare-ai llama3 gpt-4o ai-evals

Updated May 7, 2026
Python

7ahir / project-forge

Star

Portfolio project showing product strategy, evals, roadmap, and GTM for a GenAI coding assistant

developer-tools product-management portfolio-project product-strategy llm genai ai-coding-assistant ai-evals

Updated Apr 2, 2026

Improve this page

Add a description, image, and links to the ai-evals topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ai-evals topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ai-evals

Here are 25 public repositories matching this topic...

solana8800 / langeval

productfoundry101 / ai-evals-bootcamp

yiouli / pixie-qa

RafaelParonis / jailbench

vibheksoni / jailbench

vitron-ai / aip-foundry-themis-starter

MohsinCreed / LangfuseOllama

SuperfiedStudd / ai-evals-orchestration

aksiom-dev / eu-ai-act-evals

IsaacCavallaro / agent-evals-workbench

mayankmankhand / trace-annotator

vishal-labade / llm_exp_platform_v2

Abhi-2016 / agentic-ai-learning

EaCognitive / Metivta-Eval

vineethcv / eval-engine

RamyaLakshmiKS / agentic_software_team

AlejandroFuentePinero / ai-jie

bjk95 / defrost

abhinov / agentmed

7ahir / project-forge

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

ai-evals

Here are 25 public repositories matching this topic...

solana8800 / langeval

productfoundry101 / ai-evals-bootcamp

yiouli / pixie-qa

RafaelParonis / jailbench

vibheksoni / jailbench

vitron-ai / aip-foundry-themis-starter

MohsinCreed / LangfuseOllama

SuperfiedStudd / ai-evals-orchestration

aksiom-dev / eu-ai-act-evals

IsaacCavallaro / agent-evals-workbench

mayankmankhand / trace-annotator

vishal-labade / llm_exp_platform_v2

Abhi-2016 / agentic-ai-learning

EaCognitive / Metivta-Eval

vineethcv / eval-engine

RamyaLakshmiKS / agentic_software_team

AlejandroFuentePinero / ai-jie

bjk95 / defrost

abhinov / agentmed

7ahir / project-forge

Improve this page

Add this topic to your repo