llm-eval

Here are 95 public repositories matching this topic...

promptfoo / promptfoo

Test your prompts, agents, and RAGs. Red teaming/pentesting/vulnerability scanning for AI. Compare performance of GPT, Claude, Gemini, Llama, and more. Simple declarative configs with command line and CI/CD integration. Used by OpenAI and Anthropic.

testing ci evaluation ci-cd pentesting cicd vulnerability-scanners prompts evaluation-framework red-teaming rag llm prompt-engineering llmops prompt-testing llm-eval llm-evaluation llm-evaluation-framework

Updated May 9, 2026
TypeScript

Arize-ai / phoenix

Star

AI Observability & Evaluation

openai datasets agents ai-monitoring ai-observability prompt-engineering llms langchain llmops anthropic llamaindex llm-eval evals llm-evaluation aiengineering smolagents

Updated May 9, 2026
Python

Giskard-AI / giskard-oss

Sponsor

Star

🐢 Open-Source Evaluation & Testing library for LLM Agents

ai-security mlops fairness-ai responsible-ai ml-validation red-team-tools trustworthy-ai ml-testing llm ai-red-team ai-testing llmops llm-security llm-eval llm-evaluation rag-evaluation agent-evaluation

Updated May 7, 2026
Python

truera / trulens

Star

Evaluation and Tracking for LLM Experiments and AI Agents

machine-learning neural-networks ai-agents explainable-ml agentops ai-monitoring ai-observability llms llmops llm-eval evals llm-evaluation agent-evaluation

Updated May 8, 2026
Python

uptrain-ai / uptrain

Star

UpTrain is an open-source unified platform to evaluate and improve Generative AI applications. We provide grades for 20+ preconfigured checks (covering language, code, embedding use-cases), perform root cause analysis on failure cases and give insights on how to resolve them.

machine-learning monitoring evaluation experimentation jailbreak-detection autoevaluation root-cause-analysis prompt-engineering llmops openai-evals llm-prompting llm-eval llm-test hallucination-detection

Updated Aug 18, 2024
Python

AI-QL / tuui

Star

A desktop MCP client designed as a tool unitary utility integration, accelerating AI adoption through the Model Context Protocol (MCP) and enabling cross-vendor LLM API orchestration.

Updated Mar 11, 2026
TypeScript

athina-ai / athina-evals

Star

Python SDK for running evaluations on LLM generated responses

evaluation evaluation-metrics evaluation-framework llmops llm-eval llm-ops llm-evaluation llm-evaluation-toolkit

Updated Jun 6, 2025
Python

verifywise-ai / verifywise

Star

Complete AI governance and LLM Evals platform with support for EU AI Act, ISO 42001, NIST AI RMF and 20+ more AI frameworks and regulations. Join our Discord channel: https://discord.com/invite/d3k3E4uEpR

auditing ai audit compliance grc governance risk-management iso27001 ai-risk ai-compliance ai-governance ai-governance-model llm-eval llm-evaluation eu-ai-act ai-auditing iso42001 nist-ai-rmf

Updated May 9, 2026
TypeScript

Re-Align / just-eval

Star

A simple GPT-based evaluation tool for multi-aspect, interpretable assessment of LLMs.

evaluation gpt4 llm llm-eval llm-evaluation llm-evaluation-toolkit

Updated Jan 29, 2024
Python

QuesmaOrg / BinaryAudit

Star

An open-source benchmark for evaluating AI agents' ability to find backdoors hidden in compiled binaries.

benchmark ai reverse-engineering cybersecurity binary-analysis llm-eval

Updated Feb 27, 2026
Shell

parea-ai / parea-sdk-py

Star

Python SDK for experimenting, testing, evaluating & monitoring LLM-powered applications - Parea AI (YC S23)

metrics good-first-issue llm prompt-engineering generative-ai llmops llm-eval llm-tools llm-evaluation llm-evaluation-toolkit llms-benchmarking llm-evaluation-framework

Updated Feb 13, 2025
Python

grigio / llm-eval-simple

Star

llm-eval-simple is a simple LLM evaluation framework with intermediate actions and prompt pattern selection

llm llm-eval llm-evaluation-benchmark

Updated Feb 28, 2026
Python

whitecircle / circle-guard-bench

Star

First-of-its-kind AI benchmark for evaluating the protection capabilities of large language model (LLM) guard systems (guardrails and safeguards)

benchmarking benchmark ai jailbreak safeguard guardrail guardrails large-language-models llm large-language-model llm-security llm-eval llm-evaluation llm-as-a-judge llm-jailbreaks

Updated Mar 7, 2026
Python

kuk / rulm-sbs2

Star

Бенчмарк сравнивает русские аналоги ChatGPT: Saiga, YandexGPT, Gigachat

russian-specific llm-eval

Updated Sep 26, 2023
Jupyter Notebook

fastxyz / skill-optimizer

Star

Benchmark, evaluate, and optimize skills to ensure reliable performance across all LLMs

cli benchmark sdk ai mcp evaluation optimizer eval evaluation-framework ai-agent llm llm-eval evals openrouter llm-evaluation-framework tool-calling llm-evals ai-skill

Updated May 8, 2026
TypeScript

izam-mohammed / ragrank

Sponsor

Star

🎯 Your free LLM evaluation toolkit helps you assess the accuracy of facts, how well it understands context, its tone, and more. This helps you see how good your LLM applications are.

machine-learning evaluation language-model rag llm prompt-engineering llmops llm-eval

Updated Apr 21, 2026
Python

multinear / multinear

Star

Develop reliable AI apps

reliability evaluation llm llms llm-eval llm-evaluation llms-benchmarking llm-evaluation-framework

Updated Sep 2, 2025
Python

Striveworks / valor

Star

Valor is a lightweight, numpy-based library designed for fast and seamless evaluation of machine learning models.

nlp computer-vision evaluation text-generation classification object-detection image-segmentation evaluation-metrics model-evaluation mlops llm-eval

Updated Feb 9, 2026
Python

alan-turing-institute / prompto

Star

An open source library for asynchronous querying of LLM endpoints

python nlp machine-learning natural-language-processing deep-learning transformers transformer hut23 large-language-models llms llm-eval llm-evaluation

Updated Jul 18, 2025
Python

thedataquarry / structured-outputs

Star

Structured output benchmarks comparing DSPy and BAML with different LLMs

information-extraction structured-evaluation structured-output baml dspy llm llm-eval llm-evaluation

Updated Dec 23, 2025
Python

Improve this page

Add a description, image, and links to the llm-eval topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the llm-eval topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llm-eval

Here are 95 public repositories matching this topic...

promptfoo / promptfoo

Arize-ai / phoenix

Giskard-AI / giskard-oss

truera / trulens

uptrain-ai / uptrain

AI-QL / tuui

athina-ai / athina-evals

verifywise-ai / verifywise

Re-Align / just-eval

QuesmaOrg / BinaryAudit

parea-ai / parea-sdk-py

grigio / llm-eval-simple

whitecircle / circle-guard-bench

kuk / rulm-sbs2

fastxyz / skill-optimizer

izam-mohammed / ragrank

multinear / multinear

Striveworks / valor

alan-turing-institute / prompto

thedataquarry / structured-outputs

Improve this page

Add this topic to your repo

Search code, repositories, users, issues, pull requests...

llm-eval

Here are 95 public repositories matching this topic...

promptfoo / promptfoo

Arize-ai / phoenix

Giskard-AI / giskard-oss

truera / trulens

uptrain-ai / uptrain

AI-QL / tuui

athina-ai / athina-evals

verifywise-ai / verifywise

Re-Align / just-eval

QuesmaOrg / BinaryAudit

parea-ai / parea-sdk-py

grigio / llm-eval-simple

whitecircle / circle-guard-bench

kuk / rulm-sbs2

fastxyz / skill-optimizer

izam-mohammed / ragrank

multinear / multinear

Striveworks / valor

alan-turing-institute / prompto

thedataquarry / structured-outputs

Improve this page

Add this topic to your repo