evalforge

Agent evaluation harness — repeatable evals, regression detection, CI-ready

Python 3.10+ TypeScript 5.4+ MIT License Provider Agnostic

EvalForge is a standalone, provider-agnostic harness that lets you define test cases, run any agent against them, score results with multiple strategies, generate reports, and track regressions over time. No vendor lock-in. No hidden magic. Just clean, composable eval primitives.

# Install

Python

$ pip install evalforge
# With all extras:
$ pip install "evalforge[all]"

TypeScript / Node.js

$ npm install evalforge

# Quick Start

Python

from evalforge import EvalHarness, TestCase
from evalforge.scorer import fuzzy_match, exact_match

def my_agent(prompt: str) -> str:
    return "The capital of France is Paris."

harness = EvalHarness(agent=my_agent, suite_name="geo-smoke")

harness.add(TestCase(
    id="france-capital",
    description="Knows EU capitals",
    input="What is the capital of France?",
    expected_output="Paris",
    scoring=fuzzy_match(threshold=0.8),
    tags=["geography"],
))

result = harness.run(report_html="reports/run.html")
# Prints a rich table to the terminal
# Saves an HTML report to reports/run.html

TypeScript

import { EvalHarness, TestCase } from "evalforge";
import { fuzzyMatch } from "evalforge/scorer";

const harness = new EvalHarness({
  agent: async (input) => "The capital of France is Paris.",
  suiteName: "geo-smoke",
});

harness.add(new TestCase({
  id: "france-capital",
  input: "What is the capital of France?",
  expectedOutput: "Paris",
  scoring: fuzzyMatch(0.8),
  tags: ["geography"],
}));

const result = await harness.run({ reportHtml: "reports/run.html" });

Regression Tracking

harness = EvalHarness(
    agent=my_agent,
    suite_name="my-suite",
    history_path="eval_history.jsonl",   # appends every run
)
result = harness.run()

# Or use RegressionTracker directly:
from evalforge.reporter import RegressionTracker
tracker = RegressionTracker("eval_history.jsonl")
regressions = tracker.compare_and_save(result)
if regressions:
    print(f"Regressions: {regressions}")

# Scoring Strategies

Strategy	Description
exact	Exact string equality (default)
fuzzy	Levenshtein / token similarity
contains	Expected is a substring of actual
json_match	Deep-compare JSON structures (with key ignore)
llm_judge	Use an LLM to score correctness (0-1)
custom	Bring your own scoring function

# Report Formats

Format	Usage
Console (rich)	harness.run() — verbose table by default
JSON	harness.run(report_json="out/results.json")
HTML	harness.run(report_html="out/results.html")

# EvalHarness API

Method	Description
EvalHarness(agent, suite_name, ...)	Create a harness with an agent function
.add(test_case)	Add a test case to the suite
.run(**kwargs)	Run all test cases and generate reports
TestCase(id, input, expected_output, scoring, ...)	Define a test case with scoring criteria

# Registry

# Register reusable suites and agents
from evalforge.registry import registry

@registry.suite("support-faq")
def support_faq():
    return [TestCase(id="refund", input="What is your refund policy?",
                     expected_output="30-day", scoring=contains_match())]

@registry.agent("support-bot")
async def support_bot(prompt):
    return "We offer a 30-day money back guarantee."

# Run by name
result = asyncio.run(registry.run("support-faq", "support-bot"))

# Key Features

6 scoring strategies (exact, fuzzy, contains, JSON, LLM judge, custom)
Regression tracking across runs with JSONL history
HTML, JSON, and console report generation
Provider-agnostic — works with any LLM or agent
Registry for reusable test suites and agents
OpenAI & Anthropic integrations for LLM-as-judge
Async runner with timeout and retry support