About EvalForge

An open-source, extensible evaluation framework for generative AI systems.

Pipeline Architecture

Prompt Generation

LLM-powered prompt generation with category-aware templates for video and agent tasks.

Content Generation

Multi-provider API orchestration to generate videos (Veo, Kling, Seedance, etc.) or agent conversations.

Classification

Intent classification for agent turns and category tagging for video prompts using LLM judges.

Evaluation

VBench-based metrics for video quality; multi-dimensional quality scoring for agent responses.

Analysis & Reporting

Statistical aggregation, comparative rankings, and formatted reports with visualizations.

Video Evaluation Methodology

Video quality is assessed using 9 VBench-inspired metrics, each scored on a 0-1 scale. Scores are compared against baseline values derived from prior-generation models. The 5 evaluation categories (Narrative, Subject, Environment, Motion, Style) aggregate subsets of these metrics for domain-specific analysis.

Subject Consistency

How well the main subject maintains identity across frames.

Background Consistency

Stability and coherence of background elements over time.

Temporal Flickering

Absence of unnatural brightness or color fluctuations between frames.

Motion Smoothness

Natural flow and continuity of movement without jitter.

Dynamic Degree

Amount and variety of meaningful motion in the generated video.

Aesthetic Quality

Visual appeal, composition, and artistic quality of individual frames.

Imaging Quality

Technical quality including sharpness, noise, and artifact absence.

Overall Consistency

Holistic coherence of the video as a unified visual narrative.

Text Alignment

Faithfulness of the generated video to the input text prompt.

Agent Evaluation Methodology

Agent conversations are classified into 7 intent categories (QA, Code Development, Content Generation, Data Analysis, Tool Action, Task Planning, Translation). Each response is evaluated by an LLM judge across 4 quality dimensions, then aggregated per intent and overall.

Coverage

Completeness of the response in addressing all aspects of the user query.

Relevance

How directly the response addresses the specific question or task at hand.

Executability

Whether code, instructions, or suggestions can be directly executed or applied.

Practicality

Real-world applicability and usefulness of the provided answer.

Open Source

EvalForge is open-source. Contributions, feedback, and new evaluation tracks are welcome.

View on GitHub