About EvalForge
An open-source, extensible evaluation framework for generative AI systems.
Pipeline Architecture
Prompt Generation
LLM-powered prompt generation with category-aware templates for video and agent tasks.
Content Generation
Multi-provider API orchestration to generate videos (Veo, Kling, Seedance, etc.) or agent conversations.
Classification
Intent classification for agent turns and category tagging for video prompts using LLM judges.
Evaluation
VBench-based metrics for video quality; multi-dimensional quality scoring for agent responses.
Analysis & Reporting
Statistical aggregation, comparative rankings, and formatted reports with visualizations.
Video Evaluation Methodology
Video quality is assessed using 9 VBench-inspired metrics, each scored on a 0-1 scale. Scores are compared against baseline values derived from prior-generation models. The 5 evaluation categories (Narrative, Subject, Environment, Motion, Style) aggregate subsets of these metrics for domain-specific analysis.
Subject Consistency
How well the main subject maintains identity across frames.
Background Consistency
Stability and coherence of background elements over time.
Temporal Flickering
Absence of unnatural brightness or color fluctuations between frames.
Motion Smoothness
Natural flow and continuity of movement without jitter.
Dynamic Degree
Amount and variety of meaningful motion in the generated video.
Aesthetic Quality
Visual appeal, composition, and artistic quality of individual frames.
Imaging Quality
Technical quality including sharpness, noise, and artifact absence.
Overall Consistency
Holistic coherence of the video as a unified visual narrative.
Text Alignment
Faithfulness of the generated video to the input text prompt.
Agent Evaluation Methodology
Agent conversations are classified into 7 intent categories (QA, Code Development, Content Generation, Data Analysis, Tool Action, Task Planning, Translation). Each response is evaluated by an LLM judge across 4 quality dimensions, then aggregated per intent and overall.
Coverage
Completeness of the response in addressing all aspects of the user query.
Relevance
How directly the response addresses the specific question or task at hand.
Executability
Whether code, instructions, or suggestions can be directly executed or applied.
Practicality
Real-world applicability and usefulness of the provided answer.
Open Source
EvalForge is open-source. Contributions, feedback, and new evaluation tracks are welcome.
View on GitHub