EVAL Engine

Abstract

We present EVAL Engine (Evaluation Validation Architecture), a decentralized framework for evaluating AI agents with a focus on crypto-native agents through verifiable real-time assessments and continuous learning capabilities.

Our system utilizes Chromias gas-free relational blockchain architecture to enable transparent, immutable, and cost-effective evaluation of AI agent performance. The system incorporates multiple LLM-as-a-judge^[1] and social engagement metrics for continuous reinforcement learning via feedback loop and reward system.

We demonstrate EVAL Engine can achieve efficient, secure evaluations while adapting to evolving performance standards through engagement-driven feedback loops.

We also present a comprehensive roadmap for the development of EVAL Engine, including API development, data preparation, model development, and model deployment.

Introduction

We saw Crypto x AI Agent taking off in the space of Crypto Twitter (CT), particularly on the social aspect of things. But most AI that weve seen dont even have evaluation metrics. This creates a significant risk of hallucination - the generation of plausible yet factually unsupported content^[2]. Without standardized evaluation metrics, it becomes challenging to verify the reliability and trustworthiness of these AI agents' outputs, particularly in the context of sensitive financial and cryptocurrency-related information

Key Challenges

Lack of real-time verification capabilities.
High computational cost to store evaluation data unreliably due to gas fees.
Absence of continuous learning mechanisms.
Limited integration with engagement metrics.

Key Objectives

Achieving real-time, on-chain verification of AI agent performance.
Minimizing gas-related overhead through a gas-free, scalable relational blockchain.
Building a continuous learning workflow that leverages social engagement metrics.
Ensuring data integrity via double-signed transactions and robust smart contracts.

EVAL Engine addresses these challenges head-on with a decentralized system specifically designed for real-time, cost-effective, and transparent evaluations.

Built on Chromias gas-free relational blockchain, our framework unlocks efficient data storage while preserving immutability and traceability.

By integrating LLM-based judgments, social engagement metrics, and continuous reinforcement learning strategies, EVAL Engine enables a dynamic environment where AI agents—particularly crypto-native ones— can evolve and improve continuously.

System Architecture

Data Format & Ingestion Layer

Standardized APIs facilitate interoperability with different AI agents, evaluation modules, and social media platforms.
Allows for quick retrieval of AI Agent outputs and historical evaluation scores
Normalizes incoming data (e.g., raw tweets, platform engagement stats) to ensure consistency before evaluation.

EVAL Engine: Evaluation Engine

The EVAL Engine orchestrates multiple LLM evaluators to form a composite score that reflects various performance metrics (accuracy, creativity, truthfulnes and engagement).
Each LLM executes asynchronously, enabling parallel evaluation and reduced latency, powered by DSPy^[6]
Final evaluation scores and relevant data are batched for submission to Chromia gas-free blockchain. submission to Chromia gas-free blockchain.

Storage & Smart Contract Layer

Stores evaluation results in a structured table-based format.
Relational indexing allows for advanced queries, efficient retrieval, and scalable data operations.
Both the AI agent and the evaluation service provider must sign off on each transaction to ensure mutual transparency and accountability.

Engagement Evaluation Module

Ingests real-time engagement data from Twitter (likes, retweets, commentary) to gauge public reception.
Uses these metrics to dynamically adjust weighting factors in the multi-evaluation engine.
Incorporates user feedback and on-chain predictions into a long-term reinforcement learning algorithm.

Continuous Evaluation & Reinforcement Learning

Submitted evaluations feed additional insights into the AI Agents model, refining future performance outputs.
Performance thresholds arent static; they evolve based on community engagement and empirical performance data.

Evaluation Technicalities

Traditional single-prompt scoring systems are vulnerable to manipulation through prompt injection and leaking. EVAL Engine implements a robust multi-layer approach to ensure evaluation integrity:

Multi-LLM Evaluation Engine

Parallel evaluation using multiple specialized LLMs, each focused on distinct criteria (truthfulness, creativity, accuracy and engagement)
Our models are powered by Hyperbolic^[4] and Groq^[5] to serve instant open-source LLMs at low latency to keep our response time within industry standards. Different LLMs are used for different criteria to ensure accuracy and diversity. The image below shows average response time using our providers for some of the LLMs that is powering our EVAL Engine.

Note: The image below is not an exhaustive list of all the LLMs that we use, but just a few examples - average of 50 calls

DeepSeek R1 Distill LLaMA 70B

Average

2.39s

Range

1.94-2.94s

Std Dev

0.29

Median

2.32s

LLaMA 3.1 8B Instant

Average

0.56s

Range

0.49-0.7s

Std Dev

0.07

Median

0.53s

Real-Time Engagement Analysis

Integration of social signals (likes, shares, comments) for continuous evaluation refinement
Feedback loop for continuous improvement of evaluation criteria

On-Chain Storage

Final evaluation outcomes and intermediate results are stored on Chromia's relational blockchain, guaranteeing transparency, verifiability, and historical traceability.

Technical Process Flow

Client signs an evaluation request transaction

Engine receives and double-signs the transaction

Engine submits the signed transaction to Chromia blockchain

Engine processes the evaluation

Engine stores results on the blockchain

Client receives the evaluation results

* All transactions and results are permanently stored on Chromia blockchain, ensuring transparency and auditability. Clients interact with the system through a simple SDK/API interface, while the Engine handles all blockchain operations.

Evaluation Mathematics

Weighted Average Evaluation

Let {𝑠𝑐₁, 𝑠𝑐₂, …, 𝑠𝑐ₙ} be individual scores from different evaluation criteria (e.g., factual correctness, creativity, coherence) and {𝑤₁, 𝑤₂, …, 𝑤ₙ} be their respective weights where Σ(𝑤ᵢ) = 1.

𝑆 = Σ(𝑤ᵢ × 𝑠𝑐ᵢ), 𝑓𝑜𝑟 𝑖 = 1 𝑡𝑜 𝑁

Where:
𝑠𝑐ᵢ = sub-score from the i-th model or criterion
𝑤ᵢ = weighting factor reflecting importance of sub-score

Engagement-Driven Feedback Loop

Real-time engagement metrics are incorporated through an engagement factor E:

𝐸 = α𝑅 + β𝐿 + γ𝐶

Where:
𝑅 = retweets, 𝐿 = likes, 𝐶 = comments
α, β, γ = importance coefficients for each metric

Dynamic weight adjustment:

𝑤ᵢ = 𝑤ᵢ + (δ × 𝐸)

Where:
δ = engagement impact factor
𝑤ᵢ = adjusted weight after normalization

Final weighted score:

𝑆* = Σ(𝑤ᵢ × 𝑠𝑐ᵢ)

Tokenomics

$EVAL Token Overview

$EVAL is a fair launch project via Virtuals which will be bridge to Chromia as FT4 token down the road. How $EVAL bridges to $CHR is detailed below:

Account Creation Requirements

To set up a self-sovereign Evaluation Engine Account, users must hold a minimum of 10 $CHR. This requirement creates an entry barrier that discourages spam and ensures that all registered accounts have a stake in the network integrity.

Subscription Model in $EVAL

Each evaluation report on EVAL Engine is paid for using $EVAL tokens through a subscription model^[3]. By tokenizing the cost per evaluation, the platform seamlessly handles real-time user demand for evaluation services.

Token Utility and Conversion

EVAL Engine uses $EVAL tokens to acquire $CHR when computational resources are required. This design leverages Chromias gas-free architecture while maintaining a clear payment model. Because each evaluation is properly backed by tokens (both in $EVAL and $CHR), the network remains secure, cost-effective, and accessible for continuous AI evaluation needs.

Roadmap

EVAL Engine continually refines its AI evaluation capabilities, leverages decentralized data storage on Chromia, and integrates real-time social signals to maintain accuracy and adaptability.

Phase 1: API Development

Deploy evaluation APIs using Virtuals SDK for seamless integration with partner ecosystems
Develop custom endpoints for both public and partner access
Create intuitive frontend interface for:
- API access management
- Evaluation history visualization
- Performance metrics dashboard

Phase 2: Data Preparation

Implement ELT+L (Extract, Load, Transform + Load to Chromia) pipeline
Data processing workflow:
- Aggregate raw input from multiple sources
- Clean and normalize user feedback and social signals
- Process upvotes and engagement metrics
Establish manual and automated labeling workflows for specific tasks

Phase 3: Model Development

Fine-tune lightweight open source AI models:
- Finetune Open Source Models on TogetherAI or any provider
- Optimize for evaluation specific tasks
Data integration:
- $GRASS UpVoteWeb dataset integration for engagement metrics
- $Elfa.ai real-time data ingestion for source of truth
- Custom data finetuning prepared by Chromia team
Setup real-time engagement tracking system for:
- Social media likes and retweets
- User comments and feedback
- Performance metrics analysis

Phase 4: Model Deployment

Hugging Face platform integration:
- Public access deployment
- Community collaboration framework
- Version control and model tracking
Continuous improvement cycle:
- Global feedback integration
- Dataset updates and refinement
- Performance optimization

References

[1] Zheng, L., et al. (2023). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." [v4] arXiv:2306.05685

[2] Huang, L., et al. (2023). "A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions." [v2] arXiv:2311.05232

[3] Chromia Documentation. "Subscription Fee Strategy." Chromia FT4 Documentation

[4] The Open Access AI Cloud Hyperbolic

[5] Platform for Fast AI Inference Groq

[6] Framework for programming—rather than prompting—language models Dspy

EVAL Engine LitePaper