Skip to content

Persona Evaluation — condition1__screen_no_metadata__sample500_seed99

Field Value
Persona Nimesh Kulatunga
Judge model openai/gpt-4.1-mini
Embed model text-embedding-3-small
Rubric draws (k) 3
Total suggestions 50
Pipeline mode hall_pass=anti_gt (active=True)

Bucket Distribution

Bucket Count % of total
Task Critical 16 32.0%
Quality Of Life 23 46.0%
Noise 11 22.0%
Hallucinated 0 0.0%

Set-Level Diversity Metrics

Metric Value Interpretation
DPP log-det -35.8823 Higher = more diverse + high-quality set
Cluster coverage 1.000 Fraction of BGT clusters with a task-critical hit
ILAD 0.6824 Mean pairwise distance; higher = more diverse
Redundancy rate 0.000 Fraction of near-duplicate suggestions (cos > 0.9)

Composite Score

Component Weight Value
DPP set score (normalised) 0.5
Cluster coverage 0.3 1.000
Mean quality (non-hallucinated) 0.2
Hallucination penalty alpha=0.5 x 1.0000

Composite score: 0.6167

Hallucination Summary

Filter: anti_gt (active). Flagged 0 / 50 suggestions (rate 0.0%). Composite hallucination penalty: 1.0000.

Filtered Hallucinations

No suggestions were flagged as hallucinations.

Top 5 Task-Critical Suggestions

# ID Quality Title
1 188 1.000 Automate GUMBO Schema Sync to Google Drive
2 212 1.000 Use Regex for WhatsApp Export Cleaning
3 229 1.000 Fix Gmail API Scopes for Composio Integration
4 234 1.000 Implement Unified Provider Factory for GUMBO
5 248 1.000 Add 'SessionMetadata' to GUMBO Schema v1.1

Top 5 Quality-of-Life Suggestions

# ID Quality Title
1 179 1.000 Sync GUMBO 'Focus' Mode with Virtual Desktop 3
2 49 0.980 Configure Local OCR for Screen Observer Data Extraction
3 184 0.960 Schedule High-Risk Ingestion During Low-Traffic Windows
4 210 0.960 Schedule Test Runs During 'logical-eng-ext' Peak Hours
5 244 0.933 Automate LaTeX Schema Sync to Google Drive