Skip to content

Persona Evaluation — condition2__screen_with_metadata__sample500_seed99

Field Value
Persona Nimesh Kulatunga
Judge model openai/gpt-4.1-mini
Embed model text-embedding-3-small
Rubric draws (k) 3
Total suggestions 50
Pipeline mode hall_pass=anti_gt (active=True)

Bucket Distribution

Bucket Count % of total
Task Critical 14 28.0%
Quality Of Life 22 44.0%
Noise 10 20.0%
Hallucinated 4 8.0%

Set-Level Diversity Metrics

Metric Value Interpretation
DPP log-det -29.9868 Higher = more diverse + high-quality set
Cluster coverage 1.000 Fraction of BGT clusters with a task-critical hit
ILAD 0.7387 Mean pairwise distance; higher = more diverse
Redundancy rate 0.000 Fraction of near-duplicate suggestions (cos > 0.9)

Composite Score

Component Weight Value
DPP set score (normalised) 0.5
Cluster coverage 0.3 1.000
Mean quality (non-hallucinated) 0.2
Hallucination penalty alpha=0.5 x 0.9600

Composite score: 0.6030

Hallucination Summary

Filter: anti_gt (active). Flagged 4 / 50 suggestions (rate 8.0%). Composite hallucination penalty: 0.9600.

Filtered Hallucinations

ID Failure mode Title Reasoning
14 over_elaboration Master CS3243: The 1982 CMU Vending Machine Cas... Fabricates the 'Finger' protocol and ARPANET connection details not supported...
21 passive_viewing_as_active_interest Target High-Impact Bitcoin Repositories for 2026 The suggestion prescribes specific libraries and projects (LDK, Stratum V2, F...
23 over_elaboration Optimize Cursor/Sonnet 4.6 for Bitcoin Protocol... The suggestion over-elaborates by prescribing a detailed Sonnet prompt with s...
40 empty_fallback Schedule Deep-Work for GitHub Issue Contributions Suggestion is a generic productivity prompt lacking actionable value, matchin...

Top 5 Task-Critical Suggestions

# ID Quality Title
1 16 1.000 Automate WhatsApp Export Parsing with Python
2 22 1.000 Automate WhatsApp Ingestion for GUMBO via Python Script
3 36 1.000 Map Python Skills to Nostr NIP-01 Implementation
4 41 1.000 Implement RPKI-to-ASmap Validation Logic
5 2 0.980 Configure Cursor 'Rules for AI' for GUMBO Context

Top 5 Quality-of-Life Suggestions

# ID Quality Title
1 7 1.000 Automate DB Isolation with venv-aware Shell Script
2 28 1.000 Implement Local Polling for WhatsApp Export Parsing
3 39 1.000 Optimize RPKI Validation with Routinator Filters
4 42 0.980 Automate BGP Data Fetching via RIS-Live
5 31 0.970 Implement Erlay (PR #21515) to Mitigate Eclipse Attacks