Skip to content

Persona Evaluation — Ground truth ablation: IGT only (anti_gt)

Field Value
Persona Nimesh Kulatunga
Judge model openai/gpt-4.1-mini
Embed model text-embedding-3-small
Rubric draws (k) 3
Total suggestions 16
Pipeline mode hall_pass=anti_gt (active=True)

Bucket Distribution

Bucket Count % of total
Task Critical 6 37.5%
Quality Of Life 5 31.2%
Noise 4 25.0%
Hallucinated 1 6.2%

Set-Level Diversity Metrics

Metric Value Interpretation
DPP log-det -8.6813 Higher = more diverse + high-quality set
Cluster coverage 1.000 Fraction of BGT clusters with a task-critical hit
ILAD 0.6757 Mean pairwise distance; higher = more diverse
Redundancy rate 0.000 Fraction of near-duplicate suggestions (cos > 0.9)

Composite Score

Component Weight Value
DPP set score (normalised) 0.5
Cluster coverage 0.3 1.000
Mean quality (non-hallucinated) 0.2
Hallucination penalty alpha=0.5 x 0.9688

Composite score: 0.6038

Hallucination Summary

Filter: anti_gt (active). Flagged 1 / 16 suggestions (rate 6.2%). Composite hallucination penalty: 0.9688.

Filtered Hallucinations

ID Failure mode Title Reasoning
9 source_context_confusion Auto-apply Toastmasters brand guidelines to the... Attributes Toastmasters brand guidelines to design, but prior anti-GT shows c...

Top 5 Task-Critical Suggestions

# ID Quality Title
1 2 0.970 Rank SOB orgs against user's tech stack and stated prefer...
2 5 0.970 Pre-draft proposal sections using user's past proposal st...
3 4 0.893 Consolidate Matt Morehouse's published research and map t...
4 11 0.870 Propose batch creation of 8 new-member flyers from single...
5 12 0.770 Draft feedback-request message to Siyane PR team 25/26 Wh...

Top 5 Quality-of-Life Suggestions

# ID Quality Title
1 1 0.970 Remind that SOB 2026 proposal deadline is April 30 (23 da...
2 6 0.903 Auto-draft SOB April 30 deadline and 4 proposal work-bloc...
3 13 0.807 Flag that caption writing is a deferred task (your usual ...
4 8 0.690 Recognise flyer request from TM Praveen Kumarasinghe in e...
5 3 0.670 Produce maintainer dossier with contribution preferences ...