Skip to content

Persona Evaluation — Ground truth ablation: BGT + IGT (anti_gt)

Field Value
Persona Nimesh Kulatunga
Judge model openai/gpt-4.1-mini
Embed model text-embedding-3-small
Rubric draws (k) 3
Total suggestions 43
Pipeline mode hall_pass=anti_gt (active=True)

Bucket Distribution

Bucket Count % of total
Task Critical 25 58.1%
Quality Of Life 8 18.6%
Noise 6 14.0%
Hallucinated 4 9.3%

Set-Level Diversity Metrics

Metric Value Interpretation
DPP log-det -35.2258 Higher = more diverse + high-quality set
Cluster coverage 1.000 Fraction of BGT clusters with a task-critical hit
ILAD 0.6647 Mean pairwise distance; higher = more diverse
Redundancy rate 0.000 Fraction of near-duplicate suggestions (cos > 0.9)

Composite Score

Component Weight Value
DPP set score (normalised) 0.5
Cluster coverage 0.3 1.000
Mean quality (non-hallucinated) 0.2
Hallucination penalty alpha=0.5 x 0.9535

Composite score: 0.5672

Hallucination Summary

Filter: anti_gt (active). Flagged 4 / 43 suggestions (rate 9.3%). Composite hallucination penalty: 0.9535.

Filtered Hallucinations

ID Failure mode Title Reasoning
21 source_context_confusion Flag that Toastmasters Brand Manual is open in ... The suggestion misattributes the Toastmasters brand manual viewing to the Bit...
22 source_context_confusion Extract brand colors (hex), fonts, and logo rul... The suggestion fabricates specific brand colors and fonts linked to the Bitco...
35 source_context_confusion Recognise flyer request from TM Praveen Kumaras... Suggestion misattributes the flyer request context to Toastmasters when user ...
36 source_context_confusion Auto-apply Toastmasters brand guidelines to the... The suggestion misattributes the Toastmasters brand manual viewing to the Bit...

Top 5 Task-Critical Suggestions

# ID Quality Title
1 4 0.980 Auto-populate research doc with per-org summary stubs
2 3 0.970 Detect that 'Summer of Bitcoin Interested Projects' Googl...
3 8 0.970 Summarise SOB Student Guide pages user read into an appli...
4 9 0.970 Offer a proposal draft outline after 60+ minutes of SOB r...
5 10 0.970 Remind that SOB requires a competency test before proposa...

Top 5 Quality-of-Life Suggestions

# ID Quality Title
1 33 0.937 Auto-draft SOB April 30 deadline and 4 proposal work-bloc...
2 17 0.870 Offer to resume GitHub docs task when Canva work wraps
3 27 0.807 Recognise Google Docs as primary note-taking surface for ...
4 40 0.807 Flag that caption writing is a deferred task (your usual ...
5 30 0.670 Produce maintainer dossier with contribution preferences ...