Persona Evaluation — Ground truth ablation: BGT + IGT (anti_gt)
| Field |
Value |
| Persona |
Nimesh Kulatunga |
| Judge model |
openai/gpt-4.1-mini |
| Embed model |
text-embedding-3-small |
| Rubric draws (k) |
3 |
| Total suggestions |
43 |
| Pipeline mode |
hall_pass=anti_gt (active=True) |
Bucket Distribution
| Bucket |
Count |
% of total |
| Task Critical |
25 |
58.1% |
| Quality Of Life |
8 |
18.6% |
| Noise |
6 |
14.0% |
| Hallucinated |
4 |
9.3% |
Set-Level Diversity Metrics
| Metric |
Value |
Interpretation |
| DPP log-det |
-35.2258 |
Higher = more diverse + high-quality set |
| Cluster coverage |
1.000 |
Fraction of BGT clusters with a task-critical hit |
| ILAD |
0.6647 |
Mean pairwise distance; higher = more diverse |
| Redundancy rate |
0.000 |
Fraction of near-duplicate suggestions (cos > 0.9) |
Composite Score
| Component |
Weight |
Value |
| DPP set score (normalised) |
0.5 |
— |
| Cluster coverage |
0.3 |
1.000 |
| Mean quality (non-hallucinated) |
0.2 |
— |
| Hallucination penalty |
alpha=0.5 |
x 0.9535 |
Composite score: 0.5672
Hallucination Summary
Filter: anti_gt (active). Flagged 4 / 43 suggestions (rate 9.3%). Composite hallucination penalty: 0.9535.
Filtered Hallucinations
| ID |
Failure mode |
Title |
Reasoning |
| 21 |
source_context_confusion |
Flag that Toastmasters Brand Manual is open in ... |
The suggestion misattributes the Toastmasters brand manual viewing to the Bit... |
| 22 |
source_context_confusion |
Extract brand colors (hex), fonts, and logo rul... |
The suggestion fabricates specific brand colors and fonts linked to the Bitco... |
| 35 |
source_context_confusion |
Recognise flyer request from TM Praveen Kumaras... |
Suggestion misattributes the flyer request context to Toastmasters when user ... |
| 36 |
source_context_confusion |
Auto-apply Toastmasters brand guidelines to the... |
The suggestion misattributes the Toastmasters brand manual viewing to the Bit... |
Top 5 Task-Critical Suggestions
| # |
ID |
Quality |
Title |
| 1 |
4 |
0.980 |
Auto-populate research doc with per-org summary stubs |
| 2 |
3 |
0.970 |
Detect that 'Summer of Bitcoin Interested Projects' Googl... |
| 3 |
8 |
0.970 |
Summarise SOB Student Guide pages user read into an appli... |
| 4 |
9 |
0.970 |
Offer a proposal draft outline after 60+ minutes of SOB r... |
| 5 |
10 |
0.970 |
Remind that SOB requires a competency test before proposa... |
Top 5 Quality-of-Life Suggestions
| # |
ID |
Quality |
Title |
| 1 |
33 |
0.937 |
Auto-draft SOB April 30 deadline and 4 proposal work-bloc... |
| 2 |
17 |
0.870 |
Offer to resume GitHub docs task when Canva work wraps |
| 3 |
27 |
0.807 |
Recognise Google Docs as primary note-taking surface for ... |
| 4 |
40 |
0.807 |
Flag that caption writing is a deferred task (your usual ... |
| 5 |
30 |
0.670 |
Produce maintainer dossier with contribution preferences ... |