Persona Evaluation — condition3__email_only__sample500_seed99
| Field |
Value |
| Persona |
Nimesh Kulatunga |
| Judge model |
openai/gpt-4.1-mini |
| Embed model |
text-embedding-3-small |
| Rubric draws (k) |
3 |
| Total suggestions |
50 |
| Pipeline mode |
hall_pass=anti_gt (active=True) |
Bucket Distribution
| Bucket |
Count |
% of total |
| Task Critical |
1 |
2.0% |
| Quality Of Life |
19 |
38.0% |
| Noise |
29 |
58.0% |
| Hallucinated |
1 |
2.0% |
Set-Level Diversity Metrics
| Metric |
Value |
Interpretation |
| DPP log-det |
-21.1657 |
Higher = more diverse + high-quality set |
| Cluster coverage |
0.000 |
Fraction of BGT clusters with a task-critical hit |
| ILAD |
0.6872 |
Mean pairwise distance; higher = more diverse |
| Redundancy rate |
0.100 |
Fraction of near-duplicate suggestions (cos > 0.9) |
Composite Score
| Component |
Weight |
Value |
| DPP set score (normalised) |
0.5 |
— |
| Cluster coverage |
0.3 |
0.000 |
| Mean quality (non-hallucinated) |
0.2 |
— |
| Hallucination penalty |
alpha=0.5 |
x 0.9900 |
Composite score: 0.2786
Hallucination Summary
Filter: anti_gt (active). Flagged 1 / 50 suggestions (rate 2.0%). Composite hallucination penalty: 0.9900.
Filtered Hallucinations
| ID |
Failure mode |
Title |
Reasoning |
| 14 |
over_elaboration |
Elicit Onboarding Milestone Checklist |
Suggestion invents specific onboarding steps and Customer.io triggers not sup... |
Top 5 Task-Critical Suggestions
| # |
ID |
Quality |
Title |
| 1 |
32 |
0.690 |
Fix CSE Pipeline: Update GitHub Actions YAML |
Top 5 Quality-of-Life Suggestions
| # |
ID |
Quality |
Title |
| 1 |
7 |
0.947 |
Master Elicit's 'Extract Data' for Literature Reviews |
| 2 |
1 |
0.903 |
Master Elicit's 'Extract Data' for Literature Reviews |
| 3 |
31 |
0.853 |
Sync Pipeline Maintenance with CSE Market Close |
| 4 |
8 |
0.840 |
Use Semantic Scholar API for Citation Mapping |
| 5 |
19 |
0.837 |
Automate Research-to-Trade Workflow with Elicit & IBKR |