Skip to content

Persona Evaluation — condition3__email_only__sample500_seed99

Field Value
Persona Nimesh Kulatunga
Judge model openai/gpt-4.1-mini
Embed model text-embedding-3-small
Rubric draws (k) 3
Total suggestions 50
Pipeline mode hall_pass=anti_gt (active=True)

Bucket Distribution

Bucket Count % of total
Task Critical 1 2.0%
Quality Of Life 19 38.0%
Noise 29 58.0%
Hallucinated 1 2.0%

Set-Level Diversity Metrics

Metric Value Interpretation
DPP log-det -21.1657 Higher = more diverse + high-quality set
Cluster coverage 0.000 Fraction of BGT clusters with a task-critical hit
ILAD 0.6872 Mean pairwise distance; higher = more diverse
Redundancy rate 0.100 Fraction of near-duplicate suggestions (cos > 0.9)

Composite Score

Component Weight Value
DPP set score (normalised) 0.5
Cluster coverage 0.3 0.000
Mean quality (non-hallucinated) 0.2
Hallucination penalty alpha=0.5 x 0.9900

Composite score: 0.2786

Hallucination Summary

Filter: anti_gt (active). Flagged 1 / 50 suggestions (rate 2.0%). Composite hallucination penalty: 0.9900.

Filtered Hallucinations

ID Failure mode Title Reasoning
14 over_elaboration Elicit Onboarding Milestone Checklist Suggestion invents specific onboarding steps and Customer.io triggers not sup...

Top 5 Task-Critical Suggestions

# ID Quality Title
1 32 0.690 Fix CSE Pipeline: Update GitHub Actions YAML

Top 5 Quality-of-Life Suggestions

# ID Quality Title
1 7 0.947 Master Elicit's 'Extract Data' for Literature Reviews
2 1 0.903 Master Elicit's 'Extract Data' for Literature Reviews
3 31 0.853 Sync Pipeline Maintenance with CSE Market Close
4 8 0.840 Use Semantic Scholar API for Citation Mapping
5 19 0.837 Automate Research-to-Trade Workflow with Elicit & IBKR