GUMBO Persona Evaluation¶
This site publishes the LLM-as-judge evaluation of GUMBO suggestion generation across seven observation conditions, scored against a single user persona.
- Persona judge model:
openai/gpt-4.1-mini(temperature=0.0for bucketing / hallucination,temperature=0.15for rubric draws) - Embedding model:
text-embedding-3-small - Sample size: 50 suggestions per condition (stratified, seed
99). Condition 7 contains 30 because the focus-only export only has 30 raw suggestions to begin with. - Pipeline:
suggestions -> hallucination pass -> bucketize -> persona rubric -> diversity-aware aggregation - Two runs per condition: with the anti-ground-truth hallucination pass active
(
anti_gt) and with it skipped, so the contribution of each stage is directly comparable. - Numbers below are from the v2 pipeline (single-item bucketer cache,
in-flight rubric dedup, anti-GT id remapping). Reproducing requires the
cache files committed under
.persona_cache/v2/.
Headline composite scores¶
| Condition | Source mix | Composite (active) | Composite (skipped) | Delta | Hallucinations |
|---|---|---|---|---|---|
| 1 | Screen, no metadata | 0.6167 | 0.6167 | +0.0000 | 0 / 50 (0.0%) |
| 2 | Screen, with metadata | 0.6030 | 0.6274 | -0.0244 | 4 / 50 (8.0%) |
| 3 | Email only | 0.2786 | 0.2815 | -0.0028 | 1 / 50 (2.0%) |
| 4 | Slack only | 0.5545 | 0.6185 | -0.0640 | 11 / 50 (22.0%) |
| 5 | Discord only | 0.4994 | 0.6115 | -0.1121 | 20 / 50 (40.0%) |
| 7 | Focus only | 0.5716 | 0.5777 | -0.0061 | 1 / 30 (3.3%) |
| 8 | All sources | 0.5600 | 0.6457 | -0.0857 | 14 / 50 (28.0%) |
The active column is the score we ship; skipped is the ablation that drops the hallucination penalty (
hallucination_penalty = 1.0) and reuses the bucketer + rubric caches written by the active run. With v2's per-suggestion bucketer cache the active vs. skipped delta now equals exactly-0.5 * hallucination_ratefor every condition (within rubric cache reuse), so you can read the delta as the marginal contribution of the hallucination filter alone.
Bucket distribution (active runs)¶
| Condition | Task-critical | Quality-of-life | Noise | Hallucinated |
|---|---|---|---|---|
| 1 Screen, no metadata | 16 | 23 | 11 | 0 |
| 2 Screen, with metadata | 14 | 22 | 10 | 4 |
| 3 Email only | 1 | 19 | 29 | 1 |
| 4 Slack only | 5 | 21 | 13 | 11 |
| 5 Discord only | 7 | 17 | 6 | 20 |
| 7 Focus only | 6 | 18 | 5 | 1 |
| 8 All sources | 2 | 10 | 24 | 14 |
Ground-truth ablation (BGT / IGT)¶
These rows feed ground-truth items into the same pipeline, dressed up as suggestions. They establish a metric ceiling: any model run that scores higher than its corresponding ground-truth row is benefiting from a metric quirk, not from genuine quality. See Caveats > 6. Ground-truth ablation under-shoots.
| # | Source | Composite (active) | Composite (skipped) | Hallucinations | Task-critical / total |
|---|---|---|---|---|---|
| 91 | BGT only | 0.5853 | 0.6026 | 2 / 27 (7.4%) | 18 / 27 |
| 92 | IGT only | 0.6038 | 0.6185 | 1 / 16 (6.3%) | 6 / 16 |
| 93 | BGT + IGT | 0.5672 | 0.5878 | 4 / 43 (9.3%) | 25 / 43 |
Note the BGT-only and BGT+IGT rows score below C1 (Screen, no metadata). That isn't because the ground truth is worse than the suggestions — it's because the composite's DPP-based set score normalises through
sigmoid(set_score / N), which penalises larger sets even when each item is on-target, and the rubric judge under-rates IGT items whose claims aren't directly verifiable from the limited observation context shown to the judge.
Per-condition reports¶
| # | Condition | With hallucination pass | Skipped (ablation) |
|---|---|---|---|
| 1 | Screen, no metadata | anti_gt | skipped |
| 2 | Screen, with metadata | anti_gt | skipped |
| 3 | Email only | anti_gt | skipped |
| 4 | Slack only | anti_gt | skipped |
| 5 | Discord only | anti_gt | skipped |
| 7 | Focus only | anti_gt | skipped |
| 8 | All sources | anti_gt | skipped |
| 91 | GT ablation: BGT only | anti_gt | skipped |
| 92 | GT ablation: IGT only | anti_gt | skipped |
| 93 | GT ablation: BGT+IGT | anti_gt | skipped |
Raw JSON reports are available alongside each Markdown file under
anti_gt/data/conditionN.json, skipped/data/conditionN.json, and
ground_truth/{anti_gt,skipped}/data/conditionN.json.
Methodology¶
The evaluator is implemented in
gum/evaluation/persona/
and runs in four stages.
1. Hallucination pass (modular)¶
The first stage applies a swappable HallucinationFilter.
AntiGTHallucinationFilter(used here) prompts the judge with a curated list of anti-ground-truth (anti-GT) failure modes for the persona and asks whether each suggestion matches one. Flagged suggestions are bypassed from every downstream stage but still appear in the report under "Filtered Hallucinations" for auditability.NoOpHallucinationFilter(selected with--skip-hallucination-pass) short-circuits the stage so every suggestion flows through unchanged. The composite hallucination penalty defaults to1.0, keeping scores numerically comparable to the active run when no items would have been flagged anyway.
2. Bucketing¶
The judge classifies each surviving suggestion individually (one suggestion
per LLM call, temperature=0). Per-suggestion bucketing means a suggestion's
bucket is a function of its own content alone, so active and skipped runs
produce identical bucket assignments for every non-hallucinated suggestion.
Bucket assignments are persisted in persona_bucketer_cache_<condition>.json.
The four buckets are:
task_critical-- directly serves an active goal in the persona's ground truth.quality_of_life-- helpful but not on the critical path.noise-- off-persona or low-signal.hallucinated-- only set by the filter in stage 1.
3. Per-suggestion rubric¶
For every non-noise, non-hallucinated suggestion, the judge produces three
draws (k=3) on a four-axis rubric: Relevance, Accuracy,
Timeliness, Specificity. Per-suggestion quality is the mean across
draws and axes.
4. Diversity-aware aggregation¶
Set-level metrics are computed over the task_critical bucket using suggestion
embeddings:
- DPP log-determinant (quality-weighted determinantal point process) -- rewards sets that are simultaneously high quality and well spread.
- Cluster coverage -- fraction of beneficial-ground-truth (BGT) clusters hit by at least one task-critical suggestion.
- ILAD (mean intra-list angular distance) -- raw diversity.
- Redundancy rate -- fraction of near-duplicate pairs (cosine > 0.9).
The composite score is
with hallucination_penalty = 1 - 0.5 * hallucination_rate. When the
hallucination pass is skipped the penalty is forced to 1.0 so the formula
stays comparable.
Caveats¶
The first three describe behaviour that was fixed during the v2 bug-fix pass. They are kept here for historical context — the numbers in the tables above are post-fix and reflect the corrected pipeline.
1. Bucketer batch-context coupling (FIXED in v2)¶
Status: Fixed. The bucketer now classifies one suggestion per LLM call
and persists bucket assignments in persona_bucketer_cache_<condition>.json.
Active and skipped runs reuse identical entries for every non-hallucinated
suggestion, so the active vs. skipped delta now equals exactly
-0.5 * hallucination_rate for every condition (modulo rubric cache reuse).
Original symptom: the bucketer prompted the judge in batches of 5
suggestions, so removing one suggestion in stage 1 shifted batch composition
for every downstream batch and could re-bucket other surviving suggestions.
Most visible on conditions 2, 3, 4, 5, 8 — the active vs. skipped delta did
not equal the analytical penalty (-0.5 * rate).
2. Anti-GT id leakage in failure_mode (FIXED in v2)¶
Status: Fixed. The hallucination prompt now demands the canonical
hallucination_modes[].name value in the failure_mode field and explicitly
forbids anti-GT ids. As a defensive backstop each anti_gt[] entry in
ground_truth_v3.json carries a failure_mode_id, and the server maps any
stray anti-GT id to its canonical mode name before returning a verdict. New
runs render the canonical name in the failure_mode column with no
Unknown failure_mode warnings.
Original symptom: anti_gt reports occasionally rendered - instead of
a canonical mode name when the judge echoed the anti-GT item id (e.g.
anti_02) into the failure_mode field. The is_hallucination flag was
unaffected, so headline numbers were not impacted — only the audit column.
3. Rubric-judge concurrency variance (FIXED in v2)¶
Status: Fixed. The rubric loop now uses an in-flight pending Future
table so concurrent cache misses for the same (suggestion_id, draw_index)
share a single LLM call. New CLI flags --rubric-temperature and
--rubric-draws give a deterministic verification path: invoke
--rubric-temperature 0.0 --rubric-draws 1 to force fully deterministic
rubric scoring on a fresh cache.
Original symptom: conditions with zero hallucinations should produce identical composites in active and skipped modes by construction, but a ~0.0015 residual remained from rubric-judge variance during concurrent cache fills. Within evaluator noise, not a real signal.
4. Sample size is 50, not the full corpus¶
Each row is 50 suggestions stratified-sampled from the per-condition export (typically ~500 suggestions). Condition 7 is 30 because the focus-only export only contains 30 raw suggestions. Numbers are stable enough for ranking conditions but should not be quoted as the absolute quality of the full generation run.
5. Persona is a single user¶
This is a one-persona snapshot. Cross-persona generalisation is not in scope for this report; treat it as a calibration of the evaluator pipeline, not as a verdict on GUMBO across users.
6. Ground-truth ablation under-shoots the ceiling¶
The ground-truth ablation rows (91 / 92 / 93) feed BGT, IGT, and BGT+IGT items through the same scoring path as real suggestions. Naively you would expect them to score ~1.0 — these are the items the metric was designed around. They don't, and the gap is informative:
- DPP normalisation penalises set size. The set score is
sigmoid(dpp_log_det / N), which scales the determinant by the number of scored items. BGT+IGT (43 items) gets divided by a larger N than C1 (16 task-critical out of 50), so the same per-item quality projects to a lower normalised set score. Right tail of the metric is squeezed. - Rubric judge under-rates IGT. The judge is shown a small slice of observations and asked to score Accuracy + Specificity. IGT items are by design inferred from external knowledge of the persona, so their claims often aren't grounded in the observations the judge sees. Mean rubric quality on the ground-truth runs (~0.74–0.83) is below several real conditions (~0.85–0.89) for this reason, not because the items are wrong.
- Anti-GT filter false-positives bleed in. 7 of ~86 ground-truth items
across the three conditions are flagged by the hallucination filter
(~6–9% rate) — they shouldn't be. Most map to the
anti_07brand-manual / source-confusion theme catching legitimate Toastmasters references. This caps the active-mode ceiling at(1 - 0.5 * 0.07) ≈ 0.965 * inner_score.
The takeaway: any model-generated condition that beats its corresponding ground-truth row is exploiting one of the three effects above (small set, in-context-verifiable claims, no false-positive hallucinations) rather than producing better content. C1 beating C91 is in this bucket, not a sign that C1's suggestions outperform the BGT.
Reproducing a single row¶
python -m gum.evaluation.persona.persona_runner \
--suggestions gumbo_data_export_v2/exports/condition1__screen_no_metadata__suggestions.json \
--persona persona_template.yaml \
--ground-truth "gumbo ground truth v3.txt" \
--judge-model openai/gpt-4.1-mini \
--max-suggestions 50 \
--seed 99 \
--output-dir gumbo_data_export_v2/persona_evaluation/anti_gt
Add --skip-hallucination-pass and change --output-dir to .../skipped
for the ablation row.
Run order matters for cache reuse. Run the active mode first, then the skipped mode; the skipped run will hit the rubric cache and yield numerically identical composites for any condition with zero hallucinations.
Deterministic verification. Pass --rubric-temperature 0.0 --rubric-draws 1
to force fully deterministic rubric scoring on a fresh cache. Use this when
you need byte-identical reports across reruns.
Reproducing the ground-truth ablation rows¶
The ground-truth rows feed BGT / IGT / BGT+IGT items from
gum/evaluation/ground_truth/ground_truth_v3.json through the same scoring
path, dressed up as suggestions. The pre-built input files live under
gumbo_data_export_v2/exports/oracle/ (the directory keeps its original
name since the cache files reference it):
python -m gum.evaluation.persona.persona_runner \
--suggestions gumbo_data_export_v2/exports/oracle/condition91__bgt_oracle__suggestions.json \
--persona persona_template.yaml \
--ground-truth "gumbo ground truth v3.txt" \
--judge-model openai/gpt-4.1-mini \
--max-suggestions 100 \
--seed 99 \
--output-dir gumbo_data_export_v2/persona_evaluation/v2/oracle/anti_gt \
--cache-path .persona_cache/oracle/persona_judge_cache_condition91.json
Swap condition91__bgt_oracle for condition92__igt_oracle (IGT only) or
condition93__bgt_igt_oracle (combined). For the skipped row, add
--skip-hallucination-pass, change the cache path to
.persona_cache/oracle/persona_judge_cache_condition91.json (active and
skipped share a cache), and point --output-dir at the .../skipped
sibling.