Skip to content

GUMBO Persona Evaluation

This site publishes the LLM-as-judge evaluation of GUMBO suggestion generation across seven observation conditions, scored against a single user persona.

  • Persona judge model: openai/gpt-4.1-mini (temperature=0.0 for bucketing / hallucination, temperature=0.15 for rubric draws)
  • Embedding model: text-embedding-3-small
  • Sample size: 50 suggestions per condition (stratified, seed 99). Condition 7 contains 30 because the focus-only export only has 30 raw suggestions to begin with.
  • Pipeline: suggestions -> hallucination pass -> bucketize -> persona rubric -> diversity-aware aggregation
  • Two runs per condition: with the anti-ground-truth hallucination pass active (anti_gt) and with it skipped, so the contribution of each stage is directly comparable.
  • Numbers below are from the v2 pipeline (single-item bucketer cache, in-flight rubric dedup, anti-GT id remapping). Reproducing requires the cache files committed under .persona_cache/v2/.

Headline composite scores

Condition Source mix Composite (active) Composite (skipped) Delta Hallucinations
1 Screen, no metadata 0.6167 0.6167 +0.0000 0 / 50 (0.0%)
2 Screen, with metadata 0.6030 0.6274 -0.0244 4 / 50 (8.0%)
3 Email only 0.2786 0.2815 -0.0028 1 / 50 (2.0%)
4 Slack only 0.5545 0.6185 -0.0640 11 / 50 (22.0%)
5 Discord only 0.4994 0.6115 -0.1121 20 / 50 (40.0%)
7 Focus only 0.5716 0.5777 -0.0061 1 / 30 (3.3%)
8 All sources 0.5600 0.6457 -0.0857 14 / 50 (28.0%)

The active column is the score we ship; skipped is the ablation that drops the hallucination penalty (hallucination_penalty = 1.0) and reuses the bucketer + rubric caches written by the active run. With v2's per-suggestion bucketer cache the active vs. skipped delta now equals exactly -0.5 * hallucination_rate for every condition (within rubric cache reuse), so you can read the delta as the marginal contribution of the hallucination filter alone.

Bucket distribution (active runs)

Condition Task-critical Quality-of-life Noise Hallucinated
1 Screen, no metadata 16 23 11 0
2 Screen, with metadata 14 22 10 4
3 Email only 1 19 29 1
4 Slack only 5 21 13 11
5 Discord only 7 17 6 20
7 Focus only 6 18 5 1
8 All sources 2 10 24 14

Ground-truth ablation (BGT / IGT)

These rows feed ground-truth items into the same pipeline, dressed up as suggestions. They establish a metric ceiling: any model run that scores higher than its corresponding ground-truth row is benefiting from a metric quirk, not from genuine quality. See Caveats > 6. Ground-truth ablation under-shoots.

# Source Composite (active) Composite (skipped) Hallucinations Task-critical / total
91 BGT only 0.5853 0.6026 2 / 27 (7.4%) 18 / 27
92 IGT only 0.6038 0.6185 1 / 16 (6.3%) 6 / 16
93 BGT + IGT 0.5672 0.5878 4 / 43 (9.3%) 25 / 43

Note the BGT-only and BGT+IGT rows score below C1 (Screen, no metadata). That isn't because the ground truth is worse than the suggestions — it's because the composite's DPP-based set score normalises through sigmoid(set_score / N), which penalises larger sets even when each item is on-target, and the rubric judge under-rates IGT items whose claims aren't directly verifiable from the limited observation context shown to the judge.

Per-condition reports

# Condition With hallucination pass Skipped (ablation)
1 Screen, no metadata anti_gt skipped
2 Screen, with metadata anti_gt skipped
3 Email only anti_gt skipped
4 Slack only anti_gt skipped
5 Discord only anti_gt skipped
7 Focus only anti_gt skipped
8 All sources anti_gt skipped
91 GT ablation: BGT only anti_gt skipped
92 GT ablation: IGT only anti_gt skipped
93 GT ablation: BGT+IGT anti_gt skipped

Raw JSON reports are available alongside each Markdown file under anti_gt/data/conditionN.json, skipped/data/conditionN.json, and ground_truth/{anti_gt,skipped}/data/conditionN.json.


Methodology

The evaluator is implemented in gum/evaluation/persona/ and runs in four stages.

1. Hallucination pass (modular)

The first stage applies a swappable HallucinationFilter.

  • AntiGTHallucinationFilter (used here) prompts the judge with a curated list of anti-ground-truth (anti-GT) failure modes for the persona and asks whether each suggestion matches one. Flagged suggestions are bypassed from every downstream stage but still appear in the report under "Filtered Hallucinations" for auditability.
  • NoOpHallucinationFilter (selected with --skip-hallucination-pass) short-circuits the stage so every suggestion flows through unchanged. The composite hallucination penalty defaults to 1.0, keeping scores numerically comparable to the active run when no items would have been flagged anyway.

2. Bucketing

The judge classifies each surviving suggestion individually (one suggestion per LLM call, temperature=0). Per-suggestion bucketing means a suggestion's bucket is a function of its own content alone, so active and skipped runs produce identical bucket assignments for every non-hallucinated suggestion. Bucket assignments are persisted in persona_bucketer_cache_<condition>.json.

The four buckets are:

  • task_critical -- directly serves an active goal in the persona's ground truth.
  • quality_of_life -- helpful but not on the critical path.
  • noise -- off-persona or low-signal.
  • hallucinated -- only set by the filter in stage 1.

3. Per-suggestion rubric

For every non-noise, non-hallucinated suggestion, the judge produces three draws (k=3) on a four-axis rubric: Relevance, Accuracy, Timeliness, Specificity. Per-suggestion quality is the mean across draws and axes.

4. Diversity-aware aggregation

Set-level metrics are computed over the task_critical bucket using suggestion embeddings:

  • DPP log-determinant (quality-weighted determinantal point process) -- rewards sets that are simultaneously high quality and well spread.
  • Cluster coverage -- fraction of beneficial-ground-truth (BGT) clusters hit by at least one task-critical suggestion.
  • ILAD (mean intra-list angular distance) -- raw diversity.
  • Redundancy rate -- fraction of near-duplicate pairs (cosine > 0.9).

The composite score is

composite = (0.5 * dpp_norm + 0.3 * cluster_coverage + 0.2 * mean_quality)
            * hallucination_penalty

with hallucination_penalty = 1 - 0.5 * hallucination_rate. When the hallucination pass is skipped the penalty is forced to 1.0 so the formula stays comparable.


Caveats

The first three describe behaviour that was fixed during the v2 bug-fix pass. They are kept here for historical context — the numbers in the tables above are post-fix and reflect the corrected pipeline.

1. Bucketer batch-context coupling (FIXED in v2)

Status: Fixed. The bucketer now classifies one suggestion per LLM call and persists bucket assignments in persona_bucketer_cache_<condition>.json. Active and skipped runs reuse identical entries for every non-hallucinated suggestion, so the active vs. skipped delta now equals exactly -0.5 * hallucination_rate for every condition (modulo rubric cache reuse).

Original symptom: the bucketer prompted the judge in batches of 5 suggestions, so removing one suggestion in stage 1 shifted batch composition for every downstream batch and could re-bucket other surviving suggestions. Most visible on conditions 2, 3, 4, 5, 8 — the active vs. skipped delta did not equal the analytical penalty (-0.5 * rate).

2. Anti-GT id leakage in failure_mode (FIXED in v2)

Status: Fixed. The hallucination prompt now demands the canonical hallucination_modes[].name value in the failure_mode field and explicitly forbids anti-GT ids. As a defensive backstop each anti_gt[] entry in ground_truth_v3.json carries a failure_mode_id, and the server maps any stray anti-GT id to its canonical mode name before returning a verdict. New runs render the canonical name in the failure_mode column with no Unknown failure_mode warnings.

Original symptom: anti_gt reports occasionally rendered - instead of a canonical mode name when the judge echoed the anti-GT item id (e.g. anti_02) into the failure_mode field. The is_hallucination flag was unaffected, so headline numbers were not impacted — only the audit column.

3. Rubric-judge concurrency variance (FIXED in v2)

Status: Fixed. The rubric loop now uses an in-flight pending Future table so concurrent cache misses for the same (suggestion_id, draw_index) share a single LLM call. New CLI flags --rubric-temperature and --rubric-draws give a deterministic verification path: invoke --rubric-temperature 0.0 --rubric-draws 1 to force fully deterministic rubric scoring on a fresh cache.

Original symptom: conditions with zero hallucinations should produce identical composites in active and skipped modes by construction, but a ~0.0015 residual remained from rubric-judge variance during concurrent cache fills. Within evaluator noise, not a real signal.

4. Sample size is 50, not the full corpus

Each row is 50 suggestions stratified-sampled from the per-condition export (typically ~500 suggestions). Condition 7 is 30 because the focus-only export only contains 30 raw suggestions. Numbers are stable enough for ranking conditions but should not be quoted as the absolute quality of the full generation run.

5. Persona is a single user

This is a one-persona snapshot. Cross-persona generalisation is not in scope for this report; treat it as a calibration of the evaluator pipeline, not as a verdict on GUMBO across users.

6. Ground-truth ablation under-shoots the ceiling

The ground-truth ablation rows (91 / 92 / 93) feed BGT, IGT, and BGT+IGT items through the same scoring path as real suggestions. Naively you would expect them to score ~1.0 — these are the items the metric was designed around. They don't, and the gap is informative:

  • DPP normalisation penalises set size. The set score is sigmoid(dpp_log_det / N), which scales the determinant by the number of scored items. BGT+IGT (43 items) gets divided by a larger N than C1 (16 task-critical out of 50), so the same per-item quality projects to a lower normalised set score. Right tail of the metric is squeezed.
  • Rubric judge under-rates IGT. The judge is shown a small slice of observations and asked to score Accuracy + Specificity. IGT items are by design inferred from external knowledge of the persona, so their claims often aren't grounded in the observations the judge sees. Mean rubric quality on the ground-truth runs (~0.74–0.83) is below several real conditions (~0.85–0.89) for this reason, not because the items are wrong.
  • Anti-GT filter false-positives bleed in. 7 of ~86 ground-truth items across the three conditions are flagged by the hallucination filter (~6–9% rate) — they shouldn't be. Most map to the anti_07 brand-manual / source-confusion theme catching legitimate Toastmasters references. This caps the active-mode ceiling at (1 - 0.5 * 0.07) ≈ 0.965 * inner_score.

The takeaway: any model-generated condition that beats its corresponding ground-truth row is exploiting one of the three effects above (small set, in-context-verifiable claims, no false-positive hallucinations) rather than producing better content. C1 beating C91 is in this bucket, not a sign that C1's suggestions outperform the BGT.


Reproducing a single row

python -m gum.evaluation.persona.persona_runner \
    --suggestions gumbo_data_export_v2/exports/condition1__screen_no_metadata__suggestions.json \
    --persona persona_template.yaml \
    --ground-truth "gumbo ground truth v3.txt" \
    --judge-model openai/gpt-4.1-mini \
    --max-suggestions 50 \
    --seed 99 \
    --output-dir gumbo_data_export_v2/persona_evaluation/anti_gt

Add --skip-hallucination-pass and change --output-dir to .../skipped for the ablation row.

Run order matters for cache reuse. Run the active mode first, then the skipped mode; the skipped run will hit the rubric cache and yield numerically identical composites for any condition with zero hallucinations.

Deterministic verification. Pass --rubric-temperature 0.0 --rubric-draws 1 to force fully deterministic rubric scoring on a fresh cache. Use this when you need byte-identical reports across reruns.

Reproducing the ground-truth ablation rows

The ground-truth rows feed BGT / IGT / BGT+IGT items from gum/evaluation/ground_truth/ground_truth_v3.json through the same scoring path, dressed up as suggestions. The pre-built input files live under gumbo_data_export_v2/exports/oracle/ (the directory keeps its original name since the cache files reference it):

python -m gum.evaluation.persona.persona_runner \
    --suggestions gumbo_data_export_v2/exports/oracle/condition91__bgt_oracle__suggestions.json \
    --persona persona_template.yaml \
    --ground-truth "gumbo ground truth v3.txt" \
    --judge-model openai/gpt-4.1-mini \
    --max-suggestions 100 \
    --seed 99 \
    --output-dir gumbo_data_export_v2/persona_evaluation/v2/oracle/anti_gt \
    --cache-path .persona_cache/oracle/persona_judge_cache_condition91.json

Swap condition91__bgt_oracle for condition92__igt_oracle (IGT only) or condition93__bgt_igt_oracle (combined). For the skipped row, add --skip-hallucination-pass, change the cache path to .persona_cache/oracle/persona_judge_cache_condition91.json (active and skipped share a cache), and point --output-dir at the .../skipped sibling.