What is baseline verification in pricing data?

Baseline verification is a human-in-the-loop process where the first AI-extracted pricing snapshot for each tool is manually reviewed and approved before being used as the reference point for detecting future price changes.

Why not use fully automated verification?

Fully automated systems can propagate errors. If the AI extracts wrong data initially, all future comparisons would be against incorrect baselines. Human review ensures the foundation is accurate before automation takes over.

What thresholds trigger anomaly detection?

Our system flags snapshots when: price changes exceed 50%, more than 3 new plans appear, more than 2 plans disappear, or confidence score drops below 0.5. These thresholds were calibrated from analyzing 1,900+ real pricing snapshots.

How long does manual baseline review take?

On average, 30-60 seconds per tool. The admin UI pre-loads extracted data, comparison tables, and context, making review fast. For 290 tools, initial baseline review took approximately 4.5 hours.

What happens after baseline approval?

Once approved, the baseline snapshot becomes the reference for automated comparison. Future crawls are compared against this baseline, and only anomalies exceeding thresholds require human review.

How We Built a Human-in-the-Loop Baseline Verification System for Pricing Data (2026)

📅 Published: December 7, 2025 • ⏱️ 12 min read • 🏷️ Engineering

When you're tracking pricing data for 290+ SaaS tools, accuracy isn't optional—it's existential. A single false positive (claiming Notion raised prices when they didn't) destroys user trust. Here's how we built a verification system that combines AI extraction with human oversight to achieve 95%+ accuracy.

📊 System Performance

290+

Tools Monitored

95%+

Data Accuracy

50%

Anomaly Threshold

30-60s

Review Time/Tool

1. The Problem: AI Extraction Isn't Perfect

Our initial approach was straightforward: use Gemini Pro to extract pricing data from HTML, store it, compare against previous extractions, flag changes. Simple, right?

After analyzing 1,900+ snapshots, we discovered several failure modes:

Billing period confusion: AI extracting "$48/year" as "$48/month" (12x error)
Promotional pricing: Capturing sale prices instead of regular prices
Duplicate plans: Same plan extracted multiple times with different names
Missing context: Enterprise "Contact Sales" interpreted as $0

Real Example: Our system once flagged Notion as having a "+1040% price increase" when the actual change was from €10/month to €114/year—the same price, different billing period. This would have been a catastrophic false positive if published.

The core insight: you can't automate what you can't verify. We needed a way to establish a trusted "ground truth" before automation could take over.

2. The Baseline Concept

A baseline is the first verified snapshot for each tool. Think of it as the "anchor point" that all future comparisons reference. The key principle:

"The first extraction MUST be human-verified. After that, automation can detect deviations."

Here's the workflow:

First crawl: AI extracts pricing data → stored as "candidate baseline"
Human review: Admin reviews extraction against actual pricing page
Approval/Rejection: If correct, baseline is approved. If wrong, admin corrects it.
Automation activates: Future crawls compared against verified baseline
Anomaly flagging: Significant deviations trigger new human review

This creates a "trust chain" where automated decisions inherit confidence from human-verified foundations.

3. Anomaly Detection Thresholds

Not every difference is an anomaly. Pricing pages change formatting, add/remove trial badges, update copyright years. We needed thresholds that catch real changes while ignoring noise.

After analyzing historical data, we settled on these thresholds:

Threshold	Value	Rationale
Price Change	>50%	Most real price changes are 10-30%. 50%+ usually indicates extraction error.
New Plans	>3	Companies rarely add 4+ plans at once. Often indicates duplicate extraction.
Missing Plans	>2	Removing 3+ plans is rare. Usually means incomplete page load.
Confidence Score	<0.5	Below 50% confidence, human review is mandatory.

Here's the actual implementation:

const ANOMALY_THRESHOLDS = {
  PRICE_CHANGE_PERCENT: 50,  // >50% change is suspicious
  MAX_NEW_PLANS: 3,          // More than 3 new plans is suspicious
  MAX_MISSING_PLANS: 2,      // More than 2 missing plans is suspicious
  MIN_CONFIDENCE: 0.5,       // Below this, flag as anomaly
};

4. Confidence Scoring Algorithm

The confidence score quantifies how much we trust an extraction compared to the baseline. It starts at 1.0 (100%) and gets penalized for various issues:

Issue	Penalty	Example
Plan count difference	-10% per plan	Baseline has 4 plans, new has 6 → -20%
Missing plan	-15% per plan	"Enterprise" plan disappeared → -15%
Anomalous price change	-20% per change	Pro: $10 → $100 (900% change) → -20%

The algorithm in code:

function calculateConfidence(
  baseline: PlanData[],
  current: PlanData[]
): number {
  let confidence = 1.0;

  // Penalize plan count differences
  const planDiff = Math.abs(baseline.length - current.length);
  confidence -= planDiff * 0.10;

  // Penalize missing plans
  const missingPlans = baseline.filter(
    b => !current.some(c => normalizePlanName(c.name) === normalizePlanName(b.name))
  );
  confidence -= missingPlans.length * 0.15;

  // Penalize anomalous price changes
  for (const basePlan of baseline) {
    const match = findMatchingPlan(current, basePlan.name);
    if (match) {
      const change = calculatePriceChange(basePlan.price, match.price);
      if (Math.abs(change) > ANOMALY_THRESHOLDS.PRICE_CHANGE_PERCENT) {
        confidence -= 0.20;
      }
    }
  }

  return Math.max(0, Math.min(1, confidence));
}

Plan Name Normalization

Comparing plan names requires fuzzy matching. "Professional Plan" and "Professional" should match. We normalize by removing common suffixes:

function normalizePlanName(name: string): string {
  return name
    .toLowerCase()
    .replace(/s*(plan|tier|package|edition)s*/gi, '')
    .replace(/s+/g, ' ')
    .trim();
}

// "Professional Plan" → "professional"
// "Enterprise Tier" → "enterprise"
// "Business Package" → "business"

Price Normalization

Different tools quote prices in different periods. We normalize everything to monthly:

function normalizeToMonthly(price: number, period: string): number {
  const periodLower = period.toLowerCase();

  if (periodLower.includes('year') || periodLower.includes('annual')) {
    return price / 12;
  }
  if (periodLower.includes('quarter')) {
    return price / 3;
  }
  if (periodLower.includes('week')) {
    return price * 4.33; // Average weeks per month
  }

  return price; // Assume monthly
}

5. The Admin Review Interface

Speed matters for human review. With 290 tools, even 2-minute reviews would take 9+ hours. We built an admin interface optimized for rapid decision-making:

Admin UI Features:

✓ Status tabs: Pending / Verified / Needs Review with counts
✓ Pre-loaded context: Tool name, URL, tier, last crawl date
✓ Plans table: Name, price, billing period, features at a glance
✓ One-click actions: Approve or Reject with optional notes
✓ Audit trail: Who approved, when, with what corrections

The interface uses a card-based design where each monitor is expandable:

// Simplified component structure
function BaselineCard({ monitor }: { monitor: Monitor }) {
  const [expanded, setExpanded] = useState(false);

  return (
    <div className="border rounded-lg p-4">
      <div className="flex justify-between items-center">
        <div>
          <h3>{monitor.name}</h3>
          <span className={statusBadge}>{monitor.baselineStatus}</span>
        </div>
        <button onClick={() => setExpanded(!expanded)}>
          {expanded ? 'Collapse' : 'Expand'}
        </button>
      </div>

      {expanded && (
        <>
          <PlansTable plans={monitor.candidateBaseline.plans} />
          <div className="flex gap-2 mt-4">
            <button onClick={handleApprove}>✓ Approve</button>
            <button onClick={handleReject}>✗ Reject</button>
          </div>
        </>
      )}
    </div>
  );
}

6. Database Schema Design

The audit trail is critical for debugging and compliance. We track every review decision:

// Drizzle ORM schema
export const baselineReviews = pgTable("baseline_reviews", {
  id: uuid("id").defaultRandom().primaryKey(),
  snapshotId: uuid("snapshot_id").notNull(),
  monitorId: uuid("monitor_id").notNull(),
  reviewerEmail: varchar("reviewer_email", { length: 255 }).notNull(),
  status: varchar("status", { length: 50 }).notNull(), // approved | rejected
  originalPriceData: jsonb("original_price_data"),
  correctedPriceData: jsonb("corrected_price_data"), // If admin made corrections
  notes: text("notes"),
  reviewedAt: timestamp("reviewed_at").defaultNow(),
});

// Monitor table additions
export const monitors = pgTable("monitors", {
  // ... existing fields
  baselineSnapshotId: uuid("baseline_snapshot_id"),
  baselineStatus: varchar("baseline_status", { length: 50 })
    .default("pending"), // pending | verified | needs_review
});

The correctedPriceData field is key—when an admin rejects and corrects data, we store their corrections. This creates training data for improving the AI extractor.

7. Lessons Learned

What Worked

Thresholds calibrated from real data: Analyzing 1,900+ snapshots before setting thresholds prevented both false positives and missed changes.
Speed-optimized admin UI: Pre-loading all context reduced review time from 2+ minutes to 30-60 seconds per tool.
Audit trail from day one: Every decision is logged, making debugging and pattern analysis possible.

What We'd Do Differently

Earlier human review: We initially ran 3 months of automated extraction before realizing accuracy issues. Should have started with human review.
Tighter price normalization: Our first version didn't handle quarterly/weekly periods, causing false positives.
Batch review mode: Reviewing 290 tools one-by-one is tedious. A "quick approve similar" feature would help.

Key Takeaways

Automation needs anchors: You can't trust automated comparisons without verified baselines.
Thresholds are not universal: Calibrate from YOUR data, not industry assumptions.
UX for internal tools matters: Admin interfaces deserve design attention—your team uses them daily.
Audit trails are gold: Logged corrections become training data for improving extraction.

Try It Yourself

The baseline verification system powers SaaS Price Pulse's pricing intelligence. We track 290+ tools with 95%+ accuracy, catching real price changes while filtering false positives.

Start tracking competitor pricing →

How We Built a Human-in-the-Loop Baseline Verification System for Pricing Data