SaaS Price Pulse - Track SaaS Pricing Changes & Get Alerts
How We Built a Human-in-the-Loop Baseline Verification System for Pricing Data
Engineering12 min read

How We Built a Human-in-the-Loop Baseline Verification System for Pricing Data

A technical deep-dive into our baseline verification system that combines automated anomaly detection with human review to ensure 95%+ pricing data accuracy across 290+ SaaS tools.

SaaS Price Pulse TeamDecember 7, 2025
#engineering#data-quality#ai#verification#technical
📅 Published: December 7, 2025 ⏱️ 12 min read 🏷️ Engineering

When you're tracking pricing data for 290+ SaaS tools, accuracy isn't optional—it's existential. A single false positive (claiming Notion raised prices when they didn't) destroys user trust. Here's how we built a verification system that combines AI extraction with human oversight to achieve 95%+ accuracy.

📊 System Performance

290+
Tools Monitored
95%+
Data Accuracy
50%
Anomaly Threshold
30-60s
Review Time/Tool

1. The Problem: AI Extraction Isn't Perfect

Our initial approach was straightforward: use Gemini Pro to extract pricing data from HTML, store it, compare against previous extractions, flag changes. Simple, right?

After analyzing 1,900+ snapshots, we discovered several failure modes:

  • Billing period confusion: AI extracting "$48/year" as "$48/month" (12x error)
  • Promotional pricing: Capturing sale prices instead of regular prices
  • Duplicate plans: Same plan extracted multiple times with different names
  • Missing context: Enterprise "Contact Sales" interpreted as $0

Real Example: Our system once flagged Notion as having a "+1040% price increase" when the actual change was from €10/month to €114/year—the same price, different billing period. This would have been a catastrophic false positive if published.

The core insight: you can't automate what you can't verify. We needed a way to establish a trusted "ground truth" before automation could take over.

2. The Baseline Concept

A baseline is the first verified snapshot for each tool. Think of it as the "anchor point" that all future comparisons reference. The key principle:

"The first extraction MUST be human-verified. After that, automation can detect deviations."

Here's the workflow:

  1. First crawl: AI extracts pricing data → stored as "candidate baseline"
  2. Human review: Admin reviews extraction against actual pricing page
  3. Approval/Rejection: If correct, baseline is approved. If wrong, admin corrects it.
  4. Automation activates: Future crawls compared against verified baseline
  5. Anomaly flagging: Significant deviations trigger new human review

This creates a "trust chain" where automated decisions inherit confidence from human-verified foundations.

3. Anomaly Detection Thresholds

Not every difference is an anomaly. Pricing pages change formatting, add/remove trial badges, update copyright years. We needed thresholds that catch real changes while ignoring noise.

After analyzing historical data, we settled on these thresholds:

Threshold Value Rationale
Price Change >50% Most real price changes are 10-30%. 50%+ usually indicates extraction error.
New Plans >3 Companies rarely add 4+ plans at once. Often indicates duplicate extraction.
Missing Plans >2 Removing 3+ plans is rare. Usually means incomplete page load.
Confidence Score <0.5 Below 50% confidence, human review is mandatory.

Here's the actual implementation:

const ANOMALY_THRESHOLDS = {
  PRICE_CHANGE_PERCENT: 50,  // >50% change is suspicious
  MAX_NEW_PLANS: 3,          // More than 3 new plans is suspicious
  MAX_MISSING_PLANS: 2,      // More than 2 missing plans is suspicious
  MIN_CONFIDENCE: 0.5,       // Below this, flag as anomaly
};

4. Confidence Scoring Algorithm

The confidence score quantifies how much we trust an extraction compared to the baseline. It starts at 1.0 (100%) and gets penalized for various issues:

Issue Penalty Example
Plan count difference -10% per plan Baseline has 4 plans, new has 6 → -20%
Missing plan -15% per plan "Enterprise" plan disappeared → -15%
Anomalous price change -20% per change Pro: $10 → $100 (900% change) → -20%

The algorithm in code:

function calculateConfidence(
  baseline: PlanData[],
  current: PlanData[]
): number {
  let confidence = 1.0;

  // Penalize plan count differences
  const planDiff = Math.abs(baseline.length - current.length);
  confidence -= planDiff * 0.10;

  // Penalize missing plans
  const missingPlans = baseline.filter(
    b => !current.some(c => normalizePlanName(c.name) === normalizePlanName(b.name))
  );
  confidence -= missingPlans.length * 0.15;

  // Penalize anomalous price changes
  for (const basePlan of baseline) {
    const match = findMatchingPlan(current, basePlan.name);
    if (match) {
      const change = calculatePriceChange(basePlan.price, match.price);
      if (Math.abs(change) > ANOMALY_THRESHOLDS.PRICE_CHANGE_PERCENT) {
        confidence -= 0.20;
      }
    }
  }

  return Math.max(0, Math.min(1, confidence));
}

Plan Name Normalization

Comparing plan names requires fuzzy matching. "Professional Plan" and "Professional" should match. We normalize by removing common suffixes:

function normalizePlanName(name: string): string {
  return name
    .toLowerCase()
    .replace(/s*(plan|tier|package|edition)s*/gi, '')
    .replace(/s+/g, ' ')
    .trim();
}

// "Professional Plan" → "professional"
// "Enterprise Tier" → "enterprise"
// "Business Package" → "business"

Price Normalization

Different tools quote prices in different periods. We normalize everything to monthly:

function normalizeToMonthly(price: number, period: string): number {
  const periodLower = period.toLowerCase();

  if (periodLower.includes('year') || periodLower.includes('annual')) {
    return price / 12;
  }
  if (periodLower.includes('quarter')) {
    return price / 3;
  }
  if (periodLower.includes('week')) {
    return price * 4.33; // Average weeks per month
  }

  return price; // Assume monthly
}

5. The Admin Review Interface

Speed matters for human review. With 290 tools, even 2-minute reviews would take 9+ hours. We built an admin interface optimized for rapid decision-making:

Admin UI Features:

  • Status tabs: Pending / Verified / Needs Review with counts
  • Pre-loaded context: Tool name, URL, tier, last crawl date
  • Plans table: Name, price, billing period, features at a glance
  • One-click actions: Approve or Reject with optional notes
  • Audit trail: Who approved, when, with what corrections

The interface uses a card-based design where each monitor is expandable:

// Simplified component structure
function BaselineCard({ monitor }: { monitor: Monitor }) {
  const [expanded, setExpanded] = useState(false);

  return (
    <div className="border rounded-lg p-4">
      <div className="flex justify-between items-center">
        <div>
          <h3>{monitor.name}</h3>
          <span className={statusBadge}>{monitor.baselineStatus}</span>
        </div>
        <button onClick={() => setExpanded(!expanded)}>
          {expanded ? 'Collapse' : 'Expand'}
        </button>
      </div>

      {expanded && (
        <>
          <PlansTable plans={monitor.candidateBaseline.plans} />
          <div className="flex gap-2 mt-4">
            <button onClick={handleApprove}>✓ Approve</button>
            <button onClick={handleReject}>✗ Reject</button>
          </div>
        </>
      )}
    </div>
  );
}

6. Database Schema Design

The audit trail is critical for debugging and compliance. We track every review decision:

// Drizzle ORM schema
export const baselineReviews = pgTable("baseline_reviews", {
  id: uuid("id").defaultRandom().primaryKey(),
  snapshotId: uuid("snapshot_id").notNull(),
  monitorId: uuid("monitor_id").notNull(),
  reviewerEmail: varchar("reviewer_email", { length: 255 }).notNull(),
  status: varchar("status", { length: 50 }).notNull(), // approved | rejected
  originalPriceData: jsonb("original_price_data"),
  correctedPriceData: jsonb("corrected_price_data"), // If admin made corrections
  notes: text("notes"),
  reviewedAt: timestamp("reviewed_at").defaultNow(),
});

// Monitor table additions
export const monitors = pgTable("monitors", {
  // ... existing fields
  baselineSnapshotId: uuid("baseline_snapshot_id"),
  baselineStatus: varchar("baseline_status", { length: 50 })
    .default("pending"), // pending | verified | needs_review
});

The correctedPriceData field is key—when an admin rejects and corrects data, we store their corrections. This creates training data for improving the AI extractor.

7. Lessons Learned

What Worked

  • Thresholds calibrated from real data: Analyzing 1,900+ snapshots before setting thresholds prevented both false positives and missed changes.
  • Speed-optimized admin UI: Pre-loading all context reduced review time from 2+ minutes to 30-60 seconds per tool.
  • Audit trail from day one: Every decision is logged, making debugging and pattern analysis possible.

What We'd Do Differently

  • Earlier human review: We initially ran 3 months of automated extraction before realizing accuracy issues. Should have started with human review.
  • Tighter price normalization: Our first version didn't handle quarterly/weekly periods, causing false positives.
  • Batch review mode: Reviewing 290 tools one-by-one is tedious. A "quick approve similar" feature would help.

Key Takeaways

  1. Automation needs anchors: You can't trust automated comparisons without verified baselines.
  2. Thresholds are not universal: Calibrate from YOUR data, not industry assumptions.
  3. UX for internal tools matters: Admin interfaces deserve design attention—your team uses them daily.
  4. Audit trails are gold: Logged corrections become training data for improving extraction.

Try It Yourself

The baseline verification system powers SaaS Price Pulse's pricing intelligence. We track 290+ tools with 95%+ accuracy, catching real price changes while filtering false positives.

Start tracking competitor pricing →

Start Tracking SaaS Pricing Today

Never miss a competitor pricing change. Get instant alerts and stay ahead.

Start Tracking Free →