Stop LLM Model Extraction: A Practical Defense Playbook Inspired by the 100k‑Prompt Gemini Case

Attackers hammered Gemini with 100,000+ prompts—many in non‑English—trying to distill a cheaper clone. Here’s a hands‑on playbook to recognize extraction patterns, instrument your logs, set effective rate limits, harden outputs (without wrecking UX), and respond fast when you spot a campaign.

·11 min read
LLM securitymodel extractiondistillationrate limiting

The Gemini 100k‑prompt incident is a wake‑up call for LLM builders

Google disclosed that commercially motivated actors attempted to clone Gemini by prompting it more than 100,000 times across non‑English languages—collecting outputs to train a cheaper copycat. Whether you frame it as model extraction or distillation, the core risk is practical: once your model is accessible at scale, determined actors can harvest (prompt, response) pairs and teach a smaller model to mimic your capabilities and style. This playbook focuses on prevention, detection, response, and ethical alternatives, so you can make extraction campaigns noisy, costly, and ultimately uneconomical.

Distillation isn’t new; it’s a widely used technique inside companies and across the open ecosystem. The risk is not that it exists, but that uncontrolled scraping of your model’s outputs—especially reasoning traces and refusal templates—lets competitors replicate your differentiated behavior. The Gemini case highlights common attacker tactics: multi‑language prompts, reasoning solicitation, broad task coverage, and automation tuned to evade naïve rate limits. If you run an LLM service, you need instrumentation, rate limiting, output hardening, and an incident runbook ready before your next “100k‑prompt” day.

How Distillation Attacks Actually Run (So You Can Recognize Them)

The typical extraction workflow is straightforward: attackers script large prompt sets, pull model outputs at scale, then fine‑tune a smaller student model on the harvested pairs. The resulting clone won’t replicate your training corpus or weights, but it can approximate your answer distributions, refusal behavior, and surface style. Imagine ordering every dish, recording the plating and flavor notes, then teaching a junior chef to recreate them—you won’t get the exact recipe, but you’ll get close enough to serve customers.

To maximize training value, attackers target reasoning. Prompts explicitly solicit step‑by‑step explanations (“explain your reasoning,” “show your chain of thought”) or propose to “think aloud.” Long, structured outputs dramatically increase signal for distillation—revealing intermediate steps, safety rubrics, and characteristic patterns like “Overall…” summaries or deterministic refusal openers. If your logs show a high fraction of such queries from a single session, you’re seeing the extraction signal directly.

Language pivoting is common. Campaigns spread prompts across many non‑English languages to broaden coverage and evade English‑centric heuristics. Expect bursts in Spanish, Hindi, Arabic, Korean, and niche locales—often mixed within the same day. Attackers aim to collect style and policy behavior across linguistic variations so their student model generalizes beyond English.

Coverage strategy matters: prompt corpora span math, code, writing, Q&A, translation, and safety boundary tests. This teaches the clone general behavior, including your refusal style and de‑escalation phrasing. Operationally, automation rotates IPs and API keys, staggers timing, interleaves human‑like prompt variations, and maintains concurrency to slip past simple per‑key rate limits. Recognizing these patterns in your telemetry is the first defense.

Instrument and Detect: Build an Extraction Early‑Warning Pipeline

Before you can stop an attack, you need the right signals in your logs. Capture per‑request tokens in/out, latency, language ID, prompt length, prompt entropy (to detect templated inputs), presence of reasoning keywords, session ID, account/org, IP and ASN, and user agent/device fingerprint. Persist raw prompts for high‑risk sessions, but take privacy into account; at minimum store hashes for similarity analysis. Logging scope flags (e.g., “reasoning_enabled”) helps you draw a bright line around high‑value content.

{
  "ts": "2026-02-14T12:34:56Z",
  "req_id": "uuid-1234",
  "account_id": "acct_789",
  "org_id": "org_456",
  "api_key_id": "key_abcd",
  "ip": "203.0.113.10",
  "asn": 64496,
  "ua": "sdk/1.2.3",
  "device_fp": "hash_efgh",
  "model": "chat-reasoner-1",
  "endpoint": "/v1/chat/completions",
  "lang_id": "es",
  "prompt_len_tokens": 112,
  "prompt_entropy": 3.21,
  "contains_reasoning_kw": true,
  "output_len_tokens": 672,
  "latency_ms": 932,
  "scopes": ["chat", "reasoning"],
  "risk_score": 0.12
}

Sessionize across signals: stitch requests by API key + IP + device fingerprint over rolling windows (e.g., 24–72 hours) to spot long‑running, high‑volume sessions masked by concurrency. Strong sessionization defeats the attacker’s favorite trick: distributing traffic across many keys and IPs to stay under per‑key limits. Maintain counters for prompts/day, languages used, reasoning keyword fraction, and average output length per session.

Track pattern features that correlate with extraction: an abnormal ratio of reasoning requests; rapid task switching across domains; unusually low prompt entropy indicating templated scripts; and multilingual spikes within the same hour. Practical heuristics: flag accounts with (a) >X prompts/day AND (b) >Y languages used AND (c) reasoning keyword rate > Z% over 24–72 hours. Choose X/Y/Z based on your baseline, but the combination (volume + multilingual + reasoning bias) is robust.

-- Example: flag likely extraction sessions in SQL (Postgres)
WITH session_stats AS (
  SELECT
    account_id,
    session_id,
    COUNT(*) AS prompt_count,
    COUNT(DISTINCT lang_id) AS languages_used,
    AVG(output_len_tokens) AS avg_out_tokens,
    100.0 * SUM(CASE WHEN contains_reasoning_kw THEN 1 ELSE 0 END)
      / COUNT(*) AS reasoning_rate_pct
  FROM requests
  WHERE ts > NOW() - INTERVAL '72 hours'
  GROUP BY account_id, session_id
)
SELECT *
FROM session_stats
WHERE prompt_count >= 5000
  AND languages_used >= 8
  AND reasoning_rate_pct >= 30;

Similarity clustering closes another gap. Compute prompt embeddings and look for high volumes of near‑duplicate prompts with minor suffix/prefix changes—classic scripted extraction. An additional output‑side check: if a client repeatedly solicits long, structured explanations or exact refusal wording, raise their risk score; cloners want your style distribution, not just answers.

# Python: cluster near-duplicate prompts using sentence-transformers + DBSCAN
from sentence_transformers import SentenceTransformer
from sklearn.cluster import DBSCAN
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2") prompts = [p["text"] for p in recent_requests] # last 72h emb = model.encode(prompts, normalize_embeddings=True) clusters = DBSCAN(eps=0.15, min_samples=20, metric="cosine").fit(emb)

Flag sessions with large clusters of near-duplicates

cluster_sizes = np.bincount(clusters.labels_[clusters.labels_ >= 0]) suspicious_clusters = [i for i, sz in enumerate(cluster_sizes) if sz >= 200]

Rate Limiting and Access Controls That Actually Slow Cloners

Set multi‑tier limits: enforce burst (e.g., 2–5 requests per second) and sustained quotas (e.g., 100–500 requests per minute) per API key, IP, and org. Combine these with regional and IP reputation throttles—certain ASNs are high risk for automation. Cloners bank on concurrency; layered controls break their economics by forcing slower harvests.

# NGINX: burst + sustained rate limiting per API key
limit_req_zone $http_x_api_key zone=keyburst:10m rate=5r/s;
limit_req_zone $http_x_api_key zone=keyrpm:10m rate=300r/m;

server { location /v1/ { limit_req zone=keyburst burst=10 nodelay; limit_req zone=keyrpm; } }

Introduce dynamic trust tiers. Require KYC or verified billing for higher quotas, and gate multilingual/high‑token reasoning usage behind elevated tiers with stricter monitoring. Distillation needs long outputs—cap output tokens per minute/day. For example, keep default max_output_tokens at 256–512 and offer paid upgrades for 2k–4k token outputs with additional oversight. This reduces training value and raises attacker costs without crippling normal use.

// Pseudo: token budget enforcement with Redis
const budgetKey = `tok:${apiKey}:${minuteBucket}`;
const allowed = redis.decrby(budgetKey, outputTokens);
if (allowed < 0) {
  redis.incrby(budgetKey, outputTokens); // rollback
  return http429("Token budget exceeded. Please wait or upgrade your plan.");
}

Use rotating, scoped keys. Issue short‑lived API keys bound to org + IP ranges or mTLS identities; scope sensitive features (e.g., "reasoning-enabled", "json-structured-only") separately. On anomaly, revoke/rotate affected keys, and audit use by scope. Add friction on anomalies: step‑up challenges (phone/email re‑verification, business validation), detune to concise answers temporarily, or require human support to lift caps.

Don’t neglect web UI protections. Apply bot challenges and session binding; limit copy/export volume (CSV/JSON dumps) and throttle clipboard and automation endpoints. Bind sessions to device fingerprints and rotate CSRF tokens aggressively. If your UI supports bulk export, fence it behind verified business tiers with logging and quotas.

Harden Outputs: Reduce Training Value Without Ruining UX

First, stop leaking chain‑of‑thought by default. Provide final answers and short rationales; gate detailed step‑by‑step reasoning behind enterprise scopes with strict logging and quotas. You can still satisfy genuine use cases (audited users, research partners) while drastically cutting distillation signal.

// Pseudo: policy wrapper that suppresses chain-of-thought
if (!scopes.includes("reasoning-enabled")) {
  response = summarize_to_final_answer(model_output);
}

Canonicalize for style by randomizing surface phrasing per account. Rotate refusal templates and synonym sets, seeded by account ID, so cloners can’t capture a stable, reusable style distribution. Vary sentence openers, hedge phrases, and closing summaries. The goal isn’t to be chaotic; it’s to break determinism in ways that don’t harm user comprehension.

// Example refusal templates
const templates = [
  "I can’t assist with that request.",
  "Sorry, I’m not able to help with that.",
  "I’m unable to provide that information."
];

function refusal(accountId) { const i = hash(accountId + dayOfYear()) % templates.length; return templates[i]; }

Prefer structure over prose when possible. For tool use, citations, and data extraction, return function calls or JSON schemas. Structured responses are excellent for user integration but lower the value for style/behavior cloning because they reduce the richness of prose where your signature lives.

{
  "answer": "Paris",
  "confidence": 0.92,
  "evidence": [
    {"source": "encyclopedia", "page": "France#Capital"}
  ]
}

Consider semantic watermarking for long text. Embed statistically detectable patterns in token selection (e.g., a rotating green‑list seeded by a secret per account) that let you later attribute training leakage. Rotate keys regularly to avoid easy removal. Watermarks should be subtle enough not to impact readability but strong enough to survive mild paraphrasing.

// Pseudo: green-list watermark per account
function chooseToken(candidates, accountId, position) {
  const seed = hmac(secret, accountId + ":" + position);
  const green = candidates.filter(t => inGreenList(t, seed));
  return pickPreferentially(green, candidates);
}

Use canary tokens and phrasing. Inject benign signature phrases or formatting markers per account—visible (subtle) or invisible (e.g., zero‑width spaces)—to trace if your outputs appear in another model’s training data. For JSON, add a nonfunctional field like "x_source":"acct_789_canary" for enterprise accounts. Canary rotation after suspected campaigns invalidates collected style data and helps attribution.

Finally, shape safety prompts carefully. Avoid disclosing policy rubrics verbatim. Vary refusal rationales and avoid deterministic sentence openers that are easy to distill. If your policies require a specific rubric, use templated variation and keep the rubric private, not echoed to end users.

Response Playbook, Cost Math, and Ethical Alternatives

When you detect an extraction campaign, triage decisively. Throttle the session (HTTP 429 with decaying limits), require re‑verification, rotate/refuse the “reasoning” scope, and snapshot logs for forensic evidence. Preserve request/response bodies or hashes for suspect sessions, along with IP/ASN metadata and device fingerprints. Alert your trust & safety and legal teams; extraction at scale often violates ToS and may warrant formal action.

// Step-down throttling: decay from 300 RPM to 30 RPM
if (riskScore >= 0.8) {
  applyRateLimit(apiKey, burst=2, rpm=30);
  flagForVerification(apiKey);
  scopes.remove("reasoning-enabled");
  http429("Temporarily rate-limited due to unusual activity. Please verify your account.");
}

Attribution matters. Check watermark and canary hits in any public samples you can legally inspect. If your outputs (style or markers) appear in another model’s behavior, document ToS breaches and consult counsel. After incidents, rotate refusal templates, watermark seeds, and safety prompt variants to invalidate collected style data. Treat this as a cache‑busting exercise for your behavior distribution.

Do the cost math to tune quotas and pricing so large‑scale extraction becomes uneconomical. A simple estimate: total_cost ≈ prompts × (avg_output_tokens / 1000) × price_per_1k_tokens. If a campaign needs 100,000 prompts with 800 output tokens each at $0.50 per 1k tokens, the direct API cost is roughly $40,000. Raising costs on long outputs, adding friction for multilingual reasoning, and tightening default token budgets push attackers toward less valuable harvests or force them to stop.

// Cost estimator
function estimateCost(prompts, avgOutTokens, pricePer1k) {
  return prompts * (avgOutTokens / 1000.0) * pricePer1k;
}
console.log(estimateCost(100000, 800, 0.50)); // 40000

If you need a small model, choose ethical, low‑risk distillation paths. Distill from models you own or from permissively licensed/open‑weights sources where the license permits derivative work. Generate synthetic data with your own larger model and carefully filter it. Use open instruction sets curated by the community (e.g., instruction datasets similar to those used in projects like Alpaca, Dolly, OpenHermes) and public benchmarks. Avoid training on another vendor’s outputs if their ToS prohibit it; apart from legal exposure, you risk inheriting their refusal quirks and style markers.

In practice, many teams build compact models by distilling from their own flagship models (e.g., “mini” versions), or by bootstrapping with synthetic data plus human review. Pair instruction‑following with lightweight RL from human or AI feedback (RLAIF) to refine behavior without scraping a competitor. This preserves capability while respecting licenses and reduces your future risk when others scrutinize your provenance.

Putting It All Together

The Gemini 100k‑prompt case illustrates how fast a determined actor can harvest an LLM’s behavior once it’s accessible. A resilient defense layers telemetry, heuristics, clustering, rate controls, output hardening, and a rehearsed incident response. None of these steps alone is foolproof; together, they push attackers into uncomfortable trade‑offs—higher cost, lower yield, and greater attribution risk.

Adopt the pipeline now: instrument the right signals, sessionize aggressively, flag reasoning‑heavy multilingual surges, throttle intelligently, and treat long‑form reasoning as an enterprise feature with strict quotas. Randomize style, prefer structured outputs, and watermark responsibly. Have your runbook ready with step‑up verification and rapid signature rotation. And if you need distillation, do it with models and datasets you’re allowed to use. That’s how you stop model extraction from turning your work into someone else’s shortcut.

Tags#LLM security#model extraction#distillation#rate limiting#watermarking
Tharun P Karun

Written by

Tharun P Karun

Full-Stack Engineer & AI Enthusiast. Writing tutorials, reviews, and lessons learned.

← Back to all posts
Published February 14, 2026