Anthropic Accuses Alibaba of Illicitly Extracting Claude AI Model Capabilities: A Technical Analysis of Model Theft and Its Implications — AI News

Background & Context§

On June 24, 2025, Anthropic—creator of the Claude series of large language models—filed a legal complaint alleging that Alibaba Group systematically extracted the capabilities of Claude models through unauthorized API access. The accusation centers on a technique known as model extraction, where an adversary uses queries to a deployed model to reconstruct a functionally equivalent surrogate. This event underscores the growing tension between AI leaders and the rising threat of intellectual property theft in the rapidly commoditizing LLM landscape.

Anthropic’s Claude models are among the most advanced proprietary LLMs, leveraging constitutional AI and reinforcement learning from human feedback (RLHF) to achieve high alignment and reasoning capabilities. Alibaba, in turn, develops its own LLM family (Qwen) and has been investing heavily in generative AI. The complaint suggests that Alibaba used a distributed network of accounts to send millions of carefully crafted prompts to Claude’s API, harvesting both outputs and, importantly, log-probability vectors to reverse-engineer model internals. This is not a simple theft of output text; it is a systematic effort to replicate the model's learned representations.

Technical Deep-Dive§

Model extraction attacks exploit the fact that any publicly accessible ML model can be queried to create an input-output dataset. For generative LLMs, the attacker can further extract richer signals such as token probabilities, top-k logits, or hidden state embeddings if the API exposes them. While Anthropic likely limits such details, even raw output text can be leveraged with techniques like knowledge distillation or synthetic data generation.

Anatomy of an Extraction Attack§

A state-of-the-art extraction pipeline typically involves:

1. Data Harvesting: Send a diverse set of prompts covering various domains and difficulty levels. Use strategies like prompt perturbation, synonym substitution, and multi-turn conversations to elicit comprehensive coverage of the model's reasoning paths. 2. Response Collection: For each prompt, collect multiple completions (e.g., with varying temperature) to capture distributional information. Store not only the generated text but also any API-provided token-level probabilities. 3. Surrogate Training: Use the collected (prompt, completion) pairs to fine-tune a base model (e.g., LLaMA, Qwen) via supervised learning. Optionally, apply distillation by using the teacher’s logits as soft labels. 4. Iterative Refinement: Use the surrogate to generate new prompts that target areas of high teacher-student discrepancy, then query the teacher again to annotate them (active learning).

A simplified Python snippet illustrating the query step (using a hypothetical API) might look like:

import anthropic
import concurrent.futures

client = anthropic.Anthropic(api_key="<target_key>")

def query(prompt):
    response = client.completions.create(
        model="claude-3-opus-20240229",
        prompt=prompt,
        max_tokens=1024,
        temperature=0.7,
        logprobs=True,  # If available
    )
    return response.completion, response.logprobs

prompts = generate_diverse_prompts(n=100000)
with concurrent.futures.ThreadPoolExecutor(max_workers=64) as executor:
    results = list(executor.map(query, prompts))

This attack, if scaled to millions of queries, can yield enough data to train a surprisingly capable surrogate. Research [Tramer et al., 2016] shows that even a large model’s decision boundary can be extracted with polynomial query complexity. For transformers, the required number of queries scales with the model's dimension and depth.

Claude’s Defenses and Potential Exploitation§

Anthropic’s API likely implements several countermeasures:

Rate limiting per account.
Logit suppression (limiting access to probabilities).
Anomaly detection on query patterns (e.g., excessive repetition, low entropy prompts).

However, attackers can circumvent these by rotating accounts, using residential proxies, and shaping query distributions to mimic normal traffic. The fact that Anthropic publicly identified Alibaba suggests that Alibaba’s operation was either exceptionally large or left identifiable fingerprints—perhaps through specific system prompt structures or batch request fingerprints.

Claude’s architecture itself is not fully open, but it is known to be a transformer with around 100B-200B parameters (for Claude 3 Opus), trained on a mixture of filtered web data and RLHF. Constitutional AI adds a layer of behavioral constraints that are particularly hard to extract because they manifest only in specific edge cases. An adversary might need to craft adversarial prompts to elicit those constraints, which is computationally expensive.

Likely Alibaba’s Approach§

Given Alibaba’s access to large compute clusters and its own Qwen series, a plausible strategy is:

1. Query Claude via API to generate a synthetic instruction-following dataset covering diverse tasks (reasoning, coding, creative writing). 2. Fine-tune Qwen on this dataset using supervised fine-tuning (SFT) and possibly RLHF with a reward model trained on Claude’s outputs. 3. Distill further by matching logits on a fixed set of inputs, generating a model that mimics Claude’s style and factual knowledge.

This process is accelerated by techniques like LoRA (low-rank adaptation) to quickly adapt Qwen to Claude’s behavior. The cost of such an extraction is orders of magnitude lower than training Claude from scratch.

Cost & Resource Analysis§

API Query Costs§

To extract a model the size of Claude 3 Opus, an adversary needs a sufficient number of high-quality demonstrations. Estimates from the literature (e.g., in the context of OpenAI’s GPT-4) suggest that to replicate core reasoning abilities, one might need on the order of 10 million to 100 million queries, each generating on average 500 tokens.

Claude’s API pricing (as of 2025) is approximately:

Input: $15 per million tokens
Output: $75 per million tokens

For 100 million queries with average 500 output tokens, total output tokens = 50 billion → cost = 50,000 * $75 = $3.75 million. Input tokens would be much smaller if prompts are short (say 50 tokens each → 5 billion tokens → $75,000). So the total cost ranges from $1M to $5M, depending on query efficiency. This is a bargain compared to training a frontier model from scratch, which costs $100M–$500M for compute alone.

Training Resource Efficiency§

Training a surrogate like Qwen-72B on 50 billion tokens extracted from Claude would require roughly:

GPU cluster: 2,048 A100s (or equivalent H100s) for ~2 weeks.
Cost: ~$2–$4 million in cloud compute (at $2–4 per A100 hour).

Thus total extraction cost (queries + training) might be under $10M, while developing Claude-level capabilities from scratch would exceed $200M in R&D, including data curation, red teaming, and RLHF iterations.

Inference Latency and Serving§

A surrogate model trained via distillation may not be as efficient as the original, but fine-tuning Qwen on Claude outputs can yield similar performance on many benchmarks. The inference latency of Qwen-72B on an H100 is around 30 tokens/second, comparable to Claude’s reported latency. Thus extraction can achieve near-parity at a fraction of the upfront cost.

Developer & Pipeline Implications§

For Model Providers§

This incident calls for a rethinking of API security. Developers should consider:

Watermarking: Embed invisible statistical watermarks in model outputs (e.g., via token distribution skew) that can be detected in suspect surrogate models.
Query auditing: Use ML-based anomaly detection to flag potential extraction campaigns. For instance, measure the diversity of prompt lengths, topics, and the frequency of rare tokens.
Rate limiting per IP/account with cohort analysis to catch distributed attacks.
Differential privacy (DP): Train with DP-SGD to bound memorization. However, DP may degrade utility for high-capability models.

For Downstream Developers§

Developers integrating LLMs into production should be cautious about exposing too much information through APIs. If you are building on third-party models, be aware that your application might be used as a vector for extraction if you cache outputs or expose probabilities. Implement internal rate limiting and log all unusual query patterns.

Pipeline Adjustments§

Model extraction changes the calculus for choosing between open-source and proprietary models. Proprietary models now carry a risk of being imitated by competitors, potentially eroding their competitive advantage. This may accelerate the trend toward:

Self-hosted models where full access is gated by NDAs.
Entangled model releases where inference requires hardware-specific enclaves.
Legal contracts with rigorous audit rights.

In production, consider using models with built-in extraction resistance, such as those trained with adversarial robustness techniques, or deploy ensembles that make extraction harder.

Takeaways & Outlook§

Extraction is economically viable: For less than 5% of the cost of developing a frontier model, a determined adversary can replicate core capabilities.
Defenses must evolve: Anomaly detection, watermarking, and legal deterrence are necessary but not sufficient. Technical solutions like randomized response (adding noise to logits) can make extraction less precise, but they may degrade user experience.
Legal precedent: This case could set a landmark for intellectual property protection in AI, clarifying that unauthorized systematic querying for model training constitutes theft.
Open-source implications: If proprietary models are vulnerable, open-source models might become more attractive despite their lower starting performance, because they can be improved with community contributions without legal risk.

For practitioners, the key actions are: monitor your API usage statistics, implement behavioral tracking for suspicious activity, and keep abreast of new defense techniques like model splitting or trusted execution environments. The era of open-ended model sharing via APIs may be ending, giving way to more secure, audited access patterns.