Gemini 3.1 Pro: Google DeepMind's New Model Doubles ARC-AGI Score with 1M Context Window
Google DeepMind's Gemini 3.1 Pro scored 77.1% on ARC-AGI-2, more than double its predecessor. The new model features a 1M token context window and leads on 13 of 16 benchmarks at unchanged pricing.

Gemini 3.1 Pro: Google DeepMind's New Model Doubles ARC-AGI Score with 1M Context Window
Google DeepMind has released Gemini 3.1 Pro, and it puts Google back at the top of the AI benchmark charts. The model scored 77.1% on ARC-AGI-2, a test of pure logic and novel problem-solving, more than doubling its predecessor's score.
Released on February 19, 2026, Gemini 3.1 Pro leads on 13 of 16 major benchmarks while keeping pricing identical to Gemini 3 Pro. For developers and enterprises building agentic systems, this is now the strongest general-purpose model available.
The Benchmark Breakthrough
The headline number is 77.1% on ARC-AGI-2. This benchmark matters because models cannot memorize their way through it. The test requires genuine reasoning on novel problems, making it one of the hardest evaluations in AI.
Gemini 3 Pro scored around 35% on ARC-AGI-2. Gemini 3.1 Pro more than doubled that in a single generation.
On GPQA Diamond, which tests expert-level scientific knowledge across biology, physics, and chemistry, Gemini 3.1 Pro hit 94.3%. That puts it ahead of both Claude Opus 4.6 and GPT-5.2 on scientific reasoning.
The 1M Token Context Window
Gemini 3.1 Pro ships with a 1 million token context window. This allows the model to process roughly 750,000 words in a single conversation, equivalent to about 10 full-length novels.
For enterprises, this means:
- •Large codebases: Analyze entire repositories without chunking
- •Legal documents: Process complete contracts and case files
- •Research papers: Synthesize hundreds of pages of technical content
- •Customer data: Maintain full conversation history for personalized agents
The 1M context window matches what Anthropic offers with Claude Opus 4.6 and Sonnet 4.6, but at Google's Pro-tier pricing.
Pricing That Undercuts Competitors
Google kept Gemini 3.1 Pro pricing identical to Gemini 3 Pro:
| Model | Input Cost | Output Cost |
|---|---|---|
| Gemini 3.1 Pro | $2.00/million tokens | $8.00/million tokens |
| Claude Opus 4.6 | $15/million tokens | $75/million tokens |
| Claude Sonnet 4.6 | $3/million tokens | $15/million tokens |
| GPT-5.3 Codex | ~$1.75/million tokens | ~$10/million tokens |
For an enterprise processing 10 million tokens per day, Gemini 3.1 Pro costs $20/day in input tokens. Claude Opus 4.6 would cost $150/day for the same volume.
The value proposition is clear: frontier-level performance at a fraction of flagship pricing.
What Makes Gemini 3.1 Pro Different
Multimodal Native
Gemini 3.1 Pro was built from the ground up to handle text, images, audio, video, and code in a single model. Unlike models that bolt vision capabilities onto a text-only foundation, Gemini processes all modalities through unified architecture.
This matters for:
- •Video analysis: Extract insights from meeting recordings, security footage, or content
- •Document processing: Handle PDFs with embedded charts, diagrams, and images
- •Code with context: Analyze repositories with documentation, screenshots, and diagrams
Agentic Capabilities
On multi-step reasoning and long-horizon planning tasks, Gemini 3.1 Pro shows significant improvements over its predecessor. The model maintains coherence across extended agent workflows, reducing the failure rate that plagues multi-turn deployments.
Early testing shows Gemini 3.1 Pro excels at:
- •Multi-step research synthesis
- •Complex code generation with testing and iteration
- •Long-running data analysis pipelines
- •Autonomous task completion with tool use
Access Via Multiple Platforms
Gemini 3.1 Pro is available through:
- •Gemini API: Direct developer access
- •Vertex AI: Google Cloud enterprise platform
- •Google Antigravity: Google's internal AI infrastructure
- •Gemini Advanced: Consumer subscription at ~$18.99/month
Competitive Landscape
Gemini 3.1 Pro enters a crowded February 2026 release window that included:
- •Claude Opus 4.6 (Feb 4): Anthropic's flagship, 68.8% ARC-AGI-2
- •Claude Sonnet 4.6 (Feb 17): Near-Opus performance at lower cost
- •GPT-5.3 Codex (Feb 5): OpenAI's coding specialist
- •Grok 4.20 (Feb 17): xAI's multi-agent architecture
- •Qwen 3.5 (Feb 2026): Alibaba's open-weight contender
On raw benchmark breadth, Gemini 3.1 Pro leads the pack. On the GDPval-AA human preference benchmark, which measures real expert-level office work, Sonnet 4.6 still holds the top spot at 1,633 Elo points.
The trade-off: Gemini 3.1 Pro wins on logic and scientific reasoning benchmarks. Claude Sonnet 4.6 wins on human preference for complex professional work. Both are viable choices depending on your use case.
What This Means for Developers
If you're building agentic systems, Gemini 3.1 Pro is worth serious consideration:
1. Cost efficiency: At $2/M input tokens, you can run more iterations for the same budget
2. Context capacity: 1M tokens means less engineering around context limits
3. Multimodal: Single model handles text, images, audio, and video
4. Benchmark leadership: 13 of 16 benchmarks suggests consistent quality
The main consideration is human preference. If your users prioritize Claude's writing style and reasoning approach, Sonnet 4.6 at $3/M input remains competitive. If you need maximum benchmark performance for logic and reasoning tasks, Gemini 3.1 Pro has the edge.
The Bottom Line
Google DeepMind needed a win after months of relative quiet in the model release cycle. Gemini 3.1 Pro delivers.
The combination of 77.1% ARC-AGI-2, 1M context window, and unchanged pricing makes this the best value proposition for developers building agentic systems. Whether it displaces Claude in professional workflows depends on your specific use case, but the benchmark numbers speak for themselves.
For enterprises evaluating AI models in Q1 2026, Gemini 3.1 Pro belongs on your shortlist.
Sources:
Share this article
About NeuralStackly Team
Expert researcher and writer at NeuralStackly, dedicated to finding the best AI tools to boost productivity and business growth.
View all postsRelated Articles
Continue reading with these related posts

Claude Sonnet 4.6: Anthropic's Mid-Tier Model Now Matches Flagship Opus at One-Fifth the Cost
Anthropic's Claude Sonnet 4.6 delivers near-Opus performance across coding, computer use, and agentic tasks while costing 80% less. The new default model features a 1M token con...

Google Launches Nano Banana 2: The AI Image Generator That Combines Pro Quality with Flash Speed
Google DeepMind has released Nano Banana 2 (Gemini 3.1 Flash Image), combining studio-quality image generation with lightning-fast speeds. The new model features advanced world ...

Grok 4.20 Beta 2: xAI's Multi-Agent AI With 4 Agents That Debate Each Other Before Answering
xAI's Grok 4.20 Beta 2 uses a revolutionary 4-agent architecture where AI models argue with each other to cut hallucinations by 65%. Released March 3, 2026 with 2M token context...