DiffusionGemma Goes Open Source, Microsoft's Agent Eval Framework, and Anthropic Bets $150M on Nonprofit Fellowships — AI Digest for June 11, 2026

Today's five items: Google DeepMind open-sources a text model that generates tokens in parallel via diffusion (4× faster, with real tradeoffs); Microsoft releases an MIT-licensed eval framework that turns written specs into agent test suites; Databricks proposes an open protocol for sharing AI models and agent skills across platforms without copying them; TestSprite open-sources a QA tool that gives coding agents a real browser-based test loop; and Anthropic commits $150 million to place 1,000 AI fellows at nonprofits across the US.

DiffusionGemma: a 26B open-weights model that generates text 4× faster — and why that comes with caveats

Google DeepMind released DiffusionGemma on June 10, an open-weights 26-billion-parameter model that abandons the standard token-by-token generation approach in favor of diffusion — the same mechanism behind image generators like Stable Diffusion. 1

Instead of predicting the next token autoregressively, DiffusionGemma starts with a block of 256 random placeholder tokens and iteratively denoises them into readable text using bidirectional attention. 2 Every position attends to every other in both directions simultaneously — structurally different from GPT-style generation where each token can only look backward. The model commits completed 256-token blocks to a key-value cache before generating the next block, allowing it to handle long outputs without reprocessing earlier context.

DiffusionGemma speed at a glance

Vendor-stated throughput figures on dedicated GPUs (single-user scenario)

H100 (FP8)

1,000+

RTX 5090 (NVFP4)

700+

VRAM needed

18 GB

Active params

3.8 B

통계 카드를 불러오는 중…

Speed numbers: on a single H100 GPU at FP8 precision, DiffusionGemma exceeds 1,000 tokens per second. On a consumer RTX 5090 with Nvidia's NVFP4 (4-bit) quantization, throughput reaches 700+ tokens/sec and the model fits within 18GB of VRAM. 3 That's roughly four to five times faster than comparable autoregressive models in single-user GPU scenarios. In multi-request cloud serving, Google's documentation notes the advantage narrows or disappears — the throughput gains are specifically suited for local/edge deployment. 4

The model is a mixture-of-experts architecture with only 3.8 billion active parameters during inference despite the 26 billion total.

The tradeoff is real: DiffusionGemma scores lower than standard Gemma 4 on MMLU and coding benchmarks. Google positions it as experimental and recommends Gemma 4 for production use cases where quality matters more than speed. The release is positioned as a research foothold on diffusion-based text generation, not a drop-in replacement. 4

Where it does have structural advantages: code infilling, inline text editing, constraint-satisfaction tasks, and anything requiring awareness of surrounding context rather than just prior tokens. A fine-tuned DiffusionGemma solved 80% of Sudoku puzzles in 12 denoising steps versus 0% for the base model, as a demonstration of what bidirectional attention enables. 2

Availability: Apache 2.0 license on Hugging Face, Kaggle, and Google Cloud's Vertex AI Model Garden. Compatible with HuggingFace Transformers, vLLM, SGLang, MLX, JAX, and Nvidia NIM on day one. 1

Microsoft open-sources ASSERT, an AI agent evaluation framework that converts specs into test suites

Most teams deploying AI agents don't test them before production — according to Gartner analyst Anushree Verma, that figure is around 99%. 5 Microsoft released ASSERT (Adaptive Spec-driven Scoring for Evaluation and Regression Testing) on June 11 under an MIT license to address this gap. 5

The idea is to take what you already have — a product requirements doc, a governance policy, a plain-language spec — and have ASSERT convert it into evaluation scenarios, datasets, metrics, and scorecards automatically. Microsoft's blog post puts it plainly: "Agents fail in ways that are hard to see. They drift from policy, produce unsafe outputs in edge cases, and behave differently in production than they did in testing. Generic benchmarks do not catch these failures because they are not built around your policies, your agent, or your use case." 6

The framework uses LLMs as judges, with Microsoft reporting 80–90% agreement with human reviewers in internal validation. Forrester principal analyst Biswajeet Mahapatra's assessment: that level of agreement is useful but not sufficient as a standalone governance control — human oversight should remain for high-risk or regulated scenarios, and buyers should watch for bias when a single model acts as both generator and evaluator. 5

ASSERT enters a market that already has LangSmith, Braintrust, Patronus AI, Galileo, Arize AI's Phoenix, and Promptfoo. The open-source MIT release lowers the cost to try it and lets teams inspect and modify the scoring logic, though as Mahapatra notes, the originating vendor still influences how evaluation criteria are defined and encoded.

On June 11, Databricks launched OpenSharing, an open protocol that lets enterprises share AI models, agent skills, dashboards, and unstructured data across platforms without copying or moving the assets. 7 It's now a sandbox project under the Linux Foundation AI & Data Foundation.

The mechanism is zero-copy credential vending: instead of replicating assets, a provider issues temporary, scoped credentials so a recipient can access shared content directly from the provider's cloud storage. This means an agent skill or fine-tuned model can be shared with a partner without duplicating it across their environment.

The practical target is what HFS Research's Ashish Chaturvedi describes as the "integration tax" — the overhead that grows exponentially when models, skills, and consumers all sit on different platforms: "The integration tax is enormous, and it grows exponentially with every new partner, customer, or internal team." 7

OpenSharing is an evolution of Databricks' existing Delta Sharing protocol, extended to cover the broader range of AI artifacts. The novelty compared to Snowflake's zero-copy integrations is that OpenSharing works across platforms — Snowflake's approach requires both provider and receiver to be on Snowflake.

Connectors currently available: Python, Apache Spark, Tableau, PowerBI, Snowflake, DuckDB, Clojure, Node.js, Java, Rust, Go, C++, and R. Coming soon: Google Spreadsheet, Excel, Airflow, and Lakehouse Sharing. 7

github.com · GitHub 저장소

delta-io/delta-sharing

https://github.com/delta-io/delta-sharing

콘텐츠 카드를 불러오는 중…

TestSprite open-sources a QA CLI that gives AI coding agents a real browser-based test loop

A persistent frustration with autonomous coding agents: they declare a feature "complete" while some of the tests failed, or they fix one bug and break three others. TestSprite, a software testing startup, today open-sourced its TestSprite CLI under Apache 2.0 to give coding agents a proper quality assurance loop rather than a spot check. 8

The tool works like this: the coding agent describes a behavior once. TestSprite runs it in the cloud against a live browser or live API — no mocks. On failure, it returns a self-consistent report: the failing step, its neighbors, screenshots, a DOM manifest, the test source, a root cause hypothesis, and a suggested fix. The agent reads the report, patches the code, and reruns. Coverage grows alongside the codebase — each new phase of work adds tests rather than replacing them.

Install: npm install -g @testsprite/cli (requires Node.js 2.0+). Documentation and source are on GitHub.

github.com · GitHub 저장소

TestSprite/testsprite-cli

https://github.com/TestSprite/testsprite-cli

콘텐츠 카드를 불러오는 중…

To mark the launch, TestSprite also ran CoderCup, a public World Cup–themed benchmark that pitted Claude Code, OpenAI Codex, Google Antigravity, and Kimi against each other on the same app build under a clock, with TestSprite's CLI as the referee. Claude Code ranked highest on consistency. Codex and Antigravity were fastest overall (both under 100 cumulative minutes). 8

The surprise: Kimi (Moonshot AI) took the longest at around 350 minutes but posted the highest correctness score in the field at 0.89 — and the lowest total cost, outperforming agents many times its size on accuracy. Notably, every agent in the competition broke previously working features at some point.

Anthropic commits $150M to place 1,000 AI fellows at nonprofits across the US

Anthropic launched Claude Corps on June 11, a national fellowship program placing people early in their careers at nonprofits for 12 months, at a $150 million initial commitment. 9 The program is structured as a partnership between Anthropic (funding and Claude expertise), CodePath (the fellows' employer of record and training lead), and Social Finance (evaluation and longer-term financing).

Fellows receive a $85,000 full-time salary, five hours of weekly training, a dedicated Claude token budget, and a CodePath mentor. Over the next 12 months, at least 400 nonprofits will host fellows; participating organizations include food banks, veteran support nonprofits, marine conservation groups, and workforce development programs.

The first cohort of 100 fellows starts in October 2026; applications close July 17. Subsequent cohorts begin January 2027 and August 2027. Anyone over 18 with under two years of full-time work experience can apply — no specific educational background required.

Anthropic's stated framing: "The benefits of transformative AI systems could come at the cost of significant disruption. The companies building this technology have a responsibility to make sure the benefits are fully realized and widely shared." The program is explicitly designed to scale — Anthropic says it plans to open-source the core technology behind Claude Corps for other organizations to replicate, and to expand it internationally.

Whether this counts as meaningful workforce transition support or primarily as a brand move is a fair question. What's measurable: $150 million committed, 1,000 fellows in the first wave, and specific organizations with names and missions already on board. The efficacy measurement plan is being handled by Social Finance, an independent nonprofit and registered investment advisor. 9

Quick hits

Google DeepMind announced a $10M multi-agent AI safety research funding call in partnership with Schmidt Sciences, the Cooperative AI Foundation, ARIA, and Google.org, targeting four areas: sandboxes and testbeds, the science of agent networks, agent infrastructure (identity and reputation protocols), and oversight/control methods at deployment scale. Application deadline: August 8, 2026. 10
OpenAI published a threat report on June 10 detailing two clusters of ChatGPT accounts linked to Chinese state-origin influence operations targeting US AI policy debates. The "Data Center Bandwagon" campaign generated content claiming AI data center buildouts raised electricity prices; the "Tech and Tariffs" campaign pushed anti-tariff narratives while spreading false claims that ChatGPT user data had been compromised. OpenAI says it found no evidence either campaign achieved meaningful reach beyond its own activity. 11
Cursor Bugbot (the AI code review tool inside Cursor IDE) cut average review time from ~5 minutes to ~90 seconds in its June 10 update, powered by Composer 2.5. Vendor-stated figures: 10% more bugs found per run, 22% lower cost per run, 90% of runs under three minutes. A new /review command lets you run Bugbot before pushing, not just at PR time; duplicate-diff detection means a locally pre-reviewed PR won't be billed twice. All numbers are vendor-stated from Cursor's own changelog — on independent SWE-Bench Multilingual, Composer 2.5 sits 0.7 points behind Claude Opus 4.7. 12

DiffusionGemma, ASSERT, OpenSharing, TestSprite CLI, and Claude Corps — AI Digest for June 11, 2026