Meta unleashes Llama API running 18x faster than OpenAI: Cerebras partnership delivers 2,600 tokens per second

Date:

Share post:

Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More


Meta announced today a partnership with Cerebras Systems to power its new Llama API, offering developers access to inference speeds up to 18 times faster than traditional GPU-based solutions.

The announcement, made at Metaโ€™s inaugural LlamaCon developer conference in Menlo Park, positions the company to compete directly with OpenAI, Anthropic, and Google in the rapidly growing AI inference service market, where developers purchase tokens by the billions to power their applications.

โ€œMeta has selected Cerebras to collaborate to deliver the ultra-fast inference that they need to serve developers through their new Llama API,โ€ said Julie Shin Choi, chief marketing officer at Cerebras, during a press briefing. โ€œWe at Cerebras are really, really excited to announce our first CSP hyperscaler partnership to deliver ultra-fast inference to all developers.โ€

The partnership marks Metaโ€™s formal entry into the business of selling AI computation, transforming its popular open-source Llama models into a commercial service. While Metaโ€™s Llama models have accumulated over one billion downloads, until now the company had not offered a first-party cloud infrastructure for developers to build applications with them.

โ€œThis is very exciting, even without talking about Cerebras specifically,โ€ said James Wang, a senior executive at Cerebras. โ€œOpenAI, Anthropic, Google โ€” theyโ€™ve built an entire new AI business from scratch, which is the AI inference business. Developers who are building AI apps will buy tokens by the millions, by the billions sometimes. And these are just like the new compute instructions that people need to build AI applications.โ€

A benchmark chart shows Cerebras processing Llama 4 at 2,648 tokens per second, dramatically outpacing competitors SambaNova (747), Groq (600) and GPU-based services from Google and others โ€” explaining Metaโ€™s hardware choice for its new API. (Credit: Cerebras)

Breaking the speed barrier: How Cerebras supercharges Llama models

What sets Metaโ€™s offering apart is the dramatic speed increase provided by Cerebrasโ€™ specialized AI chips. The Cerebras system delivers over 2,600 tokens per second for Llama 4 Scout, compared to approximately 130 tokens per second for ChatGPT and around 25 tokens per second for DeepSeek, according to benchmarks from Artificial Analysis.

โ€œIf you just compare on API-to-API basis, Gemini and GPT, theyโ€™re all great models, but they all run at GPU speeds, which is roughly about 100 tokens per second,โ€ Wang explained. โ€œAnd 100 tokens per second is okay for chat, but itโ€™s very slow for reasoning. Itโ€™s very slow for agents. And people are struggling with that today.โ€

This speed advantage enables entirely new categories of applications that were previously impractical, including real-time agents, conversational low-latency voice systems, interactive code generation, and instant multi-step reasoning โ€” all of which require chaining multiple large language model calls that can now be completed in seconds rather than minutes.

The Llama API represents a significant shift in Metaโ€™s AI strategy, transitioning from primarily being a model provider to becoming a full-service AI infrastructure company. By offering an API service, Meta is creating a revenue stream from its AI investments while maintaining its commitment to open models.

โ€œMeta is now in the business of selling tokens, and itโ€™s great for the American kind of AI ecosystem,โ€ Wang noted during the press conference. โ€œThey bring a lot to the table.โ€

The API will offer tools for fine-tuning and evaluation, starting with Llama 3.3 8B model, allowing developers to generate data, train on it, and test the quality of their custom models. Meta emphasizes that it wonโ€™t use customer data to train its own models, and models built using the Llama API can be transferred to other hostsโ€”a clear differentiation from some competitorsโ€™ more closed approaches.

Cerebras will power Metaโ€™s new service through its network of data centers located throughout North America, including facilities in Dallas, Oklahoma, Minnesota, Montreal, and California.

โ€œAll of our data centers that serve inference are in North America at this time,โ€ Choi explained. โ€œWe will be serving Meta with the full capacity of Cerebras. The workload will be balanced across all of these different data centers.โ€

The business arrangement follows what Choi described as โ€œthe classic compute provider to a hyperscalerโ€ model, similar to how Nvidia provides hardware to major cloud providers. โ€œThey are reserving blocks of our compute that they can serve their developer population,โ€ she said.

Beyond Cerebras, Meta has also announced a partnership with Groq to provide fast inference options, giving developers multiple high-performance alternatives beyond traditional GPU-based inference.

Metaโ€™s entry into the inference API market with superior performance metrics could potentially disrupt the established order dominated by OpenAI, Google, and Anthropic. By combining the popularity of its open-source models with dramatically faster inference capabilities, Meta is positioning itself as a formidable competitor in the commercial AI space.

โ€œMeta is in a unique position with 3 billion users, hyper-scale datacenters, and a huge developer ecosystem,โ€ according to Cerebrasโ€™ presentation materials. The integration of Cerebras technology โ€œhelps Meta leapfrog OpenAI and Google in performance by approximately 20x.โ€

For Cerebras, this partnership represents a major milestone and validation of its specialized AI hardware approach. โ€œWe have been building this wafer-scale engine for years, and we always knew that the technologyโ€™s first rate, but ultimately it has to end up as part of someone elseโ€™s hyperscale cloud. That was the final target from a commercial strategy perspective, and we have finally reached that milestone,โ€ Wang said.

The Llama API is currently available as a limited preview, with Meta planning a broader rollout in the coming weeks and months. Developers interested in accessing the ultra-fast Llama 4 inference can request early access by selecting Cerebras from the model options within the Llama API.

โ€œIf you imagine a developer who doesnโ€™t know anything about Cerebras because weโ€™re a relatively small company, they can just click two buttons on Metaโ€™s standard software SDK, generate an API key, select the Cerebras flag, and then all of a sudden, their tokens are being processed on a giant wafer-scale engine,โ€ Wang explained. โ€œThat kind of having us be on the back end of Metaโ€™s whole developer ecosystem is just tremendous for us.โ€

Metaโ€™s choice of specialized silicon signals something profound: in the next phase of AI, itโ€™s not just what your models know, but how quickly they can think it. In that future, speed isnโ€™t just a featureโ€”itโ€™s the whole point.


Source link
spot_img

Related articles