
As LLMs become more capable, user expectations for speed and reliability continue to rise, especially for enterprise and productivity applications. That’s why we believe in the power of on-device AI, which can deliver real-time, high-quality experiences that match or exceed cloud-based or hybrid solutions. The challenge? There’s no established blueprint for optimizing and scaling on-device models.
We’ve tackled this challenge head-on by developing a systematic approach to optimizing language models (like T5) to run on-device—without any quality degradation. Specifically, we’ve reduced the memory and latency of our grammatical error correction (GEC) model by over 50%, enabling it to run efficiently on users’ devices. We have also created a custom software development kit (SDK) to deliver this optimized model to millions of users across various desktop platforms. By successfully shipping this functionality at scale, we’ve demonstrated that our on-device approach works, making us one of the trailblazing companies in this space.
In this blog post, we will share how we solved the technical challenges of optimizing and scaling our GEC model, establishing a foundation for future on-device AI development.
What does it take to run the GEC model on-device?
While Grammarly provides more than just grammatical error correction, it’s a core part of how we empower users to communicate effectively. To ensure a seamless writing experience, Grammarly must provide high-quality suggestions in real time—less than 100 milliseconds. Achieving this requires solving three unique challenges:
- Memory management: User devices have limited memory, which is often shared with other applications, making optimizing memory usage critical for performance. Our original model (designed for cloud servers) needed almost 4 GB, roughly the size of an average desktop’s RAM—making it impractical to run locally without significant optimization.
- Computational efficiency: As with memory, user devices have limited processing power that is often shared with other applications. For a real-time experience, the model’s resource-intensive operations (like inference) must run quickly without compromising the quality. If our model demands too many resources, it can cause lag, interfere with other applications, or rapidly drain battery life, leading to a poor user experience.
- Cross-platform deployment: To serve all Grammarly users, our solution must work across different platforms and devices. This is challenging because each platform uses different hardware, programming languages, and machine learning APIs. We must also consider varying device capabilities—from powerful MacBooks with dedicated GPUs to budget Chromebooks with minimal resources.
To overcome these challenges, we iteratively addressed each problem: memory management, computational latency, and cross-platform deployment.
Reducing memory footprint
To ensure that GEC would not impact the memory usage of other applications, we set an ambitious target of optimizing the model to run under 1 GB of memory. A common technique for solving this problem is quantization, in which model weights (typically 32-bit floats) are converted into smaller, less precise numbers (like a 4-bit integer). While quantization greatly reduces the memory footprint, it can also degrade model accuracy.
To balance model quality and memory optimization, we experimented with different quantization levels, including BFLOAT16 and INT4. BFLOAT16 had the best results: It had a minimal impact on accuracy while reducing memory usage by 50%. By combining quantization with other optimization techniques (like the graph optimizations described below), the final grammatical error correction model ran in less than 300 MB.
Improving computational efficiency
To ensure a real-time user experience, we calculated that the entire GEC model must run at 100+ tokens/second, which required us to reduce latency by 50%.
We started by optimizing the T5 model, the core inference engine for the GEC model. We believed that increasing the T5 model’s speed from 70 tokens/second to 200 tokens/second would allow us to cross-apply these learnings to other components in the pipeline and meet our overall performance targets.
So, we began by performing manual graph optimizations on the model, reorganizing and streamlining operations in ways that are similar to those performed by libraries like TensorRT and ONNX Runtime. Next, we examined the calculations within each operation. Some optimizations were obvious, such as removing unnecessary operations (e.g., unnecessary typecasting), minimizing time-intensive operations (e.g., reshape and transpose operations), and sequencing operations to keep data on the same processor as much as possible (data transfers are time-intensive).
One big unlock was utilizing an optimized kernel and fused operations that leverage the device’s hardware specification. For example, one critical calculation in a neural network is the multi-head attention (MHA) operation, which occurs when the model examines the user input to determine the user’s query and potential output. This operation requires multiple calculations simultaneously across different parts of the input query. We replaced the naive implementation of MHA with an optimized function from MLX, Apple’s ML framework, thereby speeding up the model’s performance.
With these architectural and computational performance gains, we increased our T5 processing speed from 70 tokens/second to 297 tokens/second—achieving our latency goals.
Seamless cross-platform deployment
Now that we had a performant model, we faced a new challenge: deploying it to millions of Grammarly desktop users across various platforms, each with its specific hardware, language, and APIs for ML operations. A straightforward approach is to create a separate SDK for each platform. However, this approach makes future maintenance and iteration cumbersome as each code change must be applied to each SDK.
To solve this, we built a Rust-based SDK that runs Grammarly machine learning models across three key platforms: Mac, Windows, and Chrome Extension. This approach lets us write code once and compile it for each platform, simplifying model deployment and maintainability. Additionally, the SDK leveraged native platform ML libraries (such as Metal for Apple devices), enabling us to take advantage of hardware-specific accelerations.
The future is on-device
Using our SDK, we shipped our on-device GEC model to millions of Grammarly desktop users. Early results show no degradation in quality or performance compared to the prior cloud-based model. This confirms our belief that on-device AI can deliver real-time, high-quality experiences without relying on cloud servers—turning what was once just a vision into reality.
This work is just the first step toward powerful, new AI experiences. Leveraging these optimization learnings and our SDK, we’re building on-device versions of other complex writing models. If you want to tackle challenging AI optimization problems at scale, come work with us. Check out our jobs page.
Special thanks to the entire team that worked on this project: Sri Malireddi, Illia Dzivinskyi, Ignat Blazhko, Dhruv Matani, and John Blatz.