LiteRT-LM: Google's Framework for Running LLMs on Edge Devices

TL;DR

LiteRT-LM is Google’s open-source framework for running LLMs on edge devices — phones, tablets, browsers, wearables, and IoT. Key facts:

Powers production Google products: Chrome, Chromebook Plus, Pixel Watch
Just added Gemma 4 support — Google’s most capable on-device model (Apache 2.0)
Cross-platform: Android, iOS (coming), Web, Desktop, Raspberry Pi
Hardware acceleration: GPU and NPU via platform-specific backends
Multi-modal: Vision and audio inputs, not just text
Function calling: Tool use support for agentic workflows on-device
Models: Gemma 4, Gemma 3n, Llama, Phi-4, Qwen, and more
3,157 GitHub stars | Apache 2.0 | C++ core with Kotlin, Python, C++ APIs
One command to try: uv tool install litert-lm && litert-lm run --from-huggingface-repo=...

This isn’t a research project — it’s what actually runs AI in Google’s shipping products.

Why LiteRT-LM Matters

The AI industry has a cloud problem. Every API call to GPT, Claude, or Gemini costs money, adds latency, and sends user data to external servers. LiteRT-LM is Google’s answer: run the model directly on the user’s device.

What makes it different from Ollama or llama.cpp:

Feature	LiteRT-LM	Ollama	llama.cpp
Target	Mobile/edge/IoT	Desktop/server	Desktop/server
Platforms	Android, iOS, Web, Pi	macOS, Linux, Windows	macOS, Linux, Windows
Optimization	GPU + NPU acceleration	CPU + GPU	CPU + Metal/CUDA
Production	Powers Chrome, Pixel Watch	Developer tool	Developer tool
Model format	.litertlm (optimized)	GGUF	GGUF
Function calling	Built-in	No	No
Multi-modal	Vision + audio	Text only	Text + vision

The key differentiator: LiteRT-LM is specifically optimized for constrained devices. It memory-maps embedding layers (keeping them on disk until needed), uses NPU acceleration where available, and is designed for the 2-8GB RAM reality of phones and wearables.

What’s New: Gemma 4 on Edge

Google just released Gemma 4 — their most capable open model — with day-one LiteRT-LM support:

Gemma 4 E2B (2B effective params): Runs on phones with ~1.5GB working memory
Gemma 4 E4B (4B effective params): Better quality, ~3GB working memory
Agentic capabilities: Function calling, tool use, multi-step reasoning — all on-device
Apache 2.0: Commercially permissive, no restrictions
Offline: Zero latency, zero cost, full privacy

From Google’s blog: “In collaboration with Pixel, Qualcomm, and MediaTek, these models run completely offline with near-zero latency across phones, Raspberry Pi, and NVIDIA Jetson Orin Nano.”

Quick Start

No-Code Trial (CLI)

# Install
uv tool install litert-lm

# Run Gemma 4 E2B
litert-lm run \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  --prompt="What is the capital of France?"

Works on Linux, macOS, Windows (WSL), and Raspberry Pi.

Android (Kotlin)

val model = LiteRtLm.load(context, "gemma-4-E2B-it.litertlm")
val response = model.generateResponse("What's the weather like?")

Python

import litert_lm
model = litert_lm.load("gemma-4-E2B-it.litertlm")
response = model.generate("Explain this code:", max_tokens=512)

Google AI Edge Gallery App

Download the AI Edge Gallery app and run models on your phone — no code required.

Supported Models

Model	Effective Size	Memory	Best For
Gemma 4 E2B	2B	~1.5GB	Phones, quick tasks
Gemma 4 E4B	4B	~3GB	Quality on phones/tablets
Gemma 3n E2B	2B	~1.2GB	Ultra-lightweight
Llama 3.2 3B	3B	~2GB	General purpose
Phi-4 Mini	3.8B	~2.5GB	Reasoning tasks
Qwen 2.5 3B	3B	~2GB	Multilingual

Practical Use Cases

1. Offline AI Assistant

Build a personal assistant that works without internet. Gemma 4’s agentic capabilities support function calling on-device.

2. Privacy-First Applications

Medical apps, legal tools, financial advisors — anything where data cannot leave the device.

3. IoT and Embedded

Run AI on Raspberry Pi for smart home automation, industrial monitoring, or edge analytics.

4. Browser-Based AI

LiteRT-LM powers on-device AI in Chrome. No server costs for AI features.

5. Wearables

Powers AI features on Pixel Watch — demonstrating extreme optimization capabilities.

Honest Limitations

Small models only — targets 1-8B parameters. Don’t expect GPT-5 quality.
iOS Swift API still in development — Android is first-class.
Model conversion required — can’t use GGUF files directly.
Limited model library — ~20 models vs Ollama’s hundreds.
C++ complexity — building from source is non-trivial.

LiteRT-LM vs Alternatives

Feature	LiteRT-LM	Ollama	llama.cpp	MLX
Mobile	Native	No	Partial	No
Browser	Yes	No	Via WASM	No
IoT/Pi	Yes	Yes	Yes	No
NPU accel	Yes	No	No	No
Function calling	Built-in	No	No	No
Production use	Chrome, Pixel	Dev tool	Dev tool	Dev tool
Model library	~20	100+	100+	50+

Choose LiteRT-LM if: Building for mobile, browser, or IoT with hardware acceleration needs. Choose Ollama if: Simplest local LLM on desktop/server with wide model selection.

FAQ

What is LiteRT-LM?

LiteRT-LM is Google’s open-source inference framework for running LLMs on edge devices. It powers Chrome, Chromebook Plus, and Pixel Watch. Supports Gemma 4, Llama, Phi-4, Qwen across Android, iOS, Web, Desktop, and Raspberry Pi. Apache 2.0.

How is LiteRT-LM different from Ollama?

LiteRT-LM targets mobile/edge with GPU+NPU acceleration and memory-mapped embeddings. Ollama targets desktop/server. LiteRT-LM powers production Google products; Ollama is a developer tool.

Can I run Gemma 4 on my phone?

Yes. Gemma 4 E2B needs ~1.5GB working memory. Download the Google AI Edge Gallery app or use the Kotlin SDK for integration.

Does LiteRT-LM work offline?

Yes, completely. Models run entirely on-device with zero network calls — private, offline, zero-latency inference.

What models does LiteRT-LM support?

Gemma 4 (E2B, E4B), Gemma 3n, Llama 3.2, Phi-4 Mini, Qwen 2.5, and more in .litertlm format from HuggingFace.

GitHub: github.com/google-ai-edge/LiteRT-LM Product Site: ai.google.dev/edge/litert-lm License: Apache 2.0 | Stars: 3,157 | Language: C++