TL;DR

LiteRT-LM is Google’s open-source framework for running LLMs on edge devices — phones, tablets, browsers, wearables, and IoT. Key facts:

  • Powers production Google products: Chrome, Chromebook Plus, Pixel Watch
  • Just added Gemma 4 support — Google’s most capable on-device model (Apache 2.0)
  • Cross-platform: Android, iOS (coming), Web, Desktop, Raspberry Pi
  • Hardware acceleration: GPU and NPU via platform-specific backends
  • Multi-modal: Vision and audio inputs, not just text
  • Function calling: Tool use support for agentic workflows on-device
  • Models: Gemma 4, Gemma 3n, Llama, Phi-4, Qwen, and more
  • 3,157 GitHub stars | Apache 2.0 | C++ core with Kotlin, Python, C++ APIs
  • One command to try: uv tool install litert-lm && litert-lm run --from-huggingface-repo=...

This isn’t a research project — it’s what actually runs AI in Google’s shipping products.


Why LiteRT-LM Matters

The AI industry has a cloud problem. Every API call to GPT, Claude, or Gemini costs money, adds latency, and sends user data to external servers. LiteRT-LM is Google’s answer: run the model directly on the user’s device.

What makes it different from Ollama or llama.cpp:

FeatureLiteRT-LMOllamallama.cpp
TargetMobile/edge/IoTDesktop/serverDesktop/server
PlatformsAndroid, iOS, Web, PimacOS, Linux, WindowsmacOS, Linux, Windows
OptimizationGPU + NPU accelerationCPU + GPUCPU + Metal/CUDA
ProductionPowers Chrome, Pixel WatchDeveloper toolDeveloper tool
Model format.litertlm (optimized)GGUFGGUF
Function callingBuilt-inNoNo
Multi-modalVision + audioText onlyText + vision

The key differentiator: LiteRT-LM is specifically optimized for constrained devices. It memory-maps embedding layers (keeping them on disk until needed), uses NPU acceleration where available, and is designed for the 2-8GB RAM reality of phones and wearables.


What’s New: Gemma 4 on Edge

Google just released Gemma 4 — their most capable open model — with day-one LiteRT-LM support:

  • Gemma 4 E2B (2B effective params): Runs on phones with ~1.5GB working memory
  • Gemma 4 E4B (4B effective params): Better quality, ~3GB working memory
  • Agentic capabilities: Function calling, tool use, multi-step reasoning — all on-device
  • Apache 2.0: Commercially permissive, no restrictions
  • Offline: Zero latency, zero cost, full privacy

From Google’s blog: “In collaboration with Pixel, Qualcomm, and MediaTek, these models run completely offline with near-zero latency across phones, Raspberry Pi, and NVIDIA Jetson Orin Nano.”


Quick Start

No-Code Trial (CLI)

# Install
uv tool install litert-lm

# Run Gemma 4 E2B
litert-lm run \
  --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \
  gemma-4-E2B-it.litertlm \
  --prompt="What is the capital of France?"

Works on Linux, macOS, Windows (WSL), and Raspberry Pi.

Android (Kotlin)

val model = LiteRtLm.load(context, "gemma-4-E2B-it.litertlm")
val response = model.generateResponse("What's the weather like?")

Python

import litert_lm
model = litert_lm.load("gemma-4-E2B-it.litertlm")
response = model.generate("Explain this code:", max_tokens=512)

Download the AI Edge Gallery app and run models on your phone — no code required.


Supported Models

ModelEffective SizeMemoryBest For
Gemma 4 E2B2B~1.5GBPhones, quick tasks
Gemma 4 E4B4B~3GBQuality on phones/tablets
Gemma 3n E2B2B~1.2GBUltra-lightweight
Llama 3.2 3B3B~2GBGeneral purpose
Phi-4 Mini3.8B~2.5GBReasoning tasks
Qwen 2.5 3B3B~2GBMultilingual

Practical Use Cases

1. Offline AI Assistant

Build a personal assistant that works without internet. Gemma 4’s agentic capabilities support function calling on-device.

2. Privacy-First Applications

Medical apps, legal tools, financial advisors — anything where data cannot leave the device.

3. IoT and Embedded

Run AI on Raspberry Pi for smart home automation, industrial monitoring, or edge analytics.

4. Browser-Based AI

LiteRT-LM powers on-device AI in Chrome. No server costs for AI features.

5. Wearables

Powers AI features on Pixel Watch — demonstrating extreme optimization capabilities.


Honest Limitations

  1. Small models only — targets 1-8B parameters. Don’t expect GPT-5 quality.
  2. iOS Swift API still in development — Android is first-class.
  3. Model conversion required — can’t use GGUF files directly.
  4. Limited model library — ~20 models vs Ollama’s hundreds.
  5. C++ complexity — building from source is non-trivial.

LiteRT-LM vs Alternatives

FeatureLiteRT-LMOllamallama.cppMLX
MobileNativeNoPartialNo
BrowserYesNoVia WASMNo
IoT/PiYesYesYesNo
NPU accelYesNoNoNo
Function callingBuilt-inNoNoNo
Production useChrome, PixelDev toolDev toolDev tool
Model library~20100+100+50+

Choose LiteRT-LM if: Building for mobile, browser, or IoT with hardware acceleration needs. Choose Ollama if: Simplest local LLM on desktop/server with wide model selection.


FAQ

What is LiteRT-LM?

LiteRT-LM is Google’s open-source inference framework for running LLMs on edge devices. It powers Chrome, Chromebook Plus, and Pixel Watch. Supports Gemma 4, Llama, Phi-4, Qwen across Android, iOS, Web, Desktop, and Raspberry Pi. Apache 2.0.

How is LiteRT-LM different from Ollama?

LiteRT-LM targets mobile/edge with GPU+NPU acceleration and memory-mapped embeddings. Ollama targets desktop/server. LiteRT-LM powers production Google products; Ollama is a developer tool.

Can I run Gemma 4 on my phone?

Yes. Gemma 4 E2B needs ~1.5GB working memory. Download the Google AI Edge Gallery app or use the Kotlin SDK for integration.

Does LiteRT-LM work offline?

Yes, completely. Models run entirely on-device with zero network calls — private, offline, zero-latency inference.

What models does LiteRT-LM support?

Gemma 4 (E2B, E4B), Gemma 3n, Llama 3.2, Phi-4 Mini, Qwen 2.5, and more in .litertlm format from HuggingFace.


GitHub: github.com/google-ai-edge/LiteRT-LM Product Site: ai.google.dev/edge/litert-lm License: Apache 2.0 | Stars: 3,157 | Language: C++