AI agents · OpenClaw · self-hosting · automation

Quick Answer

How to Run Qwen 3.5 Small Locally: Complete Setup Guide

Published: • Updated:

How to Run Qwen 3.5 Small Locally

Qwen 3.5 Small is Alibaba’s March 2026 release—a family of models from 0.8B to 9B parameters that run entirely on your device. The 9B model matches GPT-OSS-120B on key benchmarks, and the 2B model runs on an iPhone with 4GB RAM. Here’s how to set it up on your hardware.

Last verified: March 2026

Choose Your Model Size

ModelRAM NeededBest ForSpeed
Qwen 3.5 Small 0.8B~2GBIoT, embedded, ultra-fast responsesVery fast
Qwen 3.5 Small 2B~4GBPhones, lightweight laptopsFast
Qwen 3.5 Small 4B~6GBLaptops, tabletsGood
Qwen 3.5 Small 9B~8GBDesktops, workstations (best quality)Moderate

Recommendation: Start with 9B if you have 16GB+ RAM. Use 2B or 4B on laptops with 8GB RAM.

Method 1: Ollama (Easiest)

Ollama is the simplest way to run local models on Mac, Windows, and Linux.

Install Ollama

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Windows: Download from ollama.ai

Download and Run Qwen 3.5 Small

# 9B model (best quality)
ollama run qwen3.5:9b

# 4B model (balanced)
ollama run qwen3.5:4b

# 2B model (lightweight)
ollama run qwen3.5:2b

That’s it. Ollama downloads the model and starts a chat interface. First run takes a few minutes to download; subsequent runs start in seconds.

Use as API

Ollama also serves a local API:

curl http://localhost:11434/api/chat -d '{
  "model": "qwen3.5:9b",
  "messages": [{"role": "user", "content": "Explain quantum computing simply"}]
}'

Method 2: LM Studio (Best GUI)

LM Studio provides a visual interface for running local models.

  1. Download LM Studio from lmstudio.ai
  2. Open the app and search for “Qwen 3.5 Small”
  3. Select the quantization that fits your RAM (Q4_K_M for most systems)
  4. Click Download, then Load
  5. Start chatting in the built-in interface

LM Studio also offers a local API server compatible with OpenAI’s format—great for connecting to other tools.

Method 3: llama.cpp (Maximum Performance)

For developers who want maximum speed and control:

# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# Download GGUF model from Hugging Face
# (search for Qwen3.5-Small-9B-GGUF)

# Run with optimal settings
./llama-cli -m qwen3.5-small-9b-q4_k_m.gguf \
  -c 32768 \
  -n 2048 \
  --temp 0.7 \
  -p "You are a helpful assistant."

Apple Silicon Optimization

On M1/M2/M3/M4 Macs, llama.cpp uses Metal acceleration automatically. For even better performance:

# Build with Metal support (usually automatic)
make LLAMA_METAL=1 -j

# Offload all layers to GPU
./llama-cli -m model.gguf -ngl 99

Method 4: MLX (Apple Silicon Only)

For Mac users, Apple’s MLX framework offers the best performance:

pip install mlx-lm

# Download and run
mlx_lm.generate \
  --model mlx-community/Qwen3.5-Small-9B-4bit \
  --prompt "What is quantum computing?"

MLX is optimized for Apple Silicon’s unified memory architecture, often delivering the fastest inference speeds on Mac.

Multimodal: Text + Images

Qwen 3.5 Small supports image understanding. With Ollama:

# Send an image for analysis
ollama run qwen3.5:9b "What's in this image?" --image photo.jpg

This works offline—analyze documents, photos, diagrams, and screenshots without sending data to any cloud.

Performance Tips

Quantization Guide

  • Q8_0: Best quality, needs more RAM
  • Q4_K_M: Best balance of quality and speed (recommended)
  • Q4_0: Smallest size, slightly lower quality

Context Length

Qwen 3.5 Small supports long context. Set -c 32768 or higher for document analysis. Reduce if you’re low on RAM.

Batch Size

Increase batch size for faster processing of long inputs:

# Ollama: set in Modelfile
# llama.cpp: use -b 512 or higher

What Can You Do With It?

TaskWorks Well?Notes
Chat / Q&A✅ Excellent9B rivals cloud models
Summarization✅ ExcellentGreat for documents
Code generation✅ GoodSimple tasks, autocomplete
Image analysis✅ GoodDocuments, photos
Translation✅ GoodMulti-language support
Complex reasoning⚠️ ModerateCloud models still better
Creative writing⚠️ ModerateSmaller models = less creative

FAQ

How does Qwen 3.5 Small 9B compare to cloud AI?

It matches GPT-OSS-120B (a 13× larger model) on GPQA Diamond (81.7 vs 71.5) and HMMT Feb 2025 (83.2 vs 76.7). It won’t match GPT-5.4 or Claude Opus 4.6, but for a local model it’s remarkably capable.

Can I run it on my phone?

The 2B model runs on recent iPhones (4GB RAM) and Android phones (6GB+ RAM). Apps like MLC Chat and private LLM clients support Qwen models.

Is it really private?

Yes. When running locally, zero data leaves your device. No API calls, no logging, no training on your inputs. This is truly private AI.

Which method should I choose?

Ollama for simplicity, LM Studio for a nice GUI, llama.cpp for maximum performance and control, MLX for Apple Silicon optimization.

Can I fine-tune Qwen 3.5 Small?

Yes. The open weights allow fine-tuning with tools like Unsloth, LoRA, or QLoRA. The 2B and 4B models are practical to fine-tune on consumer GPUs.