How to Run Qwen 3.5 Small Locally: Complete Setup Guide
How to Run Qwen 3.5 Small Locally
Qwen 3.5 Small is Alibaba’s March 2026 release—a family of models from 0.8B to 9B parameters that run entirely on your device. The 9B model matches GPT-OSS-120B on key benchmarks, and the 2B model runs on an iPhone with 4GB RAM. Here’s how to set it up on your hardware.
Last verified: March 2026
Choose Your Model Size
| Model | RAM Needed | Best For | Speed |
|---|---|---|---|
| Qwen 3.5 Small 0.8B | ~2GB | IoT, embedded, ultra-fast responses | Very fast |
| Qwen 3.5 Small 2B | ~4GB | Phones, lightweight laptops | Fast |
| Qwen 3.5 Small 4B | ~6GB | Laptops, tablets | Good |
| Qwen 3.5 Small 9B | ~8GB | Desktops, workstations (best quality) | Moderate |
Recommendation: Start with 9B if you have 16GB+ RAM. Use 2B or 4B on laptops with 8GB RAM.
Method 1: Ollama (Easiest)
Ollama is the simplest way to run local models on Mac, Windows, and Linux.
Install Ollama
# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh
# Windows: Download from ollama.ai
Download and Run Qwen 3.5 Small
# 9B model (best quality)
ollama run qwen3.5:9b
# 4B model (balanced)
ollama run qwen3.5:4b
# 2B model (lightweight)
ollama run qwen3.5:2b
That’s it. Ollama downloads the model and starts a chat interface. First run takes a few minutes to download; subsequent runs start in seconds.
Use as API
Ollama also serves a local API:
curl http://localhost:11434/api/chat -d '{
"model": "qwen3.5:9b",
"messages": [{"role": "user", "content": "Explain quantum computing simply"}]
}'
Method 2: LM Studio (Best GUI)
LM Studio provides a visual interface for running local models.
- Download LM Studio from lmstudio.ai
- Open the app and search for “Qwen 3.5 Small”
- Select the quantization that fits your RAM (Q4_K_M for most systems)
- Click Download, then Load
- Start chatting in the built-in interface
LM Studio also offers a local API server compatible with OpenAI’s format—great for connecting to other tools.
Method 3: llama.cpp (Maximum Performance)
For developers who want maximum speed and control:
# Clone and build
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
# Download GGUF model from Hugging Face
# (search for Qwen3.5-Small-9B-GGUF)
# Run with optimal settings
./llama-cli -m qwen3.5-small-9b-q4_k_m.gguf \
-c 32768 \
-n 2048 \
--temp 0.7 \
-p "You are a helpful assistant."
Apple Silicon Optimization
On M1/M2/M3/M4 Macs, llama.cpp uses Metal acceleration automatically. For even better performance:
# Build with Metal support (usually automatic)
make LLAMA_METAL=1 -j
# Offload all layers to GPU
./llama-cli -m model.gguf -ngl 99
Method 4: MLX (Apple Silicon Only)
For Mac users, Apple’s MLX framework offers the best performance:
pip install mlx-lm
# Download and run
mlx_lm.generate \
--model mlx-community/Qwen3.5-Small-9B-4bit \
--prompt "What is quantum computing?"
MLX is optimized for Apple Silicon’s unified memory architecture, often delivering the fastest inference speeds on Mac.
Multimodal: Text + Images
Qwen 3.5 Small supports image understanding. With Ollama:
# Send an image for analysis
ollama run qwen3.5:9b "What's in this image?" --image photo.jpg
This works offline—analyze documents, photos, diagrams, and screenshots without sending data to any cloud.
Performance Tips
Quantization Guide
- Q8_0: Best quality, needs more RAM
- Q4_K_M: Best balance of quality and speed (recommended)
- Q4_0: Smallest size, slightly lower quality
Context Length
Qwen 3.5 Small supports long context. Set -c 32768 or higher for document analysis. Reduce if you’re low on RAM.
Batch Size
Increase batch size for faster processing of long inputs:
# Ollama: set in Modelfile
# llama.cpp: use -b 512 or higher
What Can You Do With It?
| Task | Works Well? | Notes |
|---|---|---|
| Chat / Q&A | ✅ Excellent | 9B rivals cloud models |
| Summarization | ✅ Excellent | Great for documents |
| Code generation | ✅ Good | Simple tasks, autocomplete |
| Image analysis | ✅ Good | Documents, photos |
| Translation | ✅ Good | Multi-language support |
| Complex reasoning | ⚠️ Moderate | Cloud models still better |
| Creative writing | ⚠️ Moderate | Smaller models = less creative |
FAQ
How does Qwen 3.5 Small 9B compare to cloud AI?
It matches GPT-OSS-120B (a 13× larger model) on GPQA Diamond (81.7 vs 71.5) and HMMT Feb 2025 (83.2 vs 76.7). It won’t match GPT-5.4 or Claude Opus 4.6, but for a local model it’s remarkably capable.
Can I run it on my phone?
The 2B model runs on recent iPhones (4GB RAM) and Android phones (6GB+ RAM). Apps like MLC Chat and private LLM clients support Qwen models.
Is it really private?
Yes. When running locally, zero data leaves your device. No API calls, no logging, no training on your inputs. This is truly private AI.
Which method should I choose?
Ollama for simplicity, LM Studio for a nice GUI, llama.cpp for maximum performance and control, MLX for Apple Silicon optimization.
Can I fine-tune Qwen 3.5 Small?
Yes. The open weights allow fine-tuning with tools like Unsloth, LoRA, or QLoRA. The 2B and 4B models are practical to fine-tune on consumer GPUs.