• Home
  • Blog
  • Small-to-Medium language models you can run on your laptop — fully offline

Small-to-Medium language models you can run on your laptop — fully offline

Recent advances in model compression and quantization have made it possible to run powerful language models entirely offline on a typical laptop. Thanks to the open-source community and generous licensing from some model developers, it's no longer necessary to rely on cloud-based APIs for many natural language processing or chatbot tasks. Below are five of the best-performing small-to-medium offline LLMs (ranging from 7B to 13B parameters).

1. Qwen2.5-7B-Instruct

Model Overview
Qwen2.5-7B is an instruction-tuned 7B model from Alibaba’s Qwen2 series. It excels at coding tasks, general Q&A, and multilingual tasks. I’ve personally found it to be extremely efficient due to grouped-query attention (GQA) optimizations.

  • Size: 7.6B parameters. The quantized model weighs around 4–5 GB.
  • Performance: Impressively scores around 74 on MMLU, outscoring even some older 13B models. Handles coding tasks (~57.9% on HumanEval).
  • Hardware Requirements: When quantized to 4 bits, it can run on around 5 GB of RAM. If relying solely on CPU, I recommend using an 8-core or higher processor for decent generation speed, though a 6 GB VRAM GPU is even better.
  • Tools & OS Support:
    • Available as GGUF quantized files for llama.cpp.
    • Compatible with Windows, macOS, and Linux (LM Studio, Ollama, etc.).
  • License: Open-source under Apache 2.0, which allows commercial use.

What it's good for:
Qwen2.5-7B has proven to be a well-rounded model with strong multilingual capabilities and efficient inference, making it a great baseline option for an offline assistant.


2. Gemma 2 (9B Instruct)

Model Overview
Gemma 2 (9B) is Google DeepMind’s open LLM that focuses on high performance at relatively low parameter counts. It's an extension of the same research that led to Google’s Gemini.

  • Size: 9B parameters, with a 4-bit quantized file of about 5–6 GB.
  • Performance: Consistently tops benchmarks for models in the 7–13B range. The 27B version reportedly rivals some 70B models, though that’s less practical for a laptop.
  • Hardware Requirements:
    • A 4-bit quantization requires around 6–8 GB of system memory.
    • CPU-only mode is feasible if you have at least 16 GB of total system RAM.
    • A GPU with 8 GB VRAM accelerates inference nicely.
  • Tools & OS Support:
    • Integrated with Hugging Face Transformers and a dedicated gemma.cpp (similar to llama.cpp).
    • Runs on Windows, macOS, or Linux with Ollama or LM Studio.
  • License: Released under a permissive Gemma license, allowing commercial use.

What it's good for:
Gemma 2 offers top-tier performance in this parameter range, making it an excellent choice if you need something more powerful than 7B but still want to keep resource usage manageable.


3. Mistral 7B (OpenHermes Fine-Tune)

Model Overview
Mistral 7B has gained a reputation for outperforming larger models (like Llama 2 13B) in many benchmarks, thanks to its efficient architecture. Community fine-tunes such as OpenHermes-2.5 further refine its instruction-following capabilities.

  • Size: 7B parameters, with a 4-bit quantized file around 4 GB.
  • Performance: Known to beat some 13B models in benchmark tests. It also provides up to 8K context windows via sliding-window attention.
  • Hardware Requirements: Extremely lightweight, only about 4 GB in 4-bit form. Even a modest laptop CPU can handle it, and a 4–6 GB VRAM GPU is more than enough.
  • Tools & OS Support:
    • Compatible with llama.cpp-based tools (Ollama, LM Studio, text-generation-webui).
    • No OS restrictions: Windows, macOS, Linux are all supported.
  • License: Apache 2.0, fully open-source with commercial usage allowed.

What it's good for:
Mistral 7B is arguably the best all-around “tiny” LLM I’ve tried. Its efficiency and performance balance make it perfect for experimentation or general-purpose usage on almost any decent laptop.


4. Xwin-LM 13B

Model Overview
Xwin-LM 13B is a refined variant of Meta’s Llama-2 13B. It has been extensively fine-tuned with Reinforcement Learning from Human Feedback (RLHF), resulting in high-quality, instruction-aligned outputs.

  • Size: 13B parameters, typically 7–8 GB in 4-bit quantized form.
  • Performance: Often compared to GPT-3.5-level quality in conversational tasks. Early versions have shown >90% win rates against text-davinci-003 on the AlpacaEval benchmark.
  • Hardware Requirements:
    • About 10–12 GB of RAM in 4-bit mode, so a laptop with 16 GB memory is recommended.
    • A GPU with 8+ GB VRAM is ideal.
  • Tools & OS Support:
    • Runs on Windows, macOS, Linux.
    • Functions with llama.cpp, text-generation-inference, Ollama, LM Studio, etc.
  • License: Inherits Meta’s Llama license, permitting personal and commercial usage under certain conditions.

What it's good for:
Xwin-LM 13B stands out for its well-rounded chat abilities and alignment. I’d consider it a close approximation to ChatGPT-level responses for offline usage, provided I have enough system resources.


5. DeepSeek-R1 Distilled (8B)

Model Overview
DeepSeek-R1 Distilled Llama 8B is a smaller model distilled from a large, expert “teacher” model. The result is an 8B model that demonstrates outstanding reasoning and problem-solving for its size.

  • Size: 8B parameters, 4.9 GB at 4-bit quantization.
  • Performance: Often rivals or exceeds older 13B models in complex tasks, especially chain-of-thought reasoning and math word problems.
  • Hardware Requirements:
    • Only about 5 GB of RAM needed for 4-bit.
    • Very CPU-friendly.
  • Tools & OS Support:
    • Provided as GGUF or GPTQ formats on Hugging Face.
    • Compatible with llama.cpp, LM Studio, and more, across Windows/macOS/Linux.
  • License: MIT license, which is highly permissive and suitable for commercial applications.

What it's good for:
DeepSeek-R1 Distilled 8B is an excellent balance of small footprint and surprisingly strong logical capabilities. It’s especially useful if you’re limited by RAM but still want solid reasoning skills.


Quick Comparison Table

Here’s a side-by-side breakdown of the five models:

Model Params Quant Size Performance Hardware Platforms License
Qwen2.5-7B 7.6B 4–5 GB (4-bit) Scores ~74 MMLU, excels at coding & multilingual tasks ~5 GB RAM (4-bit), runs on CPU/GPU llama.cpp, LM Studio, Ollama; multi-OS Apache 2.0
Gemma 2 (9B) 9B 5–6 GB (4-bit) Near top-of-class for ≤13B. Efficient architecture ~6–8 GB RAM for CPU or 8 GB VRAM GPU gemma.cpp, HF Transformers, Ollama Permissive (Gemma)
Mistral 7B 7B ~4 GB (4-bit) Surpasses Llama2 13B on many benchmarks ~4 GB RAM (4-bit), extremely fast and light llama.cpp-based tools; multi-OS Apache 2.0
Xwin-LM 13B 13B ~7–8 GB (4-bit) ChatGPT-like quality, extensive RLHF tuning ~10 GB+ RAM, GPU recommended for speed llama.cpp, text-gen UIs; multi-OS Llama license
DeepSeek Distill 8B 8B ~4.9 GB (4-bit) Distilled from 670B expert; excels in reasoning ~5 GB RAM, CPU-friendly llama.cpp, LM Studio, multi-OS MIT License

Closing Thoughts

All of these models are capable of running fully offline, assuming you have enough memory to store and load the quantized weights. The sweet spot for most laptops is a 7–9B model, which offers a great balance between performance and resource requirements. Models like Mistral 7B and DeepSeek 8B are especially lightweight, making them ideal for those with limited hardware.

If you have at least 16 GB of RAM (and possibly a decently sized GPU), models like Xwin-LM 13B can deliver nearly ChatGPT-level responses while preserving privacy and independence from cloud services. The open-source licensing on most of these models allows for commercial use, which is a significant advantage if you’re integrating these solutions into your own apps or businesses.

Running an LLM locally means being in complete control of one’s data and workload. Whether exploring offline coding assistants, building chatbots, or experimenting with advanced NLP tasks, these models offer a variety of capabilities and resource footprints. This is an exciting time to be a software engineer, as the barriers to entry for offline AI continue to drop.

If you have any questions or if you’re experimenting with other local models, feel free to reach out in the comments!


Grouped-query attention

Grouped-query attention (GQA) is an optimization used in transformer models to reduce both the computational and memory overhead of the multi-head attention mechanism. In traditional multi-head attention, each head computes its own set of query vectors independently, which can be redundant and expensive, especially when scaling up to longer context lengths or larger models.

How GQA Works:

  • Grouping Queries: Instead of computing a distinct query for every head, GQA groups some of the heads together so that they share a common query. This means fewer unique query computations.
  • Reduced Computation: By sharing queries across groups, the number of matrix multiplications required is reduced. This leads to lower memory usage and faster inference.
  • Maintained Performance: Even though queries are shared among several heads, the keys and values remain independent per head, which helps preserve the diversity and richness of the attention mechanism.

Why It Matters:
For models running on laptops or other resource-constrained hardware, such optimizations can be a game changer. They allow the model to handle longer inputs and generate responses more efficiently without sacrificing performance—crucial for offline applications where computational resources may be limited.

In summary, GQA is a smart tweak to the standard attention mechanism that makes language models more efficient, particularly useful when trying to run high-performing models locally.