Gemma 3 1B (llama.cpp)

Google

Production AI assistant via llama.cpp: 50-100x faster than Ollama with 114ms first token latency. Optimized for real-time chat interactions.

Text Generation Local Gemma Family v1B

Back to Models

Parameters

params

Context Window

tokens

Max Output

tokens

Input Price

per 1M tokens

Output Price

per 1M tokens

Gemma Family 7 models

The full Gemma line by generation — pricing and capabilities vary across the family.

Google

FunctionGemma

Native function calling for on-device agents. Routes complex tasks to larger models. Optimized for edge deployment.

context

Dec 2025

2 9B

General purpose, balanced

context

Jun 2024

2 2B

Edge devices, fast inference

context

Jun 2024

2 27B

High quality generation

context

Jun 2024

3 1B (llama.cpp) Current

Fast AI assistant for chat, code generation, and reasoning tasks

context

Feb 2024

3 12B

General-purpose, multimodal, coding

Complex reasoning, multimodal, research

131K

context

Mar 2025

Capabilities

👁️

Vision

⚡

Function Calling

📋

JSON Mode

🌊

Streaming

💬

System Prompt

🖥️

Code Execution

🔍

Web Search

🔌

MCP Support

Local Model Specs

Quantization

Q4_K

Architecture

Gemma

Runtime

llama.cpp

VRAM Usage

0.23 GB

Disk Size

0.78 GB

Details

Release Date: February 21, 2024
Knowledge Cutoff: September 1, 2024
Source: Local
License: Open Source
Model ID: gemma3-1b-llama-cpp

Last updated: November 15, 2025