Gemma 3: Google’s Open Source LLM Revolution

Google’s Gemma 3 represents a major leap in open-source language models. With models ranging from 1B to 27B parameters, Gemma 3 brings near-GPT-4 level capabilities to local deployment, offering unprecedented accessibility for developers and enterprises.

AI Neural Network Photo by Steve Johnson on Unsplash

What Makes Gemma 3 Special?

Key Improvements

Multimodal Capabilities: Native vision + text understanding
128K Context Window: Process entire codebases or documents
Improved Reasoning: Better at math, code, and complex tasks
Efficient Architecture: Run 27B model on consumer hardware
Open Weights: Full access for research and commercial use

Model Variants

Model	Parameters	VRAM	Best For
Gemma 3 1B	1B	2GB	Edge devices, mobile
Gemma 3 4B	4B	6GB	Laptops, quick tasks
Gemma 3 12B	12B	16GB	Development, fine-tuning
Gemma 3 27B	27B	24GB	Production, complex reasoning

Getting Started

Installation

# Install dependencies
pip install transformers accelerate torch

# Or use Ollama for easy local deployment
ollama pull gemma3:27b

Basic Usage with Transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "google/gemma-3-27b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "user", "content": "Explain quantum computing in simple terms"}
]

input_ids = tokenizer.apply_chat_template(
    messages,
    return_tensors="pt",
    add_generation_prompt=True
).to(model.device)

outputs = model.generate(
    input_ids,
    max_new_tokens=512,
    temperature=0.7,
    top_p=0.9,
    do_sample=True
)

response = tokenizer.decode(outputs[0][input_ids.shape[1]:], skip_special_tokens=True)
print(response)

Using Ollama

# Interactive chat
ollama run gemma3:27b

# API usage
curl http://localhost:11434/api/generate -d '{
    "model": "gemma3:27b",
    "prompt": "Write a Python function to calculate fibonacci numbers",
    "stream": false
}'

Machine Learning Photo by Franki Chamaki on Unsplash

Vision Capabilities

Gemma 3 includes powerful multimodal abilities:

from transformers import AutoProcessor, Gemma3ForConditionalGeneration
from PIL import Image
import requests

model = Gemma3ForConditionalGeneration.from_pretrained(
    "google/gemma-3-27b-vision",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
processor = AutoProcessor.from_pretrained("google/gemma-3-27b-vision")

# Load image
image_url = "https://example.com/chart.png"
image = Image.open(requests.get(image_url, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "Analyze this chart and explain the trends"}
        ]
    }
]

inputs = processor.apply_chat_template(messages, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0], skip_special_tokens=True))

Fine-Tuning Gemma 3

LoRA Fine-Tuning

from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from transformers import BitsAndBytesConfig, TrainingArguments
from trl import SFTTrainer

# Quantization config for memory efficiency
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-3-12b-it",
    quantization_config=bnb_config,
    device_map="auto"
)

model = prepare_model_for_kbit_training(model)

# LoRA configuration
lora_config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, lora_config)

# Training
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=TrainingArguments(
        output_dir="./gemma3-finetuned",
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,
        num_train_epochs=3,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=10,
        save_strategy="epoch"
    ),
    tokenizer=tokenizer,
    max_seq_length=2048
)

trainer.train()

Preparing Training Data

from datasets import Dataset

# Format for instruction tuning
data = [
    {
        "instruction": "Summarize this code",
        "input": "def fib(n): return n if n <= 1 else fib(n-1) + fib(n-2)",
        "output": "A recursive Fibonacci function that returns the nth Fibonacci number."
    },
    # ... more examples
]

def format_prompt(example):
    return f"""<start_of_turn>user
{example['instruction']}

{example['input']}<end_of_turn>
<start_of_turn>model
{example['output']}<end_of_turn>"""

dataset = Dataset.from_list(data)
dataset = dataset.map(lambda x: {"text": format_prompt(x)})

Production Deployment

vLLM for High Throughput

from vllm import LLM, SamplingParams

llm = LLM(
    model="google/gemma-3-27b-it",
    tensor_parallel_size=2,  # Multi-GPU
    dtype="bfloat16",
    max_model_len=32768
)

sampling_params = SamplingParams(
    temperature=0.7,
    top_p=0.9,
    max_tokens=512
)

prompts = ["Question 1...", "Question 2...", "Question 3..."]
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    print(output.outputs[0].text)

Docker Deployment

FROM nvidia/cuda:12.1-runtime-ubuntu22.04

RUN pip install vllm transformers

COPY serve.py /app/serve.py

EXPOSE 8000

CMD ["python", "-m", "vllm.entrypoints.openai.api_server", \
     "--model", "google/gemma-3-27b-it", \
     "--host", "0.0.0.0", \
     "--port", "8000"]

Kubernetes with GPU

apiVersion: apps/v1
kind: Deployment
metadata:
  name: gemma3-server
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: gemma3
        image: gemma3-server:latest
        resources:
          limits:
            nvidia.com/gpu: 1
        env:
        - name: CUDA_VISIBLE_DEVICES
          value: "0"
        ports:
        - containerPort: 8000

Benchmarks vs Competitors

Benchmark	Gemma 3 27B	Llama 3.1 70B	GPT-4 Turbo
MMLU	82.3	83.1	86.4
HumanEval	78.5	80.2	87.1
GSM8K	85.2	84.0	92.0
MT-Bench	8.9	8.8	9.3

Gemma 3 achieves ~95% of GPT-4 performance with 4x fewer parameters

Best Practices

1. Prompt Engineering

system_prompt = """You are a helpful coding assistant. Follow these guidelines:
- Write clean, well-documented code
- Explain your reasoning step by step
- Consider edge cases and error handling
- Use type hints in Python code"""

def format_message(user_query: str) -> list:
    return [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_query}
    ]

2. Context Management

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("google/gemma-3-27b-it")

def truncate_context(messages: list, max_tokens: int = 120000) -> list:
    """Ensure context fits within model limits."""
    while True:
        text = tokenizer.apply_chat_template(messages, tokenize=False)
        tokens = len(tokenizer.encode(text))
        
        if tokens <= max_tokens:
            return messages
        
        # Remove oldest non-system message
        for i, msg in enumerate(messages):
            if msg["role"] != "system":
                messages.pop(i)
                break

3. Output Parsing

import json
import re

def extract_json(response: str) -> dict:
    """Extract JSON from model response."""
    # Try to find JSON block
    json_match = re.search(r'```json\n(.*?)\n```', response, re.DOTALL)
    if json_match:
        return json.loads(json_match.group(1))
    
    # Try direct parse
    try:
        return json.loads(response)
    except json.JSONDecodeError:
        return {"raw": response}

Conclusion

Gemma 3 democratizes access to powerful language models. With its efficient architecture, multimodal capabilities, and open weights, it’s now possible to run near-state-of-the-art AI locally or in your own cloud.

Whether you’re building chatbots, coding assistants, or complex reasoning systems, Gemma 3 provides a solid foundation without the costs and privacy concerns of proprietary APIs.

The future of AI is open, and Gemma 3 is leading the way.

Ready to deploy your own language model? Download Gemma 3 and start building!

이 글이 도움이 되셨다면 공감 및 광고 클릭을 부탁드립니다 :)

Gemma 3: Google's Open Source LLM Revolution

Gemma 3: Google’s Open Source LLM Revolution

What Makes Gemma 3 Special?

Key Improvements

Model Variants

Getting Started

Installation

Basic Usage with Transformers

Using Ollama

Vision Capabilities

Fine-Tuning Gemma 3

LoRA Fine-Tuning

Preparing Training Data

Production Deployment

vLLM for High Throughput

Docker Deployment

Kubernetes with GPU

Benchmarks vs Competitors

Best Practices

1. Prompt Engineering

2. Context Management

3. Output Parsing

Conclusion

Dev Note

Gemma 3: Google’s Open Source LLM Revolution

What Makes Gemma 3 Special?

Key Improvements

Model Variants

Getting Started

Installation

Basic Usage with Transformers

Using Ollama

Vision Capabilities

Fine-Tuning Gemma 3

LoRA Fine-Tuning

Preparing Training Data

Production Deployment

vLLM for High Throughput

Docker Deployment

Kubernetes with GPU

Benchmarks vs Competitors

Best Practices

1. Prompt Engineering

2. Context Management

3. Output Parsing

Conclusion

Share this post