🤖 What You Will Build

A fully local AI assistant that listens to your voice, thinks with Gemma 4, decides on its own when to look through the webcam, and speaks its answers back to you. No internet connection required after setup.

  • Hardware: Jetson Orin Nano 8GB
  • Model: Gemma 4 E2B (quantized)
  • Capabilities: Voice input → AI reasoning → Vision when needed → Voice output
  • Cloud dependency: None

You Can Now Run a Voice + Vision AI Assistant Locally — And It Decides When to Use the Camera on Its Own

The thing that makes this build genuinely interesting is not the voice capability or the vision capability on its own. It is that the model decides autonomously when to use the camera. You do not say "look at this" or trigger any camera mode manually. You just ask a question — and if the model determines that seeing would help it answer better, it calls a tool that captures a webcam frame and incorporates what it sees into its response.

This is what VLA — Vision-Language-Action — means in practice. The model reasons about which of its available capabilities to use, then uses them. It is a small but meaningful step toward AI systems that act rather than just respond.

And it runs entirely on a Jetson Orin Nano. No cloud. No API keys. No monthly fees. Real-time inference on hardware that fits in your hand.


Understanding the Pipeline

Before touching any commands, understand what is actually happening when the system runs:

You speak
    ↓
Speech-to-Text (STT) — converts your voice to text
    ↓
Gemma 4 — reasons about your question
    ↓
    ├── If vision is useful: captures webcam → analyzes → responds
    └── If vision not needed: responds directly
    ↓
Text-to-Speech (TTS) — converts response to audio
    ↓
Speaker output

The critical decision point is step 3 — Gemma 4 evaluating whether vision would help. This is handled through a tool-calling mechanism. The model has access to a tool called look_and_answer, and it calls that tool when it judges that visual information would improve its response. No hard rules, no keywords, no user triggers. The model makes the call.


Hardware Requirements

  • Jetson Orin Nano 8GB — the 8GB variant is required; 4GB does not have enough RAM
  • USB webcam — any standard UVC-compatible webcam works
  • USB microphone — a dedicated USB mic gives better results
  • USB speaker or headphones — for audio output
  • Keyboard — for initial setup
  • Internet connection — only needed during setup to download the model

Step 1: Get the Code

Clone the repository that contains the VLA demo script:

git clone https://github.com/asierarranz/Google_Gemma.git
cd Google_Gemma/Gemma4

If you prefer to download just the script without the full repository:

wget https://raw.githubusercontent.com/asierarranz/Google_Gemma/main/Gemma4/Gemma4_vla.py

Step 2: Install System Dependencies

Update your package list and install the required system packages:

sudo apt update
sudo apt install -y \
  git build-essential cmake curl wget pkg-config \
  python3-pip python3-venv python3-dev \
  alsa-utils pulseaudio-utils v4l-utils psmisc \
  ffmpeg libsndfile1

This step will take a few minutes. The most important packages are alsa-utils and v4l-utils, which you will need later to identify your microphone and webcam device identifiers.


Step 3: Set Up Python Environment

Create and activate a virtual environment, then install the required Python packages:

python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install opencv-python-headless onnx_asr kokoro-onnx soundfile huggingface-hub numpy

What each package does:

  • opencv-python-headless — webcam capture without display server dependency
  • onnx_asr — speech-to-text conversion
  • kokoro-onnx — text-to-speech synthesis
  • soundfile — audio file reading and writing
  • huggingface-hub — model downloading utilities

Step 4: Optimize RAM — Critical on Jetson

The Jetson Orin Nano 8GB has shared CPU and GPU memory. Create a swap file and free up RAM by stopping services you do not need:

Create an 8GB swap file:

sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Free up RAM by stopping non-essential services:

sudo systemctl stop docker 2>/dev/null || true
pkill -f gnome-software || true
free -h

After these steps you should see at least 4–5GB available memory plus 8GB swap. That is enough to load and run the quantized Gemma 4 model.


Step 5: Build llama.cpp With CUDA Support

You need to build llama.cpp from source with CUDA support to use the Jetson's GPU for acceleration:

git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
 
cmake -B build \
  -DGGML_CUDA=ON \
  -DCMAKE_CUDA_ARCHITECTURES="87" \
  -DGGML_NATIVE=ON \
  -DCMAKE_BUILD_TYPE=Release
 
cmake --build build --config Release -j4

This build will take 10–20 minutes on the Jetson. Use -j4 to limit parallel compilation — using more will exhaust RAM and fail. Be patient with this step.


Step 6: Download the Gemma 4 Model Files

You need two files — the main language model weights and the multimodal projection weights for vision:

mkdir -p ~/models && cd ~/models
 
wget -O gemma-4-E2B-it-Q4_K_M.gguf \
  https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf
 
wget -O mmproj-gemma4-e2b-f16.gguf \
  https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/mmproj-gemma4-e2b-f16.gguf
⚠️ If you have very limited RAM, use Q3_K_M quantization instead — replace Q4_K_M with Q3_K_M in both the filename and URL. Quality will be slightly lower but it will use less memory.

Step 7: Start the Model Server

Launch the llama-server with your downloaded model files:

~/llama.cpp/build/bin/llama-server \
  -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj ~/models/mmproj-gemma4-e2b-f16.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 \
  --flash-attn on \
  --jinja

What these flags do:

  • -ngl 99 — offload 99 layers to GPU for maximum acceleration
  • --flash-attn on — faster inference with less memory
  • --jinja — enables tool calling support
  • --host 0.0.0.0 — allows connections from other devices on your network

Wait for the server to finish loading — this takes 1–2 minutes. You will see a message when it is ready.


Step 8: Verify the Server Is Running

Open a new terminal and test the API:

curl http://localhost:8080/v1/chat/completions

You should receive a JSON response. If you get a connection refused error, the server is not yet ready — check the server terminal for error messages.


Step 9: Identify Your Audio and Camera Devices

Find the device identifiers for your microphone, speaker, and webcam:

arecord -l
pactl list short sinks
v4l2-ctl --list-devices

From arecord -l output — find your USB microphone card and device number. For example card 3, device 0 gives you plughw:3,0.

From pactl list short sinks — copy the full sink name for your speaker.

From v4l2-ctl --list-devices — note your webcam device number, typically 0.


Step 10: Run the Assistant

Activate your virtual environment and run the main script with your device identifiers:

source .venv/bin/activate
 
export MIC_DEVICE="plughw:3,0"
export SPK_DEVICE="alsa_output.usb-YourDevice.analog-stereo"
export WEBCAM=0
export VOICE="af_jessica"
 
python3 Gemma4_vla.py

For text-only mode without microphone:

python3 Gemma4_vla.py --text

How the Vision Tool Works

The model autonomously decides when to use the camera through tool calling. The tool is defined like this:

{
  "name": "look_and_answer",
  "description": "Take a photo with webcam and analyze what is visible.",
  "parameters": {
    "type": "object",
    "properties": {
      "question": {
        "type": "string",
        "description": "The question to answer about what the camera sees"
      }
    },
    "required": ["question"]
  }
}

When the model receives your question, it reasons about whether calling this tool would help. If you ask something visual — identifying an object, reading text in the environment, describing what is present — the model generates a tool call. The Python script intercepts this, captures a webcam frame, sends it back to the model, and gets a final response. No user trigger needed. The model decides.


Docker Alternative

If you prefer not to build llama.cpp from source, use the NVIDIA Docker image:

sudo docker run -it --rm \
  --runtime=nvidia \
  --network host \
  ghcr.io/nvidia-ai-iot/llama_cpp \
  llama-server -hf unsloth/gemma-4-E2B-it-GGUF:Q4_K_S

Troubleshooting

  • Out of memory / server crashes — Switch to Q3_K_M quantization. Verify swap is active with free -h.
  • No sound or mic not working — Run arecord -l and pactl list again. Device names are case-sensitive and must match exactly.
  • First response is slow — This is normal. Caches warm up on first inference. Subsequent responses will be faster.
  • Webcam not found — Run v4l2-ctl --list-devices and update your WEBCAM environment variable to match the actual device number.

The Bigger Picture

Step back and think about what this system does. It listens to natural speech. It understands meaning. It reasons about whether visual information from the physical environment would help it respond better. It captures and interprets images when appropriate. It speaks its response aloud. All locally, in real time, on a device that costs a few hundred dollars.

A year ago this required cloud connectivity, API keys, and ongoing subscription costs. Today it runs on a device you own, disconnected from the internet, with no ongoing costs and no data leaving your local network.

This is not a demo of what will be possible eventually. It is a working demonstration of what is possible right now.

⚡ Quick Reference — All Commands in Order

# 1. Get code
git clone https://github.com/asierarranz/Google_Gemma.git
cd Google_Gemma/Gemma4
 
# 2. System dependencies
sudo apt update
sudo apt install -y git build-essential cmake curl wget \
  pkg-config python3-pip python3-venv python3-dev \
  alsa-utils pulseaudio-utils v4l-utils psmisc ffmpeg libsndfile1
 
# 3. Python environment
python3 -m venv .venv
source .venv/bin/activate
pip install --upgrade pip
pip install opencv-python-headless onnx_asr kokoro-onnx soundfile huggingface-hub numpy
 
# 4. Swap and RAM
sudo fallocate -l 8G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo systemctl stop docker 2>/dev/null || true
 
# 5. Build llama.cpp
git clone https://github.com/ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="87" \
  -DGGML_NATIVE=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j4
 
# 6. Download models
mkdir -p ~/models && cd ~/models
wget -O gemma-4-E2B-it-Q4_K_M.gguf \
  https://huggingface.co/unsloth/gemma-4-E2B-it-GGUF/resolve/main/gemma-4-E2B-it-Q4_K_M.gguf
wget -O mmproj-gemma4-e2b-f16.gguf \
  https://huggingface.co/ggml-org/gemma-4-E2B-it-GGUF/resolve/main/mmproj-gemma4-e2b-f16.gguf
 
# 7. Start server
~/llama.cpp/build/bin/llama-server \
  -m ~/models/gemma-4-E2B-it-Q4_K_M.gguf \
  --mmproj ~/models/mmproj-gemma4-e2b-f16.gguf \
  --host 0.0.0.0 --port 8080 \
  -ngl 99 --flash-attn on --jinja
 
# 8. Identify devices (new terminal)
arecord -l
pactl list short sinks
v4l2-ctl --list-devices
 
# 9. Run assistant
source .venv/bin/activate
export MIC_DEVICE="plughw:3,0"
export SPK_DEVICE="your_speaker_sink"
export WEBCAM=0
export VOICE="af_jessica"
python3 Gemma4_vla.py