A port of llama2.c to Plan 9 (9front). Run LLaMA inference natively on Plan 9.
- Multi-format support - Load safetensors (HuggingFace) and GGUF (llama.cpp) models
- Quantized inference - Run Q4_0 and Q8_0 quantized GGUF models
- SIMD acceleration - SSE2 assembly for 5.7x matmul speedup
- Multi-threading - Parallel attention heads via Plan 9 libthread
- 9P file server (
llmfs) - Multi-model inference over the network with LRU eviction - Pure Plan 9 C - No external dependencies, uses only Plan 9 libc
- Pure C test suite - No Python or shell scripts required
- Cross-platform - Build and test on Linux, run natively on Plan 9
sudo apt-get update
sudo apt-get install qemu-system-x86 qemu-utils mtools curl gccsudo dnf install qemu-system-x86 qemu-img mtools curl gcc# Install via Homebrew
brew install qemu mtools curl
# Or via MacPorts
sudo port install qemu mtools curl# Build everything and run tests
make testThe test harness automatically:
- Downloads 9front VM image if needed
- Creates a FAT shared disk
- Boots Plan 9 in QEMU
- Compiles and runs all tests inside Plan 9
- Compares output against C reference implementations
# Install huggingface-cli
pip install huggingface-hub
# Download safetensors model (60MB)
huggingface-cli download Xenova/llama2.c-stories15M model.safetensors \
--local-dir models/
# Download pre-quantized GGUF model (17MB, Q8_0)
huggingface-cli download tensorblock/Xenova_llama2.c-stories15M-GGUF \
llama2.c-stories15M-Q8_0.gguf --local-dir models/
# Download tokenizer
wget -O tokenizer.bin \
https://github.com/karpathy/llama2.c/raw/master/tokenizer.bin9ml supports two model formats with automatic detection:
| Format | Source | Quantization | Use Case |
|---|---|---|---|
| Safetensors | HuggingFace | FP32/FP16 | Full precision inference |
| GGUF | llama.cpp | Q4_0, Q8_0, etc. | Memory-efficient quantized inference |
The loader automatically detects format from file magic:
- GGUF: Magic
0x46554747("GGUF") - Safetensors: JSON header with tensor metadata
For safetensors models, place a config.json in the same directory to provide model configuration (dim, n_heads, rope_theta, etc.).
In Plan 9:
./run model.safetensors -z tokenizer.bin -n 50 -i 'Once upon a time'
./run model-Q8_0.gguf -z tokenizer.bin -n 50 -i 'Once upon a time'
| Option | Description |
|---|---|
-t <float> |
Temperature (0.0 = greedy, 1.0 = default) |
-p <float> |
Top-p sampling (0.9 = default) |
-s <int> |
Random seed |
-n <int> |
Number of tokens to generate |
-i <string> |
Input prompt |
-z <string> |
Path to tokenizer |
-m generate|chat |
Mode (default: generate) |
--no-simd |
Disable SIMD acceleration |
-j <int> |
Number of threads |
llmfs is a 9P file server that exposes LLM inference as a Plan 9 filesystem. It supports multiple models with LRU eviction and enables distributed inference where a powerful server runs models and thin clients access them over the network.
/mnt/llm/
ctl # Server control: load, unload, limit
info # Pool status: loaded models, memory usage
clone # Create new session (returns ID)
0/ # Session directory
ctl # Session control: model, temp, generate
info # Session info and status
prompt # Write prompt text
output # Read generated output (blocks until done)
stream # Streaming output
In Plan 9:
cd /usr/glenda/9ml/src
mk llmfs
To install globally:
cp llmfs /amd64/bin/
# Start the file server
llmfs -s llm
# Mount it
mount /srv/llm /mnt/llm
# Load a model (name, model file, tokenizer)
echo 'load small stories15M.safetensors tokenizer.bin' > /mnt/llm/ctl
# Check pool info
cat /mnt/llm/info
# Create a session
session=`{cat /mnt/llm/clone}
# Bind model to session and configure
echo 'model small' > /mnt/llm/$session/ctl
echo 'temp 0.0' > /mnt/llm/$session/ctl
echo 'steps 50' > /mnt/llm/$session/ctl
# Set prompt and generate
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo 'generate' > /mnt/llm/$session/ctl
# Read output (blocks until complete)
cat /mnt/llm/$session/output
# Check session info
cat /mnt/llm/$session/info
This setup allows you to run llmfs on a remote server and access it from your local machine using drawterm.
- Start QEMU with port forwarding:
qemu-system-x86_64 -m 1024 -cpu max -enable-kvm \
-drive file=9front.qcow2,format=qcow2,if=virtio \
-device virtio-net-pci,netdev=net0 \
-netdev user,id=net0,hostfwd=tcp::9564-:564,hostfwd=tcp::17019-:17019 \
-vnc :1- Boot 9front and configure network:
# Accept defaults at boot prompts (press Enter twice)
# Configure network (QEMU user networking)
ip/ipconfig -6
- Set up authentication (factotum):
# Add authentication key
echo 'key proto=dp9ik dom=llmfs user=glenda !password=yourpassword' > /mnt/factotum/ctl
- Start CPU listener:
# Create service script
mkdir -p /rc/bin/service
cat > /rc/bin/service/tcp17019 <<'EOF'
#!/bin/rc
exec /bin/cpu -R
EOF
chmod +x /rc/bin/service/tcp17019
# Start listener
aux/listen1 -t tcp!*!17019 /rc/bin/service/tcp17019
- Start llmfs and load model:
llmfs -s llm
mount /srv/llm /mnt/llm
echo 'load small /path/to/model.safetensors /path/to/tokenizer.bin' > /mnt/llm/ctl
- Install drawterm:
# macOS / Linux (build from source)
git clone https://github.com/9front/drawterm
cd drawterm
# macOS
CONF=osx-cocoa make
# Linux
make- Connect to remote server:
drawterm -h <server-ip>:17019 -a <server-ip>:17019 -u glendaWhen prompted, enter the password you set in factotum.
- Use llmfs from drawterm:
# Mount llmfs (already running on server)
mount /srv/llm /mnt/llm
# Create session and generate text
session=`{cat /mnt/llm/clone}
echo 'model small' > /mnt/llm/$session/ctl
echo 'Once upon a time' > /mnt/llm/$session/prompt
echo generate > /mnt/llm/$session/ctl
cat /mnt/llm/$session/output
- Resize text: Right-click on window, select "Resize" or use keyboard shortcuts
- Previous command: Use up arrow or Ctrl-P
- Copy/paste: Select text with mouse to copy, middle-click to paste
- Exit: Type
exitor close the window
| Command | Description |
|---|---|
model <name> |
Bind session to named model |
temp <float> |
Set temperature (0.0 = greedy) |
topp <float> |
Set top-p sampling (0.0-1.0) |
seed <int> |
Set random seed |
steps <int> |
Set max tokens to generate |
generate |
Start generation |
reset |
Reset session state |
close |
Close session |
| Command | Description |
|---|---|
load <name> <model> <tokenizer> |
Load model into pool with given name |
unload <name> |
Unload model from pool |
limit <max_models> <max_memory> |
Set pool limits |
| Operation | Implementation | Speedup |
|---|---|---|
| matmul | SSE2 assembly | ~5.7x |
| rmsnorm | SSE2 assembly | ~3x |
| dot_product | SSE2 assembly | ~4x |
| vec_add, vec_scale | SSE2 assembly | ~4x |
| softmax | SSE2 assembly | ~60x |
Tested on stories15M model:
| Mode | GFLOPS | Speedup |
|---|---|---|
| Scalar (1 thread) | 3.4 | 1.0x |
| SIMD (1 thread) | 19.0 | 5.7x |
| SIMD (4 threads) | 18.7 | 5.6x |
Token generation: ~180-240 tok/s on stories15M.
9ml/
├── src/
│ ├── run.c # Main inference driver
│ ├── model.c # Model loading (format detection)
│ ├── llmfs.c # 9P file server
│ ├── simd_amd64.s # SSE2 assembly
│ ├── parallel.c # Thread pool
│ ├── arch/ # Architecture plugins
│ │ ├── arch.c # Plugin registry
│ │ └── llama2.c # LLaMA 2 architecture
│ ├── format/ # File format parsers
│ │ ├── gguf.c # GGUF parser + dequantization
│ │ └── safetensors.c # Safetensors parser + config.json
│ ├── pool/ # Model pool management
│ │ └── pool.c # LRU eviction, reference counting
│ └── tests/ # Plan 9 test source files
├── test/ # C test harness (Linux host)
│ ├── harness.c # Main test driver
│ ├── reference.c # Reference implementations
│ └── qemu.c # QEMU VM management
├── models/ # Model files (download separately)
├── Makefile # Linux build file
├── mkfile # Plan 9 build file
└── CLAUDE.md # Development documentation
make test # Build and run all tests in Plan 9 VM
make clean # Clean build artifactsmk # Build all targets
mk clean # Clean build artifacts
# Compile
6c -w run.c
# Link with libraries
6l -o run run.6 simd_amd64.6 arch/arch.a6 format/format.a6
| Test | Description |
|---|---|
| rmsnorm, softmax, matmul | Core math operations |
| simd_validation | SIMD vs scalar correctness |
| generation | End-to-end text generation |
| format_generation | Safetensors vs GGUF output match |
| config_json | HuggingFace config.json parsing |
| gguf_dequant | Q4_0/Q8_0 dequantization |
| pool_lru | Model pool LRU eviction |
| llmfs_local | 9P server local mount |
| llmfs_remote | Dual-VM remote inference |
- 9front - Plan 9 fork
- Plan 9 C Programming
- llama2.c - Original implementation
- GGUF Format - GGUF specification
MIT License - Same as llama2.c