A Samsung Galaxy S25 just fine-tuned a billion-parameter language model in 78 minutes. Not on a server somewhere. Not through an API. On the phone itself, using its own GPU, with the data never leaving the device.
Two years ago that was science fiction. Today it is a benchmark result from Tether's QVAC Fabric framework, open-sourced on March 17, 2026.
The Short Version
Tether (the $184 billion stablecoin company) open-sourced QVAC Fabric - a framework that fine-tunes and runs Microsoft's BitNet 1-bit AI models on phones, laptops, and consumer GPUs. It works across AMD, Intel, Apple, Qualcomm, and ARM chips. No NVIDIA monopoly. No cloud dependency.
The Benchmarks
QVAC Fabric takes Microsoft's BitNet architecture - where each model weight is just -1, 0, or +1 instead of a 16-bit floating point number - and makes it trainable across every major consumer GPU. The framework uses Vulkan compute shaders with dynamic tiling to split large matrix operations into chunks that fit within mobile GPU constraints. The output is bit-exact whether you run it on a data center rack or a $799 phone.
From the HuggingFace technical paper:
| Device | Model Size | Fine-Tune Time | Dataset |
|---|---|---|---|
| Samsung S25 | 125M | ~10 min | 297 biomedical docs |
| Samsung S25 | 1B | 1h 18min | 297 biomedical docs |
| iPhone 16 | 1B | 1h 45min | 297 biomedical docs |
| iPhone 16 | 13B | At device limit | 297 biomedical docs |
Why This Is Different
Standard quantization methods (Q4, Q8) compress models after training, always losing quality. BitNet trains natively at 1.58 bits per weight from the start. The math collapses to additions and subtractions instead of floating-point multiplications. That simplicity is what unlocks phone-grade hardware for work that used to need server racks.
Inference Speed by Device
Most coverage stops at "it works on phones." Whether on-device AI feels usable or painful comes down to tokens per second. Here is how each phone actually performs, running a BitNet-1B model in TQ1_0 format:
Source: QVAC Fabric BitNet benchmarks, HuggingFace (March 2026). TQ1_0 format, 1B model.
The iPhone result stands out. 130 tokens per second from a phone GPU matches what many developers get from small quantized models on a mid-range laptop. The Samsung S25 shows the biggest CPU-to-GPU jump at 11x, making a device that is borderline useless on CPU (2.4 tok/s) into something practical.
The Format Decision: TQ1_0 vs TQ2_0
QVAC ships two BitNet weight formats. Picking the wrong one can cost you 5x in training speed or lock you out of larger models entirely.
TQ1_0 (Compact)
TQ2_0 (Fast)
On the Pixel 9 with BitNet 2.7B, fine-tuning with TQ1_0 took 1,570 minutes per epoch. TQ2_0 took 304 minutes. But Apple's GPU architecture strongly favors TQ1_0 for inference, delivering 4-5x higher throughput than TQ2_0 on the iPhone 16. Samsung's Adreno GPU goes the other way.
No single "best" format exists. Apple deployment should default to TQ1_0 for both memory and speed. Samsung deployment should favor TQ2_0 for training throughput. The QVAC Vulkan abstraction layer handles hardware differences, but this choice still sits with the developer.
Memory: Where 1-Bit Models Break the Rules
The memory savings go beyond incremental. BitNet runs bigger models in less space than competing architectures run smaller ones.
| Model | Parameters | VRAM | Bits/Weight |
|---|---|---|---|
| BitNet-1B (TQ1_0) | 1B | 614 MiB | 1.58 |
| Qwen3-0.6B (FP16) | 0.6B | 1,536 MiB | 16 |
| Gemma-3-1B (FP16) | 1B | 2,753 MiB | 16 |
| BitNet-2.7B (TQ1_0) | 2.7B | 1,228 MiB | 1.58 |
| Qwen3-1.7B (FP16) | 1.7B | 4,403 MiB | 16 |
| BitNet-13B (TQ1_0) | 13B | 2,789 MiB | 1.58 |
| Qwen3-4B (Q4) | 4B | ~3,600 MiB | 4 |
BitNet rows highlighted. Data from QVAC Fabric HuggingFace benchmarks.
Read that table carefully. BitNet-13B uses less memory than a 4-bit quantized Qwen3-4B despite having 3.25x more parameters. BitNet-2.7B needs less than a third of what Qwen3-1.7B requires despite being 60% larger. The 1-bit architecture does not just compress better. It operates in a different category of hardware efficiency.
Why a Stablecoin Company Is Building AI Infrastructure
Tether posted $10 billion in net profit in 2025. It holds $122 billion in U.S. Treasuries and manages $184 billion in USDT. The AI push is not a side experiment.
$184B market cap generates the cash
Edge AI framework, open-source
P2P networking, no central servers
Local AI + P2P = no cloud needed
CEO Paolo Ardoino calls the strategy "Stable Intelligence." Tether already invested $1.2 billion in German robotics startup Neura Robotics. QVAC Fabric is the second major release - the first introduced cross-platform LoRA fine-tuning for standard models like Qwen3 and Gemma3 earlier in 2026.
Through Holepunch (the team behind the Keet messaging app), Tether built a decentralized networking stack. Pair that with on-device AI that skips the server entirely:
"Intelligence will be a key determining factor in the future of society. The future of AI should be accessible, available, and open to people and builders everywhere, and it should not require an absurd amount of resources only available to a handful of cloud providers."Paolo Ardoino, CEO of Tether
The thesis stripped bare: AI models run locally, communicate peer-to-peer, and your data goes nowhere. The stablecoin business funds it. QVAC builds the compute layer. Holepunch handles networking. Whether you buy the vision or not, the open-source code is real and the benchmarks are reproducible.
The Cost Problem Driving All of This
Cloud inference spending now exceeds 55% of all AI infrastructure costs, up from 33% in 2023. The inference-optimized chip market alone should pass $50 billion this year.
A concrete example: a construction company built an AI analytics tool for under $200/month during development. In production, the bill hit $10,000/month. Development runs hundreds of queries daily. Production runs millions. API pricing that looks manageable at prototype scale turns punishing at production volume.
The Economics Flip
- Pay per inference, forever
- Costs scale linearly with usage
- Data leaves your network
- Latency depends on connectivity
- One-time hardware cost
- Every inference after is free
- Data stays on device
- Instant local response
On-device inference flips the cost curve. You buy hardware once. Every query after that costs nothing. For high-volume, latency-sensitive, or privacy-locked applications, edge deployment wins on pure math.
Quality Check: How Smart Are These Models?
No hype here. A BitNet model on your phone will not replace Claude or ChatGPT for complex reasoning.
Microsoft's BitNet b1.58 2B4T scores 53.2% on MMLU. Claude Opus scores roughly 87%. That gap is real and matters for general-purpose tasks.
But general-purpose benchmarks miss the use case. These models are not trying to be everything. A 1B BitNet model fine-tuned on your company's support tickets, your clinic's intake protocols, or a sensor classification pipeline can handle that specific job well. It runs offline, costs nothing per query, and exposes zero data.
The Honest Limitation
On-device 1-bit models will not replace frontier cloud models for complex reasoning, creative writing, or multi-step analysis. They are purpose-built specialists: fast, free, private, and good enough for targeted tasks. Ask "is it smart enough for this specific job?" not "is it as smart as Claude?"
Where it gets interesting: BitNet is at 2B natively-trained parameters today. The architecture scales. When 13B and 70B natively-trained BitNet models ship (not post-quantized conversions, which degrade), running on Apple M-series chips through QVAC's framework, the quality gap will compress fast.
The Platform Landscape
QVAC does not exist in a vacuum. Every major hardware maker already has an edge AI play:
| Platform | Scope | Hardware Lock-in | Fine-Tuning |
|---|---|---|---|
| Apple MLX | Training + inference | Apple Silicon only | Yes (Apple only) |
| Google Gemini Nano | Inference | Pixel / Android | No |
| Qualcomm AI Hub | Inference | Snapdragon | No |
| Samsung Galaxy AI | Inference | Galaxy devices | No |
| QVAC Fabric | Training + inference | None (Vulkan/Metal) | Yes (all GPUs) |
QVAC is the only framework offering cross-platform fine-tuning. Apple's MLX is locked to Apple silicon. Google's tools favor their chips. Qualcomm and Samsung only do inference. QVAC's Vulkan backend runs on all of them.
For developers building AI-powered products, a growing set of applications - health monitoring, document classification, on-device translation, smart home control - no longer require a network connection for AI to function.
For enterprises under data governance pressure, on-device processing is shifting from nice-to-have to compliance requirement. GDPR, HIPAA, and financial data sovereignty rules increasingly favor architectures where data never leaves the device. QVAC's federated approach - fine-tune locally, share only the weight updates - makes that architecturally feasible.
What Happens Next
Three trends are compounding:
BitNet matches last year's 7B quality at 2B parameters with a fraction of the memory. Extrapolate forward: today's cloud-grade intelligence fits on next year's phone chip.
Apple's M4, Qualcomm's Snapdragon 8 Elite, Samsung's Exynos - each generation adds meaningful ML throughput. The hardware runway extends years out.
GPU vendor no longer dictates what you can run. One codebase, every chip. That portability did not exist 12 months ago.
If You Are Building AI Products Today
Architect with a hybrid deployment model in mind. Design data pipelines so models can be fine-tuned and pushed to edge devices, not just called through APIs. The infrastructure exists now. By the end of 2026, on-device quality will be a competitive edge, not a trade-off.
Cloud AI is not going anywhere. Frontier models will keep scaling, and some tasks will always need that power. But the lock that cloud providers held over useful AI is weakening with every release like this one. A stablecoin company just proved a billion-parameter model can train on a phone.
The question is not whether on-device AI becomes good enough. It is when. Based on the current curve, "when" is closer than most companies are planning for.