TLDR: I got a 7GB model down to 934MB using 4-bit quantization. Then I used Apple’s MLX framework to run it on iPhone. The app itself is pretty simple—around 200 lines of Swift. Model loads in about 5 seconds, translations happen in under 300ms. I don’t know if this is the best approach, but it works!
The Size Problem
So I had a trained model. Exciting! But it was way too big to fit on a phone.
Qwen3-1.7B in full precision comes out to around 7GB. That’s… a lot. More than most apps. More than you want to ask someone to download. And when you load it into memory, you’re eating up significant resources.
I needed to make it smaller.
Quantization
Quantization is basically compression for neural networks. The idea is to reduce the precision of the numbers the model uses internally. Fortunately, I’ve quantised several models before, so this was somewhat familiar territory.
Normal neural networks (typically) use 32-bit floating point numbers (FP32) for their weights. Quantization reduces this:
| Format | Bits per Weight | Approximate Size |
|---|---|---|
| FP32 | 32 | ~7GB |
| FP16/BF16 | 16 | ~3.4GB |
| INT8 | 8 | ~1.7GB |
| INT4 | 4 | ~850MB |
The trade-off: smaller usually means some quality loss. The question is how much.
What I Found
I tested several quantization levels to see what happened:
| Quantization | Model Size | Accuracy | My Notes |
|---|---|---|---|
| BF16 (baseline) | ~3.4GB | ~70% | Too big for comfort |
| 8-bit | ~1.7GB | ~70% | Still pretty big |
| 4-bit | ~934MB | ~69% | Sweet spot? |
| 3-bit | ~680MB | ~64% | Quality starts dropping |
| 2-bit | ~520MB | ~47% | Unusable |
4-bit seemed like the right choice. I lost maybe 1% accuracy compared to the full model, but gained a huge size reduction. The model fits comfortably under 1GB.
Why MLX?
Apple has this framework called MLX for running machine learning on their chips. I ended up using it, though I’m honestly not 100% sure it was the right choice, but it’s working better than CoreML experiments in the past.
Things I liked about MLX:
- It’s designed specifically for Apple Silicon
- The CPU and GPU share memory, which seems to help with performance
- Variable-length inputs work naturally (unlike some frameworks that want fixed sizes)
- There’s both Python and Swift support
Comparison with alternatives:
I tried a few other approaches:
CoreML — Apple’s older ML framework. It works, but it felt clunkier for transformer models. I had to deal with fixed shape requirements that made variable-length translation annoying.
llama.cpp — This is a popular project for running LLMs efficiently. It’s really good! But I ran into issues with Qwen3’s tokenizer. Some special tokens weren’t handled correctly, leading to weird outputs.
In my (very unscientific) testing, MLX gave me better performance on my Mac and iPhone than the alternatives. Your mileage may vary.
The Learning Curve
I should be honest: getting all this working was not straightforward.
Some things that tripped me up:
Model format conversions — You can’t just take any model and load it with MLX. There’s a specific format with specific files (config.json, model.safetensors, tokenizer files). I had to learn how to convert my trained model into this format correctly.
Memory management — On the first few attempts, the app would crash shortly after loading the model. I was holding references wrong, causing memory to balloon. Took a while to figure out the right patterns.
Tokenizer quirks — Qwen3’s tokenizer has some special tokens that need to be handled carefully. I had a bug where the model would emit special tokens in the output, which looked like garbage. Fixing this required understanding the tokenizer configuration better than I wanted to.
None of this is insurmountable. But I want to be clear: if you’re thinking “I’ll just throw a model on an iPhone,” expect some small friction.
The iOS App
The app itself is pretty simple. I’m not a professional iOS developer, so the code might not be idiomatic, but it works.
Here’s the core of the translation logic:
// Setting up the model container
// MLX handles loading the quantized model from disk
let configuration = ModelConfiguration(
id: modelPath, // Path to our 4-bit model
defaultPrompt: "Translate Albanian to English"
)
// Load model (this takes ~5 seconds on my iPhone 14 Pro Max)
modelContainer = try await ModelContainer.load(configuration: configuration)
And the translation session:
// Create a chat session with our translation instructions
// The /no_think flag tells Qwen to skip reasoning and just output
chatSession = ChatSession(
container,
instructions: """
You are a translator. Translate from Albanian to English.
Output only the English translation, nothing else.
/no_think
""",
generateParameters: GenerateParameters(
temperature: 0.1, // Low = more consistent outputs
maxTokens: 256 // Albanian sentences are rarely this long
)
)
The actual translation is straightforward:
// Stream tokens as they're generated
// This makes the UI feel more responsive
for await token in session.generate(prompt: "Translate: \(albanianText)") {
result += token
}
That’s… basically it. The MLX framework does the heavy lifting. My code just coordinates loading and calling the model. (Obviously, will make the app better after the model is better.)
Performance Numbers
Here’s what I measured on my iPhone 14 Pro Max:
| Metric | Value |
|---|---|
| Model load time | ~5 seconds |
| Tokens per second | ~10-90 tok/s |
| Short phrase translation | <150ms |
I don’t know if these numbers are good compared to other approaches. I don’t have a baseline to compare against. But they feel fast enough—when I type something and tap translate, the result appears quickly enough that it feels responsive.
Model Loading Strategy
I wanted flexibility during development, so I set up the app to look for the model in multiple places:
func findModelPath() -> String {
// First: check Documents folder (for testing new models)
// I can drop a new model via Finder without rebuilding the app
let documentsPath = FileManager.default
.urls(for: .documentDirectory, in: .userDomainMask)[0]
.appendingPathComponent("albanian-translator").path
if FileManager.default.fileExists(atPath: documentsPath) {
return documentsPath
}
// Second: check app bundle (for production)
if let bundlePath = Bundle.main.path(
forResource: "albanian-translator", ofType: nil
) {
return bundlePath
}
// Last resort: download from HuggingFace Hub
// This should never happen in production, but it's a safety net
return "mlx-community/albanian-translator-4bit"
}
The Documents folder trick was really useful during development. I could test new model versions by just dragging files via Finder, without going through the whole Xcode build cycle.
What I Don’t Know
There’s a lot about iOS deployment that I’m unsure about:
Battery impact — I haven’t done rigorous battery testing. Running a 934MB model definitely uses power, but I don’t know how much compared to alternatives.
Memory pressure — The model uses up to ~900MB of RAM. On newer iPhones with 6GB+, that’s fine. On older devices with 4GB, it might cause issues. I haven’t tested extensively.
Background behavior — What happens when the app goes to background? I think the model gets unloaded, but I’m not certain about the details.
App Store approval — I haven’t submitted this to the App Store yet. I don’t know if there will be issues with the model size or performance requirements.
These are all things I need to figure out before a real release.
App Size
The final app breakdown is roughly:
| Component | Size |
|---|---|
| Model weights | ~934MB |
| Swift binary | ~12MB |
| MLX framework | ~8MB |
| Assets | ~2MB |
| Total | ~956MB |
Just under 1GB, which hits my target. But it’s still a hefty download. Users on cellular might not be thrilled.
The trade-off: you download once, and then it works forever offline. No ongoing data usage. No privacy concerns about your text going to servers.
I think that’s worth it for the right users, but it’s definitely a barrier.
What I Skipped
Some features I deliberately didn’t build:
Cloud sync — Your translations stay on your device. No iCloud, no server. Privacy-first means your data doesn’t leave, period.
Batch translation — One sentence at a time for now. Document translation would be cool but adds complexity.
Bidirectional translation — This model only does Albanian → English. English → Albanian would need a separate model or a different architecture.
Offline dictionary — When translation fails, a simple dictionary lookup might help. Future work.
Better streaming/Touch ups — Needs some work.
I wanted to ship something that worked before adding features. Maybe some of these come later.
Lessons for Others
If someone else wants to deploy a model to iOS, here’s what I learned:
- Start with MLX examples. There are sample projects in the mlx-swift repository. Study them before trying to build your own thing.
- Test on device early. The simulator doesn’t behave the same as real hardware. Memory constraints are different. Performance is different. Test on a real iPhone as soon as possible.
- Budget for model conversion headaches. Getting your trained model into the right format for MLX will probably take longer than you expect. Plan for it.
- Keep the UI simple at first. I spent too much time on UI polish early on, when I should have been focused on getting the model working correctly. (Oops)
- Document your model loading path. Where does the model live? How does it get there? How do you update it? These questions are surprisingly annoying if you don’t think about them upfront.
I hope this helps someone. I’m definitely not an expert—I’m just sharing what worked for me, mistakes and all.
The coffee is poured. Let’s see how it tastes.