Results & Future Plans
TLDR: 69.2% accuracy on my tiny 13-phrase test set. Basic phrases work. Proverbs mostly fail.
The big challenge: I can’t actually tell if this is good because there’s no Albanian benchmark! Building that benchmark is now priority #1.
The Frustrating Part
Let me show you my test results:
| Albanian | Expected | Model Output | Result |
|---|---|---|---|
| Sa kushton? | How much? | How much? | ✅ |
| plazh | beach | beach | ✅ |
| Faleminderit | Thank you | Thank you | ✅ |
| Mirëmëngjes! | Good morning! | Good evening! | ❌ |
| Po | Yes | Yes | ✅ |
| Jo | No | No | ✅ |
| Mirupafshim | Goodbye | Goodbye | ✅ |
| Si jeni? | How are you? | How are you? | ✅ |
| Ku është stacioni? | Where is the station? | Where is the station? | ✅ |
| Më mirë shëndet, se mbret | Better health than king | Better health, than a priest | ❌ |
| Fjala pa punë, si peshku pa lumë | Words without work, like fish without river | A word without work, like a stone without a face | ❌ |
| Kush punon, ha bukë | Who works, eats bread | Who works, eats | ❌ |
| Nuk ka tym pa zjarr | No smoke without fire | There’s no smoke without fire | ✅ |
Score: 9/13 = 69.2%
That looks decent, right? The simple stuff works. Proverbs struggle.
But here’s my frustration: I have no idea if 69% is good.
Thirteen phrases isn’t a real benchmark. I picked these phrases somewhat arbitrarily based on what I thought would be interesting test cases. Maybe I picked too many hard ones. Maybe I picked too many easy ones. Maybe my expected translations are wrong.
Is Qwen the right base model? I don’t know. I can’t compare because there’s no benchmark showing how GPT-5, Claude, Gemini, or any other model performs on Albanian.
Was my training data good? I don’t know. I have no baseline to compare against.
Did I choose the right training approach? I don’t know! Who knows!
It’s like going to school and the teacher never gives you grades. You hand in your homework and… nothing. Did you pass? Did you fail? Unclear.
This Is Why We Need an Albanian Benchmark
Let me say this louder for the people in the back:
We’re flying blind.
If a researcher wants to improve Albanian language technology, they have no way to measure progress. If a company wants to know if their translation is good enough, they have no standard to test against. If I want to know whether I should use Qwen or Gemma or Llama as my base model, I have to guess.
I’m not an expert. I’m not a linguist. But I can see that this gap needs to be filled. So I’m going to try to build something.
What An Albanian NMT Benchmark Might Look Like
Here’s my rough thinking (and I’d love feedback from people who know more than me):
| Domain | Target Sentences | Why |
|---|---|---|
| Conversational | 150 | Basic usage, most common need |
| Formal/business | 100 | Professional communications |
| News/factual | 100 | Information content |
| Proverbs | 200 | Cultural nuance, hardest test |
| Idioms | 50 | Figurative language |
| Technical | 50 | Domain-specific vocabulary |
That would be around 650 sentences. Each would need:
- Human-validated reference translations (ideally from multiple native speakers)
- Quality scores from established metrics like COMET
- Difficulty ratings so we can track performance by category
I’d want to make it open-source, with a public leaderboard where anyone can submit their model’s performance.
Is this the right approach? I’m genuinely not sure. But it seems better than what exists now, which is nothing.
The Dialect Problem (Again)
Any benchmark should probably address dialects, but this is hard.
Albanian has numerous dialects:
- Northwest Gheg, Northeast Gheg, Central Gheg, Southern Gheg
- Malsia Albanian, Upper Reka, Arbanasi
- Transitional dialects between Gheg and Tosk
- Northern Tosk, Labërisht, Çam
- Arvanitika (Greece), Arbëresh (Italy), Istrian Albanian
(Disclaimer: I compiled this list from various sources but I’m not a dialectologist. Some of these classifications might be outdated or disputed.)
My model is trained almost entirely on Tosk/standard Albanian. Testing on other dialects would probably show worse performance. But getting training data for minority dialects is difficult.
This is an unsolved problem. I’m noting it here because I think it matters, but I don’t have a solution.
Money Down the Drain
Let me be real about the cost of failures.
Every time I trained a model that didn’t work, that cost money. Cloud GPU time isn’t free. The DPO disaster where I trained on corrupted weights? Money gone. The targeted overfitting experiment? Money gone. The full fine-tuning that destabilized? Money gone.
If I’d had a benchmark from the start, if I could have evaluated base models before training, tested outputs along the way, caught problems earlier—I probably could have saved half that cost.
The benchmark isn’t just academically nice. It has practical value. It saves money and time.
Future Approach: SERA and Soft Verification
A paper came out literally today (January 28, 2026) that has me rethinking my whole approach.
SERA (Soft Extraction for Reproducibility Assessment) is a technique for generating training data using what they call “soft verification.” The basic idea, applied to my problem, would work something like this:
Back-Translation with Automatic Quality Checking
- Forward translation: Take Albanian sentence, translate to English with a good model
- Back-translation: Translate that English back to Albanian with a different model (Similar to my approach from Part 3, with a bit of extra flair.)
- Forward again: Translate the back-translated Albanian to English again
- Compare: If the two English translations are similar, the original translation was probably good
The insight is that consistency implies quality. If I translate A→B→A’→B’, and B ≈ B’, then the translation is probably reliable—even without a human checking.
This could let me:
- Generate way more training data (50k+ pairs)
- Automatically filter for quality without native speaker review (huge)
- Focus human effort on the hard cases where models disagree
Why This Might Fix My DPO Problem
My DPO training failed because of framework incompatibility (MLX → PyTorch weight corruption). But SERA suggests I might not even need DPO—their results show that pure SFT with well-verified data can match models trained with RL or preference learning.
If true, I could skip the whole DPO nightmare and just generate better training data.
Vague Instructions for Diversity
SERA also suggests using vague prompts to get diverse outputs. Instead of “translate this Albanian proverb,” try:
- “Express this Albanian concept in English”
- “Convey the meaning of this phrase”
- “Rephrase this for English speakers”
Different prompts elicit different translation styles, all potentially useful as training signal.
My Tentative Plan
- Implement soft back-translation verification
- Generate 50k+ verified training pairs
- Train SFT-only (skip DPO entirely)
- See if this beats my current 69% model
I’m excited about this but also uncertain. The paper is brand new. I haven’t tried it yet. It might not work for translation the way it works for code. But it feels like a promising direction.
The Vision: Albania as a Inspirational Case
Here’s my concept:
What if Albanian became the best-documented example of building AI capabilities for a low-resource language?
Most AI research focuses on English, Chinese, Spanish—languages with huge speaker populations and massive data availability. Low-resource languages get ignored. 🙁
But someone has to go first. Someone has to figure out the techniques, document the failures, build the benchmarks, publish the data.
What if that someone is the Albanian community?
Albania is small but tech-savvy. There’s a diaspora of Albanian software engineers around the world. The language is unique and valuable.
I’d love to see Albania become known for having the best low-resource language AI infrastructure in the world. A reference implementation that other language communities can learn from. A proof that you don’t need FAANG-scale resources to build something meaningful.
Maybe that’s naive. Maybe it’s impossible. But I think it’s worth trying.
Building More.
ML researchers: I’m an amateur. If you see obvious mistakes in my approach, please tell me. I’d rather be embarrassed than wrong.
Other low-resource language communities: I’m trying to document everything I learn. If this helps you build something for your language, that would make this project worthwhile.
Standing on Shoulders
I’ve read so many papers over the course of this project. Tried to understand so many techniques. Watched so many tutorials. Asked so many questions in Discord servers and forums.
I’m truly standing on the shoulders of giants here, and I couldn’t possibly name everyone who contributed to the knowledge I built on. But thank you—to the researchers who publish their work openly, to the engineers who open-source their code, to the community members who answer questions from confused beginners like me.
Specific thanks to:
- Bonin (GitHub: bonin1) for the open-source Albanian-English dataset that anchored my training
- The MLX team at Apple for building a framework that made on-device deployment possible
- The Albanian Wikipedia and Wikiquote communities for public domain content
- The authors of the SERA paper for giving me a new direction to explore
What’s Next
Short term:
- Build a proper Albanian NMT benchmark
- Submit that to LLM-stats.com and pay to run SOTA models against it
- Try the SERA approach with soft verification
- Get more native speaker involvement
Medium term:
- Scale to 50k+ training pairs
- Investigate dialect support (this is hard but important)
- Get the app to a state where I’d be comfortable releasing it
Long term (dreams):
- Public benchmark with leaderboard
- Albanian becoming a reference case for low-resource language AI
First sip of a new blend. The extraction could be better. The grind needs adjustment. Some notes are there that shouldn’t be, and some are missing that should be.
But the foundation is poured. The process is documented. The problems are identified.
Now I iterate.
Series Navigation:
- Part 1: The Vision — Why build this
- Part 2: Sourcing the Beans — Data collection
- Part 3: The Roast — Training approaches
- Part 4: The Pour — iOS deployment
- Part 5: The Taste Test — Results & future (you are here)