Building an Albanian AI – Part 5: Results and Future Plans

Results & Future Plans

TLDR: 69.2% accuracy on my tiny 13-phrase test set. Basic phrases work. Proverbs mostly fail.

The big challenge: I can’t actually tell if this is good because there’s no Albanian benchmark! Building that benchmark is now priority #1.

The Frustrating Part

Let me show you my test results:

Albanian	Expected	Model Output	Result
Sa kushton?	How much?	How much?	✅
plazh	beach	beach	✅
Faleminderit	Thank you	Thank you	✅
Mirëmëngjes!	Good morning!	Good evening!	❌
Po	Yes	Yes	✅
Jo	No	No	✅
Mirupafshim	Goodbye	Goodbye	✅
Si jeni?	How are you?	How are you?	✅
Ku është stacioni?	Where is the station?	Where is the station?	✅
Më mirë shëndet, se mbret	Better health than king	Better health, than a priest	❌
Fjala pa punë, si peshku pa lumë	Words without work, like fish without river	A word without work, like a stone without a face	❌
Kush punon, ha bukë	Who works, eats bread	Who works, eats	❌
Nuk ka tym pa zjarr	No smoke without fire	There’s no smoke without fire	✅

Score: 9/13 = 69.2%

That looks decent, right? The simple stuff works. Proverbs struggle.

But here’s my frustration: I have no idea if 69% is good.

Thirteen phrases isn’t a real benchmark. I picked these phrases somewhat arbitrarily based on what I thought would be interesting test cases. Maybe I picked too many hard ones. Maybe I picked too many easy ones. Maybe my expected translations are wrong.

Is Qwen the right base model? I don’t know. I can’t compare because there’s no benchmark showing how GPT-5, Claude, Gemini, or any other model performs on Albanian.

Was my training data good? I don’t know. I have no baseline to compare against.

Did I choose the right training approach? I don’t know! Who knows!

It’s like going to school and the teacher never gives you grades. You hand in your homework and… nothing. Did you pass? Did you fail? Unclear.

This Is Why We Need an Albanian Benchmark

Let me say this louder for the people in the back:

We’re flying blind.

If a researcher wants to improve Albanian language technology, they have no way to measure progress. If a company wants to know if their translation is good enough, they have no standard to test against. If I want to know whether I should use Qwen or Gemma or Llama as my base model, I have to guess.

I’m not an expert. I’m not a linguist. But I can see that this gap needs to be filled. So I’m going to try to build something.

What An Albanian NMT Benchmark Might Look Like

Here’s my rough thinking (and I’d love feedback from people who know more than me):

Domain	Target Sentences	Why
Conversational	150	Basic usage, most common need
Formal/business	100	Professional communications
News/factual	100	Information content
Proverbs	200	Cultural nuance, hardest test
Idioms	50	Figurative language
Technical	50	Domain-specific vocabulary

That would be around 650 sentences. Each would need:

Human-validated reference translations (ideally from multiple native speakers)
Quality scores from established metrics like COMET
Difficulty ratings so we can track performance by category

I’d want to make it open-source, with a public leaderboard where anyone can submit their model’s performance.

Is this the right approach? I’m genuinely not sure. But it seems better than what exists now, which is nothing.

The Dialect Problem (Again)

Any benchmark should probably address dialects, but this is hard.

Albanian has numerous dialects:

Northwest Gheg, Northeast Gheg, Central Gheg, Southern Gheg
Malsia Albanian, Upper Reka, Arbanasi
Transitional dialects between Gheg and Tosk
Northern Tosk, Labërisht, Çam
Arvanitika (Greece), Arbëresh (Italy), Istrian Albanian

(Disclaimer: I compiled this list from various sources but I’m not a dialectologist. Some of these classifications might be outdated or disputed.)

My model is trained almost entirely on Tosk/standard Albanian. Testing on other dialects would probably show worse performance. But getting training data for minority dialects is difficult.

This is an unsolved problem. I’m noting it here because I think it matters, but I don’t have a solution.

Money Down the Drain

Let me be real about the cost of failures.

Every time I trained a model that didn’t work, that cost money. Cloud GPU time isn’t free. The DPO disaster where I trained on corrupted weights? Money gone. The targeted overfitting experiment? Money gone. The full fine-tuning that destabilized? Money gone.

If I’d had a benchmark from the start, if I could have evaluated base models before training, tested outputs along the way, caught problems earlier—I probably could have saved half that cost.

The benchmark isn’t just academically nice. It has practical value. It saves money and time.

Future Approach: SERA and Soft Verification

A paper came out literally today (January 28, 2026) that has me rethinking my whole approach.

SERA (Soft Extraction for Reproducibility Assessment) is a technique for generating training data using what they call “soft verification.” The basic idea, applied to my problem, would work something like this:

Back-Translation with Automatic Quality Checking

Forward translation: Take Albanian sentence, translate to English with a good model
Back-translation: Translate that English back to Albanian with a different model (Similar to my approach from Part 3, with a bit of extra flair.)
Forward again: Translate the back-translated Albanian to English again
Compare: If the two English translations are similar, the original translation was probably good

The insight is that consistency implies quality. If I translate A→B→A’→B’, and B ≈ B’, then the translation is probably reliable—even without a human checking.

This could let me:

Generate way more training data (50k+ pairs)
Automatically filter for quality without native speaker review (huge)
Focus human effort on the hard cases where models disagree

Why This Might Fix My DPO Problem

My DPO training failed because of framework incompatibility (MLX → PyTorch weight corruption). But SERA suggests I might not even need DPO—their results show that pure SFT with well-verified data can match models trained with RL or preference learning.

If true, I could skip the whole DPO nightmare and just generate better training data.

Vague Instructions for Diversity

SERA also suggests using vague prompts to get diverse outputs. Instead of “translate this Albanian proverb,” try:

“Express this Albanian concept in English”
“Convey the meaning of this phrase”
“Rephrase this for English speakers”

Different prompts elicit different translation styles, all potentially useful as training signal.

My Tentative Plan

Implement soft back-translation verification
Generate 50k+ verified training pairs
Train SFT-only (skip DPO entirely)
See if this beats my current 69% model

I’m excited about this but also uncertain. The paper is brand new. I haven’t tried it yet. It might not work for translation the way it works for code. But it feels like a promising direction.

The Vision: Albania as a Inspirational Case

Here’s my concept:

What if Albanian became the best-documented example of building AI capabilities for a low-resource language?

Most AI research focuses on English, Chinese, Spanish—languages with huge speaker populations and massive data availability. Low-resource languages get ignored. 🙁

But someone has to go first. Someone has to figure out the techniques, document the failures, build the benchmarks, publish the data.

What if that someone is the Albanian community?

Albania is small but tech-savvy. There’s a diaspora of Albanian software engineers around the world. The language is unique and valuable.

I’d love to see Albania become known for having the best low-resource language AI infrastructure in the world. A reference implementation that other language communities can learn from. A proof that you don’t need FAANG-scale resources to build something meaningful.

Maybe that’s naive. Maybe it’s impossible. But I think it’s worth trying.

Building More.

ML researchers: I’m an amateur. If you see obvious mistakes in my approach, please tell me. I’d rather be embarrassed than wrong.

Other low-resource language communities: I’m trying to document everything I learn. If this helps you build something for your language, that would make this project worthwhile.

Standing on Shoulders

I’ve read so many papers over the course of this project. Tried to understand so many techniques. Watched so many tutorials. Asked so many questions in Discord servers and forums.

I’m truly standing on the shoulders of giants here, and I couldn’t possibly name everyone who contributed to the knowledge I built on. But thank you—to the researchers who publish their work openly, to the engineers who open-source their code, to the community members who answer questions from confused beginners like me.

Specific thanks to:

Bonin (GitHub: bonin1) for the open-source Albanian-English dataset that anchored my training
The MLX team at Apple for building a framework that made on-device deployment possible
The Albanian Wikipedia and Wikiquote communities for public domain content
The authors of the SERA paper for giving me a new direction to explore

What’s Next

Short term:

Build a proper Albanian NMT benchmark
- Submit that to LLM-stats.com and pay to run SOTA models against it
Try the SERA approach with soft verification
Get more native speaker involvement

Medium term:

Scale to 50k+ training pairs
Investigate dialect support (this is hard but important)
Get the app to a state where I’d be comfortable releasing it

Long term (dreams):

Public benchmark with leaderboard
Albanian becoming a reference case for low-resource language AI

First sip of a new blend. The extraction could be better. The grind needs adjustment. Some notes are there that shouldn’t be, and some are missing that should be.

But the foundation is poured. The process is documented. The problems are identified.

Now I iterate.

Series Navigation:

Part 1: The Vision — Why build this
Part 2: Sourcing the Beans — Data collection
Part 3: The Roast — Training approaches
Part 4: The Pour — iOS deployment
Part 5: The Taste Test — Results & future (you are here)

Building an Albanian AI – Part 5: Results and Future Plans

Results & Future Plans

The Frustrating Part

This Is Why We Need an Albanian Benchmark

What An Albanian NMT Benchmark Might Look Like

The Dialect Problem (Again)

Money Down the Drain

Future Approach: SERA and Soft Verification

Back-Translation with Automatic Quality Checking

Why This Might Fix My DPO Problem

Vague Instructions for Diversity

My Tentative Plan

The Vision: Albania as a Inspirational Case

Building More.

Standing on Shoulders

What’s Next

More posts

Protected: Examples > Instruction

Protected: Go Evergreen.

Building an Albanian AI – Part 5: Results and Future Plans

Building an Albanian LLM – Part 4: The Pour: Quantization & iOS Deployment