An Albanian Edge LLM – Part 2: Sourcing the Data

Data Collection & Synthetic Generation

TLDR: I ended up building 22,369 training pairs from scratch using multiple AI models. Finding clean, open-source Albanian translation data was way harder than I expected.

I made a constraint for myself: only CC0 or MIT-licensed data. That led me down the synthetic data rabbit hole, which turned out to be pretty interesting.

The Data Problem

I started this project assuming I could just… find data somewhere. Albanian-English parallel text must exist, right? Dictionaries, textbooks, government documents, something?

Data exists. Lots of it, actually. But I quickly ran into a constraint I imposed on myself: I only wanted to use data that was clearly CC0, MIT-licensed, or otherwise unambiguously open source.

I’m not a lawyer, and I want to respect copyright. So I drew a hard line: if I couldn’t clearly verify the licensing, I wouldn’t use it.

This turned out to be a significant constraint. I looked at various resources online—dictionaries, translation corpora, benchmark datasets—and kept hitting uncertainty about whether I could actually use them for training a model. Some had restrictive licenses. Some had ambiguous terms. Some I just couldn’t verify at all.

The reality I landed on: there’s surprisingly little clearly-licensed Albanian-English training data I could find.

One Bright Spot

There was one exception I got really excited about.

Shoutout to Bonin. GitHub user bonin1 maintains a repository called Al-En-Ger-opensourcedataset with 630 high-quality Albanian-English-German triplets, properly licensed and clearly meant for this kind of use.

Category	Approximate Count
Greetings	~90
Business	~120
Travel	~160
Education	~110
Healthcare	~100
Small talk	~50

These are hand-curated, natural-sounding translations. Exactly the kind of clean data you want to build on.

630 examples isn’t enough to train a translation model. But it gave me an anchor—something I could trust completely. I ended up upsampling this dataset 5x in my final training mix, so the model sees these high-quality examples repeatedly.

Thanks again Bonin!

The Synthetic Data Approach

So here’s where I went: if clean data doesn’t exist, maybe I can create it.

The basic idea is to use multiple frontier AI models to generate translations, then filter and combine them into a training dataset. This approach has become pretty common in the ML world—I certainly didn’t invent it—but I was curious to see if it could work for Albanian.

The process ended up being more involved than I initially expected, but also more interesting.

Starting with Albanian Source Sentences

First, I needed Albanian text to translate. I curated sentences across several categories:

Category	Rough Count	Where I Found Them
Proverbs	~300	sq.wikiquote.org (public domain)
Conversational	~400	Generated
Formal/business	~300	Generated
Technical	~200	Domain-specific terminology
Cultural references	~150	Historical, religious, traditional

Total: around 1,350 unique Albanian source sentences.

Getting diversity here seemed important. A model trained only on casual conversation probably fails on formal text. A model trained only on simple sentences probably can’t handle complex grammar.

The Multi-Model Ensemble

I ran each source sentence through multiple different SOTA LLMs:

Model	Why I Included It
Gemini 3 Flash	Fast, good multilingual reputation
Gemini 2.5 Flash Lite	Wanted to compare variants, this one benchmarks higher on some language specific tasks
GPT-5.2	Figured the flagship would be strong if not a bit expensive
GPT-5.2-nano	Lightweight comparison
GPT-4o-mini	Another perspective, have heard from Native Albanian speakers that this was best-in-class.
Qwen3-next-80B	Open-weights model via OpenRouter

Why multiple models? Different LLMs have different training data and different biases. When they all agree on a translation, that seems like a good signal. When they diverge, maybe that’s a flag for review.

Important note about Qwen: The Qwen 3 model documentation specifically mentions Tosk Albanian as one of the languages included in their training base dataset. This got me excited—having Albanian explicitly in the training data seemed like a good sign.

Quality Filtering

Not every generated translation is usable. I built a filtering pipeline to try to catch problems:

Length ratio checks — Albanian and English have roughly similar verbosity. If a translation is wildly different in length from the source, something might be wrong.

Character detection — Albanian uses Latin script with some specific characters (ë, ç). If English output contains Cyrillic, Chinese, or other unexpected characters, the model probably got confused.

Cross-model agreement — When 4+ models produce substantially similar translations, I weighted those higher.

I also ran everything through a safety filter (61-item blocklist for profanity, violence, explicit content). The flagged rate was pretty low—most flags were false positives like legitimate uses of words in proverbs.

What I Ended Up With

After all the filtering and deduplication:

Component	Count
Bonin open-source (5x upsampled)	3,150
Synthetic high-quality	14,872
Synthetic medium-quality	4,347
Total training pairs	22,369

I also created some preference data for potential DPO training later:

Component	Count
CPO triplets (chosen/rejected pairs)	1,155

22,369 translation pairs total, I wish this was 10x…maybe after initial launch 🙂

A Major Weakness: No Native Speaker Validation

Here’s something I need to be upfront about: I did not have a native Albanian speaker systematically validate this data.

This is a real weakness. Frontier LLMs are pretty good at Albanian, but they’re not perfect. They sometimes:

Use archaic or unnatural phrasing
Miss register (formal when casual is appropriate, or vice versa)
Generate grammatically correct but semantically off translations
Handle false cognates incorrectly

Without native speaker review, I’m sure there are errors in my training data that I haven’t caught. The automated filters help, but they can’t catch everything.

This is something I want to fix in future iterations, especially with less-known Albanian dialects…speaking of which:

The Dialect Challenge

Albanian has a lot of dialect complexity.

The two major dialect groups are Gheg (northern) and Tosk (southern, basis for standard Albanian). But within those, there appear to be many sub-dialects:

Northwest Gheg, Northeast Gheg, Central Gheg, Southern Gheg, Malsia Albanian, Upper Reka, Arbanasi, Transitional, Northern Tosk, Labërisht, Çam, Arvanitika, Arbëresh, Istrian Albanian

Important disclaimer: I found this list through research, but I’m not a linguist and can’t verify that all these classifications are accurate or current. Albanian dialectology seems to be a specialized field, and I’m definitely out of my depth here.

What I can say is that my training data skews heavily toward Tosk/standard Albanian. Speakers of other dialects may find the model less accurate. Building dialect-specific data would be valuable future work, though I honestly don’t know how I’d go about collecting it. (There’s also an incredible Instagram page, @projeki_ftillimi, that catalogs the massively different words for many objects/things.)

Wikiquote as a Quick Test

For informal testing during development, I used Albanian proverbs from sq.wikiquote.org/wiki/Fjalë_të_urta_shqiptare.

This Wikimedia page has around 200 Albanian proverbs with community translations. The quality varies—some translations are more literal than natural—but it’s public domain and gave me quick directional feedback.

This is definitely not a rigorous benchmark. Just a sanity check during iteration.

Why This Approach Might Matter

If you’re working on a well-resourced language like Spanish or French, you just download a dataset and start training.

For low-resource languages like Albanian, you have to build the dataset first.

The pipeline I ended up with is roughly:

Find whatever small clean dataset exists (your “Bonin”)
Curate diverse source sentences in the target language
Translate via multiple frontier LLMs
Filter: safety, deduplication, character detection
Rank: length ratios, cross-model agreement
Merge with upsampled clean data

The specific models will change over time. Gemini 4 will replace Gemini 3. GPT-6 will replace GPT-5. But the general approach should remain valid.

I’m hopeful this could be a template. Any language with at least some native speakers willing to curate source sentences and basic Unicode support in frontier LLMs could potentially follow this approach.

Maybe Albanian could become a reference case—the best-documented example of building AI capabilities for a low-resource language? That would be pretty cool.

What I Didn’t Solve

Dialect coverage — As mentioned, I’m heavily skewed toward Tosk. Gheg and other variants need attention.

Register diversity — I have conversational and formal text, but I’m light on specialized domains like legal, medical, or technical content.

Native speaker validation — This is my biggest weakness and the thing I most want to address.

Adversarial testing — I haven’t tested how the model handles intentionally confusing inputs, code-switching, or non-standard orthography.

These are all future work. The foundation is laid, but there’s a lot more to do.

Beans sourced. 22,369 of them. Let’s see if I can roast them without burning the batch.