Building an Albanian Edge LLM – Part 3: The Roast

Training Approaches & Failures

TLDR: I tried a bunch of different approaches. Most failed! Tried M2M-100, full fine-tuning, DPO preference learning (which broke spectacularly due to framework incompatibility). What eventually worked: MLX LoRA on Qwen3-1.7B with my 22k samples. After 15+ model variants, I landed at around 69% accuracy. Not great, not terrible—but I learned a ton.


Honest Accounting

I want to be upfront: I don’t fully know if the approaches I tried were the right ones. I’m not an ML researcher. I read papers, followed tutorials, asked for help, and experimented. A lot.

Most of what I tried didn’t work. That’s actually the story of this project—failure after failure, with occasional glimmers of hope that kept me going.

I’m sharing the failures because I think they’re at least entertaining.

The Training Environment

Let me describe what I was working with:

My main dev machine: MacBook Air M2 with 24GB unified RAM

Testing device: iPhone 14 Pro Max

Cloud compute: I used modal.com for most of my GPU training. I also tried to get set up on canopywave and hyperbolic, but ran into various issues and ended up sticking with Modal.

One thing I found myself wishing for: an Albanian cloud compute provider. It would be cool to keep the money local, support the Albanian tech ecosystem, and maybe even get better support for language-specific needs. If anyone knows of one, let me know?

Models I Explored

Before settling on my final approach, I looked at a bunch of different models:

Translation-specific models:

  • M2M-100 (Meta’s multilingual translation model)
  • Various seq2seq architectures
  • OmniASR, but doesn’t have S2TT built-in

General LLMs I considered for fine-tuning:

  • Qwen3-1.7B (what I ended up using)
  • Qwen3-4B (larger, more expensive to run)
  • Gemma3 (Google’s open model)
  • EuroLLM-9B-Instruct (investigated, but Albanian doesn’t appear to be in their documented language list)

ASR models (for speech-to-text, which connects to my larger project):

  • Whisper Turbo
  • Whisper Tiny
  • Whisper Flutra
  • Meta’s OmniASR (got this working locally on iPhone! but it can’t do translation in the pipeline—needs more testing)

I should mention the ASR work is part of a bigger vision where speech could flow through to translation, but that’s a whole other blog series.

Why I Picked Qwen3-1.7B

Here’s my reasoning, such as it was:

Qwen explicitly mentions Albanian. In their model documentation, Qwen lists Albanian (specifically Tosk Albanian) as one of the languages included in their training data. Most other open-source models don’t mention Albanian at all.

My assumption was: a model that was deliberately trained on Albanian will probably perform better at Albanian than a model where Albanian is just incidental web scraping debris.

Is this assumption correct? Honestly, I don’t know. And this points to a bigger problem: there’s no benchmark for Albanian language capabilities in LLMs.

Think about it. If I want to know whether GPT-5 or Claude or Qwen handles Albanian better, where do I look? There’s no leaderboard. No standardized test. No grades.

I had to basically guess which model to use as my base, based on what the model creators say in their documentation and my own informal testing. That’s… not great.

Building an Albanian benchmark is high on my future work list. (I’m in talks with LLM-Stats and ready to send the benchmark.) More on that in Part 5.

What I Tried (The Failures)

There’s more attempts than listed here, but I didn’t properly document them all.

Attempt 1: M2M-100

Meta’s M2M-100 is designed for translation. It handles 100 languages, including Albanian.

Out of the box, it does… okay? Simple phrases worked. “Hello” translates fine. “Where is the train station?” comes through.

But I wasn’t happy with the outputs on more complex text. I also ran into various issues with the architecture. Fundamentally, I am also chasing less MT and more LLM.

I moved on.

Attempt 2: Full Fine-Tuning

Maybe I should just fine-tune all the parameters of a smaller model?

I tried this with Qwen3-1.7B. Training started okay, but around iteration 800, things went sideways. The loss spiked. The outputs became repetitive garbage.

My guess is that 22k examples isn’t enough to stably update 1.7 billion parameters. The model wandered into bad local minima.

Lesson learned: LoRA isn’t just more efficient—it seems to provide stability that full fine-tuning lacks on small datasets.

Attempt 3: DPO on Base Model

I’d read about Direct Preference Optimization (DPO) and thought it sounded promising. Show the model pairs of translations, tell it which one is better, let it learn preferences.

I tried applying DPO directly to the base Qwen3-1.7B using my 1,155 CPO triplets.

Result: 19% accuracy. Worse than the unmodified model.

In retrospect, this makes sense. DPO refines existing capabilities. If the model can barely translate Albanian to begin with, there’s nothing to refine. I was trying to optimize a skill the model didn’t have. (I think? Again, no benchmarks.)

Attempt 4: DPO on Fine-Tuned Model (The Disaster)

Okay, so DPO needs a model that can already do the task. I had a fine-tuned model from my LoRA experiments that could translate Albanian. Let me apply DPO to that!

This is where things got bad.

I trained my SFT (supervised fine-tuned) model using MLX, Apple’s framework for their chips. But the DPO library I was using (TRL from HuggingFace) requires PyTorch.

“No problem,” I thought. “I’ll just load the MLX weights into PyTorch.”

Result: 0% accuracy. Complete garbage. Random characters. Chinese text. Random symbols.

TLDR – MLX and PyTorch store tensor weights differently. When I loaded MLX weights into PyTorch, the values got misinterpreted. Imagine loading a JPEG as raw bitmap data—you get something, but it’s noise.

The model was outputting garbage before DPO training even started. I trained DPO on a corrupted model. Oops.

Money spent on that training run: down the drain.

Attempt 5: Targeted Overfitting

I had 18 test phrases that kept failing. What if I just trained really hard on exactly those phrases?

I created a tiny dataset of just these 18 examples and trained with a high learning rate.

Result: 23% accuracy. Even worse than before.

The model memorized (sort of) those 18 phrases and forgot everything else. Classic catastrophic forgetting.

You can’t shortcut your way to capability by training intensively on evaluation data. Who knew? (Everyone knew. I should have known.)

What Actually Worked

SFT + LoRA

After all those failures, here’s what eventually produced reasonable results:

SettingValue
Base modelQwen3-1.7B
MethodLoRA (Low-Rank Adaptation)
LoRA rank16
Training data22,369 SFT pairs
Iterations~3,000
FrameworkMLX

LoRA only updates about 2% of the model’s parameters. This turned out to be important for stability.

One crucial discovery: the “no think” directive.

Qwen3 has a “thinking” mode where it reasons through problems before answering. Great for math problems. Bad for translation—the thinking tokens add latency and sometimes confuse the output.

I found that adding /no_think to my prompts dramatically improved both speed and consistency. The model just outputs the translation instead of “thinking” about it first.

This seems obvious in retrospect, but it took me a while to figure out. (I didn’t know the smaller Qwen models had this on by default.)

Result: 69.2% Accuracy

On my test set of 13 phrases (more on why that’s a problem in Part 5), the final model gets about 69% correct.

Basic phrases: near-perfect.
Simple sentences: very good.
Proverbs: not great.

Is 69% good? I genuinely don’t know. There’s no Albanian translation benchmark to compare against. I don’t know if my test set is representative. I don’t know if other approaches would do better.

It feels like a reasonable start. Room to improve.

The Model Zoo

I trained 15+ variants over the course of this project. Here’s a partial list:

ModelMethodAccuracyNotes
albanian_mt_final_4bitSFT + LoRA~69%Best so far
adapters_qwen_finalLoRA adapters only~69%Equivalent
albanian_mt_dpoDPO on base~19%Failed
albanian_mt_dpo_v2DPO on SFT (corrupted)~0%Disaster
albanian_mt_targetedTargeted FT~23%Forgetting
adapters_gemmaGemma3 baseDidn’t logNot as good
various othersvariousvariousLearning experiences

Each training run cost money. Each failure taught me something. Some taught me more than the successes did.

Compute Costs and Learnings

Every time something failed, I had to decide whether to debug or try something different. Debugging often meant more training runs. More training runs meant more cost.

This is one reason I’m so interested in seeing an Albanian benchmark created. If I’d had a clear way to evaluate models before fine-tuning, I might have made better choices about which base model to use. I might have caught the MLX↔PyTorch incompatibility earlier. I might have spent less money on dead ends.

What I’d Do Differently

If I were starting over:

  1. Stay in one framework. I’d do everything in PyTorch if I needed DPO, then convert to MLX only at the end. (I thought I’d succeed sooner in the pipeline)
  2. Test model outputs immediately after any weight loading. I’d never train on a model without first verifying it produces sensible outputs.
  3. Build evaluation infrastructure first. Having a good test set from the start would have helped me catch problems earlier.
  4. Talk to more Albanian speakers. Native speaker feedback would have helped validate my approach earlier in the process.

The roast is done. Some beans burned. But I think there’s something usable here. Let’s see if it pours well.