Today we're doing something special. Here's an excerpt from The Educative Newsletter, a new exclusive resource for paying Educative subscribers.
I wanted to share this with you because the truth behind the OpenAI/DeepSeek distillation controversy is something every developer should understand.
It's not just a feud between AI giants. It illustrates the future of competitive AI, the interconnectedness of tech and policy, and what developers need to do to keep up.
I hope you enjoy the read.
OpenAI vs. DeepSeek: The Distillation Battle Shaping Next-Gen AI
The latest hype in the AI sphere has included accusations surrounding two heavyweights: OpenAI and DeepSeek.
But unlike the Tyson v. Paul fight, this battle doesn't have a clear winner.
Released on January 20th, DeepSeek's R1 made history within a week, triggering one of the largest single-day losses in US stock market history (a bad day for NVIDIA).
R1's power play was the fact that it performed nearly as well as OpenAI's most advanced models (and allegedly 1000x cheaper to build than GPT-4).
Now, we have an idea of just how they did it.
OpenAI recently accused DeepSeek of violating OpenAI's Terms of Use to train R1 — specifically, through a process calledmodel distillation.
Model distillation is a deep learning technique that leverages powerful AI systems to create smaller, more efficient ones. Distillation isn't an evil in its own right, but it does cross into some ethical gray areas.
The thing is: we know for a fact that DeepSeek uses distillation... but so do others (including OpenAI).
Let's take an objective look at the distillation controversy today.
We'll cover:
What model distillation is (and how it works)
How DeepSeek uses distillation
What OpenAI's accusations mean for the future of AI
Let's dive in.
What is model distillation?
Outside of the context of LLMs, distillation is all about extracting the most valuable parts from a messy mixture. Think of how we turn seawater into fresh drinking water: boil off the salt and keep the good stuff. The key idea is separating what’s essential from the excess.
Model distillation draws this parallel in AI — where essential knowledge is extracted from a massive AI model and passed down to a smaller one.
Model distillation involves knowledge transfer between:
A large, supercharged model (the teacher model)
A smaller, more efficient model (the student model)
Distillation provides a powerful method for a student model to learn to replicate the teacher’s intelligence — while also cutting down on size, energy use, and computing power.
How distillation works
In distillation, the teacher models' outputs become the student model's training data:
The teacher model processes input data and generates a label (correct answer) and a rationale (explanation).
The student model then learns from this enhanced data, capturing both the decisions and reasoning of the teacher.
As a result, the student model learns not only the answer but the reasoning process, as well.
As the diagram below depicts, the teacher model also helps the student learn nuanced inter-class relationships by providing:
Soft labels: Probabilistic predictions across classes
Hard labels: Correct answers
This enables the student model's ability to yield an efficient approximation of the teacher's performance.
Benefits of distillation
A distilled model offers many benefits over larger models:
Efficiency & faster inference: Distilled models provide faster performance with lower latency (while retaining intelligence).
Reduced computational costs: Distilled models require less hardware and energy compared to large models.
Deployability on edge devices: These models can be deployed on devices with limited computing resources, such as smartphones and IoT devices.
Expedite development of specialized LLMs: We can fine-tune distilled models to create smaller, domain-specific models from general-purpose LLMs for areas like medicine, coding, etc.
One tradeoff of distillation is that you might sacrifice a sliver of accuracy, but that’s an acceptable compromise for many use cases.
Distillation has a distinct appeal over purely fine-tuning a pre-trained LLM. Fine-tuning involves training an LLM for specific tasks on specialized datasets. However, a large, fine-tuned LLMs boast billions of parameters, soaking up loads of computing power (and energy bills).
Distillation offers a clever route route that is more cost-effective:
Extract the model's core intelligence, and trim the excess
Create a leaner, distilled model that's faster and cheaper to deploy
Fine-tune the smaller, distilled model for specialized use cases if needed
Often, distillation and fine-tuning are used in tandem to achieve effective LLMs with less demand on resources.
🌏 Real-World Example
GPT-4o mini is distilled from GPT-4o, which is how this smaller model hits a sweet spot between performance and resource efficiency.
Ethical concerns and OpenAI's accusations
If a student model’s training data consists mostly of a teacher model’s outputs, would distillers need explicit permission to harvest that material?
As useful as this technique is, distillation wades into murky ethical waters, where it becomes difficult to draw the line between knowledge transfer and blatant copyright infringement.
While neither OpenAI nor Microsoft have disclosed evidence, OpenAI is accusing DeepSeek of using unfair methods to achieve their low-cost models.
DeepSeek allegedly exploited OpenAI to perform distillation at scale, using multiple unrelated accounts.
Microsoft security researchers reportedly detected individuals performing data exfiltration using OpenAI API in late 2024, which they believe to be linked to DeepSeek.
There is some irony in this controversy, as OpenAI itself has also been under fire for scraping people's data without permission to train ChatGPT. OpenAI considered this to be fair use — but not everyone agreed with that (and that's precisely why they were sued for 15 million euros by an Italian watchdog last December).
With that, we could entertain that OpenAI might be trying to position themselves as advocates of data privacy with this accusation.
While the question of whether they inappropriately used OpenAI data remains speculative, there's no doubt that DeepSeek is using distillation to affect powerful, efficient models.
Want to read the rest of the story?
If you already have an Educative subscription, you can click over to the Educative Newsletter to read on about:
How DeepSeek really uses distillation
What this controversy tells us about policy and competitive AI
What you need to do to keep up with the AI era
If you don't have an active subscription (and you like what you see), this is your chance to grab an Educative subscription at 50% off this week.
That way you can get industry news, career tips, and breakdowns of hot topics like distillation delivered straight to to your inbox — in addition to 1401+ hands-on courses and projects covering in-demand tech skills.