How Data Is Gotten For Training An AI Model

How AI gets its training data, the hidden labor behind it, and current data for training AI controversies

I’ve been obsessed with AI lately.

NO! I wasn’t living under a rock before now. I just have this pull to be more AI-aware, learn as much as I can about AI without needing to venture into it as a career.

And this is my second wave of obsession, by the way.

When ChatGPT was first released, before it was connected to the internet, I tried a lot of things on it. But it was very limited in its abilities. For context, this was 2022, but its knowledge of the world and output were limited to 2021. Fast forward to now, with new models, model updates, and being hooked to the internet, you can even discover the latest news via ChatGPT. In fact, it has replaced Google search for many of its users. It is that good!

That kind of leap in ability didn’t happen magically. It happened because of data.

By the way, LLMs are not built to be used as search engines.

That AI model answering your questions, generating pictures that make you look like you’re in a commercial with just four lines of prompt, all of it, started and gets better with data. Mountains of data. Labeled by humans, scraped from the internet, sometimes without permission. And now, the data owners are fighting back.

Trending in the data branch of AI is data startup controversies, copyright disputes, and a company that controls 20% of the internet stepping in like a bouncer.

This post aims to dissect the role of data in AI, especially in light of recent developments in the data community, such as Scale AI’s CEO being poached by Meta. Well, I suppose it’s not really poaching if you own the company. It includes an overview of the controversies surrounding Scale AI, and news about their competitor, Surge AI, being even better.

Okay! Now that the intro is out of the way, let’s dive in!

How is AI training Data Gotten and Used?

Before an AI system is deployed, it needs to be trained on a large amount of data. This data is manually tagged and identified by humans in a process known as data labeling or annotation. Data annotation is the backbone of the artificial intelligence revolution.

So, once an AI model is designed with a specific purpose in mind, it is then trained. But not on just any data. The data has to be:

Relevant to the AI’s goal: For example: Say you’re building a finance AI, you train it with financial data. If it’s Legal AI? Legal documents. A general-purpose model? You use books, websites, forums, and any useful human data that will help it generate human-like responses.
The data has to be cleaned so the AI is not confused and won’t make a lot of mistakes when deployed. The quality of the training data makes or breaks the AI model.
And most importantly, it has to be labeled so the model understands what it is seeing. The labels and tags are identifying markers for the raw data.

Humans, like you and me, are paid to tag, label, and categorize raw data manually, so the AI can learn from it.

Let me give a vivid example of how this process plays out. Shoutout to Professor Andrew Ng for this example, I got it from his AI for everyone course.

Say you want to build an AI that helps farmers identify and kill weeds with precision in a farm.

Step one: You have to teach the model what a weed is.

You first collect thousands of images of weeds, crops, pests, even dirt. These might be manually gotten from the farm because you don’t already have it in a database somewhere. Meaning you and your team will have to go to the farm and take the pictures. All the pictures are your raw data.

Step two: You label every single image.
“This is maize. That’s a weed. This is a healthy plant. That’s an infected one.”

Step three: Once the AI learns to tell them apart, you then teach it what to do when it finds a weed, maybe activate a micro-spray to kill it. This takes a lot of trial and error in a lab.

Step four: Then you test it in the field. First in a controlled environment. Then in the wild.

And you shouldn’t wait until your system is perfect to release it. It has to be just efficient enough, then you release to your beta users. They’re the ones who will test your models’ vulnerabilities in ways you and your team would never imagine.

Every time the model sees new data, it learns.

AI gets better, through continuous exposure to feedback and new situations. Even OpenAI collects data from ChatGPT users to train and fine-tune it further. Your upvote or downvote to ChatGPT’s responses act as feedback.

However, keep in mind that the AI’s quality depends heavily on the quality of the initial training data.

If there’s anything I have taken away from my foray into AI, it is that if this was the creation story, we’ve gone past the part where God (humans in this case) are sketching what their creation should look like. We’ve gone past gathering the clay part. In fact, the clay is being mixed, but we are still modifying the 3D frame, which is the skeletal part that will support the clay. We are improving on building AI’s brain to the level of intelligence it needs to achieve, in order to accomplish the things it’s being hyped for.

Big Data Annotation and Labelling Companies

Because Data annotation and labelling are so tedious, most companies outsource this part of the AI development process to companies like Scale AI and Surge AI, two of the more prominent Data for AI companies.

These companies specialize in creating AI-ready datasets. They have massive teams, tools, and processes to:

Source data
Label it accurately
Clean it up
Deliver it in formats AI engineers can use instantly

Essentially, they’re like the scaffolding builders for most big AI companies.

Sometimes, they also subcontract the work to a fourth party that actually does the real grunt work of labelling and tagging. Remember the case of Sama in Kenya? That’s a whole story of how data labelling companies operate. It’s like a culture.

Scale AI’s Controversy

Let’s talk about Scale AI, the data-labeling startup founded by Alexandr, without the e, Wang.

Recently, Scale AI made headlines again, after Meta’s $14.3 Billion for 49% stake investment trended. This time, it is being scrutinized over how it handles data and labor.

There are concerns about:

Whether some of the training data was ethically sourced
The working conditions of annotators, especially in outsourced labor markets: including low wages, lack of mental support for their traumatized staff. There’s a whole lot more on this that this video cannot cover.
And concerns about publicly exposing client data.

Even one of Scale’s co-founders, Lucy Guo, caught heat online for promoting an “anti work-life-balance” hustle culture.
I’ll leave that debate for another video.

To put these controversies into perspective, Google was one of their biggest clients before Meta’s investment made them withdraw. And there are concerns about Google’s newest AI video generator, Gemini veo 3, being trained on the data these contractors worked on. And the poor quality data used is probably what is allowing veo 3 generate content that most Ai models are trained to not generate, or at least say no when users request it.

Meanwhile, Surge AI, a Scale AI competitor, is marketing itself as the “ethical alternative” using more structured, high-quality data labeling practices.

Since Meta’s investment, Scale AI and Surge AI have been like the Coke vs Pepsi version of the data industry.

The Conversation About Personal Data Concerns

When the app is free, your personal data is the payment. In this digital age, even before the AI boom, data is the new oil. And one of the ways we freely allow our data to be harvested/pawned is via signing up to free apps. Because nothing is free. The owners of the free apps are not charity owners; they build these apps for profit. One of the ways they generate revenue is by selling your data, either for targeted advertising or for AI training purposes.

Do you remember the viral aging app that launched in 2019? It was one of its kind at the time, before Tiktok filters.

I’m not going to say its name but it’s the one where you upload an image of yourself to be able to download an aged version of you for a tiny fee (I remember it being around ₦1,500 or so)?

Everyone jumped on it, including celebrities.

Well, the filter was probably not the main product, even though you paid for it.
Your picture was probably used to train an AI model without your permission.

I keep saying probably because we can’t say for sure that these things happened until proven with evidence but we know they happen and the companies involved never confirm their part in these allegations. So, until proven guilty, everything is allegedly.

And allegedly, every picture you uploaded became part of a dataset to help an AI model learn faces, aging patterns, skin tones… etc.

These days, apps don’t even ask you to pay.
Why?

Have you ever wondered why?

Because your data is more valuable than your money.

AI vs Copyright Law

Let’s get into the legal mess.

In the past year, artists, writers, and major publishers have sued AI companies, claiming their content was scraped without consent.
Entire books, blogs, images fed into models without permission.

Some of these court cases are still ongoing, and the outcomes could redefine:

What counts as “fair use” in training AI
Whether AI companies owe royalties for using public data
And how open the internet will be used for future models

I wanted to add that due to mounting pressures for sustainable and ethical AI building, some AI companies have signed licensing agreements to pay for articles and data from some publisher’s websites. But the ones doing it this way are a very small fraction of the entire industry.

Cloudflare Steps In

And now Cloudflare has entered the scene.

Cloudflare is the company that makes you “Verify you are a human”, especially when you’ve been browsing a little too much.

They just launched a tool that lets website owners charge AI bots for accessing their content.
They’re calling it a “permission-based approach to crawling.”
Meaning: If an AI wants to scrape your site, it has to ask first, and maybe pay to do so.

It’s a major power shift in the data battle, and it could change the dynamics of compensation for data use in training AI.

So far, the relationship between AI companies and content owners have been parasitic. Even the sites that their data were used to train these models have lost significant portions of their traffic to the same AI they helped to train.

Case in point is Stack Overflow, a question-and-answer platform for software developers. After being used to train AI models, programmers now prefer to ask AI models debugging questions. That’s even in cases where the AI debugging tools are not already in their programming software.

Agentic AI, which is AI that acts on an individual’s behalf based on command, poses an even bigger threat to publishers who depend on in-person visits from their readers. CloudFlare ‘s pay-per-crawl tool is being positioned to save the day.

Conclusion: What This Means For You

So yeah! Data is the new oil, but just like oil, how you get it matters.

AI doesn’t just wake up one day and know things.
It learns from us: our blogs, our tweets, our faces, and our behavioral patterns online.

As the fight over compensation for data use heats up, we’re all going to feel the impact and hopefully, AI companies do better in future.

So, let me know:
If you own a website, would you allow it to be used to train an AI model in exchange for cash?

How Data Is Gotten For Training An AI Model

How is AI training Data Gotten and Used?

Big Data Annotation and Labelling Companies

Scale AI’s Controversy

The Conversation About Personal Data Concerns

AI vs Copyright Law

Cloudflare Steps In

Conclusion: What This Means For You

Like this:

Related

Leave a commentCancel reply

How is AI training Data Gotten and Used?

Big Data Annotation and Labelling Companies

Scale AI’s Controversy

The Conversation About Personal Data Concerns

AI vs Copyright Law

Cloudflare Steps In

Conclusion: What This Means For You

Share this:

Like this:

Related

Leave a commentCancel reply