Artificial Intelligence Lifecycle: A Brief Introduction
As you all know, AI projects are made of 3 main parts. The first one is the training data, which must be stored, managed, cleaned, etc. Then, data is used to train Deep Learning models, for which we will iteratively run experiments to optimize performance metrics. Finally, when the results are good enough, we will deploy the model in production so the business application can use it, either on edge or in the cloud.
But, it’s not over...
Deep Learning models are created by learning from real world's data. However, as the world changes, data changes during the lifetime of the model. This is why models have to be retrained.
As you may understand by now, the AI model's performance is all about the data. That's why everyone has been talking about "Data-Centric AI" in 2021.
In this article, we will walk you through the history of Data-Centric AI and where it's going in the next months/years! 🚀
Model-Centric AI — These days are gone
The focus on the model we talked about just before led to this state that AI has currently followed for many years.
From the words of the renowned Andrew Ng, AI systems are composed of Code + Data, Code being the model that is programmed using some frameworks in Python, C++, R, etc. And, the challenges for all research labs around the world was, for a given benchmark dataset such as the COCO dataset, create model architecture that would perform better and become the state of the art.
This is called a model-centric approach — keeping the data fixed and iterating over the model and its parameters to improve performances.
Sure, it was amazing for us ML engineers to easily have access to new and better models on Github and being able to create the best model for our project. For a lot of machine learning engineers, it gave us the feeling that, after studying ML theory so hard, we were finally applying this science package and trying to create something powerful.
The particularity of this period is that at the time, data collection was a one-off task, performed at the beginning of the project, with maybe the goal to make the dataset grow with time but with not much reflection about its inner quality.
The deployments of the model created were usually at a small scale; just one server or device could handle all the load and monitoring wasn’t such a thing.
But, the biggest hurdle was that everything was done manually: Data cleaning (rather normal), model training, validation, deployment, storage, sharing, and more.
It was obvious that there was a problem that needed to be solved. However, at that time, the solutions, such as big ML platforms, were either inexistent or too complicated to apply for the majority of organizations.
From Model-Centric to Data-Centric AI
Times have changed, and some influential people in the field, such as Dr. Andrew Ng, started proposing some new paradigms to deal with model optimization, this time by focusing on data.
This approach is now called Data-Centric — you may have seen those words on a lot of startup websites, and they can have different meanings and applications, but I will start by introducing the concept.
A data-centric approach is when you systematically change or enhance your datasets to improve the performance of the model. This means that contrary to the model-centric approach, this time the model is fixed, and you only improve the data. Enhancing the dataset can have different meanings. It can include taking care of the consistency of the labels, finely sampling the training data, and choosing the batches wisely; not always meaning an effort to increase the dataset size.
As an example of how models trained on benchmark datasets can be improved, a study showed that on average, 3.4% of the data in those datasets was mislabeled (which can take a lot of different forms). Imagine the increase of performance possible by decreasing this number to 0!
But focusing that much on the data, as it should continuously flow since we deployed models that can collect the data they are doing predictions on, means that you have automated all the processes behind the model lifecycle, from training through validation, to deployment.
This discipline is called MLOps (for Machine Learning Operations). If you'd like to learn more about MLOps and its most important concepts, you can check our first article of our series here.