A recent Cognilytica study states that the greatest challenge faced by most AI/ML teams is data management and optimization. About 50% of their time they spend in developing AI is on training data, while another 15% implies augmenting datasets to optimize processes around training data. In the long run, these optimizations can help them save a significant amount of money and time.
Time Allocation in Machine Learning Tasks
What is Data-Centric AI?
You’re probably well aware of the “Data-Centric” approach to AI, often referred to as DCAI. A lot of people have tried to define it. At Picsellia, we are aligned with Andrew Ng’s definition.
We think that the key takeaway from his definition is the term “systematically”, which implies that data will always be the first thing to have in mind when starting a new project.
If you want to follow a Data-Centric AI approach, you should always ask yourself about the quality and quantity of your data before anything else. This means that all the interrogations about your model’s implementation should become, at least, secondary.
What are the most important questions to achieve DCAI?
- Do I have data for my use-case?
- How much data do I have?
- How relevant is the data ?
You Need Data Management to be Data-Centric
Following the Data-Centric approach, it becomes obvious that the iteration speed of your computer vision models will be limited by the speed at which you iterate your data. The speed of access to information to answer these questions will be a key factor in your development process.
In order to maximize the agility of your organization around your data, an efficient and centralized data management system is a key success factor for your AI projects.
At Picsellia, we are convinced that a successful strategy in computer vision requires appropriate data management. The current complexity of AI lies in its operations and processes and not in model development.
Key Features For An Efficient Data Management Solution
Before, we mentioned the importance of centralization in data management. However, this is not the only element to consider when setting up our data management strategy.
Indeed, the objective is to be able to answer the before-mentioned three questions concerning data quantity, its relevance and quality. To achieve this, it’s necessary to set up tools that allow you to navigate as efficiently as possible in your data, and to extract relevant information. A poll we recently launched on LinkedIn shows that the most sought-after functionality when implementing a data management strategy is data mining.
Indeed, the advent of cloud object storage technologies (AWS s3, Google Cloud Storage, etc.) has allowed companies to store more and more data at a lower cost. But, when working on computer vision use-cases, centralized mass storage poses a major problem of visualization and exploration.
The unstructured character of an image makes navigation in these object-stores very complicated. Thus, one of the major functionalities required for a data management solution dedicated to computer vision is data visualization and exploration.
Then comes the traceability and versioning of your data. To let your organization reproduce and analyze your work, it is essential to keep a history of the use of data.
The development of computer vision models requires a total mastery and a 360° vision of the data that was used to create them. To sum it up, you need to be able to answer the following questions:
- What data was used to train model X?
- Which dataset was used in experiment Y?
Wrapping up
In order to guarantee the success of a Data-Centric strategy, it is necessary to set up an efficient data management system. Centralizing data via cloud storage solutions such as AWS s3 is not enough to get a data management system. You will also need to set up visualization, search and indexing functionalities to be able to answer the fundamental questions to a Data-Centric strategy.