Y-Data, an advanced education program for data scientists, has managed to assemble some of the most active voices from the Israeli AI & ML community to speak about training data.
Hosted by:
The era of modern AI started with the rise of big data. Once you have large amounts of logged structured data, be it clicks on the products in an online store, or time spent on a certain webpage in a browser, or percentage of paid credits in a bank, data science steps in.
However, in reality, the data is often either not structured or, even worse, does not exist at all.
For example, a voice assistant will only learn to correctly activate after the model analyses thousands of hours of speech recordings made by different voices, accents, amidst surrounding noises. Further, a search engine will only learn how to rank the most relevant sites on top after “seeing” millions of pairs matching user queries and web pages documents, judged by the relevance of the match.
All the magic and power of artificial intelligence has a natural glass ceiling. And this ceiling is training data.
Y-Data, an advanced education program for data scientists, has managed to assemble some of the most active voices from the Israeli AI & ML community to speak on this exciting agenda.
The Data-Centric approach is the next frontier of the AI world, but it has its challenges and barriers. Data acquisition and annotation are a challenge we are all too familiar with, and as companies scale and require more data for larger models, this challenges becomes more and more painful. High precision annotations are hard to come by, particularly for tasks that could benefit from 3D context; this is where synthetic data comes in, answering some of these pains.
In this talk Lotem presents the hidden power of synthetic annotations for CV tasks, the challenges of combining these with human annotations, and how pixel-perfect labels can help you break the annotation barrier for multiple CV use cases.
We all know that models are only as good as the data we feed them. However, building quality datasets is inherently difficult for many reasons. Gong is lucky enough to have an abundance of data and a group of dedicated labelers readily available. Nevertheless, data set creation is something on which the company spends a significant portion of time. And, as Gong supports more and more languages, this aspect becomes even more important. How do we label data efficiently in multiple languages? How do we perform error analysis in a language we don’t understand? In this talk, Inbal Horev, an NLP team lead at Gong, presents some of the data-centric challenges Gong faced and the processes the company set in place in order to solve them.
Modern AI systems consist of a series of steps from the general idea to production, and most of them involve some kind of data manipulation: from data collection for training to test cases creation for quality control. In her talk, Olga Megorskaya, CEO of Toloka AI, shares how AI companies can use data labeling platforms to build large-scale AI systems with high quality data. She demonstrates cases for data acquisition, data labeling, model quality assessment, and others.