Data Cascades, AI for Football, and Protein Generation
Faulty dataset practices have particularly devastating downstream effects on digital technology initiatives across all industries. These so-called data cascades are framing the future of dataset management across the globe.
What's new
Recent research from Google finally formalizes the importance of data in Machine Learning Systems. Too often, companies prioritize building highly complex models instead of work related to the data itself. This is strange, as data is the fuel of data science: it is the major factor that impacts the performance, fairness, robustness, and scalability of ML systems.
The recent paper is the first to measure and discuss data cascades, issues that result in technical debt and their downstream effects over time. In fact, bad data practices often yield unfavorable effects later in the production process by inducing system-level issues and a decrease in user trust.
Following the footsteps of Andrew Ng, these researchers are suggesting that AI should become more data-centric than model-centric. Recently, Landing AI and DeepLearning.AI launched a data-centric AI competition aiming to elevate data-centric approaches to improving model performance.

Why it matters
Data Cascades can have an extremely devastating effect on Machine Learning systems. In fact, out of all AI initiatives that fail, 87% do so because of faulty data practices. The most common example is using a model trained on noise-free data in a noisy environment. This particular scenario often results in less accurate models. For this reason, it is incredibly important to retrain your model with new curated data from a production environment. Often, this is done using a human-in-the-loop strategy.
The second most common data cascade occurs when practitioners rely on their technical data science expertise and completely disregard the domain of application. Certain kinds of information are of paramount importance when building statistical or machine learning models. Very often, domain expertise is absolutely necessary to understand and manage the data fed to the model.
Providing empirical evidence of data cascades as well as a formalization of the concept itself is an incredibly important step for the Machine Learning community.
As AI adoption becomes more widespread, it is essential to raise awareness about potentially life-changing data practices and to incentivize data excellence in parallel. These considerations will have pivotal effects on ensuring that AI initiates deliver on their promised business benefits.
What's next
How do you address data cascades? The paper's authors suggest a four-step approach to tackling data cascades within your organisation:
- Evaluate your data goodness: track standardized metrics such as phenomenological fidelity, validity, etc.
- Incentivize working on data: reward dataset maintenance, collection, labelling, and cleaning within your organization.
- Foster transparent data collaboration: involve technical experts as well as domain experts and data collectors.
- Support the open dataset market: help out lower income countries by sharing datasets online to address data inequalities around the globe.
Are you an ML practicioner? Check out the guidelines for the ML community regarding data collection and evaluation on PAIR that was created thanks to the discussed paper.
Recent breakthroughs in sports analytics, particularly in football, demonstrate the business potential of simple data acquisition systems combined with state-of-the-art AI techniques for all industries alike.
What's new
Recent research from top institutions such as DeepMind is advancing sports analytics like never before. The availability of sports data is increasing in quantity, quality, and granularity. Due to technological progress in sensors and acquisition systems, analysts can now access and leverage fine-grain sports data using state-of-the-art algorithms. Below is an example of predictive modeling in football. Such algorithms have immense business potential as they could help teams develop data-driven strategies for different tournaments, leagues, and competitions.

Why it matters
The application of modern AI algorithms has the potential to revolutionize sports across many different axes. The advances carry immense potential for players, coaches, scouts, fans, and broadcasters alike.

For example, data scientists are using algorithms to predict player injuries. A football player will have between 2.4 and 9.4 injuries in 1000 hours of exertion. In this scenario, having access to an algorithm that tells you when to train (and when not to) in order to maximize recovery is a potential career-saver.
Zooming out of the sports scenario, these recent advances demonstrate the incredible potential of AI applications in a modern era where data collection has become relatively cheap and accessible. In fact, the use of computer vision in combination with sensors and statistical learning is not restricted to sports. Manufacturing, distribution, and all other steps in your product's value chain can be modernized with the new data that you're capturing.
What's next
The future of sports is guaranteed to include more AI-driven analytics. As datasets increase in size and models gain in performance, the dependency on AI in sports analytics is set to increase.
The lesson learned from the sports industry resides in the immense potential of combining multiple and diverse data acquisition systems with state-of-the-art Machine Learning techniques. It also shows that you should not rely on existing datasets or their acquisition thereof. In fact, data sensors are as accessible and cheap as ever.
Have an AI use-case in mind for your company? Verify its technical feasibility by contacting us today.
New research shows impressive progress in viable protein sequence generation. For healthcare and pharmaceutical companies alike, this new technology could boost R&D output by a large margin.
What's new
Researchers from Sweden have recently developed ProteinGAN, a Machine Learning network that has learned how to create and process different natural protein sequences. Their publication in nature magazine "demonstrates the potential of artificial intelligence to rapidly generate highly diverse functional proteins within the allowed biological constraints of the sequence space". Proteins are the workhorses of life. Found in all living organisms, each protein's unique sequence of amino acids determines its 3D structure and hence its functions and properties.
A Generative Adversarial Network (GAN) is a model that consists of two adversarial modules: a generator and a discriminator. While the generator uses a random process to generate samples (in this case; protein sequences), the discriminator's job is to judge whether the samples are real or not (in this case; real protein sequences) taken from a database. These modules are trained using metrics that attest to their performance objectives, which in this case are the functionality of the generated protein sequences.

Why it matters
Creating functional protein sequences is extremely difficult. In fact, even the slightest perturbations in protein sequences can make a protein non-functional (meaning it has unwanted or harmful effects). This new techniques has the potential to allow healthcare and pharmaceutical companies to boost their protein research efficiency. In fact, going from computer design to working proteins is possible in just a few weeks!
For other industries, this new development reiterates the power of generative algorithms and their incredible potential for business cases. In fact, the added value of having the capacity to generate realistic photographs, video, or audio is immense in departments such as Marketing or Sales and industries ranging from luxury to real estate.
Impressively, the techniques used in this particular GAN are extremely similar to widely used Natural Language Processing (NLP) techniques. This shows that the bridge between linguistics and genetics is a bit narrower than what we might perceive.
What's next
Following the recent AI breakthrough tackling protein folding prediction, the ProteinGAN research continues to show the impact Machine Learning algorithms have on R&D within healthcare and pharmaceutical industries. In fact, the future of AI for protein research is looking very promising. Research on the classification and generation of viable sequences such as ProteinGAN, TAPE, and ProtTrans attest to the importance of the topic in recent months.
While NLP techniques such as GANs and Transformers are often used in research, it's sometimes difficult to translate such projects into real-life business applications. Visium's extended business and technical expertise with highly complex Machine Learning systems allows its partners to put one foot in front of the other regarding AI adoption, and thus place them light-years ahead of the competition.