AI: Driven by data

The second in a series of blogs by A.I. expert Dr Jeroen Vendrig

The technology responsible for AI’s recent breakthrough is data-driven machine learning. In this post we lift the veil on what ‘data-driven’ means, and why it’s important to understand even if you yourself are just a user of AI. The terminology used in the AI world is borrowed from our daily life language, but be aware that there are subtle differences in the meaning of terms.

Humans can learn by experience, for example on-the-job tra ining. The concept of learning without an explicit knowledge transfer from man to machine has inspired the data-driven learning approach. Imagine the simple act of catching a thrown ball. There’s a mix of physics formulas needed to predict where the ball is going to go. We give medals to those few high school students who can work out those equations during their 3 hour final exams. Yet we expect kids in kindergarten to catch the ball in a split second. These kids probably don’t even understand the concept of gravity, other than “falling is ouch”.

Youngsters don’t hit the books, they are learning by example. So does supervised machine learning. A health AI may receive an example of a patient aged 30, with a weight of 60kg who recovered from surgery in 3 days. Another patient aged 40, weighing 80kg, recovered in 4 days, etc.. With enough examples, a machine learning algorithm trains a model that observes characteristics of a new patient (age 32, weight 72kg) to predict a recovery time of 4.32 days. A prediction (inference) for a particular patient could be wrong, just as a kid won’t catch all balls. But a successful algorithm will be approximately right most of the time.

Data-driven machine learning is powerful, but there is a catch: it’s only as good as the data it was fed. A model trained on children’s hospital data may not work well in the geriatric ward. If blood pressure is a key factor, but it’s not recorded in the data set, the model has no choice but to ignore. If the data set contains many characteristics that are not or hardly related to recovery time, the model may find spurious correlations by coincidence.

Machine learning methods are designed to cater for the possibility that a new case is not exactly the same as one of the cases in the training data set. However, the better the distribution of characteristics used for training reflects the real world, the more accurate AI will perform. That’s why “big data” has been one of the foundations for the rise of AI. Just remember that “big” is not all about size, but also about coverage and variety.

In the above example, we showed input values (age, weight) which are available at both training time and at inference time when a new patient is encountered. These are called samples or instances. The output variable (recovery time) is available at training time only, and is known as a target for the sample. At inference time, the challenge is to predict that value.

Data is the key ingredient for machine learning, but how does an AI know what outcomes we expect it to extract from the data? In the next post, we’ll discuss how problem definitions can go horribly wrong.

Quick quiz

In this post’s example a surgery recovery time is predicted. What type of AI task is that? (See answer in blog post 1.)