Before data can be fed into a model or algorithm, it needs to be cleansed. This is usually one of the most cumbersome steps in the process – it can take up to 80% of a data scientist’s time. It is a crucial step, however. Make sure to reserve enough resources for it, or work with a partner that can help.
“Garbage in, garbage out” is an age-old principle of computer science that holds true for AI as well. The quality of your AI is only as good as the quality of the data you feed into it.
Imagine teaching your child to speak English. Suppose you make up every fifth word, and you mispronounce every tenth word. Your child will certainly learn a creative variation of English. But it will probably not get your child very far when talking to classmates in the schoolyard!
The same goes for healthcare data, which is often extremely noisy. For example, two physicians might annotate the same tumor in different ways. This variability can introduce massive error when training an algorithm to detect tumors. Similar inconsistencies exist in the way information is recorded in EMRs, leading some authors to speak of an electronic Tower of Babel.
Removing this noise and establishing a so-called “ground truth” is critical before training a model or algorithm. If you take data from different systems, you need to normalize the data first – to ensure that the same annotations actually refer to the same thing. The input that goes into the model or algorithm needs to be consistent, otherwise its output will be unpredictable.