Hunting for Bad Data

30-minute Talk

Defects in our data are as important as the defects in our code.

Timetable

11:10 a.m. – 11:40 a.m. Wednesday 6th

Room

Room F2 - Track 2: Talks

Audience

Developers, testers

Key-Learnings

  • Working with bad data causes trillions of dollars of losses each year.
  • Bad data is very context specific. You need to know your domain inside out to define it.
  • Implement data validity checks before write operations.
  • Learn how to build a system that will recover from bad data.
  • Simple algorithms can go a long way and most of the times they do good enough job compared to machine learning.

A Practitioner’s Guide to Self Healing Systems

In 2013, the total amount of data in the world was 4.4 zettabytes. In 2020 it is estimated to be 10 times more. With speed advances and miniaturization of computers, it is now easier than even to collect all sorts of data in vast amounts. But quantity does not equate quality. Some of the most expensive software defects were caused by handing an incorrect data. The most recent example is the crash of the Schiaparelli Mars Lander.

Back to Earth, at Falcon.io, we investigate every exception that is generated from our production environment. We were surprised to find out that 19% are caused by bad data. This includes missing data, data with the wrong type, truncated data, duplicated data, inconsistent format etc. As our core business is to collect and process and analyze data from the biggest social networks, this finding was a major wakeup call.

“Jidoka” is a term that comes from lean manufacturing. It means to give the machines “the ability to detect when an abnormal condition has occurred”. We wanted to go one step further and create a simple, yet robust system ability to not only recognize bad data but also to fix it automatically so that it can heal itself. As it turns out, in some cases, it’s faster/easier to fix bad data than to locate the code that produced it. This talk is a practical one, containing ideas and tips that you can quickly implement when dealing with data quality.

Related Sessions

Tue, Nov 05 • 11:10 a.m. – 11:40 a.m.
Room F1 - Track 1: Talks

30-min New Voice Talk

Thu, Nov 07 • 10:25 a.m. – 10:55 a.m.
Room F3 - Track 3: Talks

30-minute Talk

Mon, Nov 04 • 9:00 a.m. – 5:00 p.m.

Full-Day Tutorial (6 hours)

Wed, Nov 06 • 9:10 a.m. – 9:55 a.m.
Room F1+F2+F3 - Plenary

45-minute Keynote