Data is vital to the digital economy. As vital as the red and white blood cells to your body. Which makes dataflow equivalent to your blood flow. What happens if you’re fed bad data? Corporate blood poisoning? You bet!
The question may seem rhetorical, but it isn’t. Actually, what we’re seeing these days is the exact opposite: The digital economy is devouring bad data to the extend it’s on the verge of self-poisoning.
Here’s the thing: About 10 years ago the world suddenly discovered that the slogan ‘data is the new oil’ was real. The analogy worked well along several axes and created understanding: Value, process, importance, flow etc. The C-levels and board rooms around the world suddenly became aware of an (apparently) new resource. That the data behind the graphs and tables they’d been using for years, had been underutilized. There was more value to be extracted.
‘Save everything’ became the new slogan – empowered by the extremely low cost of data storage. Zillions of terabytes poured into huge data lakes to the extent that cloud vendors had trouble finding enough disk drives. A bonanza that continues to this day. And while no one was paying attention, even the best planned data lakes turned into data swamps.
Then AI became a hot issue and the corporate world got even more obsessed with saving everything. Enthusiastic (and quite naiive) boardroom PPT-slides peddled the new-found value: “It’s food for our AI systems, gaining more understanding, more insight, more value”, and the flood continued. While many of us continued to state what seemed obvious – that drowning is more likely than finding any value in these oceans of data – to no avail.
As we’re approaching 2023, the situation is becoming critical and more dangerous than ever. We’re feeding advanced (as in smart) AI systems garbage and trusting the output. We’re hiring data experts (scientists, engineers, analysts, testing, quality assurance and more) head over heel without knowing what to expect from them or what qualifications they really need. Which means that the good ones leave fast, the not-so-good ones stay and exacerbate the poisoning. Not because they’re bad but because they lack guidance. They deliver what they think management needs. And quite frequently they deliver the answers management expect in stead of offering new insight.
If you though that’s exactly what data science and analysis was supposed to avoid, you’re right – but here’s the thing: Data science is as much about algorithms and models as it is about data. The engineers, scientists, analysts etc. spend as much time, maybe more, tweaking models as they are cleaning, filtering, combining and quality-testing data. Tweaking the data will change the results. Tweaking the models and algorithms will also change the results. So there are no absolutes, a lot of randomness – and a lot of trust involved.
“Torture the data, and it will confess to anything.”
.. which just about tells it all. Don’t mistake data for hard facts unless you know exactly what you have. Which is rarely the case.
Raieli’s post is a timely reminder that ‘data professionals’ – scientists, engineers etc. – quite often focus more on models than data, because (my take) the former is much more interesting. Data is generally boring, data lakes (and data swamps) are even more boring – the more garbage, the more uninteresting. Something Mark Twain knew 100 years ago:
“Data is like garbage. You’d better know what you are going to do with it before you collect it.”
The point Twain so brilliantly conveys is that hoarding data – like we’ve been doing for more than 10 years with the assumption it may become valuable – is worse than useless. It’s detrimental. Thus the blood analogy above: Assuming there is value in all this data is outright dangerous. Even if there is some value (which may well be the case), finding it and making use of it is so costly and risky that we’re better of deleting it, draining the swamps.
Which in many companies becomes the first and most important task of a data professional. Delete data, rationalize expectations, clean the blood stream. And not the least: Create new understanding about data: There is good data, bad data and unknown data. The latter is most dangerous. Delete!