Data: Five myths laid bare

2nd April 2020

Author

Data: Five myths laid bare

So, your boss just attended a virtual tech conference, and she came back with a dangerous, terrifying, and exhilarating idea planted in her head by one of those canny system integrators…

“Data is the new oil,” she says. “We sit on a shed-load of data here at ACME,” her eyes glimmering. “Machine learning will replace 97% of our workforce.” Lenny in HR shivers and doesn’t know why. “They are all doing it!” Now you are getting worried.

What comes next, if you are in charge of data in your company, is inevitable: “What is our Data Science / Machine Learning / Big Data / cloud migration strategy, and why have you never told me about this before?”

You think of those seven PowerPoint slides that describe a very well delineated data-architecture strategy in descending order of complexity that you presented to leadership last quarter. But they never got beyond slide 2.

It is at that point that you need to land your key messages, strike the iron while it is hot and start the data revolution.

1. Big Data is not about big data, it’s about larger and cheaper containers

Data architects always hated throwing away data, but it was too expensive and process intensive to retain the most granular level and the largest datasets. Besides, changing the data model felt like performing open-heart surgery every time.

The big data revolution is the result of many concurring factors:

Disk space is virtually free
Computing power is cheaper and on tap
Machine learning and AI are finally feasible on commodity hardware
Relational is not the only solution to data problems anymore: document, key-value, columnar, and graph databases are mature and production-ready

Big data is the ability to elaborate large amounts of data, on commodity hardware, in a flexible environment to solve simple and hard problems that would have been too expensive or impossible to solve on pure relational technologies.

2. Big Data is not an IT problem, it’s an answer to a business question.

Big-data business cases are, most of the time, drafted by IT architects; those professional have little interest, inclination, or ability to identify hard business benefits. IT professionals (let’s come clean, I am one) want to play with the new shiny toys and leave the business case to the business.

That’s the main reason why Data Strategies usually describe a multi-year project that concentrate larger Volumes of a great Variety of datasets at great Velocity…the three Vs of big data. This is wrong and the main reason why most “data lake” projects fail.

You should start from a business question, identify the smallest dataset that is likely to hold the answer, select the right database and algorithm and deliver value, fast. You’ll have a small data lake, a manageable architecture and a business seeing tangible, quick results ready to invest more!

3. Metadata and data modelling are even more critical than before

Relational databases are schema-on-write, meaning that you need to know what your database model looks like before being able to create it. This feature usually leads to well documented and planned-for databases at the expense of speed and flexibility.

Big data appliances are schema-on-read. You can write data without a schema and leave it to the reader to model the data as they please.

The temptation to write data in the “data lake” without documenting the model and retaining the metadata is strong. Unfortunately companies that do this end up with data swamps. Very quickly they end up with dirty, polluted data and they don’t know what lurks beneath.

When it comes to big data, metadata and data modelling are the most critical piece of the puzzle.

4. Relational databases are here to stay

Nobody wants to pay large maintenance fees to established software vendors for relational databases. However, there are areas where a relational database is the answer!

Transactional environments, with hundreds or thousands of small transactions per second that have strict consistency requirements, are better served by relational databases. Forty years of evolution gave us amazing technology that has a clear place in the market today.

When it comes to defining a data architecture strategy, sometimes the right answer is: “Just keep your current appliance.” You will have a robust, battle-tested and secure environment. If it ain’t broke, don’t fix it.

5. Linked data is the lightweight solution to bringing it all together

Similarly to AI, Linked Data has historically suffered from the lack of appliances powerful enough to harness its benefits. Today, we have the opportunity to use Linked Data to build a truly revolutionary data strategy.

Linked Data lets you use the same standards to store data, models, and metadata in the same place. Moreover, it lets you programmatically use the schema to navigate, transform and publish your datasets. Linked Data is to databases what the World Wide Web is to documents. A clever and simple idea (another idea from Sir Tim Berners-Lee) that will let you store your data in an open source technology and will put an end to the messy data migration and software upgrades, once and for all!

Store your data and metadata together, the wheel has already been invented, just use it!

One bonus tip…

No matter how cutting edge, simple to use, cheap, future proof and beautiful your solution is, when all will be said and done, Miriam from accounting will tell you: “It’s exactly what I need and more, it makes my life easier, and I am going to be so much more productive, I only have one request: can I have an export button for excel?”

One size doesn't fit all: how to optimise markdowns

—