Using Data Lakes as a Transformation Tool

The following excerpt is from The Age of Customer Equity: Data-Driven Strategies to Build a Sustainable Company. Available at Amazon, Barnes & Noble, Porchlight, and your local bookstore.

When we think about companies in the Listening Zone, the fundamental need is to trust the data. But when companies advance to the Learning Zone, the fundamental need is for a fast, connected data system that pulls each piece of customer data together across the organization. That can be a tall order, especially in large companies. 

In the Learning Zone, the database is the primary tool to understand customers. Some companies have advanced to a data lake, a massive landing zone for processed data, including customer data. Let me briefly explain the difference. In the 1990s when companies installed a database, it was a relational database. To stand up a relational database, the business intelligence or IT team went through a fairly large effort to map out relationships and keys so that specific information could be connected. For example, if you wanted to know which customer bought which product, you would have to have previously created a join between the customer and product tables. This process could be good for the CFO who might need the same reports again and again, but slow and messy for customer data analysis. No matter how well the relationships were mapped out, the data could change, creating a new need that had not been anticipated before. 

For executives hungry for answers, the long delay to produce this information meant it could easily be obsolete before they received it. Adding to the problem, follow-up questions may have been equally time delayed. However, some questions are predictable and repetitive (e.g., How many sales did the Mid-Atlantic region have this month?) so data warehouses are still important and useful. 

As pressure mounted to answer executive questions quickly and the cost of data storage decreased, the data lake became a popular solution. Unlike the data warehouse, which structures data prior to use, the data lake captures unstructured data. This allows the query writer (often a data scientist) to assemble the pieces they need to answer executives’ questions on demand. Using our previous example, adding device type to the customer and product table can be done by one person in minutes — not days or weeks. A successful data lake stores the wealth of the business in one place and includes a data dictionary to define the source of record and processes or calculations used to collect the data. 

Customer data, including unidentified behavior of future customers, break data silos especially across business units, and encourage alignment around a common external cause. Every department with customer data also touches the customer, but it is unlikely they’ve been asked to interlock it before. Learners put a process around governance to gather and create the keys to interlock the data. A key is simply a piece of matching data. It could be a name, email address, or random series of numbers that is identical in two different datasets. Often email is used as a key, and an email address can be somewhat unique. Of course, the challenge is when the customer provides first name, last name, and email address, but then uses a different email address on the next purchase.

Some companies generate their own specific customer ID and try to match all the different bits of customer data into one ID. Once the keys interlock the data, the next need is to build high-speed data systems which can run queries, reports, and analysis at the customer grain. 

Whether a company runs a database, data lake, or both, the problem that holds listeners back remains the same. Data has no natural intelligence, so putting it in a bigger pile is of limited use. For example, data doesn’t differentiate between varied uses of the same word. One common example where the same word has different internal meanings is the term “channel.” What qualifies as a marketing channel could have multiple names and formats (e.g., eMail, e-mail, e mail, email) which is a governance problem but further, the sales team might also use the term “channel” in a completely different context (e.g., direct, wholesale, distributor). 

So, the first problem is consistent governance of the data. The second problem is that important context behind the data may not even be present in the raw data. There are easily a hundred points of context behind most data sets that are missing. This is why landing unprocessed data in a fast analysis system does not necessarily yield massive insights. But do not panic. More data is not necessarily better, because correlation is not causation. 

To use a fun analogy, the data problem faced by companies in the Listening Zone is like a wardrobe malfunction. The belt does not match the shoes, the shirt is all wrong, and what is with that hat? But at least Learners have the clothes to step outside. Listeners in the previous stage are missing their pants completely.

Previous
Previous

5 Early Ways to Use Customer Lifetime Value Strategically

Next
Next

5 Tactics That Transform Listeners to Learners