Data Preparation and Structure for Customer Identity Building using Machine Learning

3 min read

identity prepapration

The number of marketing channels that a customer can interact with when engaging with a brand is compounding every year. Additionally, with the advent of IoT, the number of devices a customer interacts with is also numerous. This results in that the most valuable customer data is often trapped in silos that are difficult to access by marketers. 

Impenetrable silos, scattered customer data, device, and channel proliferation — altogether compound the problem of ‘Customer Identity’. To solve this problem marketers have adopted technology solutions that come in their way as applied analytics, predictive models, and machine learning. 

To understand how machine learning is applied to customer identity building and resolution, we have broken the material into two articles. This is the first of them which explains what we do with data coming into FirstHive from different sources. 

We will be focussing on 

  • Different data components contribute to the high quality of data.
  • Data Preparation

Before we deep dive into the application of machine learning to customer identity, there are other steps that add to the quality of algorithms that we rely on. These are the data components that add structure and cohesiveness to a Customer Identity. Also, using the right data infrastructure and aligning large data sets contributes to building the customer identity.

Data Components and Structure

Unique Identifier

Most often unique identifiers are also referred to as customer data identity criteria. These identifiers are defined by their accuracy, preciseness, reliability, and scalability. Imagine you have 7 different data sets streaming into your CDP from other databases. How would you identify if a customer in one database is also there is another? This problem is resolved with the use of a ‘unique identifier’ attributed to each customer. 

Each database provides a different set of unique identifiers. 

Personal Identifiable Information (PIIs): Some of the common ones are different PII formats such as user ID, email ID, national IDs, credit card numbers, etc. 

Anonymous IDs: Unique identifiers could also be in anonymous formats that are not provided by the customer, but are collected by different marketing systems and interfaces. These combine third-party cookie IDs, device IDs, CRM IDs, serial numbers, hash functions, IP addresses, and offline IDs that enrich the customer profile. 

Data Connectors

Every data point pertaining to a customer sits in a standalone database or marketing system. Extracting every single, relevant data point from each system and attributing it to the customer ID is the role of a data connector. Multiple identifiers for a given customer profile are stored together and further used as internal identifiers. These internal identifiers are matched together that form the data connectors. 

Data connectors are used for other processes that come in the way of data ingestion and data preparation. Data matching, clustering, blocking, pairing, and scoring are some of them. These micro-processes are specific to identity resolution.

Data Schema

This is the blueprint of how data is structured by each source. It helps in capturing the right data from the right hive. Unlike relational or event-stream databases, profile-based databases use these data schema to create enriched customer profiles. CDPs work towards changing the data connectors when any of the schema changes.

Data Preparation

Right from ingesting data to qualifying it into a customer profile and further activating the customer identity for marketing activities; machine learning is applied everywhere. Better data quality will help machine learning algorithms better enrich customer data. And, hence we begin the process of Data Preparation.

Data ingested into FirstHive is put through the following steps before it reaches the common database. The common database is where the creation of customer profiles and IDs begin.

Define Data Connectors: The first step is to create data connectors that enable the ingestion of data from as many sources as possible. These keep coming in different sizes of large loads, historical datasets and incremental data loads.

Data Normalization: Data redundancy is reduced or approached to complete elimination using a combination of unique identifiers. This transforms all relational and event-streaming databases into profile-based databases. 

Profile Databases: This is an indispensable part of the FirstHive CDP architecture. This not only stores PII and other information labeled by the system for basic profiling, but also captures, interactions, purchase behavior, customer history, intent and so on across all configured sources. The consolidated data is stitched into the customer profile.

Amend Data Source Schemas: The schema changes with data normalization exercises. When the process focuses on PII data relevant connectors are updated to re-calibrate the customer identity for an enriched version.

Once data is prepared for ingestion, a CDP uses ingested data for customer profile creation. Each customer identity is mapped to customer profiles using deterministic and probabilistic mapping. This runs through a wholesome process called Uniqui-fication. This involves the use of machine learning for ID stitching, data clean up, semantic tagging, data matching with blocking, clustering, and data comparison.

Use of Data Preparation in Machine Learning

Supervised machine learning algorithms work best when provided by high-quality data sets. Data Preparation process builds enormous amounts of high-quality data sets before the initial customer profile creation and ID creation begins. Unfortunately, high-quality data sets are not available in its raw form and hence the need to understand different components in the data structure and how it is cleansed as ready-to-use data inputs for a machine learning activity to roll down.