Home Blog The Stages of Data M ...

The Stages of Data Management in a Headless CDP

Luca Ricci

These days, companies are increasingly investing in new technologies capable of collecting and aggregating various types of data to make work more agile and efficient. However, when using a complex technological stack, significant challenges often arise, such as:

  • The inability to manage data centrally because it’s organized in silos that don’t communicate with each other, resulting in duplicated information and increased maintenance for cleaning and updating.
  • The creation of incomplete and unreliable audience segments due to the dispersion of information.
  • The difficulty, if not impossibility, of integrating different platforms, leading to issues in activating data for creating personalized and effective marketing campaigns.

These challenges drive many companies to adopt a Customer Data Platform (CDP).

A CDP can collect, unify, and manage first and third-party data in a single data warehouse, providing a 360-degree view of their customers. This allows organizations to gain the business intelligence they need to increase sales, retain customers, and make data-driven strategic decisions.

In a fascinating Humans of Martech podcast, Michael Katz recalls the 8 essential steps that constitute a Customer Data Platform, as reported by Arpit Choudhury in his series of articles on Customer Data Platforms:

  1. Customer Data Infrastructure
  2. ETL (Extract, Transform, Load)
  3. Storage
  4. Identity Resolution
  5. Audience Segmentation
  6. Reverse ETL
  7. Data Quality
  8. Data Governance and Privacy Compliance

Each of these stages represents a challenge both from a technological and a strategic-ethical perspective. Let’s explore together the ones we consider most significant in the context of a solution harnessing the power of the Cloud Data Warehouse instead of a traditional Customer Data Platform, namely:

  1. CDI (Customer Data Infrastructure)
  2. ETL (Extract, Transform, Load) and Data Ingestion
  3. Identity Resolution
  4. Audience Segmentation
  5. Reverse ETL


Customer Data Infrastructure (CDI)

This phase encompasses all user data acquisition activities. Within CDI (Customer Data Infrastructure), it includes all the tools and collection strategies, starting from tracking analytics and advertising pixels, continuing with the convergence of data within CRM systems, and extending to more advanced collection solutions such as loyalty cards at the point of sale, geolocated data, or synthetic data.

This phase is often underestimated and approached inversely compared to the ideal workflow. The common approach tends to be “collect the data first, then understand how to use it,” and this often leads to difficulties in subsequent merging or the absence of fundamental data. A classic example is not exposing in online tracking the typical data from the physical world, such as user identifiers, which then makes it impossible to correlate the two behaviors.

A solid Customer Data Infrastructure starts from business needs and objectives, covers the entire customer journey, and aims for ethical and robust data collection.

In this phase, it’s often essential to adopt a lean approach, tracking extensively but only what is necessary. This approach helps dispel the illusion of tracking all data and ensures that only data with a clear purpose flows into the company’s systems.


ETL (Extract, Transform, Load) and Data Ingestion

ETL is the second step in the user data journey and encompasses all processes that lead to data extraction, transformation into a common format, and loading into the Data Warehouse.

During this phase, companies often encounter issues with data transformation, data loss during ingestion, and the need to maintain data consistency. These problems frequently stem from suboptimal approaches in the initial step, where large quantities of inconsistent and poorly structured data are collected, leaving the responsibility for ingestion to the ETL phase, making it challenging to resolve source issues.

Even in this stage, starting with lean business objectives allows for established data models and structures, as well as a clear purpose. This simplifies the understanding of how tables should be related and the optimal data format, making ETL work straightforward and robust and ensuring the presence of a streamlined, efficient, and maintainable Data Warehouse.


Identity Resolution

In this phase, a user’s identity is traced across various platforms using unique identifiers. This process is fundamental to the very concept of the Customer Data Platform. First and foremost, it shifts the focus from channels to the user, making targeted actions and loyalty widely available. Moreover, it allows for breaking down corporate silos and achieving a genuine, unified view of user behavior and interaction.

What makes identity resolution complex is that an entity can have multiple identifiers associated with it. These identifiers can vary based on the source or system they come from. For example, a person may have an identifier based on their phone number in one system, another identifier based on their email address in another system, and so on. The hierarchy of IDs implies organizing these identifiers into a structure or logical sequence that determines which ones are more reliable or prioritized over others.

The crucial part of identity resolution is linking these identifiers together. This can be done through various techniques, such as analyzing similarities between identifiers, verifying equality between them, or using advanced correlation algorithms. The goal is to connect or map different identifiers to a primary or unique identifier for the entity in question.

Audience AI solves this problem by assisting in the configuration of a unique identifier from the data collection phase, ensuring that the entire process remains coherent and rational. This approach helps avoid complex modeling and reconciliation activities that can be costly and result in low data quality.

Once the identity resolution phase is completed, you can be certain of having a 360-degree view of the user. Consequently, all the models and segments you apply can be attributed to the individual user, unleashing the full potential of automation and personalization in the user experience.


Audience Segmentation

Users are divided into homogeneous groups based on criteria such as interests, behaviors, or demographics. This process allows for customizing marketing strategies according to the needs and preferences of each group.

Let’s start by distinguishing two processes that are sometimes confused but quite different: segmentation and clustering.

By segmentation, we mean the division of our customer base into segments. Usually, this activity is based on qualitative criteria and business decisions. The audiences created do not consider the “similarity” between users, which can be taken into account using statistical clustering techniques. The significance is undoubtedly strong, but the statistical value is low, leading to poor data reliability and difficulty in using the segment for retargeting or insights analysis.

Clustering, on the other hand, is a statistical analysis that allows for dividing an audience into groups of “similar” users based on the parameters we are using. For example, we can produce an RFM analysis aiming to identify high-potential customers, frequent but low-spending customers, and top customers. Using clustering techniques like K-Means, we will group users into effective and meaningful segments, assigning the correct label and statistically monitoring when our clustering continues to have good consistency.

Often, user segmentation suffers from problems that completely compromise its effectiveness. Let’s delve into the most common ones:

Incorrect Group Membership

This typically occurs when we fail to use statistical segmentation methodologies or when the incoming data is inaccurate. In the first case, we have been too arbitrary in creating the audience segment, including users who probably do not belong directly to that group. The classic example is including users in the “Top Clients” category who are not currently the company’s best customers, but will still receive messages and promotions as if they were.

In the second case, the issue lies not in the segmentation system but in data collection: we may have lost some important transactions due to tracking issues or mishandled data from a particular source. Consequently, a very important customer might end up in a lower-value cluster and not fully benefit from all strategies dedicated to them.

The issue of having a group size that is too small

Another common error in segmentation is the tendency to create groups that are too small to be statistically significant and usable on advertising platforms. If we want to ensure that our marketing strategies make the most of segmentation potential, our groups must be of a size that allows them to be targeted in campaigns on advertising platforms as well as within our direct marketing systems.

Regarding advertising platforms, we must consider privacy limitations and actual delivery limitations. The first limitation is a platform protection method aimed at preventing them from easily identifying the individual users uploaded to their platforms and thus having access to information without their consent. Protection is absolutely necessary, but it poses a targeting challenge. We must always be able to create audiences of at least 800/1000 users if we want to ensure activation through Meta or Google.

We must also remember that not all users will be recognized when we send these segments to the platforms. Match percentages vary significantly from industry to industry, and we can only verify post hoc whether our segmentation strategies are indeed creating usable audiences.

Even in the case of direct marketing campaigns, having the right audience size is important. Having segments with 1-2 users is not advantageous in terms of aggregation and automation and may prevent us from sending effective messages.

The Limited Relevance of Segmentation to Marketing Strategies

Often, segmentation activities are carried out without considering the marketing strategy and business objectives. Typically, companies are divided into silos, and segmentation is performed either by the IT/Data Science team or the Marketing team.

This division of roles often leads to segments that are not perfectly aligned with the strategy and, therefore, challenging to use. For instance, having a segment of frequent clients may not always be strategic if our ultimate goal is not to increase the number of top clients but to grow the customer base.

In the case of demographic or interest-based segments, the issue becomes even more sensitive. Providing data on gender or age is a standard activity that often involves complexity and GDPR-related challenges, even though there may not be a campaign considering this segmentation.

Only the active involvement of business professionals in the Customer Data Enrichment project ensures that the audience aligns with strategic needs and is immediately applicable.

The low similarity among individuals within the group

As we saw at the beginning, manual segmentation that doesn’t consider a statistical approach risks including users in certain clusters who are not actually similar to others. While this flexibility may be highly appreciated from a strategic standpoint, it’s not uncommon to see that the use of audiences constructed in this way leads to poor campaign performance due to incorrect audience segmentation.


Reverse ETL

The reverse ETL process is a crucial part of a CDP’s ability to integrate into a marketing ecosystem and effectively activate customer data segmentation and enrichment.

The reversed process starts from the single-source-of-truth, the data warehouse created during the ETL phase, and extracts data to integrate into activation platforms such as CRM, Marketing Automation platforms, and PPC platforms (Google, Meta, TikTok).

The reverse ETL process is solely responsible for creating efficient queries and stable integrations to ensure that data flows correctly and allows the marketing team to achieve maximum results.

However, it’s often not enough to have well-structured queries. We need to focus on the specific needs of a campaign or automation journey to ensure that the data flow can be activated with a clear and impactful strategy.

Audience AI is built on this assumption, transforming the data paradigm not only at the technical level but, more importantly, at the strategic level.

We have already developed activation and performance enhancement strategies such as enriched bidding, and thanks to our strategies, we can reverse the data flow and expose all the necessary data for platforms like Google Ads to leverage first-party data and significantly improve campaign performance.

This process is made possible by extensive experience in marketing data collection and integration, as well as a deep understanding of digital marketing dynamics.

In this way, the Reverse ETL process can truly be described as end-to-end.