What is entity resolution? What is identity resolution? How are they related and why getting them right is a hard problem?
Let’s dive in:
Q. What exactly is Entity Resolution?
Sounds like a very, you know, a lot of jargon, entity resolution, but at the heart of it it's actually saying that multiple records in your warehouse or on your data lake belong to the same real-world entity. And this entity could be a customer, it could be a supplier, it could be location. Just any noun that the business deals with.
Q. So identity resolution is a subset of entity resolution, right? Can you please explain?
Yeah, so when we talk about identity, right, it's actually who you are. And entity resolution at a broader level is establishing what an entity is. So entity and identity in fact are very closely tied but when we talk about identity it's more related to the person, so like a customer or a citizen. Those are identity resolution, technically that's what's identity resolution.
Q. Considering entity resolution is so important, why is it still a largely unsolved problem?
It's a very important problem but to realize the problem, I think there are some building blocks that need to be there. First of all, the enterprise has to be ready with all the data in one place, being ready with their analytics, pipelines, their ETL, for them to figure out that, you know, now they need to establish the linkages and they're ready for analysis. And that is the problem. That is the time at which, you know, when they realize that you know, there are five records belonging to one customer and they're not able to tie it together. So for entity resolution to really happen some of the building blocks need to be in place. And that is what we are seeing emerging more and more, that, you know, the base tech for companies is ready and that's why the need for entity resolution is growing.
Q. So why did you choose to work on entity resolution? Like why is it exciting?
A. So I chose to work on it primarily because I failed, you know, to solve it in my first goal. I was working as a data consultant. I was tasked with building a data lake and we had to resolve some entities from multiple databases. And when we got to solving it, like, we really, really had a tough time. And that's where it hit me. And I saw this problem again and again as part of my consulting. And I felt that, you know, one. this is a tough problem to solve. Second is that it is a problem, if solved in a very domain agnostic way can actually serve multiple industries, multiple data sets. So that's what excites me very much.
You’re building Zingg, an open-source project to solve this problem.
Q. Can you give us a high-level overview of how Zingg works?
So Zingg as an open-source project, what it does is a very simple workflow. What you say is that here is my, you know, here are my records, here are the attributes on which I want to match. Some of these attributes I am okay to have variations, some of these attributes I want them to exactly match. So what we call is fuzzy and exact matching. And that's what you can configure in the system. And you tell where the data is residing. And then Zingg starts showing you some peers, asks you to tell it whether, you know, whether there are matches or non-matches according to your business logic. And pretty much you run a few rounds like that. The AI models behind Zingg start getting refined. And pretty much after, you know, a few rounds of labeling you are kind of set with your entity resolution models.
Q. Can you briefly share some light on the difference between using Zingg versus building one's models using SQL?
So see, entity resolution can be, you know, as simple as maybe in some cases, you have a user email already and then you know that these two records with this email belong to the same individual. Or you have, you know, guest checkouts and if the user finally checks in and logs in and then you capture that email, you know how to associate the anonymous activity with the logged-in activity and thus user ID.
If it is as simple as one or two systems and, you know, very deterministic attributes on which you can actually link and match I think it is okay to go with the SQL and it's a fair choice to make. But in most cases what happens is there are multiple sources from which the data is coming. They all have variations across name, across age, across address, across telephone numbers, across multiple emails.
And then with the growing size of data as well as the variation in the attributes this becomes an increasingly tougher problem to solve. Because like it's a classic join problem, right? We all talk about tuning your joins in a database, but then if you don't have unique identifiers really what are you going to join on? So it gets very complex very quickly.
And I think that's where you have to differentiate understanding your data and differentiate whether a simplistic SQL model is good enough for you or a more advanced solution or across multiple attributes, also fuzzy and deterministic matching is something that your data set needs.
🤔 Have questions?
Q. Can you tell us how this approach differs from the identity resolution capabilities offered by CDP vendors?
So CDP vendors, so one is like Zingg is entity resolution, which is like, you know, a much broader problem statement across different languages, across scale, across entities. But just coming to identity resolution per se, most CDPs, they're like a third party system, right? They're not working directly on your warehouse. So your data warehouse still has, you know, these unresolved entities.
The CDP data is actually with the CDP vendor, which you have to then get back into your systems. Zingg on the other hand works natively on the warehouse or the data lake. So you have control over your matching process, the frequency at which you're running, over your data model. So the CDPs define their own data models. Zingg is very flexible about, you know, you defining your own data model. And I think the whole approach to Zingg is favoring different use cases and various varieties of data. Which the CDP in a very minimal sense does, but obviously, I mean, Zingg is a very focused product for entity resolution compared to the offerings by CDPs.
Q. Since you mentioned use cases besides the classic ID resolution use case, what are some other important use cases of entity resolution?
Yeah, so I think in the web world, right, we talk about CDPs and we like, we know identity resolution. But identity resolution in the web world is a very marketing or a sales defined need. But when you look at traditional industries like banking or healthcare, so there a lot of, you know, compliance, there's a lot of know your customer, anti-money laundering, GDPR, healthcare provider data.
So in healthcare you have like Sunshine Act where you have to be, as a healthcare company, you have to declare what is your affiliation with healthcare providers? So those are things where, in all those cases you actually need to resolve those entities with the various touch points which you've had with these entities. So beyond the classic CDP identity resolution, entity resolution by itself has a lot more use case.
Q. Can you explain how e-commerce merchants can benefit from entity resolution?
So that's interesting because the first use case that is going into production with Zingg is actually an e-commerce use case. We have a user who is building a review website for products and so they have, the team has scraped product data from multiple websites which they are resolving, the product data they're resolving through Zingg and then putting in the reviews. S
o that is one very item matching use case, product catalog item matching. Then the other is the guest checkout and then realizing that, you know, somebody has checked out so the user journey, the customer 360 for the user. So those are broadly two use cases in the e-commerce space.
Q. Last question for you — what's your advice for companies looking to get started with entity resolution in general and identity resolution in particular?
I would say that know your data. I think just like, just with any data project, right, know your data. Start small with one or two use cases or one or two data sets. Don't underestimate your solution.
Prefer watching the interview?
Identity resolution in the Data Warehouse vs in a CDP? If you’re in the process of figuring out which one’s right for you, check out this guide.
🤔 Have follow-up questions?