AI User Interest Classification

Challenges and solutions in using AI to classify user Interests

Ready to discuss your goals?

Join the fastest-growing companies of all sizes that trust Bytek.

Data

Valentina Tortolini

30 Sep

2024

Artificial intelligence is revolutionising the corporate sector, offering advanced tools to analyse data and understand user behaviour. Through machine learning and natural language processing techniques, companies can extract valuable information from digital interactions, personalise customer experiences and optimise marketing strategies.

In this article, we will explore how AI enables the accurate identification of user interests from browsing data, focusing on the challenges and technical solutions associated with this process.

Interest Analysis: how it works

In order to identify the interests of a user visiting a website, it is sufficient to collect and analyse browsing data, i.e. the pages visited. This data, being collected directly from the site, falls into the category of first-party data.

The process involves assigning a label indicating a thematic and/or product interest to each page on the site. Once these labels have been assigned, it is sufficient to analyse how the user navigated through different interests.

Let us imagine that we want to identify the product interests of a user browsing the Amazon site. Using a simple algorithm that considers the pages viewed and the time spent on them, we can deduce that the user is interested in, say, carpets, simply by providing the algorithm with the product categories associated with the various pages.

However, attributing thematic interest to users visiting different pages can be more complex. Suppose the same user visited three pages which, based only on product type, have nothing in common. These pages, however, might share a cross-cutting or ‘custom’ interest, such as a focus on sustainability and the environment.

In this case, we could assign the user both a product interest and a custom interest, indicating that they are sensitive to environmental issues. Consequently, it would be appropriate to offer him/her products related to this sphere.

The process of assigning this interest seems simple: we need to analyse the textual content of the pages and assign the same label to all those pages that deal with the same topic. However, the main challenges emerge here, as sophisticated semantic analysis and the ability to recognise common themes within heterogeneous content is required.

Approaches and Algorithms for Interest Analysis

In order to categorise the pages of a site according to thematic interests, different approaches can be taken, and here we propose three of them, trying to highlight their advantages and disadvantages.

Machine Learning

The first approach, the more traditional one, involves the use of a classic classification model based on machine learning. In this case, it is necessary to select a classification algorithm from the many available and proceed to construct a dictionary for training the algorithm. This implies having a clear definition of all the customised labels with which one wishes to classify the pages of the site and providing an adequate number of examples, texts and descriptions associated with each label.

These models have some significant advantages. Firstly, they are deterministic, which means that applying the same algorithm to the same data will always produce the same result. Furthermore, they are low cost and can be implemented in-house without the need for providers.

However, there are also limitations. Model training requires the creation of an often very extensive dictionary; it is not sufficient to provide one or two examples per label, but many more are needed. This means that considerable effort is required to create a dictionary suitable for training the model. Furthermore, these models work effectively when the number of clusters is limited. At a site like Amazon, where hundreds of clusters can be identified, the approach becomes less feasible. Every time a new cluster, label or interest is added, it is necessary to update the dictionary and re-train the model, making this method unsustainable.

Generative AI

The second approach involves the use of generative artificial intelligence models, which can be implemented in two ways: with limitations and without limitations.

Generative AI with limitations

Using generative AI ‘with limitations’, you provide the tool with the entire list of desired clusters and a few examples for each, significantly fewer than the 30-40 examples needed in traditional machine learning models. These pre-trained models excel at handling a large number of clusters and require little maintenance, as platforms such as OpenAI handle much of the process. To add a new cluster, simply edit the initial prompt, include the new cluster and add a couple of examples.

However, this method has two fundamental problems:

The model is not deterministic: applying the same algorithm to the same data several times can lead to different results, which is undesirable for data consistency.
High costs: if you have a site with, say, 20,000 pages, using the API provided by a provider becomes economically unsustainable. This is not only because of the large amount of data to be analysed, but also because the long prompts required for the process significantly increase costs.

Unrestricted Generative AI

In this case, instead of providing all clusters, only a few examples are passed to generative AI, asking the model to perform the classification based on them and inventing other clusters useful for classification. The first run can work very well: the tool reads, classifies and identifies meaningful clusters. With a well-formulated prompt, the results obtained can be very satisfactory. Moreover, the AI can autonomously identify new emerging clusters. This method is effective with many clusters and requires very few examples. Despite the advantages, there are some criticalities:

The model remains non-deterministic: every time the model is re-run to update interests, it can generate different clusters, not remembering previous classifications. For example, what is labelled ‘eco-sustainability’ today might become ‘environmental friendliness’ in a subsequent run. This leads to the need to implement a system for cleaning and standardising the results, partially nullifying the initial benefits.
Although the costs are lower than with the restricted version, they are still quite high.

In an ideal scenario, we would have a deterministic model to which we could supply a relatively limited dictionary, thus avoiding spending too much time on its construction. The model should work effectively with numerous clusters, require little maintenance and be low in cost.

Our Solution with Embeddings

To achieve this, we adopted – and integrated into our Bytek Prediction Platform – an approach based on embeddings. This technique is widely used in natural language processing and other artificial intelligence applications to improve text comprehension, semantic search, classification and content generation. Embeddings transform text into high-dimensional numerical vectors while preserving the semantic meaning of the content. This transformation is crucial because it allows texts to be compared on the basis of their semantic content.

Once each text is represented as a numerical vector, it is possible to calculate the distance between two texts, analogous to calculating the distance between two points in numerical space. Even using the simple Euclidean distance, significant results can be obtained. For example, the distance between the words ‘carpet’ and ‘kilim’ (a type of carpet) is much smaller than that between ‘carpet’ and ‘glass’, as the embeddings capture the semantic relationships between the words.

Advantages of the Embeddings Approach

This method effectively solves the previous problems:

Determinism: The model produces consistent results at every run.
Predefined clusters: Clusters are provided in the training phase, avoiding the generation of unwanted clusters.
Limited Dictionary: 3-4 examples per cluster are sufficient, reducing preparation time.
Scalability: Works well with a large number of clusters.
Low Maintenance: Requires minimal intervention after initial implementation.
Low Costs: Costs are manageable and lower than other solutions

Assignment of Interests to Users

Once labels have been assigned to all pages, the next step is to assign products and/or interests to each user. To do this, we have developed a proprietary algorithm that considers both individual user behaviour and the collective behaviour of other users. For example, a user who is genuinely interested in eco-sustainability tends to visit many related pages, spends time reading product descriptions in detail and interacts with the content in depth. Consequently, an occasional access to a page on sustainability is not sufficient to conclude that the user is interested in this topic.

It is essential to analyse:

Transversal Behaviour: How the user interacts with different interests.
Collective Comparison: How his behaviour relates to that of other users with similar interests.

Only through this in-depth analysis can the user’s genuine interests be determined with a high degree of confidence. This integrated approach allows for a more accurate and reliable classification of interests, improving the personalisation and effectiveness of marketing and user interaction strategies.