NLP text classification: from data collection to model inference

10 min readJan 18, 2022

I worked with several eCommerce platforms, which collected data from different shops and advertised their products. One of the common issues during product import from the partners is matching the partner product category to the category of the platform itself, or product categorization if there is no category at all. We have 2 common ways to categorize the product: using the images or the description.

I took a classic ML course for the last few months and chose product categorization as my course project. Now I want to share with you what I was able to achieve.

Data preparation

As a source for the data, I chose a popular marketplace in Ukraine. The data that I needed is title, description and category tree. There's from 3 to 5 category levels in the tree, so I decided to save the main category, most profound and category before it.

There are different ways to fetch the data: API, data export and scrapping. I used the last one to do that, but I always recommend using one of the others, if possible. Scrapping can break the site, so it is better not to use it or do it during non-pick hours.

I used the Scrappy library. It’s pretty simple to write web spiders using it, as there’s already a template provided. To start a new project:

It will automatically create the project structure for you.

The next step is to get the correct XPath for the data needed. XPath is the path to the required node in the XML document. I used “XPath helper” extension to get the correct value. To get the XPath, right-click on the element with data, choose “Copy”-> “Copy XPath”.

You will get something like that:

/html/body/app-root/div/div/rz-category/div/main/rz-catalog/div/div/section/rz-grid/ul/li[3]/rz-catalog-tile/app-goods-tile-default/div/div[2]/a[2]/span

Then paste this value to the “XPath helper”. You need to check if the XPath is correct and select all the needed elements.

As you can see, it selects only one element on the page. It would help if you changed it to select all needed elements. In our case, the issue is that it specifies certain elements in the list (“li[3]”). We need to remove that part and it will select all needed elements:

/html/body/app-root/div/div/rz-category/div/main/rz-catalog/div/div/section/rz-grid/ul/li/rz-catalog-tile/app-goods-tile-default/div/div[2]/a[2]/span

After you find all the needed XPaths, you can start crawling. The process looks like following, from the main page with all categories:

Find all categories
For each category, grab all the subcategories.
For each subcategory, collect all products from all pages.
For each product, get the title, description and all categories.

You can find my implementation there.

Exploratory Data Analysis

Let's take a look at our data.

The first thing to notice — our data is in Russian. So, if we use some text processing libraries or do transfer learning, it should support this language.

We should check how many empty fields we have and fill the blank descriptions with title values. Also, let's check how many categories do we have and how many most profound category names include the previous category title

We have 80047 most profound categories, and 682K products out of 780K have previous category title in most profound title. Main difference — a brand name in most profound. So, considering the small amount of product per most profound category we will predict the previous category to it.

Full notebook available here.

Data split

We need two columns: target category and description. To make our experiments flexible, I made a script that can take parameters as our target column and sample size. We also take out all the categories with less than 50 products.

Also, I used stratified split by target column, so all the categories were presented equally if they have enough products.

You can check the full script here.

Experiments

For our task, it’s essential to build word embeddings, as machine learning models do not accept the raw text as input data. We need to convert descriptions into vectors of numbers. Word embeddings are word representation that allows words with similar meanings to have a similar representation.

Universal Sentence Encoder (USE) — family of models for encoding sentences into embedding vectors that specifically target transfer learning to other NLP tasks. We will try to use USE for baseline and transfer learning. It gives back a 512 fixed-size vector for the text.

Image source: https://amitness.com/2020/02/tensorflow-hub-for-transfer-learning/

Metrics

We will use two metrics to determine our model performance: F1 Score and Sparse Categorical Accuracy.

According to the documentation, the F1 score can be interpreted as a harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 and worst score at 0. The relative contribution of precision and recall to the F1 score are equal. The formula for the F1 score is:

F1 = 2 * (precision * recall) / (precision + recall)

F1 Score is good for an unbalanced dataset, which is our case.

Sparse Categorical Accuracy calculates how often predictions match integer labels, it checks to see if the maximal true value is equal to the index of the maximal predicted value.

Baseline

We will transform all the target categories and descriptions to their embedding using USE as our baseline. As mentioned previously, the multilingual model is used.

Then we compute the cosine similarity between each description and target category. Cosine similarity measures similarity between two non-zero vectors of an inner product space. It gives a valuable measure of how similar two documents are likely to be in terms of their subject matter. So, our prediction has a maximum value of cosine similarity.

We stored our prediction in a separate column, “prediction”. Now it’s time to measure the error. Let's calculate our F1 Score:

The result is 0.10, which is not optimal, but we can take it as a baseline.

We took category names as our target embedding. In case we have category description, it would perform better, as it has more information about what category is.

The full implementation here.

Logistic regression

Except for USE, there are different ways of transforming text into numeric vectors: Bag of Words, Bag-of-n-Grams Tf-Idf vectorizer. I used the last one for my model.

TF-IDF stands for “Term Frequency — Inverse Document Frequency”. This is a technique to quantify words in a set of documents. We generally compute a score for each word to signify its importance in the document and corpus. I recommend reading this article to dive deep into how it works.

To configure TF-IDF, we need to specify stopwords. Stopwords are those words that appear very commonly across the documents, therefore losing their representativeness. We will use NLTK library to get the stopwords.

Also, I used regexp in “token_pattern” to clear all non-Russian symbols and numbers from the text.

You can control vocabulary size with “max_df” and “min_df” parameters. “max_df” is used for removing terms that appear too frequently, also known as “corpus-specific stop words”. “min_df” is used for eliminating terms that occur too infrequently.

After training, the model F1 Score on the test data is 0.67. A lot better than a baseline, yes? Still, after checking the actual scores per category, most of the categories scored more than 0.71, but some categories got 0 as a result. It is important to check what is the reason for it. Some of the categories have too few products, but to find out what happens with the others — we can use ELI5 library and TextExplainer.

ELI5 is a Python package that helps to debug machine learning classifiers and explain their predictions. eli5.lime.TextExplainer helps debug the prediction — to check what was important in the document to make this decision.

There are two main ways to look at the classification model:

inspect model parameters and try to figure out how the model works globally;
inspect an individual prediction of a model, try to figure out why the model makes the decision it makes.

For inspection of an individual prediction of a model, ELI5 provides eli5.show_prediction() function:

The result looks like following:

For model parameters inspection ELI5 provides eli5.explain_weights() function:

These functions can help understand why the exact prediction was confused.

How we can improve logistic regression results?

We can use text lemmatization and stemming on our data.

Stemming algorithms work by cutting off the end or of the beginning of the word, taking into account a list of common prefixes and suffixes that can be found in an inflected word. On the other hand, lemmatization takes into consideration the morphological analysis of the words. It is necessary to have detailed dictionaries that the algorithm can look through to link the form back to its lemma.

You can find stemming and lemming algorithms for different languages. But, please consider that processing a big dataset can take a long time. In my case, the lemmatization of the training dataset took 12 hours.

Another option to try — use char embedding with anagrams, instead of words that I used. Character embeddings are constructed similarly to how word embeddings are constructed. However, instead of embedding at the word level, the vectors represent each character (or several characters) in a language.

Character-level embeddings have some advantages over word-level embeddings, such as:

Able to handle new slang words and misspellings
The required embedding matrix is much smaller than what is required for word-level embeddings.

And the last one that I think can improve the performance — text clearing. For example, using the Spacy Python library, we can clear our texts that there are only nouns, pronouns, and adjectives.

Here you can find the notebook for logistic regression.

Universal Sentence Encoder transfer learning

We take multilingual USE as a layer for this model, adding an output layer with the shape of our classes and training the resulting model.

Model summary:

So, we have a model that takes text data, projects it into 512-dimension embedding and passes that through a feedforward neural network with softmax activation to give a category probability.

This model has a validation accuracy of 0.77 with 25 epochs.

Check the full notebook here.

Zero-shot classification

You can define your labels in zero-shot classification and then run a classifier to assign a probability to each label. There is an option to do multi-class classification too, in this case, the scores will be independent, each will fall between 0 and 1. You can use a pre-trained model to classify the data that the model wasn`t trained on.

Hugging Face has its own ZeroShotClassificationPipeline — NLI-based zero-shot classification pipeline using a ModelForSequenceClassification trained on NLI (natural language inference) tasks.

I used the multilingual Roberta model with this pipeline, code available here.

It would be best if you had a lot of memory to do the zero-shot classification with many classes. In my case, I got OOM even with GPU Colab instance and 18 GB of memory. So for this example, I predicted just the main category, which has only 16 classes. In general, it shows pretty good results, but I couldn’t measure the performance on the whole test dataset.

Models inference

There are different ways to build UI on top of your model: you can use API and integrate it with your frontend, create chatbots or use various tools available on the internet. I choose Streamlit to build a UI for model inference.

Streamlit is an open-source app framework for Machine Learning. The main benefit for me — all UI is described in Python. No HTML, CSS tuning — hate this part :)

One of the struggles for me was how to upload my models. Options were to store it on AWS S3, download it, and store it on Github or Google Disk. My models were pretty big — 400 MB, 1.2 GB, and I chose to use Git LFS and store models together with the source code.

Building the Streamlit app is pretty straightforward: install Streamlit, create an entry point, use docs or various examples and make the UI. You can run the app locally or serve it on the Streamlit site (for free!).

I built a very simple UI where you can choose the model for inference and put the text to classify.

Check the source code here.

Few tips for deploying the application to the Streamlit server:

Create a requirements.txt file with all the dependencies.
Do not load all the models immediately to avoid Out Of Memory error
Don't forget to specify secrets if you using any

If there are any errors during deployment or during using the app — there console with all the logs available.

Summary

It was exciting to try out different approaches for text classification. The transformer-based model showed better results than a classic like Logistic Regression. Although the production-ready result wasn`t reached, the accuracy is pretty good for the multilanguage model. And, of course, there is always a place for improvements and fine-tuning!

You can check all the source code here.

I hope that this article was helpful for you. Follow me to read more about Big Data, Machine Learning and Cloud Computing!

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com