3 Tips for Working with Imbalanced Datasets

Some little known ways of dealing with those pesky skewed class distributions

Zito Relova
Towards Data Science

--

Source (Unsplash)

Real-world datasets come in all shapes and sizes. At some point, you will come across a dataset with imbalanced target classes. What exactly does this mean? Let’s take a look at an example from Kaggle. This dataset contains details of credit card clients and defaults on their payments.

Our target variable here is default.payment.next.month , a binary variable that has the class 0 if a client did not default and 1 if they did. First, we’ll check how many entries there are for each of these classes.

Plotting our class distribution
Class distribution for our target variable

We see from the graph above that almost 80% of the target variable has a class of 0. This is what is known as an imbalanced dataset. There are significantly more entries for one class than another resulting in a skewed class distribution. You might be wondering, why would this be a problem and why would we have to tackle this dataset differently?

Imagine we had a very simple classifier for this dataset that would always predict 0 regardless of the data that we pass to it. By using raw accuracy as our metric, we would get an accuracy close to 80%. This might seem like a respectable score for a classifier but it would actually be very misleading. We would not be able to predict when someone defaults on their payments at all! Let’s create this simple classifier together with a Random Forest classifier as our baseline models.

Checking the accuracy of a dummy classifier and a Random Forest classifier
  1. Choose the right metric

That leads us to making sure we use an appropriate metric to measure the performance of our model. From what we saw earlier, accuracy would not be an appropriate metric for this use case precisely because so many of our dataset entries fall into a single class. Let’s take a look at a few alternatives.

Precision — Precision is the number of correctly classified positive (1 in our case) examples divided by the total number of examples that were classified as being positive (1 ). If we use the classifier that always predicts 0 , our precision will be 0 as well because we haven’t classified any examples as being positive.

Checking precision score

Recall — Recall, also known as “sensitivity”, is the number of positive examples that were correctly classified divided by the total number of positive examples. In our case, this would be the number of entries with value 1 that were correctly identified by our model. Using our dummy classifier, our recall would also be 0. This metric would help us identify if our model was correctly getting the 1 entries in our dataset but it still has a major flaw. If our classifier would just always predict 1 for every example, we could get perfect recall! Our accuracy, on the other hand, would not be so great. We need some sort of balance between these metrics to evaluate our model reasonably.

Checking recall score

F1 Score — The F1 Score is a balance between precision and recall. It is given by the formula (2 * Precision * Recall) / (Precision + Recall). This metric proves to be suitable for imbalanced datasets, eliminating the shortcomings of the plain accuracy metric. Through this, we should be able to evaluate any of our models in a better way.

Checking F1 score

2. Set up a cross-validation strategy

Rather than using the default train_test_split provided by scikit-learn, we should try to make sure our splits accurately represent the distribution of our target variable. A very simple way to do this is to use the stratify parameter when calling the train_test_split function.

Plotting the distribution of our data after splitting
Class distribution after splitting into train and test sets

Making this small change ensures that the train and test sets follow the same distribution as our original dataset.

Another way to make your cross-validation strategy more robust to class imbalances is to use several folds or train on different subsets of your data. For this, we can use StratifiedKFold and StratifiedShuffleSplit to ensure that we still follow our target variable’s distribution.

StratifiedKFold will split our original dataset into several folds with each fold having a distribution that is similar to the original. This means that we can train a model on each of these folds while still being sure that our distribution stays consistent. StratifiedShuffleSplit also preserves our target variable’s distribution but uses the whole dataset in each iteration and does not split it into folds.

Using StratifiedKFold to evaluate a model

3. Change target weights in your model

By default, models will assign the same weight to every class in our target variable. This would be fine if our dataset had a relatively even distribution among its target classes. In our case, we will want to use different weights for each class depending on how skewed our dataset is.

How should we determine what weights to use for each class? Many classifiers have a class_weight parameter where you can pass in a string like balanced. This should compute the appropriate weights for the data that you pass in. If not, scikit-learn also has a function that computes this for us. Let’s see how we can use class_weight in our model.

Using class_weight when training a model

Conclusion

We’ve looked at 3 different ways to help us handle imbalanced datasets.

  1. Choosing the appropriate metric for our task. We have seen that sometimes, metrics like accuracy can be very misleading when evaluating a model.
  2. Using a good cross-validation strategy. We can ensure our train and test sets follow a similar distribution using several methods like StratifiedKFold and StratifiedShuffleSplit.
  3. Set class weights on your target classes to give more weight to the minority class. Using this strategy makes our model put more importance on the minority class, potentially helping classify it better.

Armed with these strategies, you should able to tackle imbalanced datasets with ease in the future.

Thank you for reading!

You can connect with me through these channels:

--

--