# Overview

## Semi-supervised learning

Suppose we want to predict which novels our friend Justine will enjoy. We have a list of the handful of novels she has read recently, along with whether or not she enjoyed each one. The standard supervised machine learning paradigm for a problem like this would involve comparing the features of the positive examples (novels she enjoyed) with the features of the negative examples (novels she did not enjoy), to devise a system for predicting whether future as-yet-unseen novels will be received positively or negatively.

On the other hand, the semi-supervised paradigm would involve using not only novels Justine has read, but those she hasn’t read, to prepare to make predictions about her preferences. There are far more unlabeled examples (i.e. novels that we don’t know Justine’s feeling about) than labeled (i.e. positive and negative) ones. Surprisingly, in situations like this, the unlabeled data can actually be put to effective use.

To begin to understand how unlabeled data can help us develop a model for predicting labels, consider the following caricature. Suppose we have a few positive data points (illustrated in green below) and a few negative data points (illustrated in red). For simplicity, let’s say each data point is described by two features. When we form a model for predicting whether other data points will be positive or negative, it’s like dividing the feature plane into two regions, where points are predicted to be positive in one region and negative in the other. Our decision about how to do this could be very different depending on whether we know where unlabeled data points (illustrated in gray) lie.

## PU learning

Positive-unlabeled learning is an important subparadigm of semi-supervised learning, where the only labeled data points available are positive. For example, suppose we only know about Justine’s favorite novels. (Why would she talk about the ones she didn’t enjoy, anyway?)

The standard supervised machine learning paradigm would be out of luck in this situation, because any standard model’s algorithm requires both positive and negative examples to train. But again, despite their ambiguity, unlabeled data points can be helpful here if we’re clever about how to use them.

Consider the following new caricature. We have a few positive data points (illustrated in green below), but no negatives. We must divide the feature plane into two regions, where points are predicted to be positive in one region and negative in the other. Our decision about how to do this will be very different indeed if we know where unlabeled data points lie!

# Some techniques for PU learning

There has been a great deal of interest in developing approaches to PU learning problems. Since standard machine learning problems (in which a large number of positive and negative data points can be used to train a model) are a well-explored field with many solution methods, most approaches to PU learning involve clever adaptations of standard methods.

Following is a quick summary of a few techniques, most of which will be applied experimentally later.

## Direct application of a standard classifier

Perhaps the most obvious approach to a PU problem is as follows: treat the positive and unlabeled data points as positives and negatives, respectively. Train a standard classifier model on the data. The classifier will assign a score to each data point, with positives generally scored higher than negatives. Among the unlabeled data points (temporarily labeled as negatives), the ones with the highest scores should be most likely to be positives.

This naïvest of approaches is explored and somewhat justified in *Learning classifiers from only positive and unlabeled data* (2008) by Elkan and Noto. The central result of that paper is that, under certain assumptions that are basic but probably slightly unreasonable for real-life purposes, a standard classifier trained on positive and unlabeled data should give scores that are proportional to the scores it would give if it were properly trained on positive and negative data.

As the authors note, “This means that if the [hypothetically properly trained] classifier is only used to rank examples according to the chance that they belong to [the positives], then the classifier [trained on positive and unlabeled data] can be used directly instead.”

## PU bagging

A related but more sophisticated approach to a given PU problem involves a variation on bagging:

- Create a training set by combining all positive data points with a random sample from the unlabeled points, with replacement.
- Build a classifier from this “bootstrap” sample, treating positive and unlabeled data points as positives and negatives, respectively.
- Apply the classifier to whatever unlabeled data points were not included in the random sample – hereafter called OOB (“out of bag”) points – and record their scores.
- Repeat the three steps above many times and finally assign to each point the average of the OOB scores it has received.

One paper describing this approach is *A bagging SVM to learn from positive and unlabeled examples* (2013) by Mordelet and Vert. According to the authors, “the method can match and even outperform the performance of state-of-the-art methods for PU learning, particularly when the number of positive examples is limited and the fraction of negatives among the unlabeled examples is small. The proposed method can also run considerably faster than state-of-the-art methods, particularly when the set of unlabeled examples is large.”

## Two-step approaches

A wide variety of PU learning strategies fall into the general category of “two-step approaches.” One decent introduction to these is given in *An Evaluation of Two-Step Techniques for Positive-Unlabeled Learning in Text Classification* (2014) by Kaboutari, Bagherzadeh, and Kheradmand.

Here are the “two steps” of this kind of approach:

- Identify a subset of the unlabeled data points that can be confidently labeled as negatives. (The authors of the paper above call these points “reliable negatives.”)
- Use the positive and negative data points to train a standard classifier and apply it to the remaining unlabeled points.

Typically, the results of the second step are used in order to return to the first step and iterate. As the authors above state, “If the [reliable negative] set contains mostly negative documents and is sufficiently large, a learning algorithm… works very well and will be able to build a good classifier. But often a very small set of negative documents identified in step 1… then a learning algorithm iteratively runs till it converges or some stopping criterion is met.”

A similar approach is laid out by Shubham Jain in the recent blog post Introduction to Pseudo-Labelling : A Semi-Supervised learning technique, but the details there are not specifically aimed at PU problems.

## Positive unlabeled random forest

One last development worth mentioning here, although it will not be included in our experiments below, is an algorithm introduced in *Towards Positive Unlabeled Learning for Parallel Data Mining: A Random Forest Framework* (2014) by Li and Hua.

According to the authors, “The proposed framework, termed PURF (Positive Unlabeled Random Forest), is able to learn from positive and unlabeled instances and achieve comparable classification performance with RF trained by fully labeled data through parallel computing according to experiments on both synthetic and real-world UCI datasets… This framework combines PU learning techniques including widely used PU information gain (PURF-IG) and newly developed PU Gini index (PURF-GI) with an extendable parallel computing algorithm (i.e. RF).”

Furthermore, the authors report that they have “implemented PURF with Python based on the scikit-learn package,” so if the code is made publically available at some point, it may be a promising tool.

# Python implementation

## Data

We will need the following tools for data manipulation and randomized sampling:

import pandas as pd import numpy as np

In all code below, it is assumed that we have a pandas dataframe X of features, and a pandas series y of targets (with 1 for positives and 0 for the rest).

We will “hide” some of the positives.

# Keep the original targets safe for later y_orig = y.copy() # Unlabel a certain number of data points hidden_size = ### ENTER A NUMBER HERE ### y.loc[ np.random.choice( y[y == 1].index, replace = False, size = hidden_size ) ] = 0

The resulting data set is what we will use to build models based on the approaches outlined before. Afterwards, we can judge their effectiveness by determining how well they manage to identify the positives we have hidden from them.

## Direct application of a standard classifier

For the first approach, we’ll use a fairly generic random forest:

from sklearn.ensemble import RandomForestClassifier rf = RandomForestClassifier( n_estimators = 1000, # 1000 trees n_jobs = -1 # Use all CPU cores ) rf.fit(X, y)

Then we simply store the scores assigned by the random forest:

results = pd.DataFrame({ 'truth' : y_orig, # True labels 'label' : y, # Labels shown to models 'output_std' : rf.predict_proba(X)[:,1] # Random forest's scores }, columns = ['truth', 'label', 'output_std'])

For convenience, we’ll store the outputs of the other approaches in this same dataframe.

## PU bagging

### The slow way

As with the random forest, we will build 1000 decision trees.

from sklearn.tree import DecisionTreeClassifier n_estimators = 1000 estimator = DecisionTreeClassifier()

It is convenient to keep track of the indices of positive and unlabeled data points:

iP = y[y > 0].index iU = y[y <= 0].index

For each data point, we’ll also record how many times it has been OOB, and the sum of the OOB scores it receives:

num_oob = pd.DataFrame(np.zeros(shape = y.shape), index = y.index) sum_oob = pd.DataFrame(np.zeros(shape = y.shape), index = y.index)

And away we go!

for _ in range(n_estimators): # Get a bootstrap sample of unlabeled points for this round ib = np.random.choice(iU, replace = True, size = len(iP)) # Find the OOB data points for this round i_oob = list(set(iU) - set(ib)) # Get the training data (ALL positives and the bootstrap # sample of unlabeled points) and build the tree Xb = X[y > 0].append(X.loc[ib]) yb = y[y > 0].append(y.loc[ib]) estimator.fit(Xb, yb) # Record the OOB scores from this round sum_oob.loc[i_oob, 0] += estimator.predict_proba(X.loc[i_oob])[:,1] num_oob.loc[i_oob, 0] += 1

Finally, store the scores assigned by this approach.

results['output_bag'] = sum_oob / num_oob

### A custom PU classifier

The PU bagging approach is so elegant (and powerful, as we will see) that I decided to create a more user-friendly implementation of it that takes advantage of parallelization. My implementation is a fairly straightforward adaptation of the BaggingClassifier from scikit-learn. The code can be found in the baggingPU.py file available here. Using my implementation, all of the code above could be replaced with the following:

from baggingPU import BaggingClassifierPU bc = BaggingClassifierPU( DecisionTreeClassifier(), n_estimators = 1000, n_jobs = -1, max_samples = sum(y) # Each training sample will be balanced ) bc.fit(X, y) results['output_bag'] = bc.oob_decision_function_[:,1]

On a machine with 4 cores, this new code snippet runs in roughly one quarter the previous time.

## A two-step approach

During a two-step approach, data points will carry one of three designations (positive, unlabeled, or negative), and many of the points will be re-designated along the way. So we begin by creating a new target vector, with 1 for positive, -1 for unlabeled, and 0 for “reliable negative.”

ys = 2 * y - 1

Of course, there are no reliable negatives to start with. For step 1, we’ll use the scores from before:

pred = rf.predict_proba(X)[:,1]

Let’s find the range of scores given to the known positive data points:

range_P = [min(pred * (ys > 0)), max(pred * (ys > 0))]

Step 1: If any unlabeled point has a score below all known positives, label it negative. On the other hand, if its score is above all known positives, we might as well label it positive.

iP_new = ys[(ys < 0) & (pred >= range_P[1])].index iN_new = ys[(ys < 0) & (pred <= range_P[0])].index ys.loc[iP_new] = 1 ys.loc[iN_new] = 0

Let’s prepare a classifier to be used for step 2:

rf2 = RandomForestClassifier(n_estimators = 1000, n_jobs = -1)

We will limit ourselves to 10 iterations at the most. It’s an arbitrary choice, but this approach can take a very long time if we simply wait for it to stop finding and labeling new negatives. (Besides, as mentioned earlier and as we will see experimentally, it might even be good enough to limit ourselves to a single iteration.)

for i in range(10): # If step 1 didn't find new labels, we're done if len(iP_new) + len(iN_new) == 0 and i > 0: break print( 'Step 1 labeled %d new positives and %d new negatives.' % (len(iP_new), len(iN_new)) ) print('Doing step 2... ', end = '') # Retrain on new labels and get new scores rf2.fit(X, ys) pred = rf2.predict_proba(X)[:,-1] # Find the range of scores given to positive data points range_P = [min(pred * (ys > 0)), max(pred * (ys > 0))] # Repeat step 1 iP_new = ys[(ys < 0) & (pred >= range_P[1])].index iN_new = ys[(ys < 0) & (pred <= range_P[0])].index ys.loc[iP_new] = 1 ys.loc[iN_new] = 0

Afterwards, as usual, we store the final scores assigned:

results['output_stp'] = pred

## Averaging

As a sort of fourth approach to compare to the rest, we can calculate the average score that the three main approaches assign to each data point.

results['output_all'] = results[[ 'output_std', 'output_bag', 'output_stp' ]].mean(axis = 1)

After this, we are prepared the compare the performance of the different approaches – that is, how well they can identify the hidden positives.

# Experiments with artificial data

We will test drive some of the approaches developed above by applying them to a few artificial data sets. In a subsequent post, I will show the results when a certain real-life data set is used instead.

## Circles

The complete code for this experiment is available at this github link. Some of the less interesting code, especially for producing graphs, is not included in this post.

To create our “circles” data set:

from sklearn.datasets import make_circles X, y = make_circles( n_samples = 6000, noise = 0.1, shuffle = True, factor = .65 ) X = pd.DataFrame(X, columns = ['feature1', 'feature2']) y = pd.Series(y)

In this data set, there are 3000 positives.

### 1000 hidden positives

Let’s set

hidden_size = 1000

– so one third of the positives will be unlabeled. The resulting data set is visualized here. We can see that the artificial data consists of two concentric circular regions of points, with the positives comprising the innermost region.

To test the various approaches to PU learning, we’ll run all the code previously listed. The graphs below visualize the scores that are assigned to unlabeled points by using a standard classifier, and by using a PU bagging approach.

On a visual level, there appears to be little difference between the approaches, except that PU bagging assigns a wider range of scores than the standard classifier does.

The two different implementations of PU bagging apparently perform almost identically. This continues to be true in all subsequent experiments (described below), indicating that the two implementations are in fact the same – as intended! – and the only real difference is that my parallelized version adapted from scikit-learn is much faster.

The next graphs visualize the two-step approach. On the left, we can see which data points are marked as “reliable negatives” in the first pass through step 1. Note that a considerable number of the intended negatives are immediately relabeled as such. The graph on the right shows the scores that are assigned to unlabeled points after several iterations of the two-step process.

Visually, the output of my two-step method seems to differ very little from the standard classifier.

Let’s take a more careful look at the performance of these three main approaches, judging each one by its ability to identify the positives we have hidden among the unlabeled points. Here is a visual summary:

No matter what approach we use, if we select only the 500 “best” unlabeled points – that is, those with the highest scores – it turns out that over 90% of them are actually hidden positives. The three different approaches perform at roughly the same level, but the performance of the PU bagging approach is noticeably behind the others. This shouldn’t be too surprising; as quoted earlier, this approach works particularly well “when the number of positive examples is limited,” which is not the case here.

### 2700 hidden positives

Let’s try setting

hidden_size = 2700

so that 90% of the positives will be unlabeled. The new data set is visualized here.

After running all the same code as before, the following graphs visualize the scores that are assigned by each of the three main approaches. Judging by this, PU bagging seems to have done a better job than the other methods.

Like before, here is a comparison of the performance of the three main approaches:

As promised, PU bagging definitively outperforms the other two approaches in this situation. Let’s trying going to a further extreme.

### 2970 hidden positives

Here we will set

hidden_size = 2970

so now 99% of the positives become unlabeled!

Again running the same code, here is a comparison of the three main approaches:

The difference between PU bagging and the other methods is really remarkable this time.

## Moons

The complete code for the next experiment is available at this github link. To create our “moons” data set:

from sklearn.datasets import make_moons X, y = make_moons(n_samples = 6000, noise = 0.1, shuffle = True) X = pd.DataFrame(X, columns = ['feature1', 'feature2']) y = pd.Series(y)

This data set has 3000 positives, like the previous one.

### 1000 hidden positives

Here is a visualization of this second data set, with the labels removed from 1000 of the positives.

The scores given to unlabeled data points by a standard classifier and by PU bagging are shown below.

Here are the visualizations of the two-step approach:

Finally, the comparison of performance:

In this case, the different approaches seem pretty similar in their ability to identify the hidden positives.

### 2700 hidden positives

With 90% of the positives unlabeled, here are the score visualizations and the performance comparison.

### 2970 hidden positives

With 99% of the positives hidden, here are the results.

Again, there is a dramatic difference between PU bagging and other approaches.

## Blobs

The complete code for this experiment is available at this link. To create our “blobs” data set:

from sklearn.datasets.samples_generator import make_blobs X, y = make_blobs( n_samples = 6000, centers = [[1,5], [5,1], [0,0], [6,6]] ) X = pd.DataFrame(X, columns = ['feature1', 'feature2']) # Convert the original labels from [0,1,2,3] to just [0,1] y = (y > 1).astype(int) y = pd.Series(y)

Again, the data set has 3000 positives.

### 1000 hidden positives

Here is a look at this last artificial data set, with labels removed from 1000 positives.

The scores given by the three main approaches are visualized below, followed by the performance comparisons.

As before, with only one third of positives hidden, the PU bagging approach lags a bit behind the others.

### 2700 hidden positives

With 90% of the positives unlabeled, here are the score visualizations and the performance comparison.

### 2970 hidden positives

For this most extreme version of our final experiment, with 99% of the positives hidden, let’s try something new. We’ll run the usual three main methods, but we’ll also try PU bagging using support vector machines (also known as SVM, or specifically SVC in the sklearn implementation we’re using) as the underlying classifier, in addition to the usual decision tree implementation.

Here are the visualizations of the scores assigned by the three usual approaches and the new one.

Using SVM instead of trees (as done, incidentally, in the paper the PU bagging approach is based on) seems to be best by a wide margin. To make sure, let’s take a closer look. Here are the comparative performances of the usual three methods:

There is a huge difference between PU bagging (with decision trees) and other approaches. Now let’s compare tree- and SVM-based PU bagging.

In this experiment with a toy data set, and with the vast majority of positives hidden among the unlabeled points, PU bagging of SVM classifiers is the clear winner.