🚀 Snorkel Intro Tutorial: Data Labeling¶
In this tutorial, we will walk through the process of using Snorkel to build a training set for classifying YouTube comments as spam or not spam. The goal of this tutorial is to illustrate the basic components and concepts of Snorkel in a simple way, but also to dive into the actual process of iteratively developing real applications in Snorkel.
Note that this is a toy dataset that helps highlight the different features of Snorkel. For examples of high-performance, real-world uses of Snorkel, see our publications list.
Additionally:
- For an overview of Snorkel, visit snorkel.org
- You can also check out the Snorkel API documentation
Our goal is to train a classifier over the comment data that can predict whether a comment is spam or not spam. We have access to a large amount of unlabeled data in the form of YouTube comments with some metadata. In order to train a classifier, we need to label our data, but doing so by hand for real world applications can often be prohibitively slow and expensive.
In these cases, we can turn to a weak supervision approach, using labeling functions (LFs) in Snorkel: noisy, programmatic rules and heuristics that assign labels to unlabeled training data.
We'll dive into the Snorkel API and how we write labeling functions later in this tutorial, but as an example,
we can write an LF that labels data points with "http"
in the comment text as spam since many spam
comments contain links:
from snorkel.labeling import labeling_function
@labeling_function()
def lf_contains_link(x):
# Return a label of SPAM if "http" in comment text, otherwise ABSTAIN
return SPAM if "http" in x.text.lower() else ABSTAIN
The tutorial is divided into four parts:
Loading Data: We load a YouTube comments dataset, originally introduced in "TubeSpam: Comment Spam Filtering on YouTube", ICMLA'15 (T.C. Alberto, J.V. Lochter, J.V. Almeida).
Writing Labeling Functions: We write Python programs that take as input a data point and assign labels (or abstain) using heuristics, pattern matching, and third-party models.
Combining Labeling Function Outputs with the Label Model: We model the outputs of the labeling functions over the training set using a novel, theoretically-grounded modeling approach, which estimates the accuracies and correlations of the labeling functions using only their agreements and disagreements, and then uses this to reweight and combine their outputs, which we then use as probabilistic training labels.
Training a Classifier: We train a classifier that can predict labels for any YouTube comment (not just the ones labeled by the labeling functions) using the probabilistic training labels from step 3.
Task: Spam Detection¶
We use a YouTube comments dataset that consists of YouTube comments from 5 videos. The task is to classify each comment as being
HAM
: comments relevant to the video (even very simple ones), orSPAM
: irrelevant (often trying to advertise something) or inappropriate messages
For example, the following comments are SPAM
:
"Subscribe to me for free Android games, apps.."
"Please check out my vidios"
"Subscribe to me and I'll subscribe back!!!"
and these are HAM
:
"3:46 so cute!"
"This looks so fun and it's a good song"
"This is a weird video."
Data Splits in Snorkel¶
We split our data into two sets:
- Training Set: The largest split of the dataset, and the one without any ground truth ("gold") labels. We will generate labels for these data points with weak supervision.
- Test Set: A small, standard held-out blind hand-labeled set for final evaluation of our classifier. This set should only be used for final evaluation, not error analysis.
Note that in more advanced production settings, we will often further split up the available hand-labeled data into a development split, for getting ideas to write labeling functions, and a validation split for e.g. checking our performance without looking at test set scores, hyperparameter tuning, etc. These splits are used in some of the other advanced tutorials, but omitted for simplicity here.
1. Loading Data¶
We load the YouTube comments dataset and create Pandas DataFrame objects for the train and test sets. DataFrames are extremely popular in Python data analysis workloads, and Snorkel provides native support for several DataFrame-like data structures, including Pandas, Dask, and PySpark. For more information on working with Pandas DataFrames, see the Pandas DataFrame guide.
Each DataFrame consists of the following fields:
author
: Username of the comment authordata
: Date and time the comment was postedtext
: Raw text content of the commentlabel
: Whether the comment isSPAM
(1),HAM
(0), orUNKNOWN/ABSTAIN
(-1)video
: Video the comment is associated with
We start by loading our data.
The load_spam_dataset()
method downloads the raw CSV files from the internet, divides them into splits, converts them into DataFrames, and shuffles them.
As mentioned above, the dataset contains comments from 5 of the most popular YouTube videos during a period between 2014 and 2015.
- The first four videos' comments are combined to form the
train
set. This set has no gold labels. - The fifth video is part of the
test
set.
This next cell takes care of some notebook-specific housekeeping. You can ignore it.
%matplotlib inline
import os
# Turn off TensorFlow logging messages
os.environ["TF_CPP_MIN_LOG_LEVEL"] = "3"
# For reproducibility
os.environ["PYTHONHASHSEED"] = "0"
If you want to display all comment text untruncated, change DISPLAY_ALL_TEXT
to True
below.
import pandas as pd
DISPLAY_ALL_TEXT = False
pd.set_option("display.max_colwidth", 0 if DISPLAY_ALL_TEXT else 50)
This next cell makes sure a spaCy English model is downloaded. If this is your first time downloading this model, restart the kernel after executing the next cell.
# Download the spaCy english model
# !python -m spacy download en_core_web_sm
|████████████████████████████████| 11.1MB 492kB/s Building wheels for collected packages: en-core-web-sm Building wheel for en-core-web-sm (setup.py) ... - \ | / done Created wheel for en-core-web-sm: filename=en_core_web_sm-2.1.0-cp36-none-any.whl size=11074435 sha256=fd5c9c9a63ae8fd9bf53357b3c2e4a15db011443cba5e045b32d2166ea58e5f0 Stored in directory: /tmp/pip-ephem-wheel-cache-cypbjlux/wheels/39/ea/3b/507f7df78be8631a7a3d7090962194cf55bc1158572c0be77f Successfully built en-core-web-sm Installing collected packages: en-core-web-sm Successfully installed en-core-web-sm-2.1.0 WARNING: You are using pip version 19.3.1; however, version 20.0.2 is available. You should consider upgrading via the 'pip install --upgrade pip' command. ✔ Download and installation successful You can now load the model via spacy.load('en_core_web_sm')
from utils import load_spam_dataset
df_train, df_test = load_spam_dataset()
# We pull out the label vectors for ease of use later
Y_test = df_test.label.values
WARNING:tensorflow:From C:\Users\akhil\OneDrive - Northeastern University\Work Space\Academics\MLOps\mlops_labs\data_labelling_lab\data_labeling_env\lib\site-packages\keras\src\losses.py:2976: The name tf.losses.sparse_softmax_cross_entropy is deprecated. Please use tf.compat.v1.losses.sparse_softmax_cross_entropy instead.
The class distribution varies slightly between SPAM
and HAM
, but they're approximately class-balanced.
# For clarity, we define constants to represent the class labels for spam, ham, and abstaining.
ABSTAIN = -1
HAM = 0
SPAM = 1
2. Writing Labeling Functions (LFs)¶
A gentle introduction to LFs¶
Labeling functions (LFs) help users encode domain knowledge and other supervision sources programmatically.
LFs are heuristics that take as input a data point and either assign a label to it (in this case, HAM
or SPAM
) or abstain (don't assign any label). Labeling functions can be noisy: they don't have perfect accuracy and don't have to label every data point.
Moreover, different labeling functions can overlap (label the same data point) and even conflict (assign different labels to the same data point). This is expected, and we demonstrate how we deal with this later.
Because their only requirement is that they map a data point a label (or abstain), they can wrap a wide variety of forms of supervision. Examples include, but are not limited to:
- Keyword searches: looking for specific words in a sentence
- Pattern matching: looking for specific syntactical patterns
- Third-party models: using an pre-trained model (usually a model for a different task than the one at hand)
- Distant supervision: using external knowledge base
- Crowdworker labels: treating each crowdworker as a black-box function that assigns labels to subsets of the data
Recommended practice for LF development¶
Typical LF development cycles include multiple iterations of ideation, refining, evaluation, and debugging. A typical cycle consists of the following steps:
- Look at examples to generate ideas for LFs
- Write an initial version of an LF
- Spot check its performance by looking at its output on data points in the training set (or development set if available)
- Refine and debug to improve coverage or accuracy as necessary
Our goal for LF development is to create a high quality set of training labels for our unlabeled dataset, not to label everything or directly create a model for inference using the LFs. The training labels are used to train a separate discriminative model (in this case, one which just uses the comment text) in order to generalize to new, unseen data points. Using this model, we can make predictions for data points that our LFs don't cover.
We'll walk through the development of two LFs using basic analysis tools in Snorkel, then provide a full set of LFs that we developed for this tutorial.
a) Exploring the training set for initial ideas¶
We'll start by looking at 20 random data points from the train
set to generate some ideas for LFs.
df_train.head()
author | date | text | label | video | |
---|---|---|---|---|---|
0 | Alessandro leite | 2014-11-05T22:21:36 | pls http://www10.vakinha.com.br/VaquinhaE.aspx... | -1.0 | 1 |
1 | Salim Tayara | 2014-11-02T14:33:30 | if your like drones, plz subscribe to Kamal Ta... | -1.0 | 1 |
2 | Phuc Ly | 2014-01-20T15:27:47 | go here to check the views :3 | -1.0 | 1 |
3 | DropShotSk8r | 2014-01-19T04:27:18 | Came here to check the views, goodbye. | -1.0 | 1 |
4 | css403 | 2014-11-07T14:25:48 | i am 2,126,492,636 viewer :D | -1.0 | 1 |
df_train[["author", "text", "video"]].sample(20, random_state=2)
author | text | video | |
---|---|---|---|
4 | ambareesh nimkar | "eye of the tiger" "i am the champion" seems l... | 2 |
87 | pratik patel | mindblowing dance.,.,.superbbb song | 3 |
14 | RaMpAgE420 | Check out Berzerk video on my channel ! :D | 4 |
80 | Jason Haddad | Hey, check out my new website!! This site is a... | 1 |
104 | austin green | Eminem is my insperasen and fav | 4 |
305 | M.E.S | hey guys look im aware im spamming and it piss... | 4 |
22 | John Monster | Οh my god ... Roar is the most liked video at ... | 2 |
338 | Alanoud Alsaleh | I started hating Katy Perry after finding out ... | 2 |
336 | Leonardo Baptista | http://www.avaaz.org/po/petition/Youtube_Corpo... | 1 |
143 | UKz DoleSnacher | Remove This video its wank | 1 |
163 | Monica Parker | Check out this video on YouTube: | 3 |
129 | b0b1t.48058475 | i rekt ur mum last nite. cuz da haterz were 2 ... | 2 |
277 | MeSoHornyMeLuvULongTime | This video is so racist!!! There are only anim... | 2 |
265 | HarveyIsTheBoss | You gotta say its funny. well not 2 billion wo... | 1 |
214 | janez novak | share and like this page to win a hand signed ... | 4 |
76 | Bizzle Sperq | https://www.facebook.com/nicushorbboy add mee ... | 1 |
123 | Gaming and Stuff PRO | Hello! Do you like gaming, art videos, scienti... | 1 |
268 | Young IncoVEVO | Check out my Music Videos! and PLEASE SUBSCRIB... | 1 |
433 | Chris Edgar | Love the way you lie - Driveshaft | 4 |
40 | rap classics | check out my channel for rap and hip hop music | 4 |
One dominant pattern in the comments that look like spam (which we might know from prior domain experience, or from inspection of a few training data points) is the use of the phrase "check out" (e.g. "check out my channel"). Let's start with that.
b) Writing an LF to identify spammy comments that use the phrase "check out"¶
Labeling functions in Snorkel are created with the
@labeling_function
decorator.
The decorator can be applied to any Python function that returns a label for a single data point.
Let's start developing an LF to catch instances of commenters trying to get people to "check out" their channel, video, or website.
We'll start by just looking for the exact string "check out"
in the text, and see how that compares to looking for just "check"
in the text.
For the two versions of our rule, we'll write a Python function over a single data point that express it, then add the decorator.
from snorkel.labeling import labeling_function
@labeling_function()
def check(x):
return SPAM if "check" in x.text.lower() else ABSTAIN
@labeling_function()
def check_out(x):
return SPAM if "check out" in x.text.lower() else ABSTAIN
To apply one or more LFs that we've written to a collection of data points, we use an
LFApplier
.
Because our data points are represented with a Pandas DataFrame in this tutorial, we use the
PandasLFApplier
.
Correspondingly, a single data point x
that's passed into our LFs will be a Pandas Series
object.
It's important to note that these LFs will work for any object with an attribute named text
, not just Pandas objects.
Snorkel has several other appliers for different data point collection types which you can browse in the API documentation.
The output of the apply(...)
method is a label matrix, a fundamental concept in Snorkel.
It's a NumPy array L
with one column for each LF and one row for each data point, where L[i, j]
is the label that the j
th labeling function output for the i
th data point.
We'll create a label matrix for the train
set.
from snorkel.labeling import PandasLFApplier
lfs = [check_out, check]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1586/1586 [00:00<00:00, 48761.32it/s]
L_train
array([[-1, -1], [-1, -1], [-1, 1], ..., [ 1, 1], [-1, 1], [ 1, 1]])
c) Evaluate performance on training set¶
We can easily calculate the coverage of these LFs (i.e., the percentage of the dataset that they label) as follows:
coverage_check_out, coverage_check = (L_train != ABSTAIN).mean(axis=0)
print(f"check_out coverage: {coverage_check_out * 100:.6f}%")
print(f"check coverage: {coverage_check * 100:.6f}%")
check_out coverage: 21.437579% check coverage: 25.788146%
Lots of statistics about labeling functions — like coverage — are useful when building any Snorkel application.
So Snorkel provides tooling for common LF analyses using the
LFAnalysis
utility.
We report the following summary statistics for multiple LFs at once:
- Polarity: The set of unique labels this LF outputs (excluding abstains)
- Coverage: The fraction of the dataset the LF labels
- Overlaps: The fraction of the dataset where this LF and at least one other LF label
- Conflicts: The fraction of the dataset where this LF and at least one other LF label and disagree
- Correct: The number of data points this LF labels correctly (if gold labels are provided)
- Incorrect: The number of data points this LF labels incorrectly (if gold labels are provided)
- Empirical Accuracy: The empirical accuracy of this LF (if gold labels are provided)
For Correct, Incorrect, and Empirical Accuracy, we don't want to penalize the LF for data points where it abstained.
We calculate these statistics only over those data points where the LF output a label.
Note that in our current setup, we can't compute these statistics because we don't have any ground-truth labels (other than in the test set, which we cannot look at). Not to worry—Snorkel's LabelModel
will estimate them without needing any ground-truth labels in the next step!
from snorkel.labeling import LFAnalysis
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
check_out | 0 | [1] | 0.214376 | 0.214376 | 0.0 |
check | 1 | [1] | 0.257881 | 0.214376 | 0.0 |
We might want to pick the check
rule, since check
has higher coverage. Let's take a look at 10 random train
set data points where check
labeled SPAM
to see if it matches our intuition or if we can identify some false positives.
df_train.iloc[L_train[:, 1] == SPAM].sample(10, random_state=1)
author | date | text | label | video | |
---|---|---|---|---|---|
305 | M.E.S | NaN | hey guys look im aware im spamming and it piss... | -1.0 | 4 |
265 | Kawiana Lewis | 2015-02-27T02:20:40.987000 | Check out this video on YouTube:opponents mm <... | -1.0 | 3 |
89 | Stricker Stric | NaN | eminem new song check out my videos | -1.0 | 4 |
147 | TheGenieBoy | NaN | check out fantasy music right here -------&... | -1.0 | 4 |
240 | Made2Falter | 2014-09-09T23:55:30 | Check out our vids, our songs are awesome! And... | -1.0 | 2 |
273 | Artady | 2014-08-11T16:27:55 | https://soundcloud.com/artady please check my ... | -1.0 | 2 |
94 | Nick McGoldrick | 2014-10-27T13:19:06 | Check out my drum cover of E.T. here! thanks -... | -1.0 | 2 |
139 | MFkin PRXPHETZ | 2014-01-20T09:08:39 | if you like raw talent, raw lyrics, straight r... | -1.0 | 1 |
303 | 이 정훈 | NaN | This great Warning will happen soon. ,0\nLneaD... | -1.0 | 4 |
246 | media.uploader | NaN | Check out my channel to see Rihanna short mix ... | -1.0 | 4 |
No clear false positives here, but many look like they could be labeled by check_out
as well.
Let's see 10 data points where check_out
abstained, but check
labeled. We can use theget_label_buckets(...)
to group data points by their predicted label and/or true labels.
from snorkel.analysis import get_label_buckets
buckets = get_label_buckets(L_train[:, 0], L_train[:, 1])
df_train.iloc[buckets[(ABSTAIN, SPAM)]].sample(10, random_state=1)
author | date | text | label | video | |
---|---|---|---|---|---|
403 | ownpear902 | 2014-07-22T18:44:36.299000 | check it out free stuff for watching videos an... | -1.0 | 3 |
256 | PacKmaN | 2014-11-05T21:56:39 | check men out i put allot of effort into my mu... | -1.0 | 1 |
196 | Angek95 | 2014-11-03T22:28:56 | Check my channel, please! | -1.0 | 1 |
282 | CronicleFPS | 2014-11-06T03:10:26 | Check me out I'm all about gaming | -1.0 | 1 |
352 | MrJtill0317 | NaN | ┏━━━┓┏┓╋┏┓┏━━━┓┏━━━┓┏┓╋╋┏┓ ┃┏━┓┃┃┃╋┃┃┃┏━┓┃┗┓┏... | -1.0 | 4 |
161 | MarianMusicChannel | 2014-08-24T03:57:52 | Hello! I'm Marian, I'm a singer from Venezuela... | -1.0 | 2 |
270 | Kyle Jaber | 2014-01-19T00:21:29 | Check me out! I'm kyle. I rap so yeah | -1.0 | 1 |
292 | Soundhase | 2014-08-19T18:59:38 | Hi Guys! check this awesome EDM & House mi... | -1.0 | 2 |
179 | Nerdy Peach | 2014-10-29T22:44:41 | Hey! I'm NERDY PEACH and I'm a new youtuber an... | -1.0 | 2 |
16 | zhichao wang | 2013-11-29T02:13:56 | i think about 100 millions of the views come f... | -1.0 | 1 |
Most of these seem like small modifications of "check out", like "check me out" or "check it out". Can we get the best of both worlds?
d) Balance accuracy and coverage¶
Let's see if we can use regular expressions to account for modifications of "check out" and get the coverage of check
plus the accuracy of check_out
.
import re
@labeling_function()
def regex_check_out(x):
return SPAM if re.search(r"check.*out", x.text, flags=re.I) else ABSTAIN
Again, let's generate our label matrices and see how we do.
lfs = [check_out, check, regex_check_out]
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1586/1586 [00:00<00:00, 36046.70it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
check_out | 0 | [1] | 0.214376 | 0.214376 | 0.0 |
check | 1 | [1] | 0.257881 | 0.233922 | 0.0 |
regex_check_out | 2 | [1] | 0.233922 | 0.233922 | 0.0 |
We've split the difference in train
set coverage—this looks promising!
Let's verify that we corrected our false positive from before.
To understand the coverage difference between check
and regex_check_out
, let's take a look at 10 data points from the train
set.
Remember: coverage isn't always good.
Adding false positives will increase coverage.
buckets = get_label_buckets(L_train[:, 1], L_train[:, 2])
df_train.iloc[buckets[(SPAM, ABSTAIN)]].sample(10, random_state=1)
author | date | text | label | video | |
---|---|---|---|---|---|
16 | zhichao wang | 2013-11-29T02:13:56 | i think about 100 millions of the views come f... | -1.0 | 1 |
99 | Santeri Saariokari | 2014-09-03T16:32:59 | Hey guys go to check my video name "growtopia ... | -1.0 | 2 |
21 | BeBe Burkey | 2013-11-28T16:30:13 | and u should.d check my channel and tell me wh... | -1.0 | 1 |
239 | Cony | 2013-11-28T16:01:47 | You should check my channel for Funny VIDEOS!! | -1.0 | 1 |
288 | Kochos | 2014-01-20T17:08:37 | i check back often to help reach 2x10^9 views ... | -1.0 | 1 |
65 | by.Ovskiy | 2014-10-13T17:09:46 | Rap from Belarus, check my channel:) | -1.0 | 2 |
196 | Angek95 | 2014-11-03T22:28:56 | Check my channel, please! | -1.0 | 1 |
333 | FreexGaming | 2014-10-18T08:12:26 | want to win borderlands the pre-sequel? check ... | -1.0 | 2 |
167 | Brandon Pryor | 2014-01-19T00:36:25 | I dont even watch it anymore i just come here ... | -1.0 | 1 |
266 | Zielimeek21 | 2013-11-28T21:49:00 | I'm only checking the views | -1.0 | 1 |
Most of these are SPAM, but a good number are false positives. To keep precision high (while not sacrificing much in terms of coverage), we'd choose our regex-based rule.
e) Writing an LF that uses a third-party model¶
The LF interface is extremely flexible, and can wrap existing models. A common technique is to use a commodity model trained for other tasks that are related to, but not the same as, the one we care about.
For example, the TextBlob tool provides a pretrained sentiment analyzer. Our spam classification task is not the same as sentiment classification, but we may believe that SPAM
and HAM
comments have different distributions of sentiment scores.
We'll focus on writing LFs for HAM
, since we identified SPAM
comments above.
A brief intro to Preprocessor
s
A Snorkel Preprocessor
is constructed from a black-box Python function that maps a data point to a new data point.
LabelingFunction
s can use Preprocessor
s, which lets us write LFs over transformed or enhanced data points.
We add the @preprocessor(...)
decorator
to preprocessing functions to create Preprocessor
s.
Preprocessor
s also have extra functionality, such as memoization
(i.e. input/output caching, so it doesn't re-execute for each LF that uses it).
We'll start by creating a Preprocessor
that runs TextBlob
on our comments, then extracts the polarity and subjectivity scores.
from snorkel.preprocess import preprocessor
from textblob import TextBlob
@preprocessor(memoize=True)
def textblob_sentiment(x):
scores = TextBlob(x.text)
x.polarity = scores.sentiment.polarity
x.subjectivity = scores.sentiment.subjectivity
return x
We can now pick a reasonable threshold and write a corresponding labeling function (note that it doesn't have to be perfect as the LabelModel
will soon help us estimate each labeling function's accuracy and reweight their outputs accordingly):
@labeling_function(pre=[textblob_sentiment])
def textblob_polarity(x):
return HAM if x.polarity > 0.9 else ABSTAIN
Let's do the same for the subjectivity scores.
This will run faster than the last cell, since we memoized the Preprocessor
outputs.
@labeling_function(pre=[textblob_sentiment])
def textblob_subjectivity(x):
return HAM if x.subjectivity >= 0.5 else ABSTAIN
Let's apply our LFs so we can analyze their performance.
lfs = [textblob_polarity, textblob_subjectivity]
applier = PandasLFApplier(lfs)
L_train = applier.apply(df_train)
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1586/1586 [00:00<00:00, 2348.71it/s] 100%|██████████| 1586/1586 [00:01<00:00, 970.94it/s]
LFAnalysis(L_train, lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
textblob_polarity | 0 | [0] | 0.035309 | 0.013871 | 0.0 |
textblob_subjectivity | 1 | [0] | 0.357503 | 0.013871 | 0.0 |
Again, these LFs aren't perfect—note that the textblob_subjectivity
LF has fairly high coverage and could have a high rate of false positives. We'll rely on Snorkel's LabelModel
to estimate the labeling function accuracies and reweight and combine their outputs accordingly.
3. Writing More Labeling Functions¶
If a single LF had high enough coverage to label our entire test dataset accurately, then we wouldn't need a classifier at all. We could just use that single simple heuristic to complete the task. But most problems are not that simple. Instead, we usually need to combine multiple LFs to label our dataset, both to increase the size of the generated training set (since we can't generate training labels for data points that no LF voted on) and to improve the overall accuracy of the training labels we generate by factoring in multiple different signals.
In the following sections, we'll show just a few of the many types of LFs that you could write to generate a training dataset for this problem.
a) Keyword LFs¶
For text applications, some of the simplest LFs to write are often just keyword lookups.
These will often follow the same execution pattern, so we can create a template and use the resources
parameter to pass in LF-specific keywords.
Similar to the labeling_function
decorator,
the LabelingFunction
class
wraps a Python function (the f
parameter), and we can use the resources
parameter to pass in keyword arguments (here, our keywords to lookup) to said function.
from snorkel.labeling import LabelingFunction
def keyword_lookup(x, keywords, label):
if any(word in x.text.lower() for word in keywords):
return label
return ABSTAIN
def make_keyword_lf(keywords, label=SPAM):
return LabelingFunction(
name=f"keyword_{keywords[0]}",
f=keyword_lookup,
resources=dict(keywords=keywords, label=label),
)
"""Spam comments talk about 'my channel', 'my video', etc."""
keyword_my = make_keyword_lf(keywords=["my"])
"""Spam comments ask users to subscribe to their channels."""
keyword_subscribe = make_keyword_lf(keywords=["subscribe"])
"""Spam comments post links to other channels."""
keyword_link = make_keyword_lf(keywords=["http"])
"""Spam comments make requests rather than commenting."""
keyword_please = make_keyword_lf(keywords=["please", "plz"])
"""Ham comments actually talk about the video's content."""
keyword_song = make_keyword_lf(keywords=["song"], label=HAM)
b) Pattern-matching LFs (regular expressions)¶
If we want a little more control over a keyword search, we can look for regular expressions instead.
The LF we developed above (regex_check_out
) is an example of this.
c) Heuristic LFs¶
There may be other heuristics or "rules of thumb" that you come up with as you look at the data. So long as you can express it in a function, it's a viable LF!
@labeling_function()
def short_comment(x):
"""Ham comments are often short, such as 'cool video!'"""
return HAM if len(x.text.split()) < 5 else ABSTAIN
d) LFs with Complex Preprocessors¶
Some LFs rely on fields that aren't present in the raw data, but can be derived from it.
We can enrich our data (providing more fields for the LFs to refer to) using Preprocessor
s.
For example, we can use the fantastic NLP (natural language processing) tool spaCy to add lemmas, part-of-speech (pos) tags, etc. to each token.
Snorkel provides a prebuilt preprocessor for spaCy called SpacyPreprocessor
which adds a new field to the
data point containing a spaCy Doc
object.
For more info, see the SpacyPreprocessor
documentation.
If you prefer to use a different NLP tool, you can also wrap that as a Preprocessor
and use it in the same way.
For more info, see the preprocessor
documentation.
If the spaCy English model wasn't already installed, the next cell may raise an exception. If this happens, restart the kernel and re-execute the cells up to this point.
from snorkel.preprocess.nlp import SpacyPreprocessor
# The SpacyPreprocessor parses the text in text_field and
# stores the new enriched representation in doc_field
spacy = SpacyPreprocessor(text_field="text", doc_field="doc", memoize=True)
@labeling_function(pre=[spacy])
def has_person(x):
"""Ham comments mention specific people and are short."""
if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
return HAM
else:
return ABSTAIN
Because spaCy is such a common preprocessor for NLP applications, we also provide a
prebuilt labeling_function
-like decorator that uses spaCy.
This resulting LF is identical to the one defined manually above.
from snorkel.labeling.lf.nlp import nlp_labeling_function
@nlp_labeling_function()
def has_person_nlp(x):
"""Ham comments mention specific people and are short."""
if len(x.doc) < 20 and any([ent.label_ == "PERSON" for ent in x.doc.ents]):
return HAM
else:
return ABSTAIN
Adding new domain-specific preprocessors and LF types is a great way to contribute to Snorkel! If you have an idea, feel free to reach out to the maintainers or submit a PR!
e) Third-party Model LFs¶
We can also utilize other models, including ones trained for other tasks that are related to, but not the same as, the one we care about. The TextBlob-based LFs we created above are great examples of this!
4. Combining Labeling Function Outputs with the Label Model¶
This tutorial demonstrates just a handful of the types of LFs that one might write for this task.
One of the key goals of Snorkel is not to replace the effort, creativity, and subject matter expertise required to come up with these labeling functions, but rather to make it faster to write them, since in Snorkel the labeling functions are assumed to be noisy, i.e. innaccurate, overlapping, etc.
Said another way: the LF abstraction provides a flexible interface for conveying a huge variety of supervision signals, and the LabelModel
is able to denoise these signals, reducing the need for painstaking manual fine-tuning.
lfs = [
keyword_my,
keyword_subscribe,
keyword_link,
keyword_please,
keyword_song,
regex_check_out,
short_comment,
has_person_nlp,
textblob_polarity,
textblob_subjectivity,
]
With our full set of LFs, we can now apply these once again with LFApplier
to get the label matrices.
The Pandas format provides an easy interface that many practitioners are familiar with, but it is also less optimized for scale.
For larger datasets, more compute-intensive LFs, or larger LF sets, you may decide to use one of the other data formats
that Snorkel supports natively, such as Dask DataFrames or PySpark DataFrames, and their corresponding applier objects.
For more info, check out the Snorkel API documentation.
applier = PandasLFApplier(lfs=lfs)
L_train = applier.apply(df=df_train)
L_test = applier.apply(df=df_test)
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1586/1586 [00:14<00:00, 108.12it/s] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 250/250 [00:03<00:00, 79.80it/s] 100%|██████████| 1586/1586 [00:14<00:00, 111.71it/s] 100%|██████████| 250/250 [00:02<00:00, 95.46it/s]
LFAnalysis(L=L_train, lfs=lfs).lf_summary()
j | Polarity | Coverage | Overlaps | Conflicts | |
---|---|---|---|---|---|
keyword_my | 0 | [1] | 0.198613 | 0.186003 | 0.110971 |
keyword_subscribe | 1 | [1] | 0.127364 | 0.107188 | 0.067465 |
keyword_http | 2 | [1] | 0.119168 | 0.101513 | 0.082598 |
keyword_please | 3 | [1] | 0.112232 | 0.109710 | 0.057377 |
keyword_song | 4 | [0] | 0.141866 | 0.111602 | 0.043506 |
regex_check_out | 5 | [1] | 0.233922 | 0.129256 | 0.083859 |
short_comment | 6 | [0] | 0.225725 | 0.144388 | 0.074401 |
has_person_nlp | 7 | [0] | 0.083859 | 0.062421 | 0.027743 |
textblob_polarity | 8 | [0] | 0.035309 | 0.030895 | 0.005044 |
textblob_subjectivity | 9 | [0] | 0.357503 | 0.261665 | 0.160151 |
We see that our labeling functions vary in coverage, how much they overlap/conflict with one another, and almost certainly their accuracies as well. We can view a histogram of how many LF labels the data points in our train set have to get an idea of our total coverage.
import matplotlib.pyplot as plt
%matplotlib inline
def plot_label_frequency(L):
plt.hist((L != ABSTAIN).sum(axis=1), density=True, bins=range(L.shape[1]))
plt.xlabel("Number of labels")
plt.ylabel("Fraction of dataset")
plt.show()
plot_label_frequency(L_train)
We see that over half of our train
dataset data points have 2 or fewer labels from LFs.
Fortunately, the labels we do have can be used to train a classifier over the comment text directly, allowing this final machine learning model to generalize beyond what our labeling functions labeling.
Our goal is now to convert the labels from our LFs into a single noise-aware probabilistic (or confidence-weighted) label per data point.
A simple baseline for doing this is to take the majority vote on a per-data point basis: if more LFs voted SPAM then HAM, label it SPAM (and vice versa).
We can test this with the
MajorityLabelVoter
baseline model.
from snorkel.labeling.model import MajorityLabelVoter
majority_model = MajorityLabelVoter()
preds_train = majority_model.predict(L=L_train)
preds_train
array([ 1, 1, -1, ..., 1, 1, 1])
However, as we can see from the summary statistics of our LFs in the previous section, they have varying properties and should not be treated identically. In addition to having varied accuracies and coverages, LFs may be correlated, resulting in certain signals being overrepresented in a majority-vote-based model. To handle these issues appropriately, we will instead use a more sophisticated Snorkel LabelModel
to combine the outputs of the LFs.
This model will ultimately produce a single set of noise-aware training labels, which are probabilistic or confidence-weighted labels. We will then use these labels to train a classifier for our task. For more technical details of this overall approach, see our NeurIPS 2016 and AAAI 2019 papers. For more info on the API, see the LabelModel
documentation.
Note that no gold labels are used during the training process.
The only information we need is the label matrix, which contains the output of the LFs on our training set.
The LabelModel
is able to learn weights for the labeling functions using only the label matrix as input.
We also specify the cardinality
, or number of classes.
from snorkel.labeling.model import LabelModel
label_model = LabelModel(cardinality=2, verbose=True)
label_model.fit(L_train=L_train, n_epochs=500, log_freq=100, seed=123)
INFO:root:Computing O... INFO:root:Estimating \mu... 0%| | 0/500 [00:00<?, ?epoch/s]INFO:root:[0 epochs]: TRAIN:[loss=0.164] 16%|██████████████████████████████▉ | 82/500 [00:00<00:01, 347.01epoch/s]INFO:root:[100 epochs]: TRAIN:[loss=0.011] 37%|█████████████████████████████████████████████████████████████████████▏ | 184/500 [00:00<00:01, 274.05epoch/s]INFO:root:[200 epochs]: TRAIN:[loss=0.009] 47%|█████████████████████████████████████████████████████████████████████████████████████████ | 237/500 [00:00<00:00, 345.22epoch/s]INFO:root:[300 epochs]: TRAIN:[loss=0.008] 75%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 373/500 [00:00<00:00, 495.59epoch/s]INFO:root:[400 epochs]: TRAIN:[loss=0.008] 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:01<00:00, 417.75epoch/s] INFO:root:Finished Training
majority_acc = majority_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
"accuracy"
]
print(f"{'Majority Vote Accuracy:':<25} {majority_acc * 100:.1f}%")
label_model_acc = label_model.score(L=L_test, Y=Y_test, tie_break_policy="random")[
"accuracy"
]
print(f"{'Label Model Accuracy:':<25} {label_model_acc * 100:.1f}%")
Majority Vote Accuracy: 84.4% Label Model Accuracy: 87.6%
The majority vote model or more sophisticated LabelModel
could in principle be used directly as a classifier if the outputs of our labeling functions were made available at test time.
However, these models (i.e. these re-weighted combinations of our labeling function's votes) will abstain on the data points that our labeling functions don't cover (and additionally, may require slow or unavailable features to execute at test time).
In the next section, we will instead use the outputs of the LabelModel
as training labels to train a discriminative classifier which can generalize beyond the labeling function outputs to see if we can improve performance further.
This classifier will also only need the text of the comment to make predictions, making it much more suitable for inference over unseen comments.
For more information on the properties of the label model, see the Snorkel documentation.
Let's briefly confirm that the labels the LabelModel
produces are indeed probabilistic in nature.
The following histogram shows the confidences we have that each data point has the label SPAM.
The points we are least certain about will have labels close to 0.5.
def plot_probabilities_histogram(Y):
plt.hist(Y, bins=10)
plt.xlabel("Probability of SPAM")
plt.ylabel("Number of data points")
plt.show()
probs_train = label_model.predict_proba(L=L_train)
plot_probabilities_histogram(probs_train[:, SPAM])
Filtering out unlabeled data points¶
As we saw earlier, some of the data points in our train
set received no labels from any of our LFs.
These data points convey no supervision signal and tend to hurt performance, so we filter them out before training using a
built-in utility.
from snorkel.labeling import filter_unlabeled_dataframe
df_train_filtered, probs_train_filtered = filter_unlabeled_dataframe(
X=df_train, y=probs_train, L=L_train
)
5. Training a Classifier¶
In this final section of the tutorial, we'll use the probabilistic training labels we generated in the last section to train a classifier for our task.
The output of the Snorkel LabelModel
is just a set of labels which can be used with most popular libraries for performing supervised learning, such as TensorFlow, Keras, PyTorch, Scikit-Learn, Ludwig, and XGBoost.
In this tutorial, we use the well-known library Scikit-Learn.
Note that typically, Snorkel is used (and really shines!) with much more complex, training data-hungry models, but we will use Logistic Regression here for simplicity of exposition.
Featurization¶
Note: An ML model requires numerical features. So, we convert the comments(text format) into numerical format using one of the many methods called "bag of n-grams". You will learn more about such methods and representations in Natural Language Processing(NLP).
For simplicity and speed, we use a simple "bag of n-grams" feature representation: each data point is represented by a one-hot vector marking which words or 2-word combinations are present in the comment text.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(ngram_range=(1, 5))
X_train = vectorizer.fit_transform(df_train_filtered.text.tolist())
X_test = vectorizer.transform(df_test.text.tolist())
Scikit-Learn Classifier¶
As we saw in Section 4, the LabelModel
outputs probabilistic (float) labels.
If the classifier we are training accepts target labels as floats, we can train on these labels directly (see describe the properties of this type of "noise-aware" loss in our NeurIPS 2016 paper).
If we want to use a library or model that doesn't accept probabilistic labels (such as Scikit-Learn), we can instead replace each label distribution with the label of the class that has the maximum probability.
This can easily be done using the
probs_to_preds
helper method.
We do note, however, that this transformation is lossy, as we no longer have values for our confidence in each label.
from snorkel.utils import probs_to_preds
preds_train_filtered = probs_to_preds(probs=probs_train_filtered)
We then use these labels to train a classifier as usual.
probs_train_filtered
array([[6.57400115e-01, 3.42599885e-01], [2.25181050e-06, 9.99997748e-01], [8.90998467e-01, 1.09001533e-01], ..., [2.27481865e-07, 9.99999773e-01], [1.08918677e-03, 9.98910813e-01], [1.22570299e-08, 9.99999988e-01]])
from sklearn.linear_model import LogisticRegression
sklearn_model = LogisticRegression(C=1e3, solver="liblinear")
sklearn_model.fit(X=X_train, y=preds_train_filtered)
LogisticRegression(C=1000.0, solver='liblinear')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(C=1000.0, solver='liblinear')
print(f"Test Accuracy: {sklearn_model.score(X=X_test, y=Y_test) * 100:.1f}%")
Test Accuracy: 93.6%
We observe an additional boost in accuracy over the LabelModel
by multiple points! This is in part because the discriminative model generalizes beyond the labeling function's labels and makes good predictions on all data points, not just the ones covered by labeling functions.
By using the label model to transfer the domain knowledge encoded in our LFs to the discriminative model,
we were able to generalize beyond the noisy labeling heuristics.
Summary¶
In this tutorial, we accomplished the following:
- We introduced the concept of Labeling Functions (LFs) and demonstrated some of the forms they can take.
- We used the Snorkel
LabelModel
to automatically learn how to combine the outputs of our LFs into strong probabilistic labels. - We showed that a classifier trained on a weakly supervised dataset can outperform an approach based on the LFs alone as it learns to generalize beyond the noisy heuristics we provide.
Next Steps¶
If you enjoyed this tutorial and you've already checked out the Getting Started tutorial, check out the Tutorials page for other tutorials that you may find interesting, including demonstrations of how to use Snorkel
- As part of a hybrid crowdsourcing pipeline
- For visual relationship detection over images
- For information extraction over text
- For data augmentation
and more! You can also visit the Snorkel website or Snorkel API documentation for more info!