Detecting novelty and redundancy in text
Pythia is Lab41's exploration of approaches to novel content detection. It adopts a supervised machine learning approach to the problem, and provides an interface for processing data, training classification systems, and evaluating their performance.
Usage and Runtime Options
A simple experiment can be kicked off with experiments/
The basic syntax pairs the keyword with
with a number of valid Python expressions setting
the various run parameters:
experiments/ with OPTION=value OPTION_2=value_2 OPTION_3='value with spaces'
A stock logistic regression on binary bag-of-words features would be specified like this:
experiments/ with BOW_APPEND=True LOG_REG=True
Runtime parameters are all documented below.
The whole collection of documents in a given experiment.
One of a number of small, potentially overlapping groups of mutually relevant documents. Pythia assumes documents have already been grouped with topically similar neighbors via some mechanism. Clusters should also have some kind of linear ordering, either in time of publication, time of ingest, or whatever is relevant to the data.
A document of interest, which a classification system should decide is novel or redundant.
The documents against which a query should be compared. In Pythia, these are the members of the query's cluster that come before it (hence the need for linear order)
observation, case
A query, its background documents, and a novelty label (novel or redundant)
The pipeline
Raw data > Processed Data > Featurization > Training > Testing/Evaluation
Data processing
The user (you!) is responsible for converting data into the form Pythia consumes. src/data/
has scripts for (acquiring and) processing two sources of data. The processed data should a folder of .json
files, each one
with data on a cluster of documents. Each line of the cluster file is a JSON object whose form is
documented at data/cluster_schema.json
and and example cluster can be found at data/SKNews.json
A full sample corpus is contained at data/stackexchange/anime
Once you have supplied the data, Pythia generates observations pairing query documents with background documents.
Numerous methods for converting observations into feature vectors are available for experimentation. They are documented below. All Pythia experiments must use at least one featurization method.
Training, testing, evaluation
Pythia does training, testing, and evaluation all in one fell swoop, since it is mostly an experimentation platform. Available learning methods are documented below. You must choose exactly one classifier for each experiment.
Usage and runtime options
Data location
directory ('data/stackexchange/anime')
Path to a folder with .json
files describing the clusters in a corpus.
Featurization techniques
Bag of words
Bag-of-words features can be generated for query and background documents. Aggregating the vectors for query document and background documents can be done by concatenating them, subtracting one from the other, or other operations. A temporal score is also available in this family of settings.
Bag-of-words vectors will automatically be used if any of the following aggregation
parameters is set to True
Calculate bag-of-words vectors for query document, background documents. Concatenate query document and sum of vectors for background documents.
Use difference vector, i.e. bow(query) - bow(background)
Take the product of bag-of-words vectors
BOW_COS (False)
Take the cosine similarity of query and background vectors
Take the temporal TF-IDF score for the cluster
Skip-thought vectors
Skip-thought vectors are a method for representing the structure and content of sentences in a fixed-size, dense vector. They can be concatenated, subtracted, multiplied elementwise, or compared with the cosine distance.
Description of method:
Basis of implementation:
See notes on Bag-of-Words for all of the below options for aggregating skip-thought feature vectors:
Concatenate vectors
Difference of vectors
Product of vectors
ST_COS (False)
Latent Dirichlet Allocation
Latent Dirichlet Allocation (LDA) is a widely-used method for representing the topic distribution in a corpus of documents. Given a corpus of documents with unique words, LDA yields two matrices, one repesenting the 'weight' of each document across the possible topics, and one describing which unique words are associated with each topic.
LDA posits that co-occurring words within documents are 'generated' by the hidden topic variables with document-specific frequencies, so it is a good way of expressing the assumptions that a) some words naturally co-occur with each other and b) which co-occurrence patterns are relevant for a given document depend on what the document is talking about.
Original paper:
Library used: scikit-learn
The number of possible topics represented in the corpus. Higher is often better as LDA should learn not to use unnecessary topics; in practice some tuning is usually necessary to find a happy medium.
LDA_COS (False)
Dynamic Memory Networks
Dynamic memory networks (DMN) are a 2016 deep learning architecture designed for automatic question answering. They take in natural language representations of background knowledge and natural language representations of a query and learn a decoding function to provide an answer. In our adaptation of DMN, background documents are fed in to the background knowledge module, the query document is used as the query, and the possible responses are True (novel) or False.
Original manuscript:
Basis implementation:
MEM_NET (False)
Set to True to use DMN algorithm
Vocabulary size for encoding functions. Ideally, this would be rather large, but memory and processing constraints may force the use of unnaturally small values.
MEM_TYPE ('dmn_basic')
Architectural variant to use. 'dmn_basic'
is the only supported value at present.
Minibatch size for gradient-based training of DMN. Currently values other than 1 are unsupported.
Number of training epochs to conduct.
MEM_MASK_MODE ('word')
Accepts values 'word'
and 'sentence'
. Tells DMN which unit to treat as an 'epsiode' to
encode in memory.
MEM_EMBED_MODE ('word2vec')
is 'word_onehot'
, set the minimum length of a one-hot-encoded document
Maximum length of a one-hot-encoded document
word2vec is a popular algorithm for learning 'distributed' vector representations of words. In practice, word2vec learns to represent each unique word in a corpus as a 50- to 300-dimensional vector of real numbers, providing plenty of room to account for semantic and syntactic similarities and differences between words.
In Pythia, documents are represented by word2vec by finding vectors representing the individual words in the input documents and aggregating these vectors in creative ways. Word vectors are extracted from the first and last sentences of input documents and then combined using averaging, concatenation, elementwise max, elementwise min, or absolute elementwise max. Once query and background vectors have been generated, they are combined using any of the customary aggregation techniques (see bag of words for discussion)
Original paper:
Implementation used: gensim
Aggregating query and background vectors:
W2V_APPEND (False)
W2V_COS (False)
Other parameters:
Use pretrained model? This should be available in the directory described by the
PYTHIA_MODELS_PATH environment variable. Currently only the 300-dimensional
Google News model is supported, so this should have the file name GoogleNews-vectors-negative300.bin
is False, Pythia will train a word2vec model based on your corpus (not recommended for small collections).
The following parameters control word2vec training:
Minimum number of times a unique word must appear in corpus to be given a vector representation
Window size, in words, to the left or right of the word being trained.
W2V_SIZE (100)
Dimensionality of trained word vectors.
If >1, number of cores to do parallel training on. Parallel-trained word2vec models will converge much more quickly but training behavior is non-deterministic and not strictly replicable.
One-hot CNN activation features
The one-hot CNN will use the full_vocab parameters
CNN_COS (False)
Word-level one-hot encoding
Not currently used.
Use word-level one-hot encoding?
Classification Algorithms
If traditional (non-DMN) featurization techniques are chosen, a classifier must also be selected. Pythia supports batch logistic regression, batch SVM, and the popular boosting algorithm XGBoost, as well as SGD-based (minibatch) logistic regression and linear SVM, which may have favorable memory performance for very large corpora.
Logistic Regression
Tried-and-true statistical classification technique. Learns a linear combination of input features and applies a nonlinear transform to output a hypothesis between 0.0 and 1.0, with values equal to or above 0.5 typically taken as true and the rest as false.
Pythia uses scikit-learn to do logistic regression.
LOG_REG (False)
Set to True to use logistic regression
Form of regularization penalty to use (see scikit-learn docs)
LOG_TOL (1e-4)
Convergence criterion during model fitting
LOG_C (1e-4)
Inverse regularization strength
Support Vector Machine
Nonparametric classifer. Can take a prohibitively long time to converge for large datasets.
Also uses scikit-learn's implementation.
SVM (False)
Set to True to use SVM
SVM_C (2000)
Inverse regularization strength
SVM_KERNEL ('linear')
Kernel function to use. Can be any predefined setting accepted by sklearn.svm.SVC
SVM_GAMMA ('auto')
If kernel is 'poly'
, 'rbf'
, or 'sigmoid'
, the kernel coefficient.
Boosted decision tree algorithm. Fast and performant, but may not scale to much larger datasets.
Original manuscript:
Implementation used:
XGB (False)
Set to true to use XGBoost
"Learning rate" (see documentation)
Maximum depth of tree
Typically, minimum number of children allowed in any child node
Proportion (0 to 1) of features to sample when building a tree
Other run parameters
Resample observations by label to achieve a 1-to-1 ratio of positive to negative observations
Resample so that total number of observations per class is equal to the largest class. Implies REPLACEMENT=True
When doing resampling, choose observations with replacement from original samples.
Save training data for grid search
When using experiments/
, use these variables
to allow GridSearchCV to cooperate with the Pythia pipeline.
EXPERIMENTDATAFILE ('data/experimentdatafile.pkl')
Two seaparate vocabularies are computed. One, a reduced vocabulary, excludes stop words and punctuation tokens, and the
, retains them. FULL_VOCAB
can also be set to use character-level tokenization
instead of word-level tokenization.
VOCAB_SIZE (10000)
Size of reduced vocabulary
STEM (False)
Conduct stemming?
Number of unique tokens in word-level full vocabulary.
FULL_VOCAB_TYPE ('character')
Either 'word'
or 'character'
. Determines tokenization strategy for full vocabulary
FULL_CHAR_VOCAB ("abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/|_@#$%^&*~`+-=<>()[]{}")
SEED (41)
Random number generator seed value.
Cache preprocessed JSON documents in ./.cache
; this can reduce experiment time significantly for large corpora.