Pythia

Detecting novelty and redundancy in text

Pythia is Lab41's exploration of approaches to novel content detection. It adopts a supervised machine learning approach to the problem, and provides an interface for processing data, training classification systems, and evaluating their performance.

Usage and Runtime Options

A simple experiment can be kicked off with experiments/experiments.py. The basic syntax pairs the keyword with with a number of valid Python expressions setting the various run parameters:

experiments/experiments.py with OPTION=value OPTION_2=value_2 OPTION_3='value with spaces'

A stock logistic regression on binary bag-of-words features would be specified like this:

experiments/experiments.py with BOW_APPEND=True LOG_REG=True

Runtime parameters are all documented below.

Terminology

corpus

The whole collection of documents in a given experiment.

cluster

One of a number of small, potentially overlapping groups of mutually relevant documents. Pythia assumes documents have already been grouped with topically similar neighbors via some mechanism. Clusters should also have some kind of linear ordering, either in time of publication, time of ingest, or whatever is relevant to the data.

query

A document of interest, which a classification system should decide is novel or redundant.

background

The documents against which a query should be compared. In Pythia, these are the members of the query's cluster that come before it (hence the need for linear order)

observation, case

A query, its background documents, and a novelty label (novel or redundant)

The pipeline

Raw data > Processed Data > Featurization > Training > Testing/Evaluation

Data processing

The user (you!) is responsible for converting data into the form Pythia consumes. src/data/ has scripts for (acquiring and) processing two sources of data. The processed data should a folder of .json files, each one with data on a cluster of documents. Each line of the cluster file is a JSON object whose form is documented at data/cluster_schema.json and and example cluster can be found at data/SKNews.json

A full sample corpus is contained at data/stackexchange/anime

Once you have supplied the data, Pythia generates observations pairing query documents with background documents.

Featurization

Numerous methods for converting observations into feature vectors are available for experimentation. They are documented below. All Pythia experiments must use at least one featurization method.

Training, testing, evaluation

Pythia does training, testing, and evaluation all in one fell swoop, since it is mostly an experimentation platform. Available learning methods are documented below. You must choose exactly one classifier for each experiment.

Usage and runtime options

Data location

directory ('data/stackexchange/anime')

Path to a folder with .json files describing the clusters in a corpus.

Featurization techniques

Bag of words

Bag-of-words features can be generated for query and background documents. Aggregating the vectors for query document and background documents can be done by concatenating them, subtracting one from the other, or other operations. A temporal score is also available in this family of settings.

Bag-of-words vectors will automatically be used if any of the following aggregation parameters is set to True:

BOW_APPEND (False)

Calculate bag-of-words vectors for query document, background documents. Concatenate query document and sum of vectors for background documents.

BOW_DIFFERENCE (False)

Use difference vector, i.e. bow(query) - bow(background)

BOW_PRODUCT (False)

Take the product of bag-of-words vectors

BOW_COS (False)

Take the cosine similarity of query and background vectors

BOW_TFIDF (False)

Take the temporal TF-IDF score for the cluster

Skip-thought vectors

Skip-thought vectors are a method for representing the structure and content of sentences in a fixed-size, dense vector. They can be concatenated, subtracted, multiplied elementwise, or compared with the cosine distance.

Description of method: https://arxiv.org/abs/1506.06726
Basis of implementation: https://github.com/ryankiros/skip-thoughts

See notes on Bag-of-Words for all of the below options for aggregating skip-thought feature vectors:

ST_APPEND (False)

Concatenate vectors

ST_DIFFERENCE (False)

Difference of vectors

ST_PRODUCT (False)

Product of vectors

ST_COS (False)

Latent Dirichlet Allocation

Latent Dirichlet Allocation (LDA) is a widely-used method for representing the topic distribution in a corpus of documents. Given a corpus of documents with unique words, LDA yields two matrices, one repesenting the 'weight' of each document across the possible topics, and one describing which unique words are associated with each topic.

LDA posits that co-occurring words within documents are 'generated' by the hidden topic variables with document-specific frequencies, so it is a good way of expressing the assumptions that a) some words naturally co-occur with each other and b) which co-occurrence patterns are relevant for a given document depend on what the document is talking about.

Original paper: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Library used: scikit-learn

LDA_TOPICS (50)

The number of possible topics represented in the corpus. Higher is often better as LDA should learn not to use unnecessary topics; in practice some tuning is usually necessary to find a happy medium.

LDA_APPEND (False)

LDA_DIFFERENCE (False)

LDA_PRODUCT (False)

LDA_COS (False)

Dynamic Memory Networks

Dynamic memory networks (DMN) are a 2016 deep learning architecture designed for automatic question answering. They take in natural language representations of background knowledge and natural language representations of a query and learn a decoding function to provide an answer. In our adaptation of DMN, background documents are fed in to the background knowledge module, the query document is used as the query, and the possible responses are True (novel) or False.

Original manuscript: http://arxiv.org/abs/1506.07285
Basis implementation: https://github.com/YerevaNN/Dynamic-memory-networks-in-Theano

MEM_NET (False)

Set to True to use DMN algorithm

MEM_VOCAB (50)

Vocabulary size for encoding functions. Ideally, this would be rather large, but memory and processing constraints may force the use of unnaturally small values.

MEM_TYPE ('dmn_basic')

Architectural variant to use. 'dmn_basic' is the only supported value at present.

MEM_BATCH (1)

Minibatch size for gradient-based training of DMN. Currently values other than 1 are unsupported.

MEM_EPOCHS (5)

Number of training epochs to conduct.

MEM_MASK_MODE ('word')

Accepts values 'word' and 'sentence'. Tells DMN which unit to treat as an 'epsiode' to encode in memory.

MEM_EMBED_MODE ('word2vec')

MEM_ONEHOT_MIN_LEN (140)

If MEM_MASK_MODE is 'word_onehot', set the minimum length of a one-hot-encoded document

MEM_ONEHOT_MAX_LEN (1000)

Maximum length of a one-hot-encoded document

word2vec

word2vec is a popular algorithm for learning 'distributed' vector representations of words. In practice, word2vec learns to represent each unique word in a corpus as a 50- to 300-dimensional vector of real numbers, providing plenty of room to account for semantic and syntactic similarities and differences between words.

In Pythia, documents are represented by word2vec by finding vectors representing the individual words in the input documents and aggregating these vectors in creative ways. Word vectors are extracted from the first and last sentences of input documents and then combined using averaging, concatenation, elementwise max, elementwise min, or absolute elementwise max. Once query and background vectors have been generated, they are combined using any of the customary aggregation techniques (see bag of words for discussion)

Original paper: http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf
Implementation used: gensim

Aggregating query and background vectors:

W2V_APPEND (False)

W2V_DIFFERENCE (False)

W2V_PRODUCT (False)

W2V_COS (False)

Other parameters:

W2V_PRETRAINED (False)

Use pretrained model? This should be available in the directory described by the PYTHIA_MODELS_PATH environment variable. Currently only the 300-dimensional Google News model is supported, so this should have the file name GoogleNews-vectors-negative300.bin or GoogleNews-vectors-negative300.bin.gz

If W2V_PRETRAINED is False, Pythia will train a word2vec model based on your corpus (not recommended for small collections). The following parameters control word2vec training:

W2V_MIN_COUNT (5)

Minimum number of times a unique word must appear in corpus to be given a vector representation

W2V_WINDOW (5)

Window size, in words, to the left or right of the word being trained.

W2V_SIZE (100)

Dimensionality of trained word vectors.

W2V_WORKERS (3)

If >1, number of cores to do parallel training on. Parallel-trained word2vec models will converge much more quickly but training behavior is non-deterministic and not strictly replicable.

One-hot CNN activation features

The one-hot CNN will use the full_vocab parameters

CNN_APPEND (False)

CNN_DIFFERENCE (False)

CNN_PRODUCT (False)

CNN_COS (False)

Word-level one-hot encoding

Not currently used.

WORDONEHOT (False)

Use word-level one-hot encoding?

WORDONEHOT_VOCAB (5000)

Classification Algorithms

If traditional (non-DMN) featurization techniques are chosen, a classifier must also be selected. Pythia supports batch logistic regression, batch SVM, and the popular boosting algorithm XGBoost, as well as SGD-based (minibatch) logistic regression and linear SVM, which may have favorable memory performance for very large corpora.

Logistic Regression

Tried-and-true statistical classification technique. Learns a linear combination of input features and applies a nonlinear transform to output a hypothesis between 0.0 and 1.0, with values equal to or above 0.5 typically taken as true and the rest as false.

Pythia uses scikit-learn to do logistic regression.

LOG_REG (False)

Set to True to use logistic regression

LOG_PENALTY ('l2')

Form of regularization penalty to use (see scikit-learn docs)

LOG_TOL (1e-4)

Convergence criterion during model fitting

LOG_C (1e-4)

Inverse regularization strength

Support Vector Machine

Nonparametric classifer. Can take a prohibitively long time to converge for large datasets.

Also uses scikit-learn's implementation.

SVM (False)

Set to True to use SVM

SVM_C (2000)

Inverse regularization strength

SVM_KERNEL ('linear')

Kernel function to use. Can be any predefined setting accepted by sklearn.svm.SVC

SVM_GAMMA ('auto')

If kernel is 'poly', 'rbf', or 'sigmoid', the kernel coefficient.

XGBoost

Boosted decision tree algorithm. Fast and performant, but may not scale to much larger datasets.

Original manuscript: http://arxiv.org/abs/1603.02754
Implementation used: https://github.com/dmlc/xgboost

XGB (False)

Set to true to use XGBoost

XGB_LEARNRATE (0.1)

"Learning rate" (see documentation)

XGB_MAXDEPTH (3)

Maximum depth of tree

XGB_MINCHILDWEIGHT (1)

Typically, minimum number of children allowed in any child node

XGB_COLSAMPLEBYTREE (1)

Proportion (0 to 1) of features to sample when building a tree

Other run parameters

Resampling

RESAMPLING (True)

Resample observations by label to achieve a 1-to-1 ratio of positive to negative observations

OVERSAMPLING (False)

Resample so that total number of observations per class is equal to the largest class. Implies REPLACEMENT=True

REPLACEMENT (False)

When doing resampling, choose observations with replacement from original samples.

When using experiments/conduct_grid_search.py, use these variables to allow GridSearchCV to cooperate with the Pythia pipeline.

SAVEEXPERIMENTDATA (False)

EXPERIMENTDATAFILE ('data/experimentdatafile.pkl')

Vocabulary

Two seaparate vocabularies are computed. One, a reduced vocabulary, excludes stop words and punctuation tokens, and the other, FULL_VOCAB, retains them. FULL_VOCAB can also be set to use character-level tokenization instead of word-level tokenization.

VOCAB_SIZE (10000)

Size of reduced vocabulary

STEM (False)

Conduct stemming?

FULL_VOCAB_SIZE (1000)

Number of unique tokens in word-level full vocabulary.

FULL_VOCAB_TYPE ('character')

Either 'word' or 'character'. Determines tokenization strategy for full vocabulary

FULL_CHAR_VOCAB ("abcdefghijklmnopqrstuvwxyz0123456789,;.!?:'\"/|_@#$%^&*~`+-=<>()[]{}")

SEED (41)

Random number generator seed value.

USE_CACHE (False)

Cache preprocessed JSON documents in ./.cache; this can reduce experiment time significantly for large corpora.