Media Analytics

A site that allows users to query a corpus of all articles written by the New York Times between 1970 and 2018. The Timeline functionality allows for tracking the frequency of word usage over time in the New York Times data corpus. The NLP functionality allows for querying of word embeddings derived from the New York Times corpus using word2vec. Word embeddings are vector representations of words learned from a corpus of text based on contextual co-occurrence statistics that capture the semantic loadings of words. .

Timeline

Compares the frequency of word usage within a time range (e.g. if user selects 1975 to 2017, it would return a time series graph of frequencies for 1975, 1976, 1977...... 2017). Please notice that in order to make the site responsive by keeping word embeddings models sizes within reasonable bounds, only words that appear more than five times in any given year can be tracked.

NLP

Most similar word

Which word from the given list doesn’t go with the others according to the selected model? This is computed by calculating the word furthest away from the mean word embedding of all input words. Example: the word out of place in the following list [breakfast, cereal, dinner, lunch] is cereal.

Similarity between two words

Computes the cosine similarity between the 2 input words using the word2vec model trained on the selected corpus.

TSNE visualisation

Word embeddings (aka word vectors) inhabit a high dimensional space, in the corpuses case, a 300 dimensional space. It is impossible for humans to visualize unassisted the geometrical structure of such a multidimensional data set. The t-SNE algorithm takes a set of points in a high dimensional space and tries to find a faithful representation of those points in a lower dimensional space, typically in the 2D plane. Its objective function tries to map each high dimensional vector into 2 or 3 dimensional representation such that with high probability, similar objects in the high dimensional space are modelled by nearby points in the low dimensional space and dissimilar objects are modelled by distant points.

Analogical reasoning

An important property of word embeddings is that the learned word representations capture meaningful syntactic and semantic regularities between words, such as for instance gender or verb tense, and that said regularities are consistent across the vector space. The regularities are observed as constant vector offsets between pairs of words sharing a particular relationship. This property permits the usage of analogical reasoning to answer questions such as “man is to woman as king is to…” by using vector algebra of the form: v_woman-v_man+v_king (where v_n stands for the vector representation of word n). In a properly trained word embedding model with a sufficient and relevant text corpus, the result of the previous vector algebra operation will be a vector whose closest neighbor is the vector for the word “queen”.

Odd one out

Which word from the given list doesn’t go with the others according to the selected model? This is computed by calculating the word furthest away from the mean word embedding of all input words. Example: the word out of place in the following list [breakfast, cereal, dinner, lunch] is cereal.

Show ranking of specific word

Rank of the word in the frequency ranking of the selected year.

Show top words

Retrieve the top-n most frequent words in the selected year.