Find the top-10 most similar words to the input word using the word2vec model trained on the selected University corpus. The similarity measurement used is cosine similarity.
Computes the cosine similarity between the 2 input words using the word2vec model trained on the selected University corpus.
Word embeddings (aka word vectors) inhabit a high dimensional space, in the universities corpuses case, a 300 dimensional space. It is impossible for humans to visualize unassisted the geometrical structure of such a multidimensional data set. The t-SNE algorithm takes a set of points in a high dimensional space and tries to find a faithful representation of those points in a lower dimensional space, typically in the 2D plane. Its objective function tries to map each high dimensional vector into 2 or 3 dimensional representation such that with high probability, similar objects in the high dimensional space are modelled by nearby points in the low dimensional space and dissimilar objects are modelled by distant points.
An important property of word embeddings is that the learned word representations capture meaningful syntactic and semantic regularities between words, such as for instance gender or verb tense, and that said regularities are consistent across the vector space. The regularities are observed as constant vector offsets between pairs of words sharing a particular relationship. This property permits the usage of analogical reasoning to answer questions such as “man is to woman as king is to…” by using vector algebra of the form: v_woman-v_man+v_king (where v_n stands for the vector representation of word n). In a properly trained word embedding model with a sufficient and relevant text corpus, the result of the previous vector algebra operation will be a vector whose closest neighbor is the vector for the word “queen”.
Which word from the given list doesn’t go with the others according to the selected model? This is computed by calculating the word furthest away from the mean word embedding of all input words. Example: the word out of place in the following list [breakfast, cereal, dinner, lunch] is cereal.
word embedding models derived using word2vec from large corpuses of textual data gathered from the institutional web domains of 50 elite U.S. universities. The model named "All Universities Corpuses Concatenated" was derived from a combined corpus consisting of all the individual universities corpuses concatenated. Due to its size (17GB of textual data), this is the most accurate model but also the one that will take the longest to generate an output to user queries.
Retrieve the top-n most frequent words in the selected University corpus