Text comparison using word vector representations and dimensionality reduction

Hendrik Heuer

Audience level:


Using word2vec word vector representations and t-SNE dimensionality reduction, a bird’s-eye view of one or more text sources can be computed. word2vec and t-SNE map the words so that semantically similar words are close to each other in 2D. This enables users to explore a text source like a geographical map.


This talk describes the development of a tool for the text analysis of book summaries. The tool uses word2vec word representations from the gensim Python library and t-SNE from scikit-learn to visualize and compare the topics in book summaries and their source material. Word vector representations capture many linguistic properties such as gender, tense, plurality and even semantic concepts like "capital city of". Using dimensionality reduction, a 2D map can be computed where semantically similar words are close to each other.