Source code machine learning tool stack
source{d}, a startup in Madrid, Spain, has recently released a number of advanced open source projects related to machine learning on the world's source code. The talk is supposed to spread the word about them and show how they can be useful. The list is as follows:
- Babelfish - an ambitious initiative to extract Universal Abstract Syntax Trees from every software repository in the world. It splits into the server, the tools and the language drivers. The Python bindings will be presented; extracting ASTs from arbitrary codebases has never been so easy before.
- ast2vec is the complete end-to-end solution to embed source code identifiers into dense vectors (like in word2vec) using UASTs from Babelfish. It includes the customized Swivel model written in Tensorflow.
- vecino finds similar repositories by comparing their contents using ast2vec embeddings. E.g., it allows one to query for repositories related to Tensorflow or to beer. The corresponding paper is under development.
- wmd-relax is an optimized and high performing package to calculate Word Mover's Distance. Vecino is based on it.
- repotm is the complete end-to-end solution to do topic modeling of tens of millions of GitHub repositories. The corresponding paper was accepted to AUAI and can be downloaded from ArXiV.
- minhashcuda - GPGPU Weighted MinHash calculation. Other projects use it to filter fuzzy duplicates.
- hercules analyzes line burndown in Git repositories at blazing speed, allowing for revolutionary hot spot and software architecture analysis.
There will be tons of technical details and hands-on experience. These projects were designed with scalability in mind from the very beginning, and can tackle with the whole set of GitHub repositories. We believe that our growing ecosystem constitutes the foundation for the next generation machine learning on source code.