PyPy core of TaLTaC3 for high performance and big data text analysis

TaLTaC3 was presented to the European text analysis research community at JADT 2016 as a promising text analysis tool with good performance on big data set, such as Repubblica-90, which a 10-years text data set of the "La Repubblica" Italian news paper. The time to extract the lexicon from the 1.51 GB data set, composed of 400.276 articles, with 1.167.723 types and 278.940.753 tokens was reported as 844 seconds on a Mac Book Pro 2013, with 16GB of RAM and 2,7 GHz Intel Core i7.

Here the TaLTaC3 multi-platform software architecture will be shown and discussed, with case studies over increasing size text corpora, compatible with the time of the presentation. The architecture comprises a web-like GUI application based on Electron connected through asyncronous methods call to the PyPy core through the a Redis NOSQL server, which also support user session persistence.

Reference:

BOLASCO, Sergio; CANZONETTI, Francesco Baiocchi2 Alessio; DE GASPERIS, Giovanni. TaLTaC 3.0: un software multi-lessicale e uni-testuale ad architettura web. JADT 2016, Nice, FR. Paper