Describing the process for matching e-commerce offers among two catalogues using python open-source scientific libraries (numpy/scipy, pandas and scikit-learn).
E-commerce sales reached $1.4 trillions in 2014 and are expected to grow even more in the next few years . The high quantity of products available for sale on the internet makes their management overly complicated. Matching similar offers from several e-commerces’ catalogs is a key issue. It is used for various purposes such as fraud detection, competition monitoring and database cleaning. The company Pricing Assistant is interested in identifying and linking offers of the same product among several e-merchant catalogues. An e-merchant catalogue is a set of html pages (dynamic pages are left out in this presentation for clarity’s sake). Catalogs often have more than ten thousand offers so we quickly face scaling problems. Finally the semantic descriptions of offers for one single product are very heterogeneous often without unique identification (such as European Article Numbering - EAN). We create an data matching algorithm treating with data which is unstructured, massive and heterogeneous.
Throughout this work, we describe how we use a variety of open source python libraries, such as numpy, pandas, and scikit-learn, and how they interface. The entity matching process can be divided into three sub-processes, extraction, indexing and matching.
In the extraction part, we start from the html source code of a product webpage. From this code we extract description fields (names, images, descriptions, attributes tables, price) using the LXML python library. From the resulting text fields, we then extract product attributes, such as the manufacturer, references or dimensions using field knowledge, with dictionaries, or context.
Beforehand, we index the data in order to be able to reduce the number of pairs to go through. By using some of the features we just extracted, we are able to filter only the best matches. In this particular part, we take advantage of Pandas’ indexing properties.
The matching starts with the definition of features, which are the result of distance calculations between attributes and descriptions of the two products. We compare product names, images colors, prices, descriptions and extracted attributes. We use four types of distances, whose calculation involve the scipy/numpy suites for four types of attributes, images, texts, numerical attributes and normalized attributes. For images, we use a color comparison algorithm. This image feature is the most simple to compare and proved to be the most relevant in the context of products, as they are often differentiated by colors. A Levenshtein-type distance is used for texts. Finally, we use both ratio and simple distance for numerical metrics, and a simple boolean distance for normalized attributes.
Any of these distances can output one or several features, which is then fed to a classification algorithm. The problem was formulated as a classification problem in which pairs of products are classified as matches or not. A kernelized support vector machine classifier was used to set scores on product pairs. Finally when comparing two product catalogues, we assume unicity of products on each sides, and take advantage of this information with an assignment algorithm which consolidate the results. For both these algorithms we use the scikit-learn library.
Since this work involves machine learning we need training and testing datasets. By using products with EANs, we could constitute a dataset of around 50000 pairs of matching products (hence a 100000 pairs dataset).
The method presents a combination of machine learning and semantic techniques which has proven powerful to solve the problem of data matching for e-commerce offers. By taking advantage of the structuration of e-commerce webpages we obtain much better results than presented in the literature in terms of recall and precision [1, 2, 3]. Most of all, we are able to classify more than 80% of pairs with a precision superior to 95%. Moreover, most of the articles treat specific technical product categories, such as camera, whereas our study treats textile and parapharmacy for instance.
Further exploration is currently done on the image analysis part (shape detection) and feature selection algorithms are being studied in detail.
 Köpcke, H., Thor, A., Thomas, S., & Rahm, E. (2012, March). Tailoring entity resolution for matching product offers. In Proceedings of the 15th International Conference on Extending Database Technology (pp. 545-550). ACM.
Kannan, A., Givoni, I. E., Agrawal, R., & Fuxman, A. (2011, August). Matching unstructured product offers to structured product specifications. In Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 404-412). ACM.
Gopalakrishnan, V., Iyengar, S. P., Madaan, A., Rastogi, R., & Sengamedu, S. (2012, October). Matching product titles using web-based enrichment. InProceedings of the 21st ACM international conference on Information and knowledge management (pp. 605-614). ACM.