64th ISI World Statistics Congress - Ottawa, Canada

64th ISI World Statistics Congress - Ottawa, Canada

Categorizing company websites based on website texts


64th ISI World Statistics Congress - Ottawa, Canada

Format: IPS Abstract

Keywords: machine learning, nlp, text analysis

Session: IPS 200 - Challenges of Natural Language Processing techniques in official statistics

Tuesday 18 July 10 a.m. - noon (Canada/Eastern)


Different types of companies are identified based on - differences in - the texts on their websites. This approach has been used to identify innovative and platform economy companies in the Netherlands and drone companies in several European countries. Usually, an initial test is performed to determine if (and how much) the website texts for the topic studied actually differ. For this, at least 2000 company website texts, including 50% positive and 50% negative cases, are routinely used. Survey data or expert findings are used to determine the actual type of company. Next, the website’s texts are preprocessed, and various classification algorithms, included in the scikit-learn library of Python, are applied to determine which of them is best able to discern between the positive and negative cases; e.g. platform vs non-platform. In addition, the effect of adding WordEmbeddings-based features is routinely tested. We found that logistic regression with WordEmbeddings worked best to detect innovative company websites (accuracy 88%), linear-SVM worked best for platform economy websites (accuracy 82%), and logistic regression worked best to detect drone companies (accuracy 82-86%) in three languages. The results are usually used to obtain a small subset of companies for the type studied that are subsequently investigated in further detail.