Skip to content

Publication

Text Mining mit der "Temi-Box"

Abstract

"The constantly growing amount of digitally accessible text data and advances in natural language processing (NLP) have made text mining a key technology. The "Temi-Box" is a modular construction kit designed for text mining applications. It enables automated text classification, topic categorization, and clustering without necessitating extensive programming expertise. Developed on the basis of the keywording and topic assignment of publications for the IAB Info Platform and financed by EU funds, it is available as an open source project. This research report documents the development and application of the Temi-Box and illustrates its use and the interpretation of the results obtained. Text mining extracts knowledge from unstructured texts using methods such as classification and clustering. The modular Temi-Box provides users with established methods in a user-friendly way and supports users with a pipeline architecture that simplifies standardised processes such as data preparation and model training. It incorporates both current and traditional approaches to text representation, such as BERT and TF-IDF, and offers a variety of algorithms for text classification and clustering, including K-Nearest Neighbors (KNN), binary and multinomial classifiers as layers in neural networks and K-Means. Various evaluation metrics facilitate the assessment of model performance and the comparison of different approaches. Experiments on automated topic assignment and the identification of key topics illustrate the use of the Temi-Box and the interpretation of the results. Based on a dataset with 1,932 IAB publications and 105 topics, the results show that BERT-based models, such as GermanBERT, consistently achieve the best results. Binary classifiers prove to be particularly flexible and accurate, while TF-IDF-based approaches offer robust alternatives with less complexity. Clustering remains a challenge, especially when content overlaps. The Temi-Box is a highly versatile instrument. In addition to the application for the IAB Info Platform described in this research report, it can be used in numerous areas, such as the analysis of job advertisements, job and company descriptions, keywording of publications or for sentiment analysis. It can also be extended for use in question-and-answer systems or for named entity recognition. The Temi-Box facilitates the application of text mining methods for a broad user base and offers numerous customization options. It reduces the effort involved in developing and comparing models. Its open source availability promotes the further development and integration of the Temi-Box into various research projects. This enables users to adapt the platform to specific needs and integrate new functions. The report shows the potential of the Temi-Box to advance the digitization and automation of text data analysis. At the same time, challenges such as ensuring data quality and the interpretability of the models remain. These aspects require continuous validation and further development in order to further improve the effectiveness and reliability of text mining methods." (Author's abstract, IAB-Doku) ((en))

Cite article

Hirmer, C. & Metzger, L. (2025): Text Mining mit der "Temi-Box". (IAB-Forschungsbericht 13/2025), Nürnberg, 58 p. DOI:10.48720/IAB.FB.2513

Download

Open Access