

Text data mining (TDM) can be a powerful tool for efficiently extracting hidden patterns from vast unstructured text data. Learn here more about how to start your TDM project and how to avoid legal pitfalls.
Please note: If using copyright-protected material for LLM training is legally allowed is still under debate. Lib4RI has agreements with some publishers (for example, Elsevier and Wiley) that specify the conditions for using their published material for LLM training. Creative Commons (CC) offers a detailed guide about using works published with CC licences for LLM training.
If you have any questions about copyright for your text data mining / AI / LLM project, check our FAQs or contact us directly at @email.
In some cases, the information is contained in a set of full text documents. For many analyses, however, the necessary information is accessible through the bibliographic data of a set of documents. For instance, titles and abstracts of publications often contain sufficient information for topic and trend analysis. Bibliometric data can be accessed with considerably less effort than full-text corpora. Refer to our bibliometrics page to learn more about sources and analyses.
To conduct a TDM project, you need a textual dataset, or corpus, and tools to transform and analyse the data.
Starting your project
Basically any written resource can be used to compile a corpus. This includes scientific publications as well as newspaper articles or web posts. In most cases, data can be accessed either via APIs or as a snapshot.
From a computational perspective, text data are inherently unstructured. Therefore, a corpus must be pre-processed before conducting a computational analysis. Depending on your corpus, this can include:
After the corpus is transformed into a machine-interpretable dataset, it can be analysed. Models use techniques from computational linguistics, natural language processing, machine learning and statistics. Possible analyses are:
Further resources
Further resources