Text data mining

Text data mining (TDM) can be a powerful tool for efficiently extracting hidden patterns from vast unstructured text data. Learn here more about how to start your TDM project and how to avoid legal pitfalls.

Determining your data needs

In some cases, the information is contained in a set of full text documents. For many analyses, however, the necessary information is accessible through the bibliographic data of a set of documents. For instance, titles and abstracts of publications often contain sufficient information for topic and trend analysis. Bibliometric data can be accessed with considerably less effort than full-text corpora. Refer to our bibliometrics page to learn more about sources and analyses. 

Starting your TDM project

To conduct a TDM project, you need a textual dataset, or corpus, and tools to transform and analyse the data.

Starting your project

  • Basically any written resource can be used to compile a corpus. This includes scientific publications as well as newspaper articles or web posts. In most cases, data can be accessed either via APIs or as a snapshot. 

    • Find a list of Open Access and licenced data sources here.
    • Please respect access regulations. In most cases only authorized tools can be used for downloading! Refrain from using self-scripted bots or crawlers unless it is explicitly allowed. 
    • To optimize your search strategy, and therefore the quality of your corpus, refer to the Info Sheet on Topic Search or get in touch with our information specialists.
  • From a computational perspective, text data are inherently unstructured. Therefore, a corpus must be pre-processed before conducting a computational analysis.  Depending on your corpus, this can include: 

    • Normalising the raw data - remove extra white spaces, coding tags, punctuation or special characters. Lowering all cases is often recommended to reduce the vocabulary. 
    • Stop word removal - remove words with little to no information, such as "and" or "the". Dictionaries for stop words are often included in R and Python packages and other tools.
    • Tokenisation - split the text into meaningful bits. Commonly tokens are built on the level of words, or sub-words. For example, the sentence "Researching is fun" could be split into the tokens ["Researching" "is" "fun"] on a word-level or ["Research" "ing" "is" "fun"] on a sub-word-level. Tokenisation transforms the text into a format suitable for computational analysis.
  • After the corpus is in transformed into a machine-interpretable dataset, it can be analysed. Models use techniques from computational linguistics, natural language processing, machine learning and statistics. Possible analyses are:

    • Topic modelling and text clustering - discover the inherent topics of a body of text. Unsupervised classification algorithms can divide bodies of text into natural groups. The underlying mechanisms recognize topics as clusters of words and can assign documents to one or more topics.
    • Named entity recognition (NER) - identify and extract named entities from a large body of text. The applied algorithms can automatically classify entities into pre-defined categories, such as name, place, date and others. NER is a sub-task in the larger field of information extraction.
    • Sentiment analysis and opinion mining - understand the emotional intent of words using sentiment dictionaries. Common use cases are for example social media posts, but also articles and opinion pieces in newspapers.
    • Trend analysis - derive trends from frequency distributions of terms. For example, the evolution of research topics over time or the shift of research foci in different countries. 

Learn more about...

Do you have any questions?

We are happy to advice you on finding open and licenced data sources. Our information specialists can help on forming your search strategy. 

Further resources

Info Topic Search

Further resources