Keywords

This function extracts keywords and/or bigrams from the corpus (the algorithm is based on TF-IDF and other metrics). The function also allows the comparison of keywords across groups or timeframes.

Parameters

The Type of Analysis parameter is used to select an option between Keyword Extraction and Group Differences1. The first function will extract keywords from all the text documents and rank them by importance, as specified in the following. This option also allows the extraction of bigrams. On the other hand, the Group Differences analysis focuses on identifying keywords that differentiate groups of documents. If you run this analysis, please provide an input file that includes the group labels on its third column (better using strings than numbers, please do not try to validate them as source weights). The Group Differences analysis ignores the bigrams and time parameters.

Please specify the CSV separator, the maximum number of keywords2, the language of the analysis, whether to perform stemming3, and the percentage of text to analyze. A value of 1 for this last field means that the full text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).

You could also specify whether to extract keywords from the full dataset or whether to consider separate time intervals. You can additionally choose to extract the most frequent bigrams.

Output

In the case of Keyword Extraction type of analysis:

In the case of Group Differences type of analysis:

  1. This is memory intensive; better not to run it on large corpora. 

  2. Setting a high value here is recommended if the option Group Differences has been selected. 

  3. If the option is selected, stemming will be performed for all languages except Polish. For this language, the system will automatically apply lemmatization.