SBS BI Docs Fetch Input Networks

Keywords

This function extracts keywords, and/or bigrams, from the corpus (the algorithm is based on TF-IDF).

Please specify the CSV separator, the maximum number of keywords, the language of the analysis (for stemming) and the percentage of text to analyze. A value of 1 for this last field means that the full text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).

You could also specify whether to extract keywords from the full dataset, or whether to consider separate time intervals. You can additionally choose to extract the most frequent bigrams.

Output

The function produces a CSV file where keywords are in the first column. The second column is their frequency count, and the third is the number of documents where they appear. In the fourth column, the TF-IDF metric is computed. The fifth column is used to rank keywords. It considers the TF-IDF metric with l2 normalization, which is important when documents can have different lengths. When the analysis is carried out by time intervals, there is an additional column indicating the end date of the interval.

If the bigrams option is selected, the app also produces a bigram list with terms are ordered by frequency count.