SBS BI Docs Fetch Input Networks

# Keywords

This function extracts keywords and/or bigrams from the corpus (the algorithm is based on TF-IDF and other metrics). The function also allows the comparison of keywords across groups or timeframes.

## Parameters

The Type of Analysis parameter is used to select an option between Keyword Extraction and Group Differences1. The first function will extract keywords from all the text documents and rank them by importance, as specified in the following. This option also allows the extraction of bigrams. On the other hand, the Group Differences analysis focuses on identifying keywords that differentiate groups of documents. If you run this analysis, please provide an input file that includes the group labels on its third column (better using strings than numbers, please do not try to validate them as source weights). The Group Differences analysis ignores the bigrams and time parameters.

Please specify the CSV separator, the maximum number of keywords2, the language of the analysis, whether to perform stemming3, and the percentage of text to analyze. A value of 1 for this last field means that the full text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).

You could also specify whether to extract keywords from the full dataset or whether to consider separate time intervals. You can additionally choose to extract the most frequent bigrams.

## Output

In the case of Keyword Extraction type of analysis:

• The function produces a CSV file containing keywords in the first column. The second column is their frequency count, and the third is the number of documents where they appear. In the fourth column, the TF-IDF metric is computed. The fifth column is used to rank keywords. It considers the TF-IDF metric with l2 normalization, which is important when documents can have different lengths. When the analysis is carried out by time intervals, there is an additional column indicating the end date of the interval.
• If the bigrams option is selected, the app also produces a bigram list with terms ordered by frequency count.

In the case of Group Differences type of analysis:

• The function produces a CSV file containing keywords in the first column. The subsequent columns are generated for each group label:
• The prefix RelFreq_ indicates the sum of frequency counts of a word in each group of documents, divided by document length. With TotalRelFrequency being the sum of these columns.
• The prefix DocCount_ indicates the number of documents where the word appears for that column group. With TotalDocs being the sum of all these scores.
• Columns with the suffix _SampleDocs are provided for convenience and report the total number of documents for their group.
• Values in the columns with the suffix _Combined are calculated by using the following formula, which indicates the importance of the term i for the generic group A:

$TF_d^i$ is the frequency of occurrence of the term $i$ in the document $d$. $Len_d$ is the length of the document $d$, measured as its number of words. $A$ is the set of documents in group A, and $D$ is the set of all documents, with |A| and |D| indicating the cardinalities of the two sets. $I_{(TF_d^i>0)}$ is an indicator function that equals 1 if the frequency of term $i$ in document $d$ is bigger than zero, i.e., if $i$ appears in $d$.

• Values in the columns with the _CombinedDocsOnly suffix are calculated the same as _Combined but without the $\sum_{d∈A}\frac{TF_d^i}{Len_d}$ and |A|/|D| terms in the importance formula.
• Columns with the suffix _CombinedDiff and _CombinedDocsOnlyDiff are obtained by subtracting the values of the other columns from the main one, considering either the _Combined metric or the _CombinedDocsOnly one.
1. This is memory intensive; better not to run it on large corpora.

2. Setting a high value here is recommended if the option Group Differences has been selected.

3. If the option is selected, stemming will be performed for all languages except Polish. For this language, the system will automatically apply lemmatization.