Text Analysis
The module allows the calculation of some textual and semantic metrics referred to the documents in the corpus. A brief description is provided in the following sections.
Analyses
Basic textual metrics are calculated at every run (such as the number of types and tokens for each document). In addition, you can choose to carry out these other more advanced analyses:
Complexity
: calculates the language complexity of each document. The function provides several metrics: the number of words of six or more letters (absolute and relative frequencies), the average word length, and other complexity scores calculated using the TF-IDF function and considering the word frequency distribution. The function also calculates the numerical intensity and readability of the text (Gunning-Fog index).
Sentiment
: calculates the average language sentiment of each document. Scores can vary from -1 to 1, where positive values indicate positive sentiment.
Emotions partial
: calculates several dimensions of the language used (such as the degree of positive and negative emotions or the language orientation towards the past or future). Scores are normalized considering the length of the document and can range from 0 to 100. To obtain the raw scores, you can multiply values by WordCountOriginal and divide by 100.
Emotions full
: calculates additional emotions, such as anger or joy, as well as scores for valence, dominance, and arousal. These scores are inspired by the NRC lexicon (for more information, see https://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm). Valence, dominance, and arousal can be positive or negative (without a predefined range). The other emotion scores are normalized considering the length of the document and can range from 0 to 100. To obtain the raw scores, you can multiply by WordCountOriginal and divide by 100.
Crovitz
: calculates the relative frequencies of the Crovitz’s relational words that appear in each document.
Parameters
Csv separator
: specifies the separator used in the CSV file. Insert a single character without quoting.
Percentage of text to analyze
: used when the analysis has to consider only a portion of each text document, for example, just the title and lead of online news. A value of 1 means that the entire text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).
Language
: this is the language of uploaded texts (please be consistent and try to analyze one language at a time).
Choose one or more analyses
: select the analyses you want to carry out.
Optional parameters
Dimensions
: this field can be used to specify custom dimensions/dictionaries. You can use a dictionary to represent each dimension. The following syntax has to be used "dimension_name1":["word1","word2",..], "dimension_name2":["word6","word8",..],..
. It is possible to repeat the same word in different dimensions. Please use the lowercase. Hyphens will be replaced by whitespaces, such that if "zero-emission" is used, the software will count both "zero-emission" and "zero emission". The dimension name will not be considered as a word for the analysis. Additionally, asterisks can be used at the end of words, indicating that a specific word could be completed with any possible set of characters. For example, if the word "asp*"
is used, this will match both the words "aspirin"
and "aspire"
. This does not work with multiple words.
Brand list
: use this field if you also want to check for the presence of brand in the documents and then average the general results by brand. A brand can be any word, such as the name of a product or a person. You must list the brands without quotes, separating them with a comma.
Cluster brands
: sometimes, it is useful to merge multiple words representing a brand. Each brand/concept could be represented by a set of keywords. If this is the case, you can use the cluster brands field to specify the words to merge. For example, we may want to have a single node for the word “pope” and the word “Francis”. The following syntax has to be used "cluster1":["word1","word2",..], "cluster2":["word6","word8",..],..
. The same word cannot appear in multiple clusters. Hyphens cannot be used in words in the cluster. Please replace them with a whitespace (e.g., if you want to replace the word “zero-emission”, please put “zero emission” in the cluster). Additionally, asterisks can be used at the end of words, indicating that a specific word could be completed with any possible set of characters. For example, if the word "asp*"
is used, this will match both the words "aspirin"
and "aspire"
. This does not work with multiple words. All words in a cluster will be replaced with the cluster label.
Time unit
and Time intervals
: use this field if you also want to average results by time intervals. If the “full dataset” option is selected, the analysis will run on all the text documents without considering time intervals.
Start time
and End time
are the analysis’s start and end times.
Output
Documents are analyzed following their order in the input file. The function will produce the following files:
- “TextAnalysis.csv”: this is the main file with results. A brief explanation of its columns is provided in the following. Please note that not all columns listed here might appear in the file, depending on the parameters you set:
- “Date” is the document date.
- “WordCountOriginal” is the number of words (tokens) in the original document before text preprocessing.
- “WordCountClean” is the number of words (tokens) in the clean document after text preprocessing.
- “Types” is the number of different unique word stems.
- “LexicalDiversity” refers to the ratio of different unique word stems (“Types”) to the total number of words (“WordCountClean”).
- “Hapaxes” is the number of words that appear only once within a document. For example, the sentence “Sun, look at the sun!” will have 3 hapaxes, 4 types, and 5 tokens.
- A series of columns containing the brand names might appear, if brands were specified in the parameters. Cells in these columns have the value of 1 if the brand name appears in the document and 0 otherwise.
- “Sentiment” is a measure of the language sentiment, calculated for the entire document. Scores will differ from those of sentiment calculated specifically for brands (see the main analysis). The calculation of sentiment also differs depending on the language of the documents (for example, we use the VADER lexicon for English). Scores range from -1 (negative) to +1 (positive).
- “LoughranSentimentNEG”, “LoughranSentimentPOS”, “LoughranUncertainty”, “LoughranModalStrong” and “LoughranModalWeak” columns report the scores obtained by using the Loughran-McDonald lexicon, which proved useful for the calculation of uncertainty and positive and negative sentiment in financial contexts (see here and here). The lexicon also includes dictionaries for weak and strong modals. Scores represent the total frequency of words of each dictionary appearing in the text, divided by the document length and multiplied by 100. This function is only available for some languages.
- “AvgWlen” is the average length of words in the documents, measured by the average number of characters.
- “SixLtrs” and “SixLtrs_perc” are the number of words in the text that have six or more letters and their ratio to the total number of words in the document.
- “GunningFog” is the Gunning-Fog index, used to measure the readability of a text document.
- “NumDigits” is the count of numbers (expressed with digits) appearing in the text, divided by the document length (number of words) and multiplied by 100.
- “NumIntensity” is the sum of the values in the columns labeled “NumDigits”, “Number”, and “NumOperations”. This dimension considers the amount of quantitative information provided in the document — counting numerical terms (including integers, numbers in lexical format, and terms referring to numerical operations) and dividing this number by the total word count. See this paper for more info and a use case.
- “ComplexityStDev” is a measure of language complexity calculated as the standard deviation of the frequency counts of the words appearing in a document. You might consider normalizing this value to take the document length into account.
- “ComplexityTFIDFsum” is a measure of the informativeness of the text, calculated following a TF-IDF logic. The idea is that a text is more informative if it has more words that deviate from the common language. Longer texts will get higher informativeness scores. The formula used is the same as “ComplexityTFIDFavg”, without the 1/n term.
- “ComplexityTFIDFavg” is a measure of the average complexity/distinctiveness of the text, calculated following a TF-IDF logic. The idea is that a document is distinctive if it contains words that do not commonly appear in the other documents and if these words are not lost in uninformative text blobs. See this paper and look at the “Measuring linguistic distinctiveness” section for more information and the formula.
- Regarding the columns “valence”, “arousal” and “dominance”, please refer to the documentation of the NRC-VAD lexicon. The software calculates scores for these three dimensions, which can be negative or positive (without a predefined range). The contribution of each word appearing in a document is summed up, considering a rescaling of the values of the original lexicon (from [0,1] to [-1,1]).
- For columns from “AngerNRCCount” to “sadnessIntensity” please refer to the documentation of the NRC lexicon. The software calculates scores for the emotions of anger, anticipation, disgust, fear, joy, sadness, surprise, and trust expressed in the text. It will also show the aggregate counts of words representing positive and negative emotions. Please consider that, for each emotion, you will get two scores: the first one (“Count”) represents the percentage of words in the text related to each emotion, with values from 0 to 100; the second one (“Intensity”) considers the intensity of words (e.g., attributing a higher intensity score to the word “magnificent” than to the word “pretty”).
- For columns from “Posemo” to “NumOperations” please refer to the documentation of the LIWC software and its related publications. The SBS BI app tries to map dimensions similar to those mapped by LIWC but using different dictionaries. We kept the labeling scheme to support researchers already familiar with it. Scores represent the total frequency of words of each dimension appearing in the text, divided by the document length and multiplied by 100. Please note that the names of these columns might change depending on the language chosen for the analysis. Only to understand the labeling, please refer to the LIWC software dictionaries and manual.
- Columns from “about” to “with” represent the frequency count of the Crovitz’s relational words that appear in each text document. Scores can vary from 0 to 100. Please note that column names will change depending on the language chosen for the analysis (the original list is in English). This function is only available for some languages.
- The file will end with additional columns representing the custom dimensions you might have specified in the parameters. Columns will be labeled with the names you assigned to each dimension. Scores will represent the total frequency of words of each dimension appearing in the text, divided by the document length and multiplied by 100.
- “TextAnalysisBrandsTime.csv”: this is the same as the previous file, with results averaged by brand and/or time intervals, if these were specified.