This function can be used to pre-process and clean the documents, with the option to remove stop-words and perform language stemming or lemmatization. It can also be used to substitute words in the text and to calculate corpus statistics. In particular, the software will calculate: the number of Tokens and Types, the Type-Token Ratio, the number of Hapaxes, and the Hapax-Type Ratio.
Csv separator
: specifies the separator used in the CSV file. Insert a single character without quoting.Percentage of text to retain
: useful to retain only the initial part of each text document, for example, just the title and lead of online news. A value of 1 means that the full text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).Stemming or lemmatization
: choose one of the two operations. You can also skip stemming or lemmatization while choosing the language.Language for stopwords
: this is used to load language-based stop-word dictionaries. Choose “SKIP” to avoid removing stop-words.Language for stemming/lemmatization
: indicates the language used for stemming or lemmatization. Choose “SKIP” to avoid both operations.Custom stopwords
: can be used to specify custom stopwords, i.e., words that will be ignored during the analysis. List custom stopwords separated by a comma, without quotes. Including multiple words (e.g., formula 1
) is possible.Cluster word substitution
: sometimes, it is useful to merge multiple words representing a common concept. Each concept could be represented by a set of keywords. If this is the case, you can use the cluster word substitution field to specify the words to merge. For example, we may want to have a single word in lieu of the word “pope” and the word “Francis”. The following syntax has to be used "cluster1":["word1","word2",..], "cluster2":["word6","word8",..],..
. The same word cannot appear in multiple clusters. Hyphens cannot be used in words in the cluster. Please replace them with a whitespace (e.g., if you want to replace the word “zero-emission”, please put “zero emission” in the cluster). Additionally, asterisks can be used at the end of words, indicating that a specific word could be completed with any possible set of characters. For example, if the word "asp*"
is used, this will match both the words "aspirin"
and "aspire"
. This does not work with multiple words. All words in a cluster will be replaced with the cluster label.