SBS BI Docs Fetch Input Networks

# Corpus Cleaner

This function can be used to pre-process and clean the documents, with the option to remove stop-words and perform language stemming or lemmatization.

## Parameters

• Csv separator: specifies the separator used in the CSV file. Insert a single character without quoting.
• Percentage of text to retain: useful to retain only the initial part of each text document, for example, just the title and lead of online news. A value of 1 means that the full text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).
• Stemming or lemmatization: choose one of the two operations. You can also skip stemming or lemmatization while choosing the language.
• Language for stopwords: this is used to load language-based stop-word dictionaries. Choose “SKIP” to avoid removing stop-words.
• Language for stemming/lemmatization: indicates the language used for stemming or lemmatization. Choose “SKIP” to avoid both operations.
• Custom stopwords: can be used to specify custom stopwords, i.e., words that will be ignored during the analysis. List custom stopwords separated by a comma, without quotes. Including multiple words (e.g., formula 1) is possible.