SBS BI Docs Fetch Input Networks

# Text Classification

This is an experimental function created to extract text features that can better classify different groups of documents. It uses different machine learning algorithms and NLP procedures to make a prediction. Feature importance is evaluated considering each feature’s Shapley values.

This is a beta function that is constantly being improved. Also, it is very resource intensive. Please only use it with small datasets.

## Parameters

Please provide an input CSV file that includes the group labels on its third column (better using strings than numbers, please do not try to validate them as source weights). Please note that the function only works for classification problems and NOT for regression.

• Csv separator: specifies the separator used in the CSV file. Insert a single character without quoting.
• Percentage of text to analyze: used when the analysis has to consider only a portion of each text document, for example, just the title and lead of online news. A value of 1 means that the entire text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).
• Language: this is the language of uploaded texts (please be consistent and try to analyze one language at a time).
• Stemming: choose whether to apply stemming or not.

• Classification method:
• “Bag of Words - Xgboost”: this algorithm will convert a collection of text documents into a document-term matrix - which will be further transformed if the TF-IDF option is selected. The Xgboost algorithm is later applied to classify documents and evaluate the most prominent features.
• “Bag of Words - Random Forest”: as the one above, with the Random Forest method used in lieu of Xgboost.
• Max features: this is the maximum number of features (words or groups of words) that the classification algorithms will consider.
• Cross Validation Folds: is the number of folds used for cross-validation of results and the assessment of model fit. The values of 3, 5, or 10 are recommended, also depending on the sample size. If the value of 1 is selected, then the algorithm will work on all data, using the full sample as the training set.
• Percentage of the test set used to create the evaluation set: some algorithms (such as Xgboost) might use an evaluation set that the function will extract (and exclude) from the test set. This field is used to specify the proportion of the test set used to create the evaluation set. With cross-validation, multiple test and evaluation sets will be created.
• Early Stopping: this parameter is considered by some algorithms (such as Xgboost) that have an Early Stopping option, useful to avoid overfitting.
• Search for Optimal Parameters: choose whether to perform a randomized search of the optimal hyperparameters (with cross-validation, 3 folds).
• TF-IDF: choose whether using TF-IDF in text vectorization.
• SMOTE: choose whether to use the SMOTE algorithm on the train set to handle class imbalance. This choice does not affect the test and evaluation sets.
• Fixed Random State: if selected, it will make results reproducible, as all the algorithms will use a fixed and common seed for their random parts.

## Output

The function will produce the following files:

• “Fit.csv”: this file will report fit statistics for each cross-validation fold. Specifically, the accuracy, area under the ROC curve (for binary classification problems), and Cohen’s Kappa.
• “Shap.csv”: this file will report the Shapley values of each model feature, indicating their importance for the prediction of each document class. In particular, for each class and feature, the file will include the average of Shapley values and the average of their absolute values. Please note that this represents the contribution of a feature to the model prediction (if the prediction is wrong, then the score is of little value).