This is an experimental function created to extract text features that can better classify different groups of documents. It uses different machine learning algorithms and NLP procedures to make a prediction. Feature importance is evaluated considering each feature’s Shapley values.
This is a beta function that is constantly being improved. Also, it is very resource intensive. Please only use it with small datasets.
Please provide an input CSV file that includes the group labels on its third column (better using strings than numbers, please do not try to validate them as source weights). Please note that the function only works for classification problems and NOT for regression.
Csv separator: specifies the separator used in the CSV file. Insert a single character without quoting.
Percentage of text to analyze: used when the analysis has to consider only a portion of each text document, for example, just the title and lead of online news. A value of 1 means that the entire text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).
Language: this is the language of uploaded texts (please be consistent and try to analyze one language at a time).
Stemming: choose whether to apply stemming or not.
Max features: this is the maximum number of features (words or groups of words) that the classification algorithms will consider.
Cross Validation Folds: is the number of folds used for cross-validation of results and the assessment of model fit. The values of 3, 5, or 10 are recommended, also depending on the sample size. If the value of 1 is selected, then the algorithm will work on all data, using the full sample as the training set.
Percentage of the test set used to create the evaluation set: some algorithms (such as Xgboost) might use an evaluation set that the function will extract (and exclude) from the test set. This field is used to specify the proportion of the test set used to create the evaluation set. With cross-validation, multiple test and evaluation sets will be created.
Early Stopping: this parameter is considered by some algorithms (such as Xgboost) that have an Early Stopping option, useful to avoid overfitting.
Search for Optimal Parameters: choose whether to perform a randomized search of the optimal hyperparameters (with cross-validation, 3 folds).
TF-IDF: choose whether using TF-IDF in text vectorization.
SMOTE: choose whether to use the SMOTE algorithm on the train set to handle class imbalance. This choice does not affect the test and evaluation sets.
Fixed Random State: if selected, it will make results reproducible, as all the algorithms will use a fixed and common seed for their random parts.
The function will produce the following files: