This function is capable of computing similarity between individual documents or groups of documents provided in the input file. The employed technique involves a bag-of-words approach with subsequent TFIDF transformation and L2 regularization to account for potential differences in text lengths.

`Csv separator`

: specifies the separator used in the CSV file. Insert a single character without quoting.`Language`

: this is the language of uploaded texts (please be consistent and try to analyze one language at a time).`Use Groups`

: specifies how documents should be grouped before calculating similarities.*“no”*: similarity is calculated document by document.*“by date”*: documents are grouped by date.*“by third CSV column”*: documents are groups according to the label provided in the third column of the input file.

`Minimum word frequency`

: a percentage indicating the minimum frequency a word must have to be considered in the computation of similarities. For example, a value of 0.001 means that words appearing in less than 0.1% of the documents will be discarded.`Maximum word frequency`

: a percentage indicating the maximum frequency a word can have to be considered in the computation of similarities. For example, a value of 0.8 means that words appearing in more than 80% of the documents will be discarded.`Max number of words to consider`

: maximum number of words to consider in the doc-term matrix, after having filtered by maximum and minimum word frequencies.`Preprocess Text (Cleaning)`

: choose whether to pre-process the input file to remove stopwords, punctuation, etc.. and apply stemming. This is highly recommended. In addition to the usual preprocessing done by the application's other functions (such as stemming), this function also removes numbers and words that begin with digits.`Dichotomize Matrix`

: if selected, the occurrence frequencies of each word will not be considered, and the document-term matrix will be binarized. This does not apply if the method chosen for calculating similarity is SBS.`Similarity Method`

: choose the method for the calculation of similarity between documents or groups of documents. Specifically, Cosine Similarity is computed on a matrix (documents x terms) generated using the TF-IDF method or the SBS method.- In the first case, we use idf(t) = log [ n / df(t) ] + 1, where n is the total number of documents in the document set and df(t) is the document frequency of t. Please be aware that our calculation of IDF and the removal of overly rare and common words consider the
*total number of groups*and not that of*individual documents*(unless "no" is selected for the Use Groups parameter). - The SBS method involves a different normalization of the frequency matrix that is multiplied by the sum of the standardized diversity and connectivity values of relevant words in the semantic network. The network is generated considering a maximum co-occurrence range of 5 words and with an automatic determination of the filter value on negligible co-occurrences.

- In the first case, we use idf(t) = log [ n / df(t) ] + 1, where n is the total number of documents in the document set and df(t) is the document frequency of t. Please be aware that our calculation of IDF and the removal of overly rare and common words consider the
`Clustering`

: choose a clustering algorithm to be applied to the similarity matrix. Only PAM (Kmedoids) is implemented at the moment.`Number of Clusters`

: indicates the number of clusters to be determined. If the value of zero is chosen, the system will operate an automatic search for the best number of clusters, evaluating options in the range from 2 to 50 and considering the mean Silhouette Coefficient of all samples.`Output Format`

: choose whether to save the similarity values as a list (smaller size) or as a matrix.`Create Similarity Network`

: choose whether to create a similarity network in the Pajek format.`Calculate Innovation/Impact`

: choose whether to calculate innovation and impact scores instead of brand similarity. To calculate these scores, it is necessary to provide a third column of the file with a label of the type LABEL_PERIOD. Where period must be an integer and label is an arbitrary string.`Innovation Span (0 = no limit)`

: select the time span (expressed in the number of periods) for calculating innovation and impact. For instance, inputting a value of 3, with periods expressed in years, implies that the software will incorporate data from 3 years before and 3 years after a specific group of text. Opting for a value of zero means that calculations will encompass all time periods before and after a given one. Be cautious, as this method calculates similarity for documents at each time step by considering sets of periods of different lengths before and after. Please also consider potential truncation biases.`Custom stopwords`

: can be used to specify custom stopwords, i.e., words that will be ignored during the analysis. List custom stopwords separated by a comma, without quotes. Including multiple words (e.g., "formula 1") is possible.

The function will produce the following files:

: a file listing all possible pairs of groups and their similarity values, excluding self-pairs and repeated relationships, as similarity is symmetric.*“TxtSimilarity.csv”*: a network file in Pajek format representing the groups for which similarity has been calculated as nodes, and where links are weighted based on the similarity value between two nodes (groups).*“TxtSimilarity.net”*: a network file in Pajek format representing the overall semantic network, generated in case the SBS similarity metod is selected.*“SemanticNet.net”*: a file with the results of the clustering algorithm. Each document or group of documents is assigned a cluster number.*“Clustering.csv”*: this file is produced when the optimal number of clusters is determined by the system and is not specified by the user. For each clustering option, the file reports the corresponding Silhouette Coefficient (mean of all samples).*“SilhouetteScores.csv”*: this file displays the impact and innovation scores of each group of texts labeled as LABEL_PERIOD, along with variables measuring new words and new words reused. In particular: Each metric is calcuated as detailed in the following:*“Impact.csv”**Novelty*is calculated as the average cosine distance of one group of texts from those written earlier.*Impact*is calculated as the average cosine similarity to the groups of texts written after the time period under analysis (excluding the analyzed period itself).*Impact/(1-Novelty)*is the ratio of similarity with future documents to the similarity with past documents. This approach is closely aligned with the methodology described by Kelly et al. (2018).- At the same time, the software also computes
*New Words*and*New Words Reused*as in Arts et al. (2021).*NewWordsReuseNet*is obtained by subtracting the former from the latter. *NewWordsPerc*measures the proportion of new words that appear in the vocabulary of subsequent documents (or groups of documents), and then averages these values.*NumReuse*is the number of future documents (or groups) that use at least one of the newly introduced words.*AtLeastTwoWords*is the number of that documents that include at least 2 of the newly introduced words. The other output columns are calculated following a similar logic.

- There might be missing values, for example, in cases where there are no previous years available for calculating similarity with the past, or when after the TF-IDF transformation, all words have been removed from some document vectors.
- Please note that the TF-IDF transformation applied here differs from the primary similarity analysis. Our calculation of IDF and the removal of overly rare and common words consider the
*total number of individual documents*rather than the number of*groups*. In addition, the IDF is calculated by*time periods*and not by groups of documents. *NewWords*computation considers all documents from previous periods, ignoring the innovation spam parameter. Conversely,*Novelty*is calculated solely on the timeframe specified by the innovation span parameter.