Text Similarity
This function is capable of computing similarity between individual documents or groups of documents provided in the input file. The employed technique involves a bag-of-words approach with subsequent TFIDF transformation and L2 regularization to account for potential differences in text lengths.
Parameters
Csv separator
: specifies the separator used in the CSV file. Insert a single character without quoting.
Language
: this is the language of uploaded texts (please be consistent and try to analyze one language at a time).
Use Groups
: specifies how documents should be grouped before calculating similarities.
- “no”: similarity is calculated document by document.
- “by date”: documents are grouped by date.
- “by third CSV column”: documents are groups according to the label provided in the third column of the input file.
Minimum word frequency
: a percentage indicating the minimum frequency a word must have to be considered in the computation of similarities. For example, a value of 0.001 means that words appearing in less than 0.1% of the documents will be discarded.
Maximum word frequency
: a percentage indicating the maximum frequency a word can have to be considered in the computation of similarities. For example, a value of 0.8 means that words appearing in more than 80% of the documents will be discarded.
Max # of Matrix Words
: maximum number of words to consider in the doc-term matrix, after having filtered by maximum and minimum word frequencies. This does not impact variables such as "NewWords".
Preprocess Text (Cleaning)
: choose whether to pre-process the input file to remove stopwords, punctuation, etc.. and apply stemming. This is highly recommended. In addition to the usual preprocessing done by the application's other functions (such as stemming), this function also removes numbers and words that begin with digits.
Dichotomize Matrix
: if selected, the occurrence frequencies of each word will not be considered, and the document-term matrix will be binarized. This does not apply if the method chosen for calculating similarity is SBS.
Similarity Method
: choose the method for the calculation of similarity between documents or groups of documents.
Cosine Similarity is computed on a matrix (documents x terms) generated using the TF-IDF method or the SBS method.
- In the first case, we use idf(t) = log [ (1 + n) / (1 + df(t)) ] + 1, where n is the total number of documents in the document set and df(t) is the document frequency of t. Please be aware that our calculation of IDF and the removal of overly rare and common words consider the total number of groups and not that of individual documents (unless "no" is selected for the Use Groups parameter).
- The SBS method involves a different normalization of the frequency matrix that is multiplied by the sum of the standardized diversity and connectivity values of relevant words in the semantic network. The network is generated considering a maximum co-occurrence range of 5 words and with an automatic determination of the filter value on negligible co-occurrences.
In both methods, words that are too frequent or too infrequent are excluded, L2 regularization is applied, and the maximum number of words specified by the user is considered.
Alternatively, the system uses transformers to generate document or group embeddings, and then applies cosine similarity to calculate the similarity between them. Please note that this approach is memory-intensive and may not work with large datasets. In addition, this method requires inputting raw text, not text that has already been processed (for example, text that has been stemmed). If using transformers, all options related to the similarity matrix will be disabled.
The system allows a choice of embedding models: one general multilingual model, paraphrase-multilingual-MiniLM-L12-v2, one faster general model for English, all-MiniLM-L6-v2, and another specifically fine-tuned for patents written in English, patent/sbert-all-MiniLM-L6-v2. Please consider that different models might better suit your analysis. If that is the case, we recommend using your own code.
Clustering
: choose a clustering algorithm to be applied to the similarity matrix. Only PAM (Kmedoids) is implemented at the moment.
Number of Clusters
: indicates the number of clusters to be determined. If the value of zero is chosen, the system will operate an automatic search for the best number of clusters, evaluating options in the range from 2 to 50 and considering the mean Silhouette Coefficient of all samples.
Output Format
: choose whether to save the similarity values as a list (smaller size) or as a matrix.
Create Similarity Network
: choose whether to create a similarity network in the Pajek format.
Calculate Innovation/Impact
: choose whether to calculate innovation, uniqueness, and impact scores instead of brand similarity. To calculate these scores, it is necessary to provide a third column of the file with a label of the type LABEL_PERIOD, or LABEL_PERIOD_CATEGORY. Where period must be an integer and label is an arbitrary string. Category can be used to calculate both intra- and extra-category scores. For example, for a patent, its CPC class could serve as a category label, allowing us to assess its impact both within and outside its own class.
Innovation Span (0 = no limit)
: select the time span (expressed in the number of periods) for calculating innovation and impact. For instance, inputting a value of 3, with periods expressed in years, implies that the software will incorporate data from 3 years before and 3 years after a specific group of text. Opting for a value of zero means that calculations will encompass all time periods before and after a given one. Be cautious, as this method calculates similarity for documents at each time step by considering sets of periods of different lengths before and after. Please also consider potential truncation biases.
Custom stopwords
: can be used to specify custom stopwords, i.e., words that will be ignored during the analysis. List custom stopwords separated by a comma, without quotes. Including multiple words (e.g., "formula 1") is possible.
Output
The function will produce the following files:
- “TxtSimilarity.csv”: a file listing all possible pairs of groups and their similarity values, excluding self-pairs and repeated relationships, as similarity is symmetric.
- “TxtSimilarity.net”: a network file in Pajek format representing the groups for which similarity has been calculated as nodes, and where links are weighted based on the similarity value between two nodes (groups).
- “SemanticNet.net”: a network file in Pajek format representing the overall semantic network, generated in case the SBS similarity metod is selected.
- “Clustering.csv”: a file with the results of the clustering algorithm. Each document or group of documents is assigned a cluster number.
- “SilhouetteScores.csv”: this file is produced when the optimal number of clusters is determined by the system and is not specified by the user. For each clustering option, the file reports the corresponding Silhouette Coefficient (mean of all samples).
- “Impact.csv”: this file displays the impact and innovation scores of each group of texts labeled as LABEL_PERIOD (or LABEL_PERIOD_CATEGORY), along with variables measuring new words and new words reused. Documents in the same period, label, and category (if any) will be grouped as one. In particular:
Each metric is calcuated as detailed in the following:
- Novelty is calculated as the average cosine distance of one group of texts from those written earlier. When using the TFIDF method, we calculate IDF using past documents within the innovation span. For words that only appear in present docs (therefore not included in the IDF), their weight is assigned as log(N+1), where N represents the total number of past documents. Words that are too frequent are identified for removal based on their occurrence in present and past documents (within the innovation span). Words too rare in this span are not removed here, but removal may happen when considering the Max # of Matrix Words.
- Impact is calculated as the average cosine similarity to the groups of texts written after the time period under analysis (excluding the analyzed period itself). When using the TFIDF method, we calculate the IDF using both present and past documents within the innovation span. For words that only appear in future docs (therefore not included in the IDF), their weight is assigned as log(N+1), where N represents the total number of present and past documents. Words that are too frequent are identified for removal based on their occurrence in present and past documents (within the innovation span). Words too rare in this span are not removed here, but removal may happen when considering the Max # of Matrix Words.
- Uniqueness is calculated as the average cosine distance of one group of texts from those written in the same period. When using the TFIDF method, we calculate the IDF using past documents within the innovation span. For words that only appear in present docs (therefore not included in the IDF), their weight is assigned as log(N+1), where N represents the total number of past documents. Words that are too frequent are identified for removal based on their occurrence in present and past documents (within the innovation span). Words too rare in this span are not removed here, but removal may happen when considering the Max # of Matrix Words.
- If a Category is specified, additional columns will appear in the output, displaying Novelty, Uniqueness, and Impact based on documents that are either within the category or outside the category (for example, a patent CPC class).
- Impact/(1-Novelty) is the ratio of similarity with future documents to the similarity with past documents. This approach is closely aligned with the methodology described by Kelly et al. (2018).
Note: Some variables have the suffix "MIN," indicating that the minimum distance (representing the maximum similarity) with other documents has been calculated instead of the average. For example, this is the case for NoveltyMIN, ImpactMIN, UniquenessMIN, and Impact/(1-NoveltyMIN).
When using the TFIDF method:
- the software also computes NewWords and NewWordsReused as in Arts et al. (2021). The baseline for identifying a word as new is the first period in the sample, and results for this period should therefore be disregarded. NewWordsReuseNet is obtained by subtracting the former from the latter.
- NewWordsPerc measures the proportion of new words that appear in the vocabulary of subsequent documents (or groups of documents), and then averages these values.
- NumReuse is the number of future documents (or groups) that use at least one of the newly introduced words. AtLeastTwoWords
is the number of that documents that include at least 2 of the newly introduced words. The other output columns are calculated following a similar logic.
- The columns NewWordsThresholdPast, NewWordsReusedThresholdPast, and all other columns with the "ThresholdPast" suffix indicate that past words were considered only within the innovation span time periods, rather than across the entire sample of past periods for each period of analysis.
Notes:
- There might be missing values, for example, in cases where there are no previous years available for calculating similarity with the past, or when after the TF-IDF transformation, all words have been removed from some document vectors.
- Please note that the TF-IDF transformation applied here differs from the primary similarity analysis. Our calculation of IDF and the removal of overly rare and common words consider the total number of individual documents rather than the number of groups. In addition, the IDF is calculated by time periods and not by groups of documents.
- NewWords computation considers all documents from previous periods, ignoring the innovation spam parameter. Conversely, Novelty and, for example, NewWordsThresholdPast are calculated solely on the timeframe specified by the innovation span parameter.