SBS and Topic Modeling
These functions produce the main analyses and reports offered by our app. They
calculate the Semantic Brand Score and generate graphical reports according to the uploaded text file and the chosen parameters.
- Clicking on Run will start the analysis, which could take some time. You can exit the app and come back to check the status, or just click on the Check Status button. You will receive an email when the analysis ends if this has been set in the parameters.
- When the analysis ends, you are able to download results, semantic networks and the processed input and visualize reports online. Each report can also be saved, as running a new analysis will delete the last report. Please note that each user has a maximum number of reports that can be saved.
- The analysis can be stopped in case of errors.
- Please note that some graphs (such as the topic modeling network) are only visible online.
Interpretation of reports
For a more comprehensive description of graphs and full examples, you might check the papers listed here or watch the videos on this YouTube channel.
The report main graphs are also briefly described in the following and an example is provided here:
- SBS time trends: shows the time trends of the SBS scores. The first tab shows absolute (standardized) values; the second tab shows the relative values with respect to the other brands.
- Brand positioning: compares brand importance (SBS) and brand sentiment.
- SBS ranking and dimensions: shows how each dimension of the SBS contributes to the average final ranking (SBS is rescaled in the range [0,300], and each dimension score can vary from 0 to 100).
- Brand associations: shows the most frequent and unique associations, together with their frequency values.
- Most frequent words: shows the most frequent words for each time period.
- Time trends of unique associations: shows the time trends of the number of unique associations for each brand.
- Brand Image Similarity: shows the degree of similarity of different brands based on their shared associations. Brands/concepts that are associated with the same words will be close to one another in the graph. Reading the graph, you should focus on the distance among brands. The axes have no direct meaning (coordinates result from multidimensional scaling).
- Target words: shows a ranking of the words with the highest connectivity scores, considering the full period of analysis, with values rescaled in the range [0, 100].
- Topic modeling: shows the results of the topic modeling.
- Network: shows a network representing the main discourse topics. Each topic is represented by a white node and a network cluster. Brand nodes are shown in red. This graph might not be visible when you download the results on your local PC. Please put the files on a server in order to see it.
- Representative words: shows the most representative words for each topics, ranked based on their importance score.
- Brand/Topic associations: shows how much each brand is related to the different topics (percentage) and, in the second tab, how much each brand is important for the different topics. These values are calculated considering the strength of the brand connections with the most representative words of each topic (typically the top 300, not all words in the topic). In rare cases, it might happen that the prevailing topic for a brand is not the same as shown in the network graph. This is because, in the network graph, all the words of a topic are considered and not only the most representative ones. If this is the case, we suggest referring to the results of the bar charts, which are more accurate than the network.
- Top topics: ranks the discourse topics by importance.
- Topic connections: shows the strength of connections among topics.
You have the possibility to download processed input files. In these files, you will find the original documents organized by periods of analysis. The original text is preprocessed according to the parameters configuration (e.g., removal of stopwords, stemming, general cleaning, etc.). Source weights are retained if present, and an additional column indicates the document number.
The software lets you download the semantic networks generated for each analysis period. Networks are saved in the Pajek “.net” format (see here for more information or look at the Networks page).
These are the files with the full results of the main analysis, the same used to generate the graphical reports. For a complete description, we recommend participating in an SBS BI workshop. Please unzip the file for a proper view of the results. We list and briefly describe each file here:
- REPORT.html: this is the graphical report you can also see online. Please note that you must put this file and its dependencies on a server to view the topic network graph.
- PARAMINFO.py: this is a file for developers, with a recap of the parameters used for the analysis.
- SBS_associations.csv: in this file, all the top brand associations (up to 300 per time period) are listed. You could see which are the words most associated with each brand and the related co-occurrence frequency (“Weight”). In the case of more than one brand, the “Unique” column indicates if an association is unique (a word is connected to a brand only) or not (scores of 1 indicate unique associations). The uniqueness analysis considers all the text documents and not a single analysis period. You can disregard the “UniqueWeighted” column (which results from multiplying “Unique” and “Weight”).
- SBS_assoXY.csv: this is the file resulting from Multidimensional Scaling and the calculation of brand similarity. What you see here are the coordinates of the brand similarity graph, reported for each time period and overall.
- SBS_mostcommon.csv: here, you will find a list of the most common words in the corpus for each time period. The “weight” column indicates the word frequency.
- SBS_R_SMALL.csv: this is the file with the main results of the SBS analysis. Brand scores are presented in rows, one per each time interval. The file has the following columns:
- “NumDocs”: indicates the total number of documents in the time period (for all brands);
- “Sentiment”: is a sentiment score. The calculation of sentiment differs depending on the language of the documents (for example, we use the VADER lexicon for English). Sentiment is calculated for brands, not the entire document (you might have a document that praises one brand and criticizes another). Scores range from -1 (negative) to +1 (positive).
- “PR_coef”, “PR_std”, “PR_100”: “PR” stands for prevalence. The label “_coef” indicates the raw coefficient of the metric; the label “_std” indicates a standardized score obtained by subtracting the average score of all terms in the discourse from the raw coefficient and dividing by their standard deviation; the label “_100” indicates scores that may vary from 0 to 1, after min-max normalization.
- “DI_coef”, “DI_std”, “DI_100”: same as above, for diversity (calculated through distinctiveness centrality).
- “CO_coef”, “CO_std”, “CO_100”: same as above, for connectivity (calculated through weighted betweenness centrality).
- “SBS”: this is the Semantic Brand Score value resulting from the sum of the standardized scores of prevalence, diversity, and connectivity. It has no predefined range and can be negative when a brand is less important than the average SBS score of the other terms in a discourse.
- “SBS_100”: this is the Semantic Brand Score value resulting from the sum of prevalence, diversity, and connectivity after min-max normalization. It ranges from 0 to 3.
- SBS_R_FULL.csv: this is a file with the full set of SBS metrics. This guide will not discuss the file in detail, as it is huge and mainly used by our research partners.
- SBS_targetwords.csv: this is the list of words with the highest betweenness centrality (calculated considering the full dataset). The values of weighted betweenness centrality are normalized to produce a connectivity score equal to 1 for the most central word. The other words have a score proportional to the maximum value (<= 1).
- SBS_topics_brandtopic.csv: shows how much each brand is connected to the top words of each topic, considering the absolute weight of links. Percentage values are also reported.
- SBS_topics_topicbrand.csv: similar to the previous file, this report shows how important each brand is for each topic, providing absolute and percentage values.
- SBS_topics_docs.csv: this file shows how much each document in the corpus relates to each topic.
- “TOPIC1” indicates how many relevant words of Topic 1 appear in the text of a document, and “TOPIC1_perclen” is this number multiplied by 100 and divided by the length of the clean document (number of words).
- “TOPIC1_weight1” and “TOPIC1_weight2” are measures that sum the rescaled importance scores of Topic 1 words that appear in a document. Note that the importance scores of the words of a topic (those reported in SBS_topics_words.csv) are rescaled before the sum. The difference between the two weighting schemes (1 and 2) is the normalization technique used for the rescaling: in the first case, raw importance values are divided by the sum of the importance scores for each topic; in the second case, we divide them by the maximum importance value for each topic. “TOPIC1_weight1_len” is “TOPIC1_weight” divided by the length of the clean document (its number of words). The same is true for “TOPIC1_weight2_len”.
- “HHI” and “HHI_weight” correspond to the Herfindahl–Hirschman index calculated considering the document scores on the TOPICx and TOPICx_weight columns. These scores can be used as a measure of concentration to understand whether a document is related to a few or many topics. See here for more information.
- “Gini” and “Gini_weight” follow the same logic of HHI values but, in this case, the Gini coefficient is calculated to represent topic inequality.
- SBS_topics_matrix.csv: this matrix represents the strength of the connections between topics. The maximum value is rescaled to 100, and the other values are rescaled proportionally.
- SBS_topics_relevance.csv: Indicates the relevance score for each topic. The most important topic will have a relevance score of 100, and the other topics will have a score proportional to the maximum value. It might be that some topics are excluded from the analysis and not shown in the list because of negligible importance. Accordingly, the cells “Total Word Weight”, “Weight of Selected Topics”, and “Weight of Excluded Topic” will provide an indication of the weight of the excluded topics (the part of the discourse that the topic model does not represent). Lastly, the HHI and Gini indexes are provided (see above), calculated on the topic relevance scores.
- SBS_topics_words.csv: this file provides a list of the most representative words for each topic. Importance scores are calculated according to the formula presented in this paper.
On this page, you can see your saved reports and delete those you do not need anymore. Please consider that every user has a maximum number of reports that can be saved. After this limit is reached, you have to delete old reports to save new ones.