This function performs advanced NLP tasks to extract Named Entities from the text documents in the input file. For more information about Named Entity Recognition, please read here.
Time intervals: specifies the time interval for the analysis. For example, you might want to calculate a daily score or aggregate the texts produced in one week. If the “full dataset” option is selected, the analysis will run on all the text documents without considering time intervals.
End time: these are the start and end times of the analysis. This setting is particularly important when you want to restrict the analysis to a specific timeframe (but the CSV you uploaded has more data).
CSV separator: specifies the separator used in the CSV file. Insert a single character without quoting.
Percentage of text to analyze: used when the analysis has to consider only a portion of each text document, for example, just the title and lead of online news. A value of 1 means that the full text will be analyzed; lower values indicate a lower percentage of text to analyze (e.g., 0.5 = the initial 50% of each document).
Language: this is the language of uploaded texts (please analyze one language at a time).
Max number of Named Entities to extract: this is the maximum number of Named Entities that will be extracted by the software.
Word co-occurrence range: indicates the range of the maximum distance between words to determine a co-occurrence. Values of 5 or 7 are usually good, and results are often robust with respect to this parameter when the values are within a reasonable range (2 to 20).
Create Social Networks: if flagged, the function will produce social networks where nodes are Named Entities and links represent their co-occurrence in the text.
Custom stopwords: can be used to specify custom stopwords, i.e., words that will be ignored during the analysis. Custom stopwords should be listed separated by a comma, without quotes. Including multiple words (e.g.,
formula 1) is possible.
The function will produce a file with the identified Named Entities ranked based on their occurrence frequencies (indicated as “Count” in the output file). If time intervals are specified, Named Entities will be listed for each interval, and a “Time” column will appear in the output file, indicating the last day of each interval. The output file also includes a code used to categorize each entity and a column describing each category. Categories might change depending on the language selected.
Create Social Networks option is flagged, the function will produce social networks where nodes are Named Entities and links represent their co-occurrence in the text. One network will be generated for each time interval. Networks are saved in the Pajek “.net” format (see here for more information or look at the Networks page).
If you compare the number of Named Entities identified in the general output file and the number of nodes in each network, it might be that the former is higher than the number of nodes. This might depend on the fact that the same entity could be classified more than once (e.g., one time as a person’s name and another time as a location). The network generator function disregards these differences.