Abstract:
Objective: To exemplify how topic modeling can be used in management research, my objectives are two-fold. First, I introduce topic modeling as a social sciences research tool and map critical published studies in management and other social sciences that employed topic modeling in a proper manner. Second, I illustrate how to do topic modeling by applying topic modeling in an analysis of the last five years of published research in this journal: the Iberoamerican Journal of Strategic Management (IJSM).
Methodology: I analyze the last five years (2014 to 2018) of published articles in the IJSM. The sample is 164 articles. The abstracts were subjected to a standard topic modeling text pre-processing routine, generating 1,252 unique tokens.
Originality/Relevance: By proposing topic modeling as a valid and opportunistic methodology for analyzing textual data, it can shift the old paradigm that textual data belongs only to the qualitative realm. Furthermore, allowing textual data to be labeled and quantified in a reproducible manner that mitigates (or closely fully eliminates) researcher bias.
Main Results: Six topics were generated through Latent Dirichlet Allocation (LDA): Topic 1 – Strategy and Competitive Advantage; Topic 2 – International Business and Top Management Team; Topic 3 – Entrepreneurship; Topic 4 – Learning and Cooperation; Topic 5 – Finance and Strategy; and Topic 6 – Dynamic Capabilities.
Theoretical/methodological Contributions: I present the state of the art of the literature published in IJSM and also show how the reader can perform their own topic modeling. The full data and code that was used are available in free open science repositories in Open Science Framework (OSF) and GitHub.
Keywords:Topic modelingTopic modeling,Latent Dirichlet allocationLatent Dirichlet allocation,Computer-aided text analysisComputer-aided text analysis,Machine learningMachine learning,Big dataBig data.
Resumen:
Objetivo: para ejemplificar cómo se puede utilizar el modelado de tópicos en la investigación de gestión, mis objetivos son dobles. Primero, introduzco el modelado de tópicos como una herramienta de investigación en ciencias sociales y mapeo los estudios críticos publicados en administración y otras ciencias sociales que emplearon el modelado de tópicos de manera adecuada. En segundo lugar, ilustro cómo hacer modelos de tópicos aplicando modelos de tópicos en un análisis de los últimos cinco años de investigación publicada en esta revista: Iberoamerican Journal of Strategic Management (IJSM).
Metodología: analizo los últimos cinco años (2014 a 2018) de artículos publicados en el IJSM. La muestra es de 164 artículos. Los resúmenes se sometieron a una rutina estándar de preprocesamiento de texto de modelado de tópicos, generando 1,252 tokens únicos.
Originalidad / Relevancia: al proponer el modelado de tópicos como una metodología válida y oportunista para analizar datos textuales, puede cambiar el viejo paradigma de que los datos textuales pertenecen solo al ámbito cualitativo. Además, permitir que los datos textuales se etiqueten y cuantifiquen de una manera reproducible que mitigue (o elimine por completo) el sesgo del investigador.
Resultados principales: Se generaron seis tópicos a través de Latent Dirichlet Allocation (LDA): Tema 1 - Estrategia y ventaja competitiva; Tema 2 - Negocios internacionales y equipo de alta dirección; Tema 3 - Emprendimiento; Tema 4 - Aprendizaje y cooperación; Tema 5 - Finanzas y estrategia; y Tema 6 - Capacidades dinámicas.
Contribuciones teóricas / metodológicas: presento el estado del arte de la literatura publicada en IJSM y también muestro cómo el lector puede realizar su propio modelado de tópicos. Los datos completos y el código que se utilizaron están disponibles en repositorios de ciencia abiertos gratuitos en Open Science Framework (OSF) y GitHub.
Palabras clave: modelado de tópicos, latent Dirichlet allocation, análisis de texto asistido por computadora, aprendizaje automático, big data.
Resumo:
Objetivo do estudo: Para exemplificar como a modelagem de tópicos pode ser usada na pesquisa de administração, meus objetivos são duplos. Primeiro, introduzo a modelagem de tópicos como uma ferramenta de pesquisa em ciências sociais e mapeio estudos críticos publicados em administração e em outras ciências sociais que usaram a modelagem de tópicos de maneira adequada. Segundo, ilustro como fazer a modelagem de tópicos aplicando-a em uma análise dos últimos cinco anos de pesquisa publicada nesta revista: Iberoamerican Journal of Strategic Management (IJSM).
Metodologia: Analiso os últimos cinco anos (2014 a 2018) de artigos publicados no IJSM. A amostra é de 164 artigos. Os resumos foram submetidos a uma rotina padrão de pré-processamento de texto para modelagem de tópicos, gerando 1.252 tokens exclusivos
Originalidade/Relevância: Ao propor a modelagem de tópicos como uma metodologia válida e oportunista para a análise de dados textuais, pode-se mudar o antigo paradigma de que os dados textuais pertencem apenas ao domínio qualitativo. Além disso, permitindo que os dados textuais sejam rotulados e quantificados de maneira reproduzível que mitigue (ou elimine completamente) o viés do pesquisador.
Principais resultados: Seis tópicos foram gerados por meio de Latent Dirichlet Allocation (LDA): Tópico 1 – Estratégia e Vantagem Competitiva; Tópico 2 – Negócios Internacionais e Equipe de Alta Administração; Tópico 3 – Empreendedorismo; Tema 4 – Aprendizagem e Cooperação; Tópico 5 – Finanças e Estratégia; e Tópico 6 – Capacidades Dinâmicas.
Contribuições teóricas/metodológicas: Apresento o estado da arte da literatura publicada no IJSM e também mostro como o leitor pode executar sua própria modelagem de tópicos. Os dados completos e o código usado estão disponíveis em repositórios gratuitos de ciência aberta no Open Science Framework (OSF) e no GitHub.
Palavras-chave: modelagem de tópicos, latent Dirichlet allocation, análise de texto auxiliada por computador, aprendizado de máquina, big data.
Perspectives
TOPIC MODELING: HOW AND WHY TO USE IN MANAGEMENT RESEARCH
MODELADO DE TÓPICOS: CÓMO Y POR QUÉ USAR EN LA INVESTIGACIÓN DE GESTIÓN
MODELAGEM DE TÓPICOS: COMO E POR QUÊ USAR NAS PESQUISAS EM ADMINISTRAÇÃO
Received: 19 January 2019
Accepted: 15 April 2019
Data is defined as "large-scale data streams taken from the Internet, social media sites, or archives" (Morh & Bogdanov, 2013, p. 561). It can be opportunistically used in research because it proportionates access to almost unlimited data. Most of Big Data is comprised of textual uncategorized data. Analyzing such unstructured data can be a challenge since researchers cannot apply traditional textual analysis (coding for instance) given the large dimension of textual data (Bendle & Wang, 2016). Topic modeling (Hannigan et al., 2019), and specifically Latent Dirichlet Allocation (LDA) (Blei et al., 2003) can analyze huge amounts of text and describe the content as focusing on unseen attributes in a specific weighting. The use of computer-aided text analysis (CATA) in Management and Social Sciences literature is growing (Nelson, 2017; Nelson, Burk, Knudsen, & McCall, 2018) and can complement, if not fully replace (Baumer et al., 2017), traditional text analysis approaches. Management researchers could avail to keep Topic Modeling approaches in their toolkit.
My aim is to introduce topic modeling as a valuable research approach to generate management theory from textual data. A researcher can, through topic modeling, label and categorize textual data in order to generate a quantity or measure that can be later used for statistical analysis and hypothesis testing. Textual data, which was mostly used in management research in an exploratory and qualitative approach, can be employed in a descriptive and quantitative approach. Researchers can use topic modeling to shift the paradigm of textual data from qualitative propositions to quantitative hypothesis.
To exemplify how topic modeling can be used in management research, my objectives are two-fold. First, I introduce topic modeling as a social sciences research tool and map critical published studies in management and other social sciences that employed topic modeling in a proper manner. Second, I illustrate how to do topic modeling by applying topic modeling in an analysis of the last five years of published research in this journal: the Iberoamerican Journal of Strategic Management (IJSM). I treated every abstract from each article published as a document and derived 6 topics. The main results show that competitive advantage along with entrepreneurship once predominant are declining, and that international and finance are increasing in importance. The . and Pythoncode used for all data collection, data analysis, images, and tables generation are available in an online Open Science Framework public repository (Storopoli, 2019).
I contribute to mapping the core concepts around Topic Modeling to guide researchers in their future endeavors. Also, I show a curated sample of good examples of topic modeling research articles for further inquiries. I continue my approach towards Topic Modeling by explaining what are the main procedures and precautions of conducting a Topic Modeling analysis. Moreover, I illustrate which further quantitative analyses the researcher can do with the labeling and quantification of textual data generated through Topic Modeling. Finally, I apply Topic Modeling to a sample of five years of published articles in IJSM to demonstrate how the technique can be useful for management research with unlabelled or uncategorized textual data.
To address my first objective, this section introduces topic modeling defining it and giving a brief overview. Next, I explain the main technique behind topic modeling: Latent Dirichlet Allocation (LDA). Furthermore, I provide the main benefits of topic modeling and continue the presentation by addressing prescriptive issues such as: how to do topic modeling and how to choose the best model. Then, I move to more theoretical issues like how to build theory with topic models; and, finally, I conclude with a curated selection of examples of published research in social sciences that have employed topic modeling.
Topic modeling is a class of text analysis that has arisen with the advent of both machine learning and big data. Today personal computers have a substantially sizeable computational power than when they first appeared in the early 1950s. For example, the computer responsible for guiding the Apollo mission to the Moon in 1969 had only 2 kilobytes (KB) of memory size (RAM). This may sound infimal since many of today's smartphones have at least 4 gigabytes (GB) of RAM and mostly commercially-available notebooks can have up to 16GB of RAM. This massive computational power makes it possible for many scholars and researchers to do heavy bouts of data analysis. In the beginnings of quantitative data analysis, it was impossible to run it on a personal computer, and the researcher had to have access to a mainframe computer--mostly located in universities, research institutes or resourceful firms. So the rise of the computational power of personal computers gave freedom to run massive data analysis on your lap.
Big Data, the second factor behind the advent of Topic Modeling, is defined as "large- scale data streams taken from the Internet, social media sites, or archives" (Morh & Bogdanov, 2013, p. 561). It can be opportunistically used in research because it proportionates access to almost unlimited data, with unfathomed proportions. By allying accessible high computation power with an unlimited supply of data, researchers could run complex algorithms and quantitative techniques to large text data in order to generate a body of broad categories and comprehensive analysis to understand what underlying phenomena may be behind the data. Topic modeling is an instance of probabilistic modeling (Mohr & Bogdanov, 2013) and uses statistical associations of words in a text to generate latent topics—clusters of co-occurring words that jointly represent higher-order concepts—but without the aid of pre-defined, explicit dictionaries or interpretive rules (Hannigan et al., 2019). It does so not as providing an automatic text analysis application but rather as providing a lens that allows researchers working on a problem to view a relevant textual data in a different light and at a different scale (Morh & Bogdanov, 2013). They provide, without predefined codes or categories of meaning, an automated procedure for coding the content of texts (including abundant textual data obtained by Big Data) into a set of substantively meaningful coding categories called "topics". The most used topic modeling technique is the Latent Dirichlet Allocation (LDA): a generative probabilistic model for collections of discrete data such as textual data (Blei et al., 2003). In LDA, each topic can be viewed as a theme because they are a set of distribution over all observed words in the texts; in other words, a bag-of-words that frequently appear together across documents. Moreover, every document (text) analyzed can have topic probabilities that stipulate which main topics they are mostly associated with. For technical statistical specifications, LDA assumes that each length of documents is Poisson distributed and the proportion of the document in each topic is Dirichlet distributed. Dirichlet distributions are commonly used as prior distributions in Bayesian statistics.
The benefits of Topic Modeling are plenty. First, they do not impose the researcher dictionaries or interpretative rules regarding the data, enabling identification of important themes that human readers are unable to discern. Also, it allows for polysemy because topics are not mutually exclusive; individual words appear across topics with differing probabilities, and topics themselves may overlap or cluster (DiMaggio, Nag & Blei, 2013). Second, when dealing with a large extent of textual data (such as in a Big Data scenario), it provides a way for researchers to obtain reasonable automated content coding, enabling to take the measure of large-scale social phenomena that we could not have previously been able to do (Morh & Bogdanov, 2013). Third, it removes the burden on the research from manually coding text data to interpret and validate the results of topic models, epitomizing a shift from interpretive methods borrowed from the humanities to disciplining the results through statistical validation (DiMaggio, 2015).
So how does Topic Modelling, a CATA technique, fare against a traditional human- powered text analysis? Surprisingly well, some might argue. In a study that compared Topic Modeling (Blei et al., 2003) versus Grounded Theory (Glaser & Strauss, 1967) the authors identified several correspondences between the grounded theory themes and the algorithmically generated topics (Baumer et al., 2017). In another study, Topic Modeling was able to match the results of human-coded text analysis. The consequence of CATA applied as the primary approach to qualitative research can result in "an efficient, rigorous, and fully reproducible computational grounded theory" (Nelson, 2017, p. 32). The CATA framework can also be applied to any qualitative text as data, including transcribed speeches, interviews, open-ended survey data, or ethnographic field notes, and can address many potential research questions (Nelson, 2017). This change to Topic Modeling requires many of us, social scientists, to move outside our comfort zone in accepting interpretive uncertainty and to develop robust ways to interpret and validate the results of our models (DiMaggio, 2015).
The first procedure is to collect data, which in topic modeling means textual data. This step can be done in several ways: by transcribing interviews, rendering reports to text, web scraping; or any other source of collecting and generating textual data. It is important to note that the number of texts in a sample can be decisive in a Topic Modeling analysis. The most common approach is to have each document as an individual text in a sample. However, sometimes the sample is comprised of one large mass of text, such as an entire book or a lengthy interview. The researcher can either choose to keep the sample to 1 text or to break up the document in chunks that are interdependent---an interview can be broken by topics and a book in its chapters.
Once the researcher has the data he or she needs, it is time to pre-process the data. Pre- processing the data means performing upon it some fundamental transformations, in order to have data that will be much more useful for performing some further, more meaningful analysis (Denny & Spirling, 2018). The first pre-processing procedure is text normalization, which is a set of transformations with the purpose to render textual data that can be quantified and compared within itself. This transformation includes: (1) converting all letters to lower or upper case; (2) converting numbers into words or removing numbers; (3) removing punctuations, accent marks, and other diacritics; (4) removing white spaces--- leading and ending spaces in a text; and (5) expanding abbreviations. The second pre- processing procedure is the removal of "stop words", defined as the "most common words in a language like 'the', 'a', 'on', 'is', 'all'" (Debortoli et al., 2016, p. 111). These words do not carry significant meaning and are usually removed from texts. The third and final pre- processing procedure is called stemming or lemmatization. Stemming is reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form; and one of the most applied stemming algorithms is Porter's stemming algorithm (Porter, 1980).
After the text is pre-processed, the researcher can carry on with the Topic Modeling by applying an algorithm to render a probabilistic model. The simplest and most widely used model is Latent Dirichlet Allocation (LDA) introduced by Blei et al. (2003). LDA inputs are: (1) a set of documents that can be represented as a document-word matrix (DTM); (2) the number of topics to be estimated by the algorithm. The DTM is a matrix in which the rows are each document in the sample, the columns are each unique word in the sample, and the cells are the number of times each word occurs in each document. The second input is the number of topics to be estimated by the algorithm. This manual input makes the topic quantity selection to be an obstacle to the researcher. Most researchers deal with it by generating a model for each number of topics and analyzes which ones have the best or optimal congruence to the data. The LDA outputs are: (1) a topic-word matrix; and (2) topic-document matrix. Both matrices are vectors of weights, with the topic- word being the weights of words in each topic and the topic-document weights of topics in each document. The main operation behind the LDA algorithm is vector space calculations based on similarity comparisons while using the inputs in order to generate the outputs. LDA assumes that the similarity comparisons are probabilistic in nature and that each word in a document is modeled as a sample from a mixture model, where the mixture components are multinomial random variables that can be viewed as representations of 'topics'.
With no fixed number of topics to be generated, researchers have to generate one model for each number of topics and then assess which one is the right model. The assessment can be either quantitative or qualitative. The quantitative metrics are a measure of the fit of the model, and the qualitative assessments are interpretative and discretionary to the researcher. One must not solely base his analysis on quantitative metrics since they can be unreliable (Maier et al., 2018). Topic models that perform better on quantitative metrics tend to infer topics that humans judge to be semantically less meaningful (DiMaggio, 2015). There is no statistical test for the optimal number of topics or the quality of a solution. The point is not to estimate population parameters correctly but to identify the lens through which one can see the data most clearly.
Regarding quantitative metrics about the model fit, there are three: (1) perplexity; (2) log-likelihood; and (3) coherence. Perplexity is a global indicator of the model, and it represents the model's "surprise" at the data (Blei et al., 2003). A lower perplexity score indicates better generalization performance. Log-likelihood is also a global indicator and represents how plausible model parameters are given the data (Bordag, 2008). Coherence is a metric for assessing topic quality; it is a local indicator for each topic that examines the words in topics, decide if they make sense (Mimno et al., 2011). Coherence is noted as the optimal quantitative metric for topic modeling assessment (DiMaggio et al., 2013). Indeed, a statistical test for an overall solution (as opposed to for the quality of particular topics) would be misleading, because models often shunt noisy data into uninterpretable topics in ways that strengthen the coherence of topics that remain. Thus, the test of the model as a whole is its ability to identify a number of substantively meaningful and analytically useful topics, not its success in optimizing across all topics.
Qualitative assessment is based on two types of validity: internal and external (DiMaggio, 2015). Semantic or internal validity confirms that the model meaningfully discriminates between different senses of the same or similar terms. This validity answers the following question: "how can we be sure that our interpretation of the meaning of a topic is better than an alternative interpretation?". Predictive or external validity determines whether particular topics correspond to information external to the topic model (e.g., by confirming that certain topics became more salient when an external event relevant to those topics occurred). It recognizes that the same text will speak in different ways and be interpreted differently by different audiences. The researcher must locate the optimal balance between the two logics of validity.
In order to generate theory from topic models, one must understand that the main contribution of topic modeling is the development of a system of inductive classification. In a classic scenario of content classification and coding (mostly textual content), researchers usually are looking for shared structures of meaning that are not formally materialized. Topic modeling can emulate the same purpose of finding these shared structures but without introducing researcher bias -- note that the only human intervention would be to choose the number of topics to be generated for the model.
The iteration between theory and the topics that emerge from the chosen model create new theoretical artifacts or build theory with them. A researcher must always ask whether such topics represent meaningful structures. That is, for every topic, one must ponder if it resonates theoretically with the content from which it was derived and also if it provides substantial theoretical contributions and discussions.
Presenting topics without particular concern for theoretical artifacts risks presenting disembodied arguments about the artifacts' importance and role regarding the theory and the data. If one naively apply topic modeling crudely, one may omit essential distinctions on how to capture an essential meaning and meaning structures in the data to generate significant theoretical discussions and contributions.
Topic modeling may not be the final destination of analysis and theory building in a study. Researchers may use topic modeling as a means to generate unbiased classifications and metrics of textual (qualitative) data. Textual data can be then measured and used in quantitative analysis, especially in hypothesis testing. It shifts the paradigm and assumption that textual data belongs only in the realm of qualitative analysis and exploratory research settings.
Researchers may find new and innovative ways of measuring variables in order to test hypotheses in contexts that were not delved before.
Researchers have been applying topic modeling and other CATA since its inception in 2003 (Blei et al., 2003). In table 1, I present a curated sample of 23 good examples of topic modeling research articles. Most of the articles are from management (65%), but there are some from other social sciences. Also, the articles' sample sizes are notoriously large: the median is 8,000, and the mean is 18,063.
Regarding the type of probabilistic model employed by the articles is LDA by the majority (87%) with a few exceptions. Two studies employed correlated topic modeling (CTM). CTM topic proportions allow for topic correlation (Blei & Lafferty, 2007), admitting that some topics are closer to each other and share words with each other (Nikolenko, Koltcov & Koltsova, 2017). CTMs use logistic normal distribution instead of the Dirichlet allocation from LDA to model correlations between topics. One study employed structural topic modeling (STM). The STM provides a flexible way to incorporate metadata associated with the text into the analysis, such as: when the text was written; where (e.g., which country) it was written; who wrote it; and characteristics of the author (Robert et al., 2014). In turn, it allows the understanding of relationships between metadata and topics in their texts (Lucas et al., 2015).
The studies in table 1 usually employ either a large number of topics (75, 100 or more) or a small number of topics (less than 20). This polarization of choice of the number of topics to be generated in topic modeling is caused by a split in the influences that the studies based their analyses. If a researcher comes from a traditional social sciences background, he or she will usually choose a small number of topics in order to focus on a comprehensive and meticulous description and discussion of each topic. On the other hand, if the background of the researcher is a more quantitative "hard science", then the number of topics will be higher because of the focus might be getting the best model fit to the data (this is generally achieved by an immense number of topics in a model). There is no consensus in the prescription of the number of topics that a researcher must anchor their decision. Some argue that it depends on the level of 'resolution' a social scientist desires to obtain (Nikolenko, Koltcov & Koltsova, 2017). While others argue that it depends on the performance metrics of the model: such as perplexity (Blei et al., 2003) or coherence (Mimno et al., 2011). Perhaps a more sensible approach would be to let the performance metrics guide you but leave the final decision pending a thorough inspection by the researchers (DiMaggio, 2015).
Some studies depicted in table 1 also employ further quantitative analysis of the data generated by topic models. As already covered in previous sections, researchers can apply topic modeling to textual data to generate non-biased categories and labels. These measures may be later included in quantitative analysis to test hypotheses. Most of the studies in table 1 employ a supervised statistical analysis, which means that a dependent variable is guiding the analysis (ANOVA, .-tests, regressions, structured equation modeling, etc.). There is also a minority of studies that are concerned about unsupervised statistical analyses that do not have an important variable to guide the analysis, but the focus is to find similarities amongst their sample (clustering, social networks, measures of similarities, etc.).
Finally, topic modeling can be freely applied using either R, Python or Java. For . there are three main packages: 'topicmodels' (Hornik & Grün,2011); 'lda' (Chang, 2011); and 'stm' (Roberts, Stewart & Tingley, 2014). The most updated of these is the 'topicmodels', which can apply either LDA or CTM as types of probabilistic model, but not STM. For STM, researchers working in . ecosystem must use the 'stm' package. The least updated of the . packages is 'lda' that can only perform the LDA type of probabilistic models. Researchers looking for topic modeling in Python setting must turn their efforts towards the 'gensim' library (Rehurek & Sojka, 2010). The 'gensim' is more frequently updated than any of the . packages while being used by firms such as Amazon, Cisco and Capital One. 'gensim' can only perform LDA, it cannot perform STM or CTM. Despite those drawbacks, 'gensim' has some advantages over the . packages: scalability (can be deployed in large environments and process huge chunks of data); and text parsing (can parse and work with different types and sources of textual data, such as wiki and XML). Also due to the python ecosystem, 'gensim' can interact natively with popular machine learning python libraries, for example, 'TensorFlow' and 'scikit-learn'. Thus, making 'gensim' very attractive towards computer scientists and social scientists looking to derive the fittest model to explain the data (mind the overfitting issue). Finally, the first tool that was available for topic modeling is the Java-based 'MALLET' (McCallum, 2002). It is still currently maintained and updated, but like 'gensim' can only perform LDA, not CTM or STM. It has an interface with both . and Python ecosystems by employing wrappers: 'gensim' library for Python and 'mallet' package for . (Mimno, 2013). All of the packages and libraries described here can be downloaded and used for free.
This section will address my second objective: to illustrate how to do topic modeling by applying it in an analysis of the last five years of published research in this journal: the Iberoamerican Journal of Strategic Management.
All the data and procedures described in this section can be found in an online Open Science Framework public repository (Storopoli, 2019). I encourage the curious reader to access it and browse the code and files in order to capture any procedure that might sufficiently raise interest. Also, the online repository addresses replicability issues and other ethical concerns. This section is comprised of an extensive description of the procedures of data collection and data analysis. Ultimately, I present the results in a thorough manner.
The first procedure was to extract the last five years of published articles in the IJSM. This encompasses a timespan from 2014 to 2018. I chose full years in order to make the data collection and, subsequently, the analysis and the results replicable for future studies.
To generate the data, I have downloaded a BibTeX file exported from Scientific Periodicals Electronic Library - SPELL (www.spell.org.br) with all the published articles in IJSM. IJSM has published a total of 197 documents in SPELL from 2014 to 2018. To parse the content of the BibTeX file, I used an R package called 'bib2df' (Ottolinger, 2019). The variables that it was able to gather from SPELL for each of the 197 documents published are: title, authors, year, volume, number, pages, full abstract, URL of submission and DOI. Since my focus is on the published articles, I have removed 25 editorials and 8 book reviews from the sample; arriving in a final sample of 164 articles published in the last five years.
The 164 articles are the inputs documents to the topic modeling. However, first, it was necessary to pre-process the abstracts. The first pre-processing procedure was text normalization: (1) converting all letters to lower or upper case; (2) removing numbers; (3) removing punctuations, accent marks, and other diacritics; (4) and removing white spaces. Since IJSM adopted a structured abstract template during the later years of the analysis timeframe, I also employed a procedure to convert structured abstracts to regular abstracts. The second pre-processing procedure was to remove the stop words by using spaCy (Honnibal & Montani 2017), which is a free, open-source library for advanced Natural Language Processing (NLP) in Python. The classes of stop words that were removed with spaCy in this procedure are adverbs (very, tomorrow, down, where, there), pronouns (I, you, he, she, myself, themselves, somebody), conjunctions (and, or, but), determiners (a, an, the) and adpositions (in, to, during). The third and last pre-processing procedure was to lemmatize the text by reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form; the stemming was done following Porter's stemming algorithm (Porter, 1980).
After the pre-processing procedures, I inspected the documents and also added 219 custom stemmed tokens as stop words and removed them from the text. The final pre- processed text for the documents has a total of 1,252 unique tokens.
With the documents pre-processed as inputs, I generated a LDA model for each topic quantity, starting in 2 up to 40 topics. In figure 1, it is possible to see the coherence slowly creeping up as the number of topics increase. There is a tradeoff between coherence and topic numbers in a specific LDA topic Comodel. Particularly in this study, for the highest coherence value, I had two choices of topic models that have the same coherence value of 0.336. The first model has 6 topics and the second 10 topics. Since the coherence value is the same, I chose (following the Occam's razor principle) the model with the least number of topics.
The topic model has 6 topics. You can find each terms' weight for all the topics in Appendix I. Also, Figure 2 shows the topic's terms counts and weights for each one the of 6 topics. The first one is comprised of the terms: "estratég"; "organiz"; "competi"; "prát"; "merc"; "conceit"; "empr"; "vantag"; "gerenc"; and "comport".
So topic 1 deals with the major strategy themes and competitive advantage. Thus, I name Topic 1 as "Strategy and Competitive Advantage". The most representative document in the sample for Topic 1 is titled "Comportamento Estratégico Organizacional e a Prática de Gerenciamento de Resultados nas Empresas Brasileiras".
Topic 2 has the following terms: "gest"; "teor"; "negóci"; "internac"; "caracterís"; "futur"; "decis"; "abord"; "país"; and "relacion". The most representative IJSM article in the sample has the following title: "Mulheres na gestão de topo: a problemática do GAP de gênero e salarial". Being most terms in Topic 2 related to International Business alongside with the most representative article dealing with gender inequality in the top management team, therefore Topic 2 is entitled "International Business and Top Management Team".
Topic 3 has the following terms: "desenvolv"; "ambi"; "inform"; "empreend"; "públic"; "sustent"; "instituc"; "institu"; "internacion"; and "empreendedor". The most representative document in the sample for this topic is "Do homo empreendedor ao empreendedor contemporâneo: evolução das características empreendedoras de 1848 a 2014". Being all terms and the most representative document regarding entrepreneurship, consequently, Topic 3 is named "Entrepreneurship".
Topic 4 has the following terms: "process"; "conhec"; "entrev"; "context"; "qualit"; "perspec"; "form"; "particip"; "envolv"; and "mudanç". The topic's most representative article in the sample is titled "Aprendizado de Rede no Contexto de Intercooperação e Fusão de Redes: A Opção de Não-Fusão". As noted by the topic terms', it is profoundly influenced by qualitative approach methods and techniques. The term "entrev" is the stemmed version of interview and the term "qualitativ" is the stem of qualitative. Due to the topic terms and the title of the topic's most representative document being learning and cooperation, Topic 4 is defined as "Learning and Cooperation".
Topic 5 has the following terms: "empr"; "fat"; "recurs"; "brasil"; "ativ"; "estrut"; "corpor"; "financ"; "efici"; and "capit". The most representative document in the sample for the topic is "Determinantes da Estrutura de Capital de Empresas Brasileiras: Uma Análise Empírica das Teorias de Pecking Order e Trade-Off no Período de 2005 e 2014". This topic is mostly commanded by finance terms, thus it is named "Finance and Strategy".
The final topic, Topic 6, has the following terms: "inov"; "desempenh"; "capac"; "organizac"; "model"; "dimens"; "dinâm"; "produt"; "ges"; and "pequen". The most representative document for the topic is "Capacidades de Inovação em Serviços: Um Estudo nos Supermercados em Santa Catarina". The topic comprises the terms "capac" and "dinâm" which means dynamic capability and also has the terms "pequen" meaning small firms. Therefore, Topic 6 is defined as "Dynamic Capabilities and Small Firms".
Now that the topics are present, we will delve into the relationship amongst the topics. By using the Python port of the . package 'LDAvis' (Sievert & Shirley, 2014), I was able to generate a multidimensional scaling (MDS) (Torgerson,1958). In figure 3, there are two main axes of the MDS: the x-axis and y-axis. In this Cartesian space, all distances are assumed as Euclidean. We can see that there are clearly 3 groups. Topic 2 and 4 are grouped together. Furthermore, Topics 1, 3 and 6 are also grouped together opposite to Topic 2 and 4. The final group is composed of a single topic, Topic 5, and is the opposite position of both remaining groups. The MDS elucidates that "Topic 4 - Learning and Cooperation" is very similar to "Topic 2 - International Business and Top Management Team"; and that "Topic 1 - Strategy and Competitive Advantage", along with "Topic 3 - Entrepreneurship" and "Topic 6 - Dynamic Capabilities and Small Firms" share similarities. These two groups are opposed to each other in the main axis (the PC1 on the x-axis). Isolated without similarities to any topic is "Topic 5 - Finance and Strategy".
Finally, I present the topic predominance over the years in Figure 4. It has some trends to note. First, Topic 1 - Strategy and Competitive Advantage and Topic 3 - Entrepreneurship were predominant in 2014 and declined towards 2018. Second, Topic 2 - International Business and Top Management Team and Topic 5 - Finance and Strategy are on the rise from 2016 onwards. Third, Topic 4 - Learning and Cooperation and Topic 6 - Dynamic Capabilities remained steady during the timeframe of analysis. Regarding the decline of Topic 1--which is a general topic, this can be interpreted as a specialization of the journal. As the years go by, IJSM started to publish more specialized content that was grouped into other topics. The rise of Topic 2 and Topic 5 can be seen as a trend towards more publications about finance, international business, and top management team.
My objectives in this article were two. First, I introduce topic modeling as a social sciences research tool and map critical published studies in management and other social sciences that employed topic modeling in a proper manner. Second, I illustrate how to do topic modeling by applying topic modeling in an analysis of the last five years of published research in IJSM. Topic Modeling is a valuable toolkit for management researchers to use in their theory-building process. It can shift the old paradigm that textual data belongs only to the qualitative realm by allowing textual data to be labeled and quantified in a reproducible manner that mitigates (or closely fully eliminates) researcher bias.
For further research, while addressing the limitations, I propose the use of CTM and STM instead of LDA to analyze the data and draw more insights from the results. CTM allows the topics to correlate with each other, and can give a better fit but harder to interpret topics (Steyvers & Griffiths, 2007). STM can benefit from the metadata that textual data carry alongside, bringing relationships between metadata and topics to the analysis (Robert et al., 2014). Also, mind that the median of the sample size from the 23 Topic Modeling articles in Table 1 is 8,000 documents. This could imply that Topic Modeling may benefit from a large sample.
Evaluation Process: Double Blind Review E-ISSN: 2176-0756