Abstract: The research aims to identify the relationship between the information search behavior on the Internet to solve a research task and the answers given by a group of university students. For this purpose, a quantitative quasi-experimental study was designed, in which both the words used in the web search process and the answers elaborated from it were analyzed. The data were processed thanks to the use of the GoNSA2 platform, which allows tracking the search process, and the Iramuteq software, oriented towards the analysis of lexical information. Among the main results, we highlight a shift between the topics used in the search and those observed in the response stage and an increase in the categories present in the latter stage, which allows us to consider the search process as a learning instance.
Keywords: information search, learning, learning analytics, lexical analysis..
Abstract: La investigación busca como objetivo identificar la relación entre el comportamiento de búsqueda de información en internet para resolver una tarea de investigación y las respuestas entregadas por parte de un grupo de estudiantes universitarios. Para ello, se diseñó un estudio cuasi-experimental, de carácter cuantitativo, en el que se analizaron tanto las palabras utilizadas en el proceso de búsqueda en la web como las respuestas elaboradas a partir de este. Los datos fueron procesados gracias a la utilización de la plataforma GoNSA2, que permite realizar seguimiento al proceso de búsqueda, y al software Iramuteq, orientado hacia el análisis de la información léxica. Dentro de los principales resultados, destacamos un desplazamiento entre los tópicos utilizados en la búsqueda y los que se observan en la etapa de respuestas y un aumento en las categorías presentes en esta última etapa, lo que permite considerar al proceso de búsqueda como una instancia de aprendizaje.
Keywords: information search, learning, learning analytics, lexical analysis..
Palabras clave: Búsqueda de información, aprendizaje, analíticas del aprendizaje, análisis léxico.
Conceptual displacement: Web search as a learning experience
Desplazamiento conceptual: La búsqueda en la web como experiencia de aprendizaje
Recepción: 23 Julio 2020
Aprobación: 09 Enero 2021
Search engines enjoy great popularity among Internet users, due to the immediacy of access to hundreds of pages of results. It has been estimated that the indexing of documents reaches at least 5 trillion web pages. However, this growing increase in indexing brings with it various difficulties in accessing information, such as saturation, algorithmic filters, and misinformation. Thus, the main attribute of search systems that elevates them as one of the favorite applications of Internet users complicates the selection of information, due to the saturation and personalization of search results and the phenomenon of misinformation.
The saturation of results occurs because of the increased indexing of search engines, and directly affects people's working memory, due to the difficulty of processing so much data in a short time (Rivas, 2008), decreasing the effectiveness of decision making. To reduce information saturation, search systems have developed filter algorithms that personalize searches, facilitating users' access to information through search results adjusted to our historical behavior. Among the most important filtering algorithms are needs detection, query detection, query suggestion, search personalization and result ranking. In this way, the result pages are ranked based on these filters (Yogananarasimhan, 2020). However, these algorithmic filters confine us in comfortable bubbles that fit the profile that search engines have created from our clicks. Thus, web browsing is algorithmically mediated by profile information, geographic location, search engine usage history, and language, among other elements.
Consequently, the efficiency of search systems depends on the personalization of results for each user, establishing boundaries to our search space in the cloud, limiting access to diverse information.
This makes it urgent to develop digital literacy plans that allow a critical approach to information and to build tools that encourage students' reflection, to promote the development of creative ideas in the different areas of study and in everyday life.
This research contributes with a characterization of the queries issued by university students and questions the belief regarding the mastery of technologies by the new generations of students; for example, it has been established that, although they show ease and familiarity in the use of computers, they are dependent on the results of search engines, so answering this question allows to highlight the need for digital literacy as a key competence for students of the 21st century.
To analyze the queries issued in the search engine, quasi-experimental research was developed, since it allows the natural search of queries in the same period, to students of similar demographic characteristics (age, gender, language, study career and geographic location) to solve factual and research/exploration tasks. The study was supported by the GoNSA2 technology platform (Olivares-Rodriguez, Guenag, & Garaizar, 2018), which allows implicit recording of all user actions.
Based on a quasi-experimental study, a group of 58 first-year university students searched for information on the web to solve four factual and research/exploratory search tasks. This article presents the results obtained from the analysis of the research/exploratory task: "How to fight crime in Chile". The content of the task corresponds to a recurrent topic in the national and international media agenda. The objective of this research is to identify the relationship between the information search behavior on the Internet to solve a research task and the answers provided by a group of university students through a study of contrast between the keywords present in the queries issued by students in the search system.
In broad terms, we propose to categorize the keywords of the queries issued by the students in the search engine and to contrast them with the answers elaborated for the research task. In the first stage of the study, the level of web exploration is determined by the number of queries issued. Then, the keywords of the queries issued by the students in the search engine are categorized. Third, the categories found in the search stage are contrasted with the words used and the categories present in the response stage.
There is evidence in the literature regarding the difficulties of users to initiate and complete a search, in terms of the number of queries and terms used, as well as the low effort they invest in the search for information, particularly the youngest, due to the cognitive barriers inherent to their development (Duarte Torres and Weber, 2011; Usta, Altingovde, and Vidinli, 2014), even a low tendency to use advanced functions in search engines has been observed (Yamamoto, Yamamoto, Oshima, and Kawakami, 2018). In addition, poor performance in the search process, impacts people's mood (Rosman, Mayer, & Krampen, 2015). Added to this, the great diversity of pages and the inherent ambiguity of the queries, as well as low levels of reading comprehension make the task difficult, since the queries turn out to be vague and redundant; therefore, information retrieval becomes frustrating and with low levels of exploration of the search space, especially in minors and young people (Foss and Druin, 2014).
The level of effort spent on search depends mainly on external factors and user experience (Zach, 2005). Also, it has been established that the decision to stop a search is highly influenced by the relevance of the results presented on the first page of search engine results, the quality of the terms they are able to use, the ability to generate new terms for queries, and the personal assessment of the effort required to solve the search task, based on the complexity of the search (Wu and Kelly, 2014 p. 34).
Toms and Freund triangulated the surveys with quantitative information from logs of information-seeking sessions and identified the actions and transitions that were preferentially used at the end of the information-seeking process (Toms and Freund, 2009 p. 90).
Consequently, the formulation of queries and the criteria for closing the information search depend mainly on the subject's personal judgment, based on the perceived usefulness and relevance of the results. Thus, transitions and criteria are an essential part in the characterization of web exploration/exploitation strategies.
Currently, most access to information is mediated by technology. Search engines allow retrieving information indexed on the web by linking it to queries issued by students, according to their relevance, but the closeness of the results to the query made depends on technological and human variables, which combine to make the information search process more complex.
The search process is made difficult by three factors: the diversity of pages, the ambiguity of the queries made by the subjects and the low levels of reading comprehension, aspects that determine the elaboration of vague and redundant queries that increase frustration and, therefore, reduce the exploration of SERPs, especially in the case of children and young people (Foss and Druin, 2014). Different studies have evidenced that most students lack the necessary skills to efficiently access information (Druin, Foss, Hatley, & Golub, 2009; Qureshi, Bokhari, Pirvani, & Dawani, 2015; Şendururur & Yildirim, 2015). In addition, it is necessary to consider that during the search process the cognitive, physical, and affective variables of the learners (Kuhlthau, 1991), as well as to the capabilities of the technology itself to respond to the users' needs, are also involved.
This growing information saturation would be affecting the attention of the subjects, hindering the ability to process the informative pieces that overload their working memory (Rivas 2008, p.185). In this line, and to improve the efficiency of the search process, search engines have incorporated query recommendation models (Duarte Torres, Hiemstra and Weber, 2012), and user purpose detection (Sadikov, Madhavan, Wang and Halevy, 2010; Santos, Nguyen and Zhao, 2003), which emerge as an alternative to reduce the overwhelm of hundreds of results with diverse information. However, algorithmic mediation personalizes our search by reducing the search space and the diversity of the results, with filters that determine the links that are close to our interests, leaving out those that are far away and that present contradictory information. In short, they shape an informational bubble (Parisier, 2017) that limits the opportunity to access information. Thus, when exploring the web for the purpose of solving a learning task, we do not have access to the full diversity of content, but rather it is reduced to the sites that we consult on a recurring basis or other similar ones. For Jiang (2014b) the ranking of results also depends on user clicks, but added to the language, the language used, the popularity of the site and geolocation. This last factor appears as a determinant in the results received by the user, since significant differences have been established in the information received by users depending on their geographic location (Jiang 2014a; Jiang 2014b; Cano-Orón, 2019). Likewise, search results are also influenced by advertising as part of the business of search engines (Rieder and Sire, 2014). In this sense, the delivery of results through a content ranking, implements a biased model, according to which the algorithm determines the priority of some content over others (Lewandowski, 2017, Rieder and Sire, 2014, Jiang 2014a; 2014b), directly influencing the access to information by users. Another factor to consider is the media influence of each country, as it would also influence the decision making of the search algorithm (Cano-Orón, 2019 p. 98).
However, there are researchers who deny the existence of a bubble that isolates internet users, as personalization would not bring about a limitation of access to information (Haim, Graefe, & Brosius, 2018). However, these bubble-stressing studies have limitations in that they are not conducted in real contexts: the information sources are news from a single newspaper, and the object of study considers only a particular type of diversity (Möller, Trilling, Helberger, & van Es, 2018), or the participants are simulated with profile generation algorithms (Haim, Arendt, & Scherr, 2017).
This quasi-experimental research aims to identify the relationship between the information search behavior on the Internet to solve a research task and the answers given by a group of university students. The task consisted of answering the question "How to fight crime in Chile". First, the level of exploration of the web is determined by the number of queries issued. Secondly, the key words of the queries issued by the students in the search engine are categorized. Finally, the categories found in the search stage are contrasted with two aspects of the response stage: a) words used and b) categories present.
The sample is composed of first-year engineering students between 18 and 19 years of age, with 53 male students and 5 female students. Participants are volunteers and agree to the use of their data in a confidential, anonymous and aggregated manner, by signing an informed consent form.
The research task was designed based on the proposal of Wildemuth and Freund (2012), which consists of a search challenge oriented towards the resolution of a complex problem, allowing the collection of information on both the search process and the results reached by the participants. This type of task was chosen because it demands a greater intellectual effort, since it requires a greater exploration and exploitation of the web to elaborate an answer. Therefore, the greater exploration of the web ensures a search for information that provides sufficient data for the research.
The research task "How to fight crime in Chile" is proposed, whose selection criteria are as follows:
The context of the task is a topic on the media agenda and, therefore, in the public domain.
Existence of diverse sources of information.
The task instructions are understandable to students from different disciplines.
However, to control for domain knowledge, the task topic is outside of the participants' formal learning. Also, tasks like the proposed one have been previously used in the literature (Arguello, J., Wu, W. C., Kelly, D., & Edwards, A., 2012; Kules, B., & Shneiderman, B., 2008), providing support for their use.
The objective of the proposed task was to determine the components that underlie a plan to combat crime in Chile, for which they had 15 minutes to submit their response.
The searches are developed on the GoNSA2 technology platform (Olivares-Rodríguez et al., 2018), which interacts with Microsoft's Bing search engine. GoNSA2 is a technological platform that supports the design of information tasks, the realization of the search process, the integration of student behavior information, and the evaluation of the elaborated solutions. It shows the interface of solving information tasks presented to students, recognizing a) the task, b) the snippet, c) the answer box and d) the personal library. The main contribution of the platform is that it provides detailed information about the search strategies, the queries issued, the documents reached with such queries and the solutions delivered by the participants.
In this way, the queries elaborated by each user are recorded, while for each query issued, the documents and sources provided by the search system are stored, as well as the actions performed on the documents, i.e., whether they were viewed, stored in the task's personal library or deleted. In addition, the intermediate solutions elaborated by the users are recorded. Finally, the timestamps when the user performs a particular action, from the beginning to the end of the task, are recorded.
For the analysis of lexical data, the Iramuteq program was used, an open access software based on the R program that offers textual and lexicometric statistics that can even be used as learning metrics (Valdés-León, 2021 p. 434).
The quasi-experiment is conducted in one session. At the beginning, the technical functionalities of GoNSA2 are presented. Then, the students have 15 minutes to perform the task described above. The second stage corresponds to the elaboration of analysis categories, based on the keywords of the queries issued by the students in the search engine. For the elaboration of these categories, the judgment of three specialists was used, who reviewed the data and grouped the words by applying semantic criteria. Subsequently, in the response stage, the same expert judgment methodology was used. Likewise, to enrich the comparison between the search and response generation instances, not only the categories present were contrasted, but also the lexicon used in each of the stages. For this purpose, a corpus containing the lexemes present in the searches carried out by the students was elaborated and, subsequently, the same was done with the texts generated to respond to the research task. This made it possible to carry out lexical comparisons of frequency (word cloud) and similarity. The purpose of the latter was "the study of the proximity and relationship between the elements of a set" (Ruiz, 2017).
The following section is organized as follows: first, we provide general information related to the global statistics that emerge after the activity; then, we present the distribution of the results in the search and response stage, considering the coding that has been established; and finally, we offer a lexical analysis that provides information regarding the frequency of occurrence and the network of lexical associations constructed in each stage.
Global statistics: characterization of student queries
As we can see, the GoNSA2 platform offers detailed information regarding the information search process by users, ranging from the number of participants to the average number of documents saved by them, to give an example. However, considering that our objective is oriented towards identifying the relationship between the information search behavior on the Internet to solve a research task and the answers provided by a group of university students, we find particularly interesting the information related to the average number of unique queries per user and the average number of documents saved in the library (Table 1).
Table 1: Overall statistics of the experience.
Source: own elaboration.
The average number of queries per user allows us to understand that, to reach an answer to a complex research task, students perform between two and three searches. This is consistent with the results of Fuentes and Monereo (2008) regarding the processes developed by adolescents to find information, since their study indicates that this age group considers "the search process and the selection of information as not very relevant" (p.51). This is confirmed when considering the low index of documents saved (close to two), as it seems a rather low amount if we consider the level of complexity of the proposed task.
In the consultation stage, we can observe that the codes with the highest presence correspond to "description of crimes" (DD) and "proposals to reduce crime" (PRC), which is closely related to the words that make up the problem (Figure 1). In other words, we mean that most students simply copied the proposed statement and pasted it into the search box, which corresponds to the description of the "copy-paste" student.
Figure 1: Comparison of consultations-solutions. Source: own elaboration
As for the categories that have a greater presence in the solution stage, it is interesting to note that they correspond to topics that emerged precisely at this stage. In other words, we refer to the fact that the contrast between queries and formulation of solutions shows a shift in terms of the topics present in the corpora, which allows us to understand that the process of solving a complex search task involves a knowledge construction process that is mediated by cognitive factors, experience, age and gender, among others (Ford, Miller and Moss, 2001). On this basis and considering the search characteristics of the new university students, it is not surprising that the process starts from similar categories, but that a greater dispersion in the topics present in the elaborated answers is evidenced.
First of all, we would like to point out that we are aware that the corpora of questions and solutions have different characteristics: the former corresponds to a list of structures that do not go beyond the sentence level, while the latter is based on the set of short texts written by the students. However, if we take into consideration only the notional lexicon (i.e., leaving out words such as connectors, pronouns, articles, etc.), it is possible to identify which lexemes predominate in each of them and, thanks to this, corroborate the findings of the classification into categories.
Thanks to the above, it is possible to identify that the first scheme has an evident predominance of three lemmas (Chile, crime and reduce), which, as we pointed out in §3.2, coincides with the three main words used in the statement of the task. However, the dispersion in terms of the topics addressed in the responses is much wider, as emerging categories appear around concepts such as "person", "police" and "victim", to give an example.
The analysis of similarities "is based on graph theory, which allows identifying co-occurrences between words and its result brings indications of the connection between words, helping in the identification of the structure of a textual corpus" (Camargo and Justo, 2013, p.516). Thanks to this, it is possible to visualize not only a greater dispersion of words, but also a greater diversity in the solutions: at least four major areas in which the students' answers were concentrated are observed, answers that are influenced both by the results found and by aspects of a cognitive and even moral nature. By way of example, we can point out that there is a branch in which terms related to education (rehabilitation, improve, quality...) are agglutinated, while, in another, a much more punitive semantic field is constructed (crime, penal, increase, etc.).
Based on the above, it seems pertinent to highlight that those who advocate empowering students to use the unlimited knowledge that the web provides to acquire knowledge and, in turn, develop higher level skills, such as autonomous and critical thinking.
The objective of this research was to identify the relationship between the information search behavior on the Internet to solve a research task and the answers provided by a group of university students. In this regard, the findings indicate that this link is given by a) a shift between the topics used in the search and those observed in the response stage and b) an increase in the categories present in the latter stage.
Both aspects mentioned above are interrelated: the shift from initial to emergent topics is due to the fact that, in the first instance, students orient their search by using the key words found in the task statement itself; however, both the results found and external factors (social, emotional, psychological, etc.) influence the presence of varied solutions to the same problem, but -in our study- with a predominance of two topics: social and criminal.
Based on the above, we consider that, in the case of solving research tasks such as the one presented here, the information search process is not only a means to reach a content, but represents a learning instance, as students mobilize skills such as critical and analytical thinking during the process of developing solutions to complex problems.