Peer Reviewed Research Manuscript
A Study of Relationship to Absentees and Score Using Machine Learning Method: A Case Study of Linear Regression Analysis
A Study of Relationship to Absentees and Score Using Machine Learning Method: A Case Study of Linear Regression Analysis
IARS' International Research Journal, vol. 12, núm. 1, pp. 33-39, 2022
International Association of Research Scholars
Revisado: 15 Febrero 2022
Aprobación: 27 Febrero 2022
Publicación: 28 Febrero 2022
Abstract: Absenteeism from classrooms amongst students is an international problem that does not only affect Indian students. This research is focuses on absentees of student in class and score and has been carried out by using linear regression analysis. Linear regression analysis is one of excellent method of machine learning. The descriptive, student's t-test, Pearson correlation, and regression models were used in this study's statistical analysis. According to the results of this study, there are considerable variations between absentees and score (t-test=-4.06075,p<0.05). The study also discovered that absenteeism from class had a negative link with the score (r=-0.6088). To investigate the impact of class absentees on student score, a regression model was created. This study will benefit both the college administration and the students by raising awareness of the disadvantages of not attending classes.
Keywords: Linear Regression model, Machine Learning, Descriptive Statistical Analysis, Statistical Analysis.
I. Introduction
I In today's technologically advanced world, a major difficulty in primary and middle schools, intermediate colleges, post graduate colleges and university is the rising tendency of student absenteeism. At the present time India also facing these type student absentees’ problems. Absenteeism is described as the habit of failing to show up for class or an event without a valid justification, and the term absentee is used to characterize someone who does this regularly. It has been observed via different studies and blind observation that students who do not attend classes receive worse grades, but individuals who have a greater attendance rate than their peers receive higher grades. A student's grade is determined by his/her attendance.
Various studies have shown that successful students have a better attendance record as well as a higher grade. According to Ahmat and Zahari's research, there is a negative relationship between absenteeism and grades, implying that more absence equals a lower grade [1]. The study looks into the link between students' attendance in class and their total marks in a variety of computer science programmes [2].
Zhang and Wang's research reveals that a positive correlation exists between the desired variables using a linear regression model [3]. On the basis of test score and classroom attendance data over two years, regression analysis is introduced, and correlation curves between test score and classroom attendance are plotted [3]. The regression model revealed a robust link between semester GPA and attendance %, as well as overall GPA [4]. Similarly, Narula and Nagar's research demonstrates a high link between attendance and grades using machine learning methods [5]. In a Finnish University, the research examines the relationship between attendance and performance using the clustering method [6]. The research examines how absenteeism affects student outcomes using administrative panel data from California to estimate the pandemic's impact [7].
The outcomes of this study demonstrated that greater performance in professional assessment tests by medical undergraduate students has a negative link with absenteeism and a positive correlation with high attendance percentage [8]. All of this evidence demonstrates that attendance and grades have a substantial beneficial relationship using machine learning based regression analysis.
The assumptions have been presented as hypotheses in this study, and a proper statistical model has been utilised to prove them, as well as relationships proposed between them to show how they will change with the changing of each and every factor.
In this research, we performed several statistical tests to draw conclusions about the relationships, and we also used a regression model to predict the least square best equation to show how grade varies depending on each element. The Pearson correlation coefficient is a measure that is used to determine how closely the components are related to each other assuming they are related at all. This research provides a machine learning based method of linear regression model that depicts how a student's score changes as a result of factor absenteeism.
II. Machine Learning
The ML is the subclass of artificial intelligence and one of its applications. Machine learning provides the ability to automatically improve and learn from systems without having to be explicitly programmed. The method of machine learning is focused on the development of computer programs and it accesses the data and uses it to learn itself. The machine learning process starts with data and experimental observations like classification information of training data and testing data, examples, direct experience, instruction and pattern in the data. With the help of learning data, machine learning algorithms make deter decisions in the future based on the example. The main purpose of machine learning methods is to permit the computers learn routinely without human interference or support and regulate activities consequently. We have presented two machine learning-based linear regression and statistical methods to the data analysis purpose of in this proposed research work.
III. Linear Regression Analysis
Linear regression is the relationship between independent variable (d) and dependent variable (s). The equation of linear regression is given below:
Where are the unknown parameters β1. and β0. known as gradient or slope of line and intercept on y axis and is a normal random variable with a mean of zero and an unknown standard deviation. Note that this model is being offered for the entire population of students enrolled in this course, not just those enrolled this semester and especially not just those in the sample. The parameters β1. and β0. are all related to this enormous population. The β1. and β0. parameters are also referred to as regression model parameters. These parameters can be learned using the least square approach, artificial neural networks, evolutionary algorithms, and other applicable learning approaches. Least square estimation and the gradient descent algorithm can be used to learn the parameters β1. and β0.
IV. Experimental Results and Discussion
In general, educators believe that class attendance has a considerable impact on course achievement, all other conditions being equal. An education researcher chooses a multiple part basic computer science course at a large university to evaluate the relationship between attendance and performance. Throughout the semester, the course instructors agree to keep an accurate record of attendance. In this proposed study, we have taken data from Dr. Rammanohar Lohia Avadh University of BCA final year students for experimental purpose. The sample size of data is 30. Here, we have taken 30 students randomly out of 60 students at the end of the semester. Two measurements are taken for each student in the 30 sample which are given below:
Table 1 shows the dataset of 30 students and figure 1 depicts scatter plot of 30 students’ data.
Table 2 shows the calculation of Calculation of d2,ds, and s2 In table 2 we have added the numbers in each column which will use in generating the trendline or predicted line.
The above data table 2 gives the following expression:
We start by determining the least squares regression line or predicted line, which is the line that best fits for the data. Its y- intercept and slope are given below:
The least squares regression line for this data, rounded to two decimal places, is:
Figure 2 depicts the fitted regression or predicted line of 30 students’ data.
The figure 2 also shows that a decreasing trend, indicating that students with more absences perform worse on the final exam on average. The total of the squared errors (Sum of Square Error) of this line's goodness of fit to the scatter plot is:
This is a huge amount. As a result, it isn't particularly useful in and of itself, but we utilised it to calculate a crucial statistic:
The statistic S€ calculates the standard deviation (σ) of the model's normal random variable (€). It means that the standard deviation of final test grades for all students with the same number of absences is around 13.85 points. Because the number of absentees has such a huge impact on a 100-point exam, the final exam scores of each sub-population of students are very diverse. The size and sign of the slope of the predicted line imply that, on average, students score 4.54 points lower on the final test for each class missed. Similarly, students tend to score 2 x 4.54=9.08 points less on the final test for every two classes missed, or around a letter grade lower on average. The s-intercept has importance in this situation because 0 is inside the range of d- values in the data set. It's a guess about the average final test grade for all students that have perfect attendance. The intercept
is the anticipated average for such pupils/students.
Before we go any farther with the regression equation or run any other analysis, it's a good idea to look at how useful the linear regression model is. This can be accomplished in two ways:
The correlation coefficient r is:
There is a moderately negative connection between the two variables. We can observe that there is a negative relationship between absentees and student scores. It means absentees of the student is increase then score will be decrease. Scores and absentees are two important data points in this study. The hypothesis developed for the first case involving absenteeism and student scores. Let us consider here the hypothesis which ais given below:
Null Hypothesis (H0): Absentees does not affect Scores.
Alternative Hypothesis (Ha): Absentees affect the Scores
We have developed two hypotheses, and now we will use a statistically independent t-test to see if the null or alternate hypothesis is correct. The independent t-test is performed to see if there is any correlation between absentee and student score. The t-test are used to test the hypothesis that the regression coefficients produced in basic linear regression are accurate. The two-sided hypothesis that the true slope, β1, equals some constant value, β1,0, is tested using a statistic based on the t distribution. Let's look at the test of hypotheses using the generally used 5% level of significance. The hypothesis test statements are written as follows:
From the “Critical Value of” with degree of freedom t0.05 so the rejection region is
The value of the standardized test is:
This is located in the rejection zone. In favor of H0, we reject H0 At the 5% level of significance, the statistics support the conclusion that β1 is negative, implying that as the number of absentees grows, the average final exam score declines. As previously stated, the figure is a point estimate of how much one additional absentee affects the average final exam result. The average reduces by around 4.54227 points for each subsequent absentee.
The frequency of absentees has been visualized using a histogram in figure 3.
Figure 3: Histogram absentees with their frequency
We can observe from this histogram that the majority of students have absentees in the range of zero percent to three. For β1, we can expand this point estimate to a confidence interval. "Critical Values of" with d.f=30-2=28 degrees of freedom, ta/2=t0.025=2.048at the 95 percent confidence level. Based on our sample data, the 95 percent confidence interval for β1 is:
We are 95 percent positive that, among all students who have ever taken this course, the average final test score drops by 2.94 to 7.52 points for each extra class missed. If we focus on the sub-population of all students who had exactly five absences. We may estimate the average final test score for those students using the least squares regression equation
This is also our best estimate of a student's final exam grade if he/she is absent five times. The average final test score for all students with five absences has a 95% confidence interval of:
According to this confidence interval, the true mean final test score for all students who miss class precisely five times over the semester is expected to be between 59.92 and 75.14. If a student misses exactly five classes during the semester, his final exam score is predicted to be in the interval with 95 percent certainty.
This prediction interval indicates that this student's final exam score will most likely fall somewhere between 38.16 and 96.90. Unlike the 95 percent confidence interval for the average score of all students with five absences, which provided useful information, this interval is so large that it reveals almost nothing about the final exam score of any particular student. The existence of the extra summand 1 under the square sign in the prediction interval can have a dramatic effect in this case. Finally, the coefficient of determination, r2, estimates the fraction of the variability in students' final exam scores that is explained by the linear relationship between that score and the number of absences. Since we've already calculated ., we can readily deduce:
As a result, the regression model explains 37 percent of the variability in the yield data, demonstrating a good fit of the regression model. Despite the fact that there is a strong link between attendance and final test performance. Although we can estimate the average score of students who miss a specific number of classes with reasonable accuracy, the number of absentees accounts for less than half of the entire range in exam scores in the sample. This is hardly surprising, given that student exam performance is influenced by a variety of factors other than attendance.
A residual plot is a graph in which the residuals are displayed on the vertical axis and the independent variable is displayed on the horizontal axis. A linear regression model is appropriate for the data if the dots in a residual plot are randomly distributed across the horizontal axis; otherwise, a nonlinear model is more suited. Table 3 shows the output of linear regression model and residuals.
Figure 4 also depicts that the residual plot displays a haphazard pattern. Some residual points are positive, while others are negative. This random pattern implies that the data is well-fit by proposed a linear model.
V. Conclusion
The findings of this study revealed that absence has a major impact on academic achievement. The research was carried out in-depth, with a lot of data visualization and statistical modelling included in the publication. The findings revealed a moderately negative relationship between the number of absences and the final score (r=-0.6088, p=0.00036 which is les than 0.05) exam scores between students who missed less than and equal to 22% of their classes and students who missed more than 23% of their classes (t-test=-4.06075 and the p is less then 0.05) The key finding was that if a student misses one class, their final test grades are projected to drop by 4.54 percent on average. It is believed that the findings of this study would help the colleges and university plan for students who will graduate on time. Furthermore, this study has the potential to raise student knowledge about the impact of missing courses on their academic performance.
VI. References
Nurhafizah Ahmad, Ahmad Zia Ul-Saufie, Siti Asmah Mohamed, Hasfazilah Ahmat and Mohd Fahmi Zahari. The Impact of Class Absenteeism on Student’s Academic Performance using Regression Models. Conference Proceeding of the 25th National Symposium on Mathematical Sciences, Published by the American Institute of Physics, (June 2018), pp. 1-5. https://aip.scitation.org/doi/abs/10.1063/1.5041712.
Jenq-Foung "Jf" Yao and Tsu-Ming Chiang. Correlation Between Class Attendance and Grade. Journal of Computing Sciences in Colleges, Volume 27, Issue 2, (December 2011), pp. 142–147. https://dl.acm.org/doi/abs/10.5555/2038836.2038857.
Chengcheng Zhang and Fei Wang. Research on Correlation Analysis between Test Score and Classroom Attendance Based on Linear Regression Model. Proceeding of 2010 2nd International Conference on Industrial Mechatronics and Automation (ICIMA2010), IEEE, (May 2010), pp. 545-548. https://ieeexplore.ieee.org/document/5538079.
Suleiman Obeidat, Adnan Bashir, Wisam Abu Jadayil. The Importance of Class Attendance and Cumulative GPA for Academic Success in Industrial Engineering Classes. World Academy of Science, Engineering and Technology International Journal of Humanities and Social Sciences Vol:6, No:1, (2012), pp. 120-123. https://publications.waset.org/2974/the-importance-of- class-attendance-and-cumulative-gpa-for-academic- success-in-industrial-engineering-classes.
Meenakshi Narula and Pankaj Nagar. Relationship Between Students’ Performance and Class Attendance in a Programming Language Subject in a Computer Course. International Journal of Computer Science and Mobile Computing. Vol. 2, Issue. 8, August 2013, pg.206 – 210. https://www.ijcsmc.com/docs/papers/August2013/abstract s/V2I8201318.pdf.
Anna Lukkarinen, Paula Koivukangas, Tomi Seppala. Relationship between class attendance and student performance. 2nd International Conference on Higher Education Advances, HEAd´16, Procedia - Social and Behavioral Sciences 228 (2016), pp. 341 – 347. https://www.sciencedirect.com/science/article/pii/S18770 42816309776.
Lucrecia Santibanez and Cassandra M. Guarino. The Effects of Absenteeism on Academic and Social- Emotional Outcomes: Lessons for COVID-19. Educational Researcher, Vol. 50 No. 6, (February 25, 2021), pp. 392–400. https://journals.sagepub.com/doi/10.3102/0013189X21994488
Yousaf Latif Khan, Sohail Khursheed Lodhi, Shahzad Bhatti and Waqas Ali. Does Absenteeism Affect Academic Performance Among Undergraduate Medical Students? Evidence From “Rashid Latif Medical College (RLMC).” Advances in Medical Education and Practice, Vol. 10, Issue 1, (2019), pp. 999–1008. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6897060/