Artículo en PDF
How to cite
Complete issue
More information about this article
Journal's homepage in redalyc.org
Sistema de Información Científica
Red de Revistas Científicas de América Latina y el Caribe, España y Portugal
325
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
About the effects of combining Latent Semantic Analysis
with natural language processing techniques for
free-text assessment*
Recibido:
30-VI-2005
Aceptado:
7-X-2005
Revista Signos
2005, 38(59)
325-343
Correspondencia:
Diana Pérez (diana.perez@uam.es). Tel.: (34-91) 4972267 Fax: 4972235.
Departamento de Ingeniería Informática, Escuela Politécnica Superior, Universidad Au-
tónoma de Madrid. Ciudad Universitaria de Cantoblanco, Calle Francisco Tomás y Valien-
te 11, Madrid, España.
* Project TIN2004-03140. Spanish Ministry of Science and Technology.
Diana Pérez
Enrique Alfonseca
Pilar Rodríguez
Universidad Autónoma
de Madrid, España
Alfio Gliozzo
Carlo Strapparava
Bernardo Magnini
Instituto para la Investigación
Científica y Tecnológica, Italia
Abstract:
This article presents the combination of Latent Semantic Analysis (LSA) with other
natural language processing techniques (stemming, removal of closed-class words and word
sense disambiguation) to improve the automatic assessment of students’ free-text answers.
The combinational schema has been tested in the experimental framework provided by
the free-text Computer Assisted Assessment (CAA) system called Atenea (Alfonseca & Pérez,
2004). This system is able to ask randomly or according to the students’ profile an open-
ended question to the student and then, assign a score to it. The results prove that for all
datasets, when the NLP techniques are combined with LSA, the Pearson correlation between
the scores given by Atenea and the scores given by the teachers for the same dataset of
questions improves. We believe that this is due to the complementarity between LSA, which
works more at a shallow semantic level, and the rest of the NLP techniques used in Atenea,
which are more focused on the lexical and syntactical levels.
Keywords:
LSA, free-text assessment, computer assisted assessment, e-learning.
326
Revista Signos 2005, 38(59)
Sobre los efectos de combinar Análisis Semántico Latente con otras técnicas
de procesamiento de lenguaje natural para la evaluación de preguntas
abiertas
Resumen:
Este artículo presenta la combinación de Análisis Semántico Latente (LSA) con otras técnicas
de procesamiento del lenguaje natural (lematización, eliminación de palabras funcionales y
desambiguación de sentidos) para mejorar la evaluación automática de respuestas en texto libre. El
sistema de evaluación de respuestas en texto libre llamado Atenea (Alfonseca & Pérez, 2004) ha servido
de marco experimental para probar el esquema combinacional. Atenea es un sistema capaz de realizar
preguntas, escogidas aleatoriamente o bien conforme al perfil del estudiante, y asignarles una califica-
ción numérica. Los resultados de los experimentos demuestran que para todos los conjuntos de datos en
los que las técnicas de PLN se han combinado con LSA la correlación de Pearson entre las notas dadas por
Atenea y las notas dadas por los profesores para el mismo conjunto de preguntas mejora. La causa puede
encontrarse en la complementariedad entre LSA, que trabaja a un nivel semántico superficial, y el resto
de las técnicas NLP usadas en Atenea, que están más centradas en los niveles léxico y sintáctico.
Palabras Clave:
LSA, preguntas abiertas, evaluación asistida por ordenador, e-learning.
INTRODUCTION
Computer Assisted Assessment can be defined as the field that studies how to use effectively
computers in the assessment of student learning. It is a very general definition, since CAA is
a broad field that covers from systems as simple as HTML forms to sophisticated systems
with a complex assessment and feedback process. In this work we focus on the assessment
of open-ended questions because according to the general opinion of the field, it is also
necessary to address this kind of assessment in order to fully assess the student learning
process.
On the other hand, automatically assessing free-text answers is a difficult task. Although it
has been studied since the 60s (Page, 1966), it has not been able to show real progress until
the late 90s, when Natural Language Processing (NLP) techniques were applied to assess
students’ free-text answers for the first time.
Nowadays, there are more than fifteen different systems that have appeared to tackle this
task (Valenti, Neri, & Cucchiarelli, 2003). Although they are based on different techniques,
all of them share a common core idea: a student’s answer should receive a higher score
when it is closer to a reference or to a group of references written by an expert in the topic
or collected from other sources such as textbooks or Internet. It is important to notice that
students can write the same idea in hundreds of different ways. Due to this paraphrasing
problem, the automatic scores highly depend on the quality of these references.
327
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
In particular, our approach relies on the combination of statistical NLP techniques (stemming,
removal of closed-class words and Word Sense Disambiguation) and Latent Semantic Analysis
(LSA). All of these modules are integrated into Atenea (Alfonseca & Pérez, 2004), the system
that we have developed for evaluating students’ free-text answers written both in Spanish
and in English.
LSA is a statistical method for inferring meaning from a text. It has already been used for
evaluating free-text students’ answers with good results (Foltz, Laham and Landauer, 1999;
Dessus, Lemaire & Vernier, 2000). In this work we study different possibilities for combining
our CAA system Atenea with LSA. Atenea has, currently, several modules, including one that
calculates n-gram statistics between the student answer and the references, and several NLP
components. In previous work, we have shown that combining Atenea’s statistical module
with LSA achieves better results than just using each of them independently (Pérez, Gliozzo,
Strapparava, Alfonseca, Rodríguez & Magnini, 2005). In this paper, our motivation is to study
how the use of the NLP components in the combination with LSA affects the results. Hence,
we present different configurations of Atenea and how the results vary when the LSA mo-
dule is used.
The paper is organised as follows: Section 1 presents the state-of-the-art of Computer Assisted
Assessment of free-text students’ answers and the use of LSA for this task; Section 2 descri-
bes the Atenea system without using LSA; Section 3 details the LSA configuration used; and,
Section 4 focuses on its integration with other NLP techniques within Atenea. Finally, the
article ends with the main conclusions and some lines of future work.
1. Review of the state-of-the-art
1.1. Free-text CAA
Table 1 presents several of the existing free-text CAA systems with the technique that
underpins each of them and the results provided by their authors. It is important to highlight
that the results are not fully comparable because these systems are using different corpora
and metrics.
328
Revista Signos 2005, 38(59)
System
Reference
Technique
Results
Domain
PEG
(Page, 1966)
Statistical
Corr: .87
Non factual
disciplines
E-rater
(Burstein, Leacock,
Statistical and NLP
Agr: .97
Non-native
& Swartz, 1998)
English writing
Larkey’s system
(Larkey, 1998)
Text Categorization
EAgr: .55
Social and opinion
IEA
(Foltz, Laham &
LSA
Agr: .85
Psychology and
Landauer, 1999)
military
SEAR
(Christie, 1999)
Information
Corr: .45
History
Extraction
Apex Assessor
(Dessus, Lemaire &
LSA
Corr: .59
Sociology of
Vernier, 2000)
education
IEMS
(Ming, Mikhailov &
Indextron
Corr: .8
Non mathematical
Kuan, 2000)
IntelliMetric
(Vantage, 2000)
NLP
Agr: .98
k-12 and creative
writing
ATM
(Callear, Jerrams-
Information
Not
Factual
Smith & Soh, 2001)
Extraction
Available
disciplines
C-rater
(Burstein, Leacock,
NLP
Agr: .83
Comprehension &
& Swartz, 2001)
algebra
Automark
(Mitchell, Russell,
Information
Corr: .95
Science
Broomhead, &
Extraction
Aldridge, 2002)
BETSY
(Rudner & Liang, 2002)
Bayesian networks
CAcc: .77
Any text classification
PS-ME
(Mason & Grove-
NLP
Not
NCA or GCSE exam
Stephenson, 2002)
Available
CarmelTC
(Rosé, Roque &
Machine Learning
f-S: .85
Physic
VanLehn, 2003)
and Bayes classif.
Auto-marking
(Sukkarieh, Pulman
NLP & Pattern
EAgr: .88
Biology
& Raikes, 2003)
Match.
Atenea
(Alfonseca &
Statistical, NLP
Corr: .56
Computer Science
Pérez, 2004)
and LSA
Table 1.
Overview of the main features of several free-text CAA systems
1
.
329
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
Nonetheless, it can be seen that the values achieved show the great amount of international
research that has been dedicated to the field in the last years, a fact that has even given rise
to commercial systems. Institutions such as the Scottish Qualifications Authority (SQA), the
Computer-Assisted Assessment (CAA) Centre in the U.K. or the U.S.A. Department of Education
and Educational Testing Services (ETS) in the United States support the research in this field.
In fact, CAA has many possibilities of application. Some of them are: assigning problems to
students, scoring them (summative assessment), returning feedback (formative assessment),
and evaluating the assessment effectiveness (Blayney & Freeman, 2003). In particular, free-
text CAA systems open the following capabilities: creation of links to theoretical explanations
to clarify the weak points exposed by the assessment of the students’ free-text answers,
support to teachers who cannot assist a large number of students, help to students showing
them where their mistakes are, and instantaneous feedback.
1.2 Application of LSA for free-text CAA
Though LSA was not originally created for assessing free-text answers, its ability to give an
idea of the semantic similarity between texts and to provide content-based feedback makes
it particularly suited to e-assessment (Miller, 2003; Haley, Pete, Nuseibeh, Taylor & Lefrere,
2003).
Therefore, this technique has been used for several of the aforementioned free-text CAA
systems. In particular, the Intelligent Essay Assessor (IEA, Foltz et al., 1999) and the Apex
Assessor (Dessus et al., 2000) rely on LSA as the core technique of their system with good
results.
2. Atenea
Atenea (Alfonseca & Pérez, 2004) is a CAA system for automatically scoring students short
answers written in Spanish or in English. It relies on the combination of shallow NLP
techniques and statistically based evaluation procedures. Figure 1 shows a snapshot of the
interface, in Spanish, of the on-line version of Atenea.
After a student logs into the system, Atenea chooses a question according to his or her
profile (Pérez & Alfonseca, 2005). This takes into account the previous performance of the
student on other questions in a test set, and on other test sets. It is also possible for the
teacher to define alternative reference texts depending on stereotypes (e.g. age, language,
experience, etc.), which will also adapt the assessment process.
330
Revista Signos 2005, 38(59)
Once the question and the reference answers have been chosen by the system, it compares
the student’s answer with the ideal answers, to see how similar they are. The student receives
a numerical score, and the answer marked up with a colour background indicating which
portions have more coincidences with the reference texts. From this output, students can
discern which ones are their weak points. Figure 2 shows an example of feedback page.
Figure 1.
Interface of Atenea.
Figure 2.
Feedback for student answer (the Spanish statement “Tu nota es un:” can be translated as
“Your score is:” and “Tu texto corregido es:” as “Your processed answer is:”).
331
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
A web-based wizard (see Figure 3) has been developed to facilitate the task of introducing
new datasets of questions. It allows augmenting an existing dataset, or creating a new one;
modifying existing questions or adding new ones; and, modifying existing question
statements, maximum scores or references.
Figure 3.
Atenea question authoring tool.
The system can also be retrained in such a way that the references for each question can be
chosen from those written by the teacher, and from the best answers written by other
students, in a way that maximises the accuracy of the assessment.
The internal architecture of Atenea is composed of a statistical module, called ERB, and
several Natural Language Processing (NLP) modules based on the wraetlic tools (Alfonseca,
2003).
The statistical module of Atenea relies on the BiLingual Evaluation Understudy (BLEU)
algorithm (Papineni, Roukos, Ward & Zhu, 2001). Basically, it looks for n-gram coincidences
between the student’s answer and the references. Its pseudocode is as follows:
1. For several values of
n
(usually from 1 to 3), calculate the Modified Unified Precision
(MUP) of the student answer, i.e. the percentage of n-grams from the student’s answer
which appears in any of the references:
MUP(n)
=
min(
count
i
,max RC
i
)
/
lCand
(
)
Σ
num
i=0
332
Revista Signos 2005, 38(59)
where
count
i
is the frecuency of the
i
-th n-gram in the student’s answer, max
RC
i
is the
maximum frequency of that n-gram in a reference text, and
lCand
is the length of the
student’s answer.
2. Combine the MUPs obtained for each value of
n
as:
3. Apply a brevity factor to penalise the texts shorter than all the references:
BP = e
(1 – lRef/lCand)
if
lCand
<
lRefs
4
.
The final BLEU score is:
BLEU = BP x e
combMUP
BLEU is basically a precision metric: it measures which n-grams in the candidate answer
appear in the references. In the case of scoring student’s answers, we both want the answer
to be correct and complete. Therefore, we have modified this metric to calculate as well the
percentage of the references that is covered by the student’s answer. To do that, the Brevity
Factor is substituted by a Modified Brevity Penalty (MBP) calculated in the following way:
for each reference, calculate the percentage of n-grams that is covered by the candidate
text, and, next, we add up all those percentages. Figure 4 shows an example in which 5% of
the first reference, 10% of the second one and 20% of the third one appears in the student’s
answer. Therefore, we can assume that 35% of a complete answer is covered by the candidate
text. The results using this MBP clearly outperform those obtained using the original
algorithm.
Concerning the NLP modules, it is possible to configure Atenea to indicate which modules
to use. The possibilities that they add to the system are:
·
Stemming
: matching nouns and verbs inflected in different ways.
·
Removal of closed-class words
: ignoring functional words that are irrelevant to extract
the students’ answer meaning.
·
Word-Sense Disambiguation
: identifying the sense intended by both the teacher and
the student and, thus looking if it is the same.
combMUP
=
log
MUP
n
/
3
Σ
3
n=1
333
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
3. LSA
The simplest technique to estimate the similarity between the topics of two texts is the
Vector Space Model (VSM) that is based on the calculation of the cosine between the vectors
that represent each text. Let
Γ
= {t
1
,t
2
,…,t
n
} be a corpus,
V
= {w
1
, w
2
,…,w
k
} its vocabulary,
T
the
k
x
n
term-by-document matrix representing
Γ
,
such that
t
i,j
is the frequency of word
w
i
into
the text
t
j
. The VSM is a
k
-dimensional space
R
k
, in which the text
t
j
Γ
is represented by
means of the vector
t
j
such that the
i
th
component of
t
j
is
t
i,j
.
However, this approach does not deal well with lexical variability and ambiguity. For example,
the two sentences
“he is affected by AIDS”
and
“HIV is a virus”
do not have any words in
common and thus, using VSM their similarity is zero because they have orthogonal vectors,
although the concepts they express are very closely related. On the other hand, the similarity
between the two sentences
“the laptop has been infected by a
virus
and
“HIV is a
virus
would turn out very high, due to the ambiguity of the word
virus
. To overcome this problem,
the notion of
Domain Model
(DM) was introduced. It is composed by soft clusters of terms.
Each cluster represents a semantic domain (Gliozzo, Magnini & Strapparava, 2004), i.e. a set
of terms that often co-occur in texts having similar topics. A DM is represented by a
k
x
k’
rectangular matrix
D
, containing the degree of association among terms and domains, as
illustrated in Table 2.
Figure 4.
Procedure for calculating the MBP factor.
334
Revista Signos 2005, 38(59)
Table 2.
Example of Domain Matrix.
Word
Medicine
Computer Science
HIV
1
0
AIDS
1
0
Virus
0.5
0.5
laptop
0
1
Domain Models can be used to describe lexical ambiguity and variability. Lexical ambiguity
is represented by associating one term to more than one domain, while variability is
represented by associating different terms to the same domain. For example, the term
virus
is associated to both the domain Computer Science and the domain Medicine (ambiguity)
while the domain Medicine is associated to both the terms
AIDS
and
HIV
(variability). More
formally, let
= {D
1
, D
2
,...,D
k’
} be a set of domains, such that
k’
<<
k
. A Domain Model is fully
defined by a
k
x
k’
domain matrix
D
representing in each cell
d
i,z
the
domain relevance
of
term
w
i
with respect to the domain
D
z
. The domain matrix
D
is used to define a function
:
R
k
R
k’
, that maps the vectors
t
j
expressed into the classical VSM, into the vectors
t’
j
in the
domain VSM.
is defined by
2
:
(
t
j
) =
t
j
(
I
IDF
D
) =
t’
j
(1)
where I
IDF
is a diagonal matrix such that
i
IDF
i,i
=
IDF
(
w
i
),
t
j
is represented as a row vector, and
IDF
(
w
i
) is the
Inverse Document Frequency
of
w
i
.
Vectors in the domain VSM are called Domain Vectors. Domain Vectors for texts are estimated
by exploiting Formula 1, while the Domain Vector
w’
i
, corresponding to the word
w
i
V
, is
the
i
th
row of the domain matrix
D
. To be a valid domain matrix such vectors should be
normalized (i.e. <
w’
i
,
w’
i
> = 1).
In the Domain VSM the similarity among Domain Vectors is estimated by taking into account
second order relations among terms. For example the similarity of the two sentences
“He is
affected by AIDS”
and
“HIV is a virus”
is very high, because the terms
AIDS
,
HIV
and
virus
are highly associated to the domain Medicine.
LSA (Deerwester, Dumais, Furnas, Landauer & Harshman, 1990) is an unsupervised technique
for estimating the similarity among texts & terms in a corpus. It has been proposed to indu-
335
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
ce Domain Models from corpora (Gliozzo, Giuliano & Strapparava, 2005a; Gliozzo &
Strapparava, 2005b). It is performed by means of a Singular Value Decomposition (SVD) of
the term-by-document matrix
T
describing the corpus. The SVD algorithm can be exploited
to acquire a domain matrix
D
from a large corpus
Γ
in a totally unsupervised way. SVD
decomposes the term-by-document matrix
T
into three matrixes
T
V
Σ
k’
U
T
where
Σ
k
is the
diagonal
k
x
k
matrix containing the highest
k’
<<
k
eigenvalues of
T
, and all the remaining
elements set to 0. The parameter
k’
is the dimensionality of the Domain VSM and can be
fixed in advance
3
. Under this setting we define the domain matrix
D
LSA
4
as
D
LSA =
I
N
V
(2)
where
I
N
is a diagonal matrix such that:
The Domain Kernel, denoted by
K
D
, can be exploited to estimate the topic similarity among
two texts while taking into account the external knowledge provided by a Domain Model.
It is defined by:
where
is the Domain Mapping defined in Formula 1. To be fully defined, the Domain
Kernel requires a Domain Matrix
D
. In principle,
D
can be acquired from any corpora by
exploiting any (soft) term clustering algorithm. Anyway, we believe that adequate Domain
Models for particular tasks can be better acquired from collections of documents from the
same source. The matrix
D
LSA
, defined by Formula 2 is acquired using the whole unlabeled
training corpora available for each task, so tuning the Domain Model on the particular task
in which it will be applied.
4. Experimental settings
We have modified Atenea’s architecture so that after the NLP modules chosen have done
the processing to the student’s answer and its references; both the ERB and the LSA module
are in charge to compare them. With the introduction of the LSA module we expect to add
a notion of semantic similarity between the student’s answer and the references. In this
way, not only the style of the answer but also the content is addressed.
336
Revista Signos 2005, 38(59)
The LSA algorithm applied follows the pseudo-document methodology described by (Berry,
1992). We have defined the LSA score as the mean of the pseudo-document similarities
between the student’s answer and each vector representing a reference. Because of the
results obtained in previous experiments (Pérez et al., 2005), we have chosen for training
the Ziff-Davis part of the North American Collection with 142.580 articles from different
journals and magazines in Computer Science. The LSA score is combined with the ERB one
using the following linear combination:
Atenea’ score =
α
&
ERB + (1-
α
)
&
LSA
(4)
The best value for
α
is obtained, for each test set, empirically, to maximize Atenea’s assessing
accuracy, measured as the Pearson correlation between Atenea’s and the teacher’s scores
for the same questions data set.
The test corpus used contains ten different questions about Operating Systems and Object
Oriented Programming, nine of them obtained from exams in our home university, and the
last one consisting of a set of definitions of “Operating System” obtained from the Internet.
In total, there are 924 student answers and 44 references written by teachers. Table 3 descri-
bes the datasets.
In previous work we have observed the robustness of the n-gram-based algorithm to cope
with automatic translations of the answers and the references. The correlation between the
teachers’ scores and the system’s does not vary in a statistically significant way if we work
with the original student’s answers or with automatic translations (Alfonseca & Pérez, 2004;
Pérez & Alfonseca, 2005). Given that the LSA system had been trained on an English corpus,
and we did not have a large corpus on Computer Science available in Spanish, we have
chosen to work with a translation of our test corpus into English, performed by Altavista
Babelfish
5
.
337
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
Table 3.
Evaluation sets
6
.
SET
NS
MS
NR
MR
Type
ERB
LSA
P(E+L)
1
38
67
4
130
Def.
0.61
0.49
0.61
2
79
51
3
42
Def.
0.54
0.20
0.38
3
96
44
4
30
Def.
0.20
-0.01
0.10
4
11
81
4
64
Def.
0.29
0.52
0.48
5
143
48
7
27
Def.
0.61
0.50
0.64
6
295
56
8
55
A/D
0.19
0.24
0.23
7
117
127
5
71
Y/NJ
0.33
0.29
0.38
8
117
166
3
186
A/D
0.39
0.39
0.46
9
14
118
3
108
Y/NJ
0.75
0.78
0.81
10
14
116
3
105
Def.
0.78
0.87
0.90
MEAN
92.4
87.4
4.4
81.8
—-
0.47
0.43
0.50
5. Experiment and results
We have tested five different configurations of Atenea: stemming (A1); removal of closed-
class words (A2); stemming and removal of closed-class words (A3); Word Sense
Disambiguation (WSD, A4); and WSD and removal of closed-class words (A5). In each case,
the scoring is done by calculating the n-gram score described above on the processed text.
In the case of WSD, nouns, verbs, adjectives and adverbs are substituted by sense identifiers
from WordNet before the score is calculated.
Table 4 shows the results obtained for these combinations without using the LSA module.
Table 5 evidences how they improved when the LSA module is used, and we give the same
weight to Atenea and LSA (
α
= 0.5). Values highlighted in bold indicate that the combination
with LSA has produced a better result. Finally, Table 6 shows the results obtained after
optimising the value of
empirically.
338
Revista Signos 2005, 38(59)
Table 4.
Results of Atenea for five different configurations.
SET
A1
A2
A3
A4
A5
1
0.618881
0.540396
0.582110
0.632213
0.595169
2
0.468772
0.582403
0.550087
0.440452
0.495139
3
0.232850
0.450573
0.405473
0.234355
0.418775
4
0.346364
0.462735
0.536814
0.311094
0.494748
5
0.639496
0.665886
0.708401
0.655028
0.718707
6
0.236440
0.306757
0.337241
0.225270
0.322892
7
0.291880
0.369568
0.411027
0.282297
0.408539
8
0.374665
0.412187
0.440874
0.368593
0.434522
9
0.765208
0.740003
0.677579
0.761126
0.691731
10
0.850650
0.709258
0.718523
0.829618
0.706292
MEAN
0.482521
0.523977
0.536813
0.474005
0.528651
Table 5.
Results of the combination (Atenea+LSA) for the five different configurations.
SET
M(A1+L)
M(A2+L)
M(A3+L)
M(A4+L)
M(A5+L)
1
0.630219
0.564336
0.599315
0.642924
0.611331
2
0.451979
0.573117
0.551557
0.428057
0.506119
3
0.212284
0.419053
0.377003
0.211843
0.385953
4
0.388911
0.499870
0.555359
0.355465
0.518276
5
0.645359
0.674847
0.714907
0.659699
0.724796
6
0.244271
0.312712
0.340969
0.233106
0.326627
7
0.302367
0.380963
0.417945
0.293451
0.416635
8
0.391221
0.430773
0.454752
0.385953
0.449158
9
0.784240
0.764052
0.712132
0.780152
0.722886
10
0.884809
0.780588
0.783078
0.868666
0.771207
MEAN
0.493566
0.540031
0.550702
0.485932
0.543299
339
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
Table 6.
Results of the combination
(
α
= 0.174, 0.346, 0.323, 0.151 and 0.298, respectively).
SET
P(A1+L)
P(A2+L)
P(A3+L)
P(A4+L)
P(A5+L)
1
0.638036
0.577326
0.610560
0.645999
0.622724
2
0.344846
0.534323
0.509780
0.316856
0.463320
3
0.138333
0.367555
0.334028
0.123294
0.329082
4
0.486621
0.521720
0.566815
0.473688
0.537167
5
0.649906
0.680228
0.719279
0.659279
0.729381
6
0.265019
0.316759
0.343870
0.257485
0.330220
7
0.329238
0.388714
0.423370
0.326698
0.424366
8
0.433712
0.444291
0.466936
0.436597
0.464349
9
0.806440
0.777007
0.737520
0.805299
0.749767
10
0.926590
0.821233
0.828300
0.924139
0.827158
MEAN
0.501874
0.542916
0.554046
0.496933
0.547754
Before these experiments were performed, the best correlation obtained using the statistical
module of Atenea combined with LSA was 50% (Pérez et al., 2005). As can be seen in Table
6, when the optimisation procedure is applied, together with some NLP modules, the avera-
ge correlation increases up to 55.40%.
CONCLUSIONS
Automatically assessing free-text students’ answers is a long-standing problem that is still
far from being completely solved. In this paper, we describe a new approach that consists in
combining statistical, Natural Language Processing and Latent Semantic Analysis techniques.
This approach has been implemented in the Atenea system, an adaptive free-text CAA system,
able to assess students’ answers. The core idea of the system is that a student’s answer is
better and thus, it should receive a higher score, when it is more similar to the teachers’
answers (or references) for the same question.
In order to be able to handle some of the many different expressions that convey the same
meaning, Atenea can be configured to process both the student’s answer and the references
340
Revista Signos 2005, 38(59)
with stemming, removal of closed-class words, and/or Word Sense Disambiguation (WSD)
techniques. Due to the stemming step, morphological differences are not taken into account
in the comparison. The removal of closed-class words is useful to ignore coincidences in
those words that are less relevant to the general meaning of the answers. Finally, WSD
attempts to identify the sense in which polysemous words are used to find out if the senses
intended by the teacher and the student are the same.
After that NLP processing, the n-gram co-occurrence scoring procedure ERB produces a value
that is combined with the LSA one to finally give the students their score.
The goodness of the procedure is measured with the Pearson correlation between the
automatic scores given by the system and the manual scores given by the teachers for the
same datasets of questions. In particular, the mean correlation has improved up to the state-
of-the-art 56% value, when both the students’ answers and the references were first stemmed,
then the closed-class words were removed and the ERB and LSA techniques were combined.
In this way, it has been seen that LSA and the rest of NLP techniques combined can produce
better results than any of them separately.
This result is also interesting since it determines the optimum combination of techniques to
use Atenea with students. That is, the on-line version of Atenea will be configured to stem,
remove closed-class words, use ERB and LSA and combine their scores to finally produce the
student’s score.
Atenea benefits from a Machine Translation engine to be able to process answers written in
Spanish and in English. In particular, the system is able to automatically translate the students’
answers to the language in which the references (teachers’ answers) are written, without a
significant decrease in accuracy. In fact, in a few cases, the Pearson correlation even improves
slightly, as noted in (Pérez & Alfonseca, 2005).
Given that the LSA subsystem had been trained just for the English language, it is not able
to process Spanish answers yet. Therefore, we have taken advantage of this feature in Atenea
in the evaluation performed, so we have been able to score as well Spanish answers, by
automatically translating them into English, using LSA.
This paper opens many promising prospective lines. Firstly, to try a more complex
combinational between Atenea and LSA. Secondly, in the general architecture of Atenea it
is easy to integrate other NLP tools such as a rhetorical analyzer. It would be very interesting
to incoporate it because it would identify both in the student’s answer and in the teacher’s
answer the fragments of the text in which a new idea is introduced, advantages or
disadvantages are given, a concluding remark is provided, etc., achieving a better comparison
process that could lead to a more accurate score to combine with the LSA score.
341
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
Other interesting possibilities are to try this combination in other systems. For instance, it
should be possible to add an LSA module to other free-text CAA systems such as SEAR (Christie,
1999), which is based on Information Extraction (IE). Conversely, systems such as IEA (Foltz et
al., 1999), that only relies on LSA, might be combined with shallow NLP techniques.
REFERENCES
Alfonseca, E. (2003).
Wraetlic user guide version 1.0.
[on line]. Retrieved from: http://www.ii.uam.es/
~ealfon/eng/download.html
Alfonseca, E. & Pérez, D. (2004).
Automatic assessment of short questions with a Bleu-inspired
algorithm and shallow NLP
. Proceedings of the 4
th
International Conference, EsTAL 2004,
Alicante, Spain.
Berry, M. (1992). Large-scale sparse singular value computations.
International Journal of
Supercomputer Applications, 6
(1), 13-49.
Blayney, P. & Freeman, M. (2003).
Automated marking of individualised spreadsheet assignments:
The impact of different formative self-assessment options
. Proceedings of the 7
th
Computer
Assisted Assessment International Conference, Loughborough, U.K.
Burstein, J., Kukich, K., Wolff, S., Lu, C., Chodorow, M., Bradenharder, L. & Harris, M. (1998).
Automated scoring using a hybrid feature identification technique
. Proceedings of the
Annual Meeting of the Association of Computational Linguistics, Montreal, Canada.
Burstein, J., Leacock, C. & Swartz, R. (2001).
Automated evaluation of essays and short answers
.
Proceedings of the 5
th
Computer Assisted Assessment International Conference,
Loughborough, U.K.
Callear, D., Jerrams-Smith, J. & Soh, V. (2001).
CAA of short non-MCQ answers
. Proceedings of the
5
th
Computer Assisted Assessment International Conference, Loughborough, U.K.
Christie, J.R. (1999).
Automated essay marking - for both style and content
.
Proceedings of the
3
rd
Computer Assisted Assessment International Conference, Loughborough, U.K.
Deerwester, S., Dumais, S., Furnas, G., Landauer, T. & Harshman, R. (1990). Indexing by latent
semantic analysis.
Journal of the American Society for Information Science, 41
(6), 391-
407.
Dessus, P., Lemaire, B. & Vernier, A. (2000).
Free text assessment in a virtual campus
. Proceedings
of the 3
rd
International Conference on Human System Learning, Paris, France.
Foltz, P., Laham, D. & Landauer, T. (1999). The intelligent essay assessor: Applications to educational
technology.
Interactive Multimedia Electronic Journal of Computer-Enhanced Learning,
1
(2). [on line]. Retrieved from: http://imej.wfu.edu/articles/1999/2/04/index.asp
Gliozzo, A., Magnini, B. & Strapparava, C. (2004).
Unsupervised domain relevance estimation for
342
Revista Signos 2005, 38(59)
word sense disambiguation
. Proceedings of the Empirical Methods in Natural Language
Processing Conference, Barcelona, Spain.
Gliozzo, A., Giuliano, C. & Strapparava, C. (2005a).
Domain kernels for word sense disambiguation
.
Proceedings of ACL, Michigan, U.S.A.
Gliozzo, A. & Strapparava, C. (2005b).
Domain kernels for text categorization
. Proceedings of
(CONLL), Michigan, U.S.A.
Haley, D., Pete T., Nuseibeh, B., Taylor, J. & Lefrere, P. (2003).
E-Assessment using Latent Semantic
Analysis
. Proceedings of the 3
rd
International LeGE-WG Workshop: Towards a european
learning GRID infrastructure to support future technology enhanced learning, Berlin,
Germany.
Larkey, L. (1998).
Automatic essay grading using text categorization techniques
. Proceedings of
the 21
st
Annual International ACM SIGIR Conference on Research and Development in
Information Retrieval, New York, U.S.A.
Mason, O. & Grove-Stephenson, I. (2002).
Automated free text marking with paperless school
.
Proceedings of the 6
th
International Computer Assisted Assessment Conference,
Loughborough, U.K.
Miller, T. (2003). Essay Assessment with Latent Semantic Analysis.
Journal of Educational Computing
Research, 29
(4), 495–512.
Ming, Y., Mikhailov, A. & Kuan, T. (2000). Intelligent essay marking system [on line]. Retrieved
Mitchell, T., Russell, T., Broomhead, P. & Aldridge, N. (2002).
Towards robust computerised marking
of free-text responses
. Proceedings of the 6
th
International Computer Assisted Assessment
Conference, Loughborough, U.K.
Page, E. (1966). The imminence of grading essays by computer.
Phi Delta Kappan, 47
(1),
238-243.
Papineni, K., Roukos, S., Ward, T. & Zhu, W. (2001).
BLEU:
A method for automatic evaluation of
machine translation
. Technical Report RC22176 (W0109-022), IBM Research Division,
Thomas J. Watson Research Center, Yorktown Heights, New York, U.S.A.
Pérez, D., Gliozzo, A., Strapparava, C., Alfonseca, E., Rodríguez, P. & Magnini, B. (2005).
Automatic
assessment of students’ free-text answers underpinned by the combination of a BLEU-
inspired algorithm and Latent Semantic Analysis. American Association for Artificial
Intelligence (AAAI) Press
. Proceedings of the 18
th
FLAIRS International Conference, Flori-
da, U.S.A.
Pérez, D. & Alfonseca, E. (2005).
Adapting the automatic assessment of free-text answers to the
students
. Proceedings of the 9
th
Computer Assisted Assessment (CAA) international
conference, Loughborough, U.K.
Rosé, C., Roque, A., Bhembe, D. & VanLehn, K. (2003).
A hybrid text classification approach for
analysis of student essays
. Proceedings of the HLT-NAACL workshop
Building Educational
343
About the effects of combining Latent Semantic Analysis with natural language.
.. / Pérez, D.
et al.
Applications Using Natural Language Processing, Edmonton, Canada.
Rudner, L. & Liang, T. (2002).
Automated essay scoring using bayes’ theorem
.
Journal of Technology,
Learning, and Assessment, 1
(2). [on line]. Retrieved from: http://www.bc.edu/research/
intasc/jtla/journal/v1n2.shtml
Salton, G. & McGill, M. (1983).
Introduction to Modern Information Retrieval.
New York, NY:
McGraw-Hill.
Sukkarieh, J., Pulman, S. & Raikes, N. (2003).
Auto-marking: Using computational linguistics to
score short, free text responses
. Proceedings of the 29
th
Annual Conference of the
International Association for Educational Assessment, Manchester, U.K.
Valenti, S., Neri, F. & Cucchiarelli, A. (2003). An overview of current research on automated essay
grading.
Journal of Information Technology Education
, 2, 319-330.
Vantage Learning Tech. (2000).
A study of expert scoring and intellimetric scoring accuracy for
dimensional scoring of grade 11 student writing responses
. Technical Report RB-397,
Vantage Learning Technology, Newtown, Philadelphia, U.S.A.
Wong, S., Ziarko, W. & Wong, P. (1985).
Generalized vector space model in information retrieval
.
Proceedings of the 8
th
International ACM SIGIR Conference on Research and Development
in Information Retrieval, New York, U.S.A.
NOTES
1
The results are presented according to the metrics indicated by their authors (Corr: correlation;
Agr: Agreement; EAgr: Exact Agreement; CAcc: Classification accuracy; f-S: f-Score). When
the authors have presented several values for the results, the mean value has been taken.
2
In (Wong, Ziarko & Wong, 1985) a similar schema is adopted to define a Generalized Vector
Space Model, of which the Domain VSM is a particular instance.
3
It is not clear how to choose the right dimensionality. In our experiments we used 400
dimensions.
4
When
D
LSA
is substituted in Formula 1 the Domain VSM is equivalent to a Latent Semantic
Space (Deerwester et al., 1990). The only difference in our formulation is that the vectors
representing the terms in the Domain VSM are normalized by the matrix IN, and then rescaled,
according to their IDF value, by matrix
I
IDF
. Note the analogy with the tf-idf term
weighting schema (Salton & Mcgill, 1983), widely adopted in Information Retrieval.
5
6
Columns indicate: number of student answers (NS), their mean length (MS), number of
references (NR), their mean length (MR), question type (Def., definitions; A/D, advantages
and disadvantages; Y/NJ, yes-no with justification), ERB, LSA and their combination results.
logo_pie_uaemex.mx