Table Of ContentMath Operation Embeddings for Open-ended Solution
Analysis and Feedback
Mengxue Zhang1, Zichao Wang2, Richard Baraniuk2, Andrew Lan1∗
1UniversityofMassachusettsAmherst,2RiceUniversity
ABSTRACT to help struggling students improve in math is to diagnose
Feedback on student answers and even during intermediate errors from student answers to math questions and deliver
steps in their solutions to open-ended questions is an im- personalized support to help them correct these errors [1].
portantelementinmatheducation. Suchfeedbackcanhelp Inshort-answerquestions,feedbackofvarioustypes[39]can
studentscorrecttheirerrorsandultimatelyleadtoimproved bedeployedaccordingtothespecificincorrectfinalanswers
learningoutcomes. Mostexistingapproachesforautomated students submit, while in open-ended questions, feedback
studentsolutionanalysisandfeedbackrequiremanuallycon- canbedeployedatintermediatesolutionstepsaccordingto
structing cognitive models and anticipating student errors the specific actions they take and their outcomes [22]. In
for each question. This process requires significant human traditional educational settings, this feedback process relies
effort and does not scale to most questions used in home- onteachersgoingoverstudentwork,identifyingerrors,and
worksandpracticesthatdonotcomewiththisinformation. providing feedback [15], which results in a labor-intensive
Inthispaper,weanalyzestudents’step-by-stepsolutionpro- process and a slow feedback cycle for students. Such a set-
cesses to equation solving questions in an attempt to scale ting is even more limited as a result of the COVID-19 pan-
uperrordiagnosticsandfeedbackmechanismsdevelopedfor demic, which introduced new barriers to face-to-face inter-
a small number of questions to a much larger number of actions between teachers and students.
questions. Leveraging a recent math expression encoding
method, we represent each math operation applied in so- Inintelligenttutoringsystems,amorescalableapproachto
lution steps as a transition in the math embedding vector mathfeedbackistoautomaticallydeployfeedbackbasedon
space. Weuseadatasetthatcontainsstudentsolutionsteps students’ final answers or certain incorrect intermediate so-
inthe Cognitive Tutorsystemtolearnimplicitandexplicit lution steps. For example, in ASSISTments [12], teachers
representations of math operations. We explore whether cancreatehintsandfeedbackmessagesforspecificincorrect
these representations can i) identify math operations a stu- student answers to short-answer questions that they antici-
dent intends to perform in each solution step, regardless of pate [28], which the system can automatically deploy when
whether they did it correctly or not, and ii) select the ap- students submit these incorrect answers. This crowdsourc-
propriate feedback type for incorrect steps. Experimental ingapproachefficientlyscalesupteachers’effortsothatthey
results show that our learned math operation representa- can benefit a large number of students without putting in
tions generalize well across different data distributions. additional effort. In many other systems such as Cognitive
Tutor [34] and Algebra Notepad [27], researchers use cogni-
Keywords tive models to anticipate student errors as results of buggy
productionrulesorinsufficientknowledgeonkeymathcon-
Embeddings, Feedback, Math expressions, Mathoperations
cepts [20, 24]. They then develop corresponding feedback
forintermediatesolutionstepsinmulti-stepquestions(e.g.,
1. INTRODUCTION
those on equation solving). This cognitive model-based ap-
Math education is of crucial importance to a competitive proachrequiressignificanteffortbydomainexpertsandhas
future science, technology, engineering, and mathematics shown to be highly effective in large-scale studies.
(STEM) workforce since math knowledge and skills are re-
quired in many STEM subjects [11]. One important way However,theseapproachesforstudentfeedbackarestilllim-
∗ThisworkissupportedbytheNationalScienceFoundation ited in their generalizability to many math questions de-
under grant IIS-1917713. ployed in daily homeworks and practices. For the teacher
crowdsourcing approach, hint and feedback messages have
tobewrittenforeachindividualquestion(orgroupofques-
tions generated from the same template with different nu-
merical values). For the cognitive model-based approach, a
MengxueZhang,ZichaoWang,RichardBaraniukandAndrewLan“Math rigorous solution process has to be specified for each ques-
OperationEmbeddingsforOpen-endedSolutionAnalysisandFeedback”. tion with annotations on the math operations that should
2021.In:ProceedingsofThe14thInternationalConferenceonEducational be applied at each solution step. However, questions used
Data Mining (EDM21). International Educational Data Mining Society,
in many real-world educational settings do not come with
216-227.https://educationaldatamining.org/edm2021/
such information; teachers simply adopt them from sources
EDM’21June29-July022021,Paris,France
216 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)
such as textbooks and open education resources and assign
Question
themtostudentswithoutdevelopingcorrespondingfeedback
Solve for :
mechanisms. Moreover,pastresearchhasshownthatalarge
portion of incorrect student answers cannot be anticipated Solution steps Predicted math operations
bycognitivemodels[43],teachers/domainexperts[8],ornu- and feedback
mericalsimulations[37]. Therefore,itmaybehardforhigh- ...
quality feedback developed for questions used in intelligent
tutoring systems to generalize to questions in the wild. 1 1. COMBINE_ADD (100%)
1.1 Contributions 2 2. COMBINE_ADD (60%), ADD_SIDE (40%)
In this paper, we develop data-driven methods that enable
us to analyze step-by-step solutions to open-ended math ... ...
questions. In contrast to existing methods that rely on a
top-down approach, i.e., defining the structure of the so-
3 3. DIV_SIDE (100%)
lution process and anticipating student errors, we propose
a bottom-up approach, i.e., using learned representations of
math expressions and math operations to predict i) math 4 4. COMBINE_MUL (92%), COMBINE_ADD (8%)
operations in student solution steps and ii) the appropriate
feedback for incorrect solution steps. We restrict ourselves 5 5. CHECK_SIGN (71%),
SIMPLIFY_FRACTION_WRONG (28%)
to the specific domain of equation solving where the solu-
tion process consists of applying specific math operations
between math expressions in consecutive steps; other sub-
domains of math such as algebra word problems [45] and Figure1. Demonstrationofthegeneralizabilityofourmath
questions involving graphs and geometry [16] are left as fu- operationrepresentationstootherdatasourcesforasolution
ture work. Specifically, our contributions are: processprovidedonAlgebra.com. Ourmethodscansuccess-
fully predict the math operations applied in each step and
the appropriate feedback type in an incorrect step.
• First, we characterize math operations by how they
transform math expressions in the math embedding
space in each solution step. We leverage recent work
1.2 UseCase
on learning math symbol embeddings from large-scale
scientificformuladata[46]toencodemathexpressions Before diving into the technical details, we first illustrate
in student solutions: each math expression is mapped a potential use case for our math operation representation
toapointinthemathembeddingvectorspace. Weuse learning methods and corresponding operation/feedback
synthetically generated data as well as solution steps classifiers. Our goal is to transfer expert designs in intel-
generated by real students to learn the representation ligent tutoring systems for math education to questions in
of each math operation. We explore several meth- the wild. Specifically, we apply the math operation rep-
ods for learning both implicit and explicit math op- resentations learned from student solution steps and corre-
eration representations: a classification-based method spondinglabels(stepname,feedbackmessage)inthehighly
thatdoesnotexplicitlyimposeastructureonmathop- structuredCognitiveTutorsystemtoenvironmentsthatare
erations,alinearmodelthatassumeseachoperationis not highly structured. Figure 1 shows the solution process
characterized by an additive vector in the embedding to an equation solving question on Algebra.com1 and the
space, and a nonlinear model where math operations corresponding math operation and feedback predictions at
live in their own, interconnected embedding spaces. each step. We see that our math operation representation
learning methods can accurately predict the math opera-
• Second,weapplythesemathoperationrepresentation tions applied in solution steps 1, 3, and 4 using the opera-
learning methods to a real-world student step-by-step tion names provided in the Cognitive Tutor system. Even
solutiondatasetcollectedwhilestudentlearnequation instep2wheretwodifferentmathoperationsarecombined
solvinginanintelligenttutoringsystem,CognitiveTu- into a single step, i.e.,
tor [34]. We validate our math operation representa-
tion learning methods via two tasks: i) predicting the 7x+9=7−x
specific math operation the student intended to ap- ↓ADD x TO BOTH SIDES
ply in a solution step from the math expressions be-
7x+9+x=7−x+x
fore and after the step and ii) predicting the appro-
↓COMBINE TERMS ON RIGHT SIDE
priate feedback deployed to students from the incor-
rectmathexpressionstheyenter. Quantitativeresults 7x+9+x=7,
show that tree embedding-based math expression en-
coding methods outperform other encoding methods
despiteonlytrainingonstepsinCognitiveTutorthatinvolve
since they are able to explicitly capture the seman-
only one math operation, the classifier is able to recognize
tic and structural characteristics of math expressions.
both of them with high predictive probability for both. We
They also have better generalizability across different
datadistributionsandremaineffectiveacrossdifferent 1The original question and the solution process can be
question difficulty levels and even when student solu- found at https://www.algebra.com/algebra/homework/
tions steps contain errors. equations/Equations.faq.question.4872.html.
Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 217
also change one of the solution steps, i.e., step 5, to make 3. BACKGROUND:EMBEDDINGMATH
it incorrect and test our feedback classifier. In this case, EXPRESSIONSINTOVECTORSPACES
the classifier is able to recognize the error in this step and
In this section, we provide an overview of a recent method
find the corresponding feedback types in Cognitive Tutor.
that we developed to embed math expressions into a vec-
Thispotentialusecasedemonstratestheutilityofourmath
tor space, i.e., a math embedding space. Doing so turns
operation representation learning methods: by transferring
discrete,symbolicmathexpressionrepresentationsintocon-
knowledge learned in well-designed, highly-structured sys-
tinuous,distributedrepresentations[2],whichenablesusto
tems such as Cognitive Tutor, especially on what feedback
manipulate math expressions in a manner compatible with
to deploy for each student error, to other domains such as
modern machine learning methodologies.
onlinemathQ&Asites,wearescaling up theeffortdomain
experts put into the design of these feedback mechanisms.
Our embedding method is a tree-structured encoder illus-
trated in Figure 2. The key observation is that any math
2. RELATEDWORK expressionhasacorrespondingsymbolictree-structuredrep-
One related body of work in math education that studies resentationintheoperatortreeformat. Intheoperatortree,
studentsolutionprocessestoidentifystudentstrategiesand the non-terminal (non-leaf) nodes are math operators, i.e.,
assess errors. Specifically, [33] uses inverse Bayesian plan- additionandsubtraction,andterminal(leaf)nodesarenum-
ning to learn solution strategies (i.e., policies) in equation bers or variables; See Figure 2 for an illustration. Thus, an
solvingandcapturestudentmisunderstandingsinaMarkov operatortreeexplicitlycapturesthesemanticandstructural
decision process framework. Our work focuses on a differ- propertiesofamathexpression. Anumberofexistingworks
entaspectofthesolutionprocess: therepresentationofthe have demonstrated the superior performance of using oper-
math expressions at each solution step and the modeling ator tree representations of math expressions compared to
of the transitions between different math expressions under other math expression representations in applications such
math operations. [9] uses basic math operations to con- as automatic math word problem solving [32, 48, 51] and
structprogramstounderstanderrorsthatstudentsmakein math formulae retrieval [5, 25, 49, 50].
their solutions to arithmetic questions. Our work focuses
on equation solving, which is a more difficult problem in Therefore, we built a math expression encoder that lever-
which students responses are are more diverse and are less ages the operator tree representation of math expressions.
structured than arithmetic calculations. Specifically, during the encoding process, it first converts a
mathexpressionintoitscorrespondingtreeformat,usingthe
Anotherrelatedbodyofworkfocusesonlearningrepresenta- parserintroducedin[5]. Itthenlinearizesthetreebydepth
tions of student answers toshort-answer questions. [21] an- firstsearchthatenablesustoprocessnodesasasequencein
alyzes incorrect student answers across multiple questions, whicheachmathsymbolisassociatedwithitsowntrainable
learnrepresentationsoferrors,andgeneralizemisconception embedding. Next, it leverages positional encoding, similar
feedback across questions. Our work analyze the full math to[44,38],toretaintherelativepositionofeachnodeinthe
expressions in intermediate solution steps while their work tree. The output of our encoder is a fixed-dimensional em-
represents short answers according to the frequency they bedding vector that represents the input math expression,
occur in an answer pool. [8] uses trained word embeddings whichwewillusetolearnrepresentationsofmathoperations
torepresentshortanswersforautomatedgradingpurposes. forthemathoperationclassificationandfeedbackprediction
Ourworkfocusesonlearningtransitionsofmathexpressions tasks. We pretrain the encoder on a large corpus of math
across solution steps instead of learning representations of expressionsextractedfromWikipediaandarXivarticlesand
only the final answer. demonstrated superior performance in reconstructing math
expressions (and scientific formulae) and retrieving similar
Indomainsotherthanmatheducation,thereexistmethods expressions. Seetheanonymizedversionofourworkat[46].
for automated feedback generation, including programming We will refer to the trained encoder as the math expression
[30, 31, 40] and essays [35]. However, transferring these encoding method in what follows.
methodstomathsolutionsisnottrivialsincei)open-ended
math solutions are less structured than programming code
4. LEARNINGREPRESENTATIONSOF
andii)data-drivenrepresentationsofmathsymbolshavenot
been developed until recently [46] whereas such representa- MATHOPERATIONS
tions have been studied for a long time in natural language In this section, we detail methods we use to learn both im-
processing [6, 7, 26]. plicitandexplicitmathoperationrepresentationsbystudy-
ing how they transform math expressions in each solution
Another body of remotely-related work focuses on using step in the math embedding space. In these methods, we
computer vision techniques to identify math expressions leveragethemathexpressionencodingmethoddevelopedin
fromimagesforsimilarmathexpressionretrieval[29],turn- our prior work that we reviewed above to embed math ex-
ing hand-written math expressions into LATEX[47], and au- pressions into vectors and work with these embedding vec-
tomatically identifying and correcting student errors [14]. tors. However, since these embeddings are trained on math
These works often bypass the inherent structure of math expressions that are very different from those occurring in
expressions and directly use an end-to-end model for their actual student solution steps, we use an additional train-
tasks, which means that they cannot be used to analyze able, fully-connected neural network to adapt these embed-
student knowledge. Nevertheless, these techniques can be dingstoourdataset,followingapopularapproachinnatural
used to build large-scale datasets containing hand-written language processing [13]. Specifically, we have e = g (m)
γ
student solutions which we can use in the future. wheremandearetheembeddedvectorofamathexpression
218 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)
Figure2. Illustration of the math expression encoding method that we employ in this work.
inourdatasetbeforeandaftertheadaptation,respectively.
γ denotes the set of parameters in the fully-connected net-
work that we will learn during the training process.
Wedefineastepinastudent’ssolutiontoopen-endedmath
questions as a tuple (E ,E ,z), where z ∈ Z is the math
1 2
operation applied in this step, with Z denoting the set of
possible math operations. E ∈ E and E ∈ E denote the
1 2
math expressions involved in this step before and after ap-
plyingthismathoperation,i.e.,thestepcanbeexpressedas
E −z→E . Edenotesthesetofalluniquemathexpressions
1 2
(acrossallstepsinadataset). Forsimplicity,weassumethat
onlyonemathoperationisappliedineachstep;anextension
to cases where multiple math operations is trivial and will Figure3. IllustrationoftheTransEandTransRframeworks.
be discussed in what follows. e1 ∈RD and e2 ∈RD are the TransE puts the embeddings of equations e1, e2, and math
fine-tuned embedding vectors that correspond to math ex- operation z in the same embedding space, whereas TransR
pressions E1 and E2, respectively, where D is the dimension puts them in their own embedding spaces.
of the embedding.
4.1 MathOperationClassification
Wewillleveragethetranslatingembedding(TransE)frame-
The first task we will study in this paper is to classify the work [3] that has found success in embedding entities
math operation applied in a solution step given the math and characterizing relationships between entities in multi-
expressionembeddingsbeforeandafterappliyingit,e1 and relationaldata. Ourkeyassumptionhereinthisframework
e2. The same notations and approaches also apply to our isthatmathoperationsarelinearandadditive,i.e.,therela-
secondtask,feedbackclassification. Thistaskcansimplybe tionship between math expressions before and after a math
solvedusingasupervisedlearningmethod,e.g.,aregression expression satisfy
modelwherethepredictedprobabilityofpredictingamath
operation zˆis given by
e ≈e +h ,
2 1 z
p(zˆ=z)=softmax(vT[eT, eT]T),
z 1 2
whereh ∈RD istheembeddingofthemathoperation z. In
wheresoftmax(·)isthesoftmaxfunctionformulti-labelclas- z
other words, we assume that the effect of a math operation
sification[10]. v isaparametervectorassociatedwitheach
z is characterized by the difference in the embedding vectors
mathoperationz,whichisusedtocomputeaninnerproduct
betweenthemathexpressionsbeforeandafteritinasingle
withtheconcatenationofe ande beforebeingfedintothe
1 2 step; adding it to the embedded vector of E results in the
softmax function. On a training dataset with given tuples 1
embedded vector of E after the step.
(e ,e ,z), we can learn the parameters (v ) by minimizing 2
1 2 z
the cross-entropy loss [10] between the predicted math op-
To learn these math operation embeddings from data, we
erationzˆandtheactualmathoperation. Thisapproachcan
usetwolossfunctions. Thefirstlossfunctionpromotesthis
beseenaslearningimplicit representationsofmathexpres-
linear and additive relationship between embeddings of the
sions since they are captured by the classifier parameters.
math expressions and operations on the training data. To
this end, we define a distance function as d(e ,e ,h ) =
4.2 LearningMathOperation 1 2 z
(cid:107)e +h −e (cid:107)2 and define the loss function as
1 z 2 2
Representations
(cid:88)
The classification approach we detailed above can help us L = d(e ,e ,h ).
1 1 2 z
classify the math operation applied in a solution step but
(E1,E2,z)
fallsshortonlearningexplicit representationsofmathoper-
ations. The latter is important, however, to help us under-
stand students’ math solution processes and diagnose their Thesecondlossfunctionpushescounterfeitsteptuplesthat
errors. We now detail a series of methods for us to learn aregeneratedbyreplacingelementsinanobservedsteptu-
explicit representations of math operations. ple with other ones in the dataset to not satisfy the afore-
mentioned linear and additive relationship. To this end, we
4.2.1 Translatingembeddings minimize the pairwise marginal distance ranking-based loss
Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 219
given by TransR learns the embeddings of math operations by pro-
jectingthemtotheircorrespondingrelationspacesandthen
(cid:88) (cid:88)
L2 = learning translations between those projected expressions.
(E1,E2,z)∈S (E1(cid:48),E2(cid:48),z(cid:48))∈S(cid:48)(E1,E2,z) FMor e∈acRhDm×Daththaotpeprraotjieocntsza, wmeatshetexaprpesrsoijoenctitoonitmsarterlaix-
[γ+d(e ,e ,h )−d(e(cid:48),e(cid:48),h(cid:48))] , z
1 2 z 1 2 z + tionspace. Tomakethisprojectionnonlinear,weapplythe
where [x] = x when x > 0 and 0 otherwise and γ > 0 rectifiedlinearunit(ReLU)activationfunction[10]toitand
+
is a hyper-parameter that controls the margin of the dis- define the corresponding distance function as
tanceranking. Sdenotesthesetofstepsinthedatasetand d (e ,e ,h )=(cid:107)ReLU(M e )+h −ReLU(M e )(cid:107)2.
S(cid:48) isasetofcounterfeitstepsthatareperturbedver- z 1 2 z z 1 z z 2 2
si(oEn1s,E2o,fz)the actual step (E ,E ,z), generated by randomly Correspondingly,thetwolossfunctionsintheTransRframe-
1 2
replacing one of the triplet elements in the step by a differ- work are given by
ent math expression or math operation from another step, (cid:88)
L = d (e ,e ,h ),
i.e., 1 z 1 2 z
(E1,E2,z)
S(cid:48) ∼A∪B∪C, (cid:88) (cid:88)
(E1,E2,z) L2 =
where A={(E1(cid:48),E2,z):E1(cid:48) (cid:54)=E1 ∈E} (E1,E2,z)∈S (E1(cid:48),E2(cid:48),z(cid:48))∈S(cid:48)(E1,E2,z)
B ={(E1,E2(cid:48),z):E2(cid:48) (cid:54)=E2 ∈E} [γ+dz(e1,e2,hz)−dz(e(cid:48)1,e(cid:48)2,h(cid:48)z)]+.
C ={(E ,E ,z(cid:48)):z(cid:48) (cid:54)=z∈Z}.
1 2 TheprojectionmatricesM , ∀z∈Zareincludedaspartof
z
Intuitively speaking, our objective encourages the distance the trainable parameters. The rest of the training and re-
function calculated on an actual tuple in the dataset to be sultingmathoperationclassificationprocedureremainsun-
smaller than that calculated on a perturbed version of it. changed from the TransE framework.
Figure 3 illustrates the whole process.
5. EXPERIMENTS
Thefinallossfunctionthatweminimizeissimplythecombi- Wenowdetailaseriesofquantitativeandqualitativeexper-
nationofthesetwolossfunctionsasL=L +L . Usingthe iments that we have conducted to validate the learned rep-
1 2
learnedembeddingsofeachmathoperation,wecanclassify resentationsofmathoperations. UsingtheCognitiveTutor
themfromthemathexpressionsE andE usingthenearest 2010equationsolving(CogTutor)dataset,2 wefocusontwo
1 2
neighbor classifier, i.e., zˆ=argmin d(e ,e ,h ). tasks: i)classifyingthemathoperationastudentappliesin
z 1 2 z
asolutionstepandii)classifyingthefeedbackcategorycor-
respondingtocertaintypesofincorrectsteps,fromthemath
4.2.2 LearningEntityandRelationEmbeddings
expressions the student enters before and after the step.
Despite potentially exhibiting excellent interpretability,
TransE’s assumption that math operations are linear and
5.1 Dataset
additive in the math expression embedding space may be
We use the CogTutor dataset which we accessed via the
too restrictive. This assumption puts math operations are
PSLC DataShop [19]. The dataset contains detailed tu-
vectorsinthesamelatentspacewheresimilarmathexpres-
tor logs generated as students in a school use the Cog-
sions will be close to each other. However, different math
nitive Tutor system [34] for their Algebra I class. These
operations are fundamentally different and can transform
logs contain the students’ step-by-step solutions to equa-
the same math expression into dramatically different math
tion solving problems, where each step is a tuple with
expressions that are far apart in the embedding space. For
three elements: a math expression E at the beginning of
example, different math operations can focus on transform- 1
the step, the step name z, i.e., the math operation the
ing different parts of the same math expression. The steps
student selected to apply to this math expression, and
(3+5+2x=x+1, 8+2x=x+1, combine similar terms)
the resulting math expression E after the step. Students
and (3 + 5 + 2x = x + 1, 3 + 5 + 2x − x = x + 1 − 2
can select math operations from a built-in list in Cogni-
x, subtract from each side)havethesamestartingmathex-
tiveTutor: COMBIN ADD,COMBINE MUL,ADD SIDE,
pression E . In the first step, only similar terms on the left
1
SUB SIDE,MUL SIDE,DIV SIDE,andDISTRIBUTE;see
hand side of the equation are combined, regardless of the
Table 1 for an illustration of these operations and some ex-
othersideoftheequation. Inthesecondstep,wesubtracted
amplesofthecorrespondingmathoperationsbeforeandaf-
x from both sides of the equation, which is a consequence
ter them in a step.
of the equality symbol in the equation, which means that
subtractingthesametermonbothsidesoftheequationbut
Thereareatotalof50,406stepsinthisdatasetthatcanbe
not what exactly is on each side. Therefore, TransE’s lin-
further divided into three subsets according to their out-
ear and additive assumption means that the resulting E in
2
these steps will be very different due to the different math comes: OK (43,413 steps), ERROR (6,377 steps), and BUG
operationsapplied,whichconflictswiththeobservationthat (5,744 steps). The OK subset contains steps that are cor-
rect, i.e., the student both selected the correct math op-
theyareverysimilar. Toaddressthislimitation,weexplore
eration and arrived at the correct math expression. The
theLearningEntityandRelationEmbeddings(TransR)[23]
model,whichmodelsmathexpressionsandmathoperations BUG and ERROR subsets contain incorrect student steps, ei-
ther because the operation they selected was incorrect or
in different spaces, i.e., there will be a shared embedding
space for all math expressions but separate relation spaces 2https://pslcdatashop.web.cmu.edu/DatasetInfo?
for different math operations. datasetId=660
220 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)
Step(Mathoperation) Description Example
COMBINE ADD combinetwosimilartermswithadd/suboperator 3x+2x→5x
COMBINE MUL combinetwosimilartermswithmultiply/divideoperator x∗x→x2
ADD SIDE addamathtermoneachside x=1→x+1=1+1
SUB SIDE subtractamathtermoneachside x=1→x−1=1−1
MUL SIDE multiplyamathtermoneachside x=1→x∗2=2
DIV SIDE divideamathtermoneachside x=1→x/2=1/2
DISTRIBUTE distribute(expand)theterms (x+1)x→x∗x+x
Table1. Detailed descriptions and examples for each math operation in the CogTutor dataset.
because they selected the correct operation but did not ap- For each character x in E, we compute its embedding
t
plyitcorrectly,i.e.,arrivingatanincorrectmathoperation
x =WTonehot(x ),
after the step. The difference between these two subsets is t t
that BUG contains steps that fit one of the predefined er- whereWisatrainableembeddingmatrix. Usingthesechar-
ror templates in the Cognitive Tutor system; in this case, acter embeddings, the GRU encoder computes
thesystemcanautomaticallydiagnosetheerroranddeploy
h =GRU (x ,h ),
a predefined feedback. On the other hand, ERROR contains t θ t t−1
incorrect steps that Cognitive Tutor could not automati- whereθrepresentsallthetrainableparametersinGRU.We
cally diagnose the underlying error. The OK subset can be then replace [eT, eT]T with h as input to the classifier
1 2 T
further split into six predefined difficulty levels (named as where T is the total number of characters in E. Similarly,
ES 01,ES 02,ES 03,ES 04,ES 05,andES 07),with2,068, the CNN encoder computes
7,546, 8,183, 13,393, 5,484, and 2,801 steps, respectively.
h=max pool(CNN ([x ,··· ,x ])),
We do not further split the BUG and ERROR subsets for the φ 1 T
mathoperationclassificationtaskduetotheirlimitedsizes. where CNN represents a 2D CNN with parameters φ and
φ
max pool is a 1D max pooling operator. Combined, they
To learn the representation of math operations, we need return a fixed dimensional feature vector h that replaces
examples of how they transform one math expression into [eT, eT]T as input to the classifier. For each of these two
1 2
another. However, the CogTutor dataset may not contain models, we learn its parameters jointly with the classifica-
enough data that is rich in both quantity and diversity for tion task using the cross-entropy loss that we described in
neural network-based models to learn from. Therefore, we Section 4.1.
designedasyntheticdatageneratorstemmingfromthemath
question answering dataset created by DeepMind [36]. The Overall, we test five different methods for the math oper-
generator can generate steps by first generating the initial ation classification and feedback classification tasks. The
math expression and then applying math operations listed first three methods use different encoding methods in con-
in Table 1 to arrive at a resulting math expression. We junction with a classifier: i) using the GRU encoder to en-
have full control over the generated steps through the en- code math expressions as input to the classifier, which we
tropy,degree,andflipparameters. Increasingentropyintro- dubGRU+C,ii)usingthetreeembedding-basedencoderin-
duces more complexity to the math expressions as numer- stead,whichwedubTE+C,andiii)usingtheCNNencoder
ical constants generated get larger. Increasing the degree instead,whichwedubCNN+C.Thesemethodsdonotlearn
parameter introduces monomials of higher degrees and also explicit representations of math operations. The next two
adds more terms in the math expression. Finally, the flip methods use the TransE and TransR frameworks to learn
parameter allows us to control which side of an equation these representations using tree embeddings: iv) using tree
hasahigherchancetobemorecomplicatedthantheother. embedding-basedencoderasinputtotheTransEframework
Tuning these parameters within this flexible synthetic data in conjunction with a nearest neighbor classifier, which we
generationmethodenablesustogeneratealargeamountof dub TE+TransE, and v) using the TransR framework in-
steps that closely resembles those in the CogTutor dataset. steadoftheTransEframeworktostudymathoperationsin
multiple relation spaces, which we dub TE+TransR.
5.2 Methods
To fully evaluate the effectiveness of our math operation 5.3 ExperimentalSetup
representations, we also experiment with two other ways of We first test our math operation representation learning
encoding math expressions commonly used in natural lan- methods on the OK subset via 5-fold cross-validation, i.e.,
guage processing tasks, in addition to the tree embedding- training on 80% of steps in the subset to learn representa-
based and translation-based encoder that we introduced in tionsofmathoperationsandtestingthemontheremaining
Section 4.2. These two encoders include a gated recurrent 20%. We also test the generalizability of the learned repre-
unit (GRU)-based encoder [4] and a convolutional neural sentations to incorrect steps, i.e., replace the test set with
network (CNN)-based encoder [17]; we will use the output the ERROR and BUG subsets, and check whether we can still
of these encoders to replace [eT, eT]T as input to the clas- recognizethemathoperationastudentappliedinanincor-
1 2
sifier detailed in Section 4.1. rect step. The results are detailed in Section 5.4.1.
Specifically, these two encoders first concatenates the two Since the distribution of math expressions in the OK, ERROR
mathexpressionsbeforeandafterthestep,i.e.,E =[E ,E ]. and BUG data subsets are mostly similar with minor differ-
1 2
Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 221
OK ERROR BUG incorrect steps. However, even on steps that are incorrect,
these methods can still effectively identify the math opera-
GRU+C 99.18±0.23 93.87±0.66 95.89±0.63
tionastudentintendedtoapply(withupto95%accuracy),
TE+C 99.82±0.04 93.30±0.65 95.38±0.62
suggesting that they may be applicable to fully open-ended
CNN+C 95.37±0.44 86.82±1.38 91.02±0.59
questionsolvingsolutionsthatarenothighlystructured,un-
TE+TransE 96.27±0.17 86.32±1.23 84.21±2.13
likethoseinCognitiveTutor,toprovidefeedbacktoteachers
TE+TransR 99.17±0.21 91.28±1.12 91.31±1.87
on students’ solution approaches.
Table2. Mathoperationclassificationaccuracyforallmeth-
WeobservethatusingGRUsandtreeembeddingsasrepre-
ods training on the OK subset of the CogTutor dataset and
sentationsformathexpressionsandapplyingaclassification
testing on different data subsets. Accuracy is high across
methodontopoftheserepresentationsresultinsimilarper-
theboard, whileGRU-basedencodingandtreeembedding-
formances; GRUs slightly outperform tree embeddings in
based encoding in conjunction with a classifier result in the
cases where we use the ERROR and BUG subsets as the test
best performance.
set while tree embeddings slightly outperform GRUs in the
case where we use a part of the OK subset as the test set.
Using CNNs to encode math expressions as input to a clas-
ences,thepreviousexperimentdoesnotgiveusagoodidea
sifier results in worse performance, suggesting that they do
on the generalization ability of our math operation repre-
notcapturethesemanticandstructuralinformationinmath
sentation learning methods. Therefore, we further divide
expressions as well as GRUs and tree embeddings. As ex-
the OK subset into six smaller subsets, each corresponds to
pected,usingtreeembeddingsundertheTransEandTransR
adifferentdifficultylevel(withdifferentstructureandcom-
frameworks leads to worse performance than the first two
plexity) according to questions within it, and test the gen-
methods,withTransEachievinglowperformance(especially
eralizability of the learned math operation representations.
on the BUG subset) and TransR achieving comparable per-
The results are detailed in Section 5.4.2. In practice, so-
formancetotheclassification-basedmethodsontheOKsub-
lution step data generated by real students is often limited.
set but lower performance on the ERROR and BUG subsets.
Therefore,weconducttwomoreexperimentstotestwhether
This result can be explained by the additional structural
syntheticallygeneratedstepscanhelpuslearnmathopera-
restriction that math operations are represented as linear
tion representations that generalize to real data. First, we
andadditiveinsomeembeddingspaceintheTransEframe-
repeat the experiments above using synthetically generated
work, which makes it less robust against incorrect student
stepsasthetrainingset. Thissynthetictrainingsetconsists
solution steps. Using the TransR framework mitigates this
of 1,000 steps for each math operation defined in Table 1
problem due to its use of different relation spaces for each
(addinguptoatotalof7,000acrossdifferentdifficultylev-
math operation.
els). The results are detailed in Section 5.4.3. Second, to
study the impact of synthetically generated data when real
Thesemethodsperformsimilarlyinthemathoperationclas-
data is limited, we pre-train the math operation represen-
sification task on real data largely due to the limited varia-
tations with synthetic data, fine-tune on a small amount of
tionandcomplexityinthemathexpressions. TheCognitive
realdatafromeachdifficultylevelintheOKsubset,andtest
Tutor system limits the degrees of freedom in a students’
on the rest. The results are detailed in Section 5.4.4.
response by splitting an open-ended step into the separate
actions of selecting a single math operation and entering
Totesttheabilityofourlearnedmathoperationrepresenta-
the resulting math expression, which limits the variability
tions on recognizing student errors, we use them to classify
in the data. In the next experiment, we see that when we
feedbacktypesprovidedbyCogTutorintheBUGdatasubset.
control against different levels of complexity in these math
Examples of such errors include when a student calculated
expressions and forcing these methods to generalize across
thewrongsimplificationresult,usedthewrongsigninfront
complexities, their performance vary significantly.
of terms, and applied useless/unlogical steps to solve the
problem, etc. The results are detailed in Section 5.4.5.
Figure4visualizestheconfusionmatrixformathoperation
classification on the OK subset and the pairwise euclidean
WeuseAdamoptimizer[18]withlearningrate0.001,batch
distances between math operation embeddings learned via
size 64 and run 10 training epochs for each experiment.
the TransE framework using tree embeddings for math ex-
The math expression encoder outputs length-512 embed-
pressions. Rows correspond to the true math operations
ding vectors for each math expression, which we adapt to
applied in steps and columns correspond to predicted ones.
length-32 embedding vectors dimensions using a trainable
Percentages in the confusion matrix (Figure 4a) are nor-
fully-connectedneuralnetwork. Allofourexperimentswere
malized w.r.t. the number of appearances of each math
conducted on a server with a single Nvidia RTX8000 GPU.
operation. We see that our math operation representa-
tion learning method captures some meaning of these op-
5.4 ResultsandDiscussion erations (Figure 4b); the learned math operation embed-
5.4.1 Generalizingtoincorrectsteps dings capture the structural changes in math expression in
ways that match our intuition. For instance, both COM-
Table2showstheaveragesandstandarddeviationsofmath
BINE ADD and COMBINE MUL can be considered types
operation classification accuracy for every method we ex-
of simplifications, so the Euclidean distance between the
perimentedwithusingtheOKsubsetasthetrainingset. As
learned embeddings for these two operations is low. This
expected, testing on the ERROR and BUG subsets result in
observation is not surprising due to the similar nature
slightly lower (5-10%) math operation classification accu-
of these operations. Moreover, COMBINE ADD, COM-
racy for all methods since the training set does not contain
222 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)
(a)Confusionmatrixofmathoperationsclassification.
Figure 5. Visualization of learned math expression change
for a randomly sampled subset of student solution steps in
2-D and corresponding operations (best viewed in color).
ent levels of difficulty. Although they are all about equa-
tion solving, questions at different difficulty levels in Cog-
nitive Tutor involve math expressions that look very differ-
ent. For example, in the easiest level (ES 01), the equation
that needs to be solved in a question looks like x+5 = 9,
with only a single variable and without numbers with dec-
imals. In contrast, in the hardest level (ES 07), a ques-
tion may contain coefficients with several decimal places
and multiple variables, such as solve for m in the equation
m(k−n) = gs. We only compare the GRU-based encoder
(b)Euclideandistancebetweenlearnedmath and the tree embedding-based encoder in conjunction with
operationembeddingvectors. a classifier since they are the best performing methods in
the previous experiment. Table 3 lists the math operation
Figure 4. Details of TE+TransE for the math operation
classification accuracy for both methods after training on
classification task on the OK subset. These results match
stepsatdifferentdifficultylevelsintheOKsubsetandtesting
our intuition on how these math operations are related.
on steps at other difficulty levels (including incorrect ones).
We see that TE+C overall outperforms GRU+C in almost
every case. This results suggest that tree embeddings are
BINE MUL,andDISTRIBUTEareoftenconfusedwithone effective at capturing the structural property of a math ex-
another. These results are also validated by a 2-D visu- pression. Asaresult,mathoperationrepresentationsbased
alization (using t-SNE [42] as a dimensionality reduction ontreeembeddingsexcelatcapturingthestructuralchange
method) of the learned math operation embeddings in Fig- in math expressions before and after applying a math op-
ure 5, where different math operations are mostly well sep- eration, leading to better generalizability than GRU-based
arated except for COMBINE ADD, COMBINE MUL, and encoding that do not explicitly account for this change.
DISTRIBUTE. One possible explanation is that these op-
erations are all applied to one side of the equation during
5.4.3 Generalizingtodifferentdatadistributions
asolutionstep,leavingonesideoftheequationunchanged,
whiletheotheroperations,suchasADD SIDE,SUB SIDE, In this experiment, we test the ability of our methods to
MUL SIDE, and DIV SIDE are all applied to both sides generalizefromsyntheticallygenerateddatatorealstudent
of the equation. Therefore, this result suggests that tree data. Wetraindifferentmathoperationclassificationmeth-
embeddings enable us to characterize a math operation by odsonthe2,000syntheticallygeneratedstepsandtestthem
the structural change in math expressions before and after onstepsgeneratedbyrealstudentsintheCogTutordataset.
a solution step where it is applied. Furthermore, the classi- Table 4 shows the mean and standard deviation for each
fication accuracy for the DISTRIBUTE operation is signif- methodoneachrealdatasubset. WeseethatTE+Csignif-
icantly lower than that for other operations. This result is icantly outperforms GRU+C and CNN+C on all data sub-
likelyduetothefactthatthenumberofstepswiththisop- sets, which is in stark contrast to the previous experiment
eration is significantly lower than that for other operations. where the difference in performance across all methods is
much smaller. This observation suggests that tree embed-
dingsaremoreeffectiveatcapturingthesemantic/structural
5.4.2 Generalizingtodifferentdifficultylevels effectofmathoperationsonmathexpressions,thusgeneral-
In this experiment, we test the ability of our learned math izingbettertodifferentdatadistributions. Indeed,although
operation representations to generalize to math expressions thesyntheticallygeneratedstepsandtherealstepshavethe
with different levels of complexity in questions at differ- same set of math operations, the distributions of numbers
Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 223
Train OK ERROR BUG
Method OK ERROR BUG
on
GRU+C 62.89±3.93 64.06±4.70 62.94±2.24
ES 01 GRU+C 58.82±1.12 63.74±1.13 66.02±1.12 TE+C 83.79±0.14 75.49±0.90 75.16±0.55
TE+C 76.51±0.62 84.24±0.87 67.49±1.10 CNN-C 51.12±1.64 45.52±0.98 59.82±1.68
GRU+C 71.05±1.12 76.66±1.11 69.01±1.14
ES 02 TE+TransE 80.17±2.32 71.86±3.24 72.32±2.72
TE+C 87.89±0.34 93.96±0.72 80.44±0.78
TE+TransR 82.22±2.88 73.83±3.46 74.85±3.23
GRU+C 82.39±3.93 79.24±1.47 80.01±1.67
ES 03
TE+C 90.79±1.12 93.83±1.32 84.70±1.54
GRU+C 76.72±0.14 71.35±6.12 83.32±2.24 Table4. Mathoperationclassificationaccuracyforallmeth-
ES 04
TE+C 94.65±0.12 92.72±1.32 90.99±1.72 odstrainingon7,000syntheticallygeneratedstepsandtest-
GRU+C 81.74±0.33 73.36±1.69 78.36±1.07 ing on different subsets of the CogTutor dataset. Tree
ES 05
TE+C 87.66±0.25 80.00±1.32 77.81±0.99 embedding-based methods significantly outperform other
GRU+C 76.25±3.21 73.15±3.42 67.35±3.62 methods, showing better ability to generalize to different
ES 07
TE+C 79.44±0.62 79.29±0.72 72.53±2.26 data distributions.
Table 3. Math operation classification accuracy after train-
ingonstepswithdifferentdifficultylevelsandtestingonthe
OKERROR,andBUGsubsets. Treeembedding-basedencoding
outperforms GRU-based encoding.
Figure 7. Math classification accuracy (difference in per-
centage) for TE+C, pre-training on synthetic data before
fine-tuning on real data versus training only on real data.
When real data is limited, pre-training on synthetic data
results in significantly better performance.
Figure 6. Math operation classification accuracy for the
TE+C method when real data is limited. Using synthet-
ically generated steps as a starting point, we already start thetically generated data can play a vital role in improving
with acceptable classification accuracy even with few real their performance under this circumstance; the strategy of
steps generated by students. The performance steadily im- fine-tuning models trained on synthetically generated data
proves after more real data becomes available. using a small amount of real data can be effective. Specif-
ically, we start with a pre-trained math operation classifi-
cation model on the 7000 synthetically generated steps and
(1,0.5,−7, etc.) and variables (x,u,t, etc.), resulting in a fine tune it on a small number of real steps by doing gradi-
mismatch between the data distributions. Tree embedding- ent descent on these steps for 10 epochs. Figure 7 plots the
based methods benefit from the tree-based representations improvement in math operation classification accuracy for
of math expressions that can effectively capture structural thefine-tunedmodeloverthemodelthattrainsononlyreal
information, making it easy for the learned embeddings of data of various amounts on all data subsets. We see that
math expressions to generalize to unseen data. the pre-trained models always performs better, with signif-
icant improvement when the real data is extremely limited.
5.4.4 Generalizingfromsyntheticdata This result suggests that i) effectively leveraging synthet-
ically generated data can mitigate the problem of limited
Ideally,ifthereisalargeamountoftrainingdata,i.e.,steps
real data and ii) our math operation representation learn-
generatedbyrealstudentscontainingdifferenttypesofmath
ingmethodsarecapableofgeneralizingacrossdifferentdata
expressions and detailed labels on these steps such as the
distributions (synthetic → real).
mathoperation(s)applied,theerror(s)ifastepisincorrect,
and corresponding feedback, we can simply use that data
to learn our math operation representations. However, in 5.4.5 Feedbacktypeclassification
practice, the amount of real data is often limited. Figure 6 In this experiment, we evaluate our math operation rep-
plots the performance of TE+C on all subsets of the Cog- resentation learning methods on the feedback type classi-
Tutor dataset, training on a portion of steps in the subset fication task. These feedback items were automatically de-
fortrainingandtestingontherest. Weseethattheperfor- ployedbyCognitiveTutorforincorrectstepsintheBUGsub-
mance on math operation classification suffers considerably set. We pre-processed these steps and grouped the detailed
when we only have limited training data. Therefore, syn- feedback items according to the students’ errors that each
224 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)
Method Accuracy step solutions to open-ended math questions. Our methods
leverage math expression encoding methods that map tree-
GRU+C 75.35±1.41
structured math expressions into a math embedding vector
TE + C 78.71±1.74
space. We demonstrated the effectiveness of our methods
CNN+C 67.23±1.54
on a dataset containing detailed student solution steps to
TE + TransE 69.15±1.13
equationsolvingquestionsintheCognitiveTutorsystemon
TE + TransR 73.21±1.63
two tasks: i) classifying the math operation applied in each
step and ii) classifying the feedback the system deploys for
Table 5. Feedback type classification accuracy for all meth- each incorrect step. Results show that our learned math
odsontheBUGsubset. Treeembedding-basedencodingout- operation representations are meaningful and can often ef-
performsotherencodingmethodswhileTransEandTransR fectively generalize across different data distributions such
frameworks do not reach similar performance levels due to as questions with different difficulty levels.
shortage of training data.
However,thesuccessofourmethodsheavilydependsonthe
availability of diverse large-scale training data. The Cogni-
feedback item addresses and narrowed it down to a total tive Tutor dataset that we used in this work represents a
of 24 types that occur multiple times. We perform 5-fold heavilyrestrictedsolutionprocesssincethelistofmathop-
cross validation on this subset. Table 5 shows the averages erationsastudentcanapplyinastepispre-defined. There-
and standard deviations of feedback classification accuracy fore, additional work has to be done to extend our method
for all methods on this task across the five folds. We see totrulyopen-endedstep-by-stepsolutionprocessesthatare
that due to the limited size of the BUG subset (only 5,744 less structured. Moreover, our methods are restricted to a
steps) and the high number of classes (24), all method per- single solution step only and do not consider the relation-
formworsethantheydoonthemathoperationclassification ship across multiple steps, which is related to another im-
task. Specifically,weseethatthetreeembedding-baseden- portant aspect of solving open-ended math questions: the
coder in conjunction with a classifier performs best while overallsolutionstrategy,i.e.,whichmathoperationtoapply
GRU-based encoding also performs well. This result shows next. Furthermore, in both classification tasks, using tree
thatalthoughtreeembeddingsaresuperioratcapturingthe embeddingstoencodemathexpressionsinconjunctionwith
meaning of math expressions, their advantage over simple a classifier outperforms explicitly learning vectorized repre-
encoding methods such as GRU-based encoding decreases sentations of math operations in the TransE and TransR
due to increased noise in the data; some math expressions frameworks. However, these explicit representations may
submitted by students in incorrect steps are ill-posed and enable us to perform other tasks such as Nevertheless, our
do not make sense. Using the TransE and TransR frame- work provides a series of tools to analyze the math expres-
works result in slightly worse performance than classifiers sionsstudentswritedownintheirsolutionsbybridgingthe
sincethesemethodsexplicitlylearnarepresentationforeach gapbetweensymbolicmathrepresentationswithcontinuous
mathoperation,whichlimitstheirperformanceonthistask representations in vector spaces, enabling the use of state-
due to the shortage of training data. However, since they of-the-art neural network-based methods. We believe that
capturethestructuraldifferenceinmathexpressionsbefore thisworkcanpotentiallyopenupanewlineofresearchthat
and after the step, they can cancel out some of the noise in studies how to automatically analyze student solutions for
erroneous steps, resulting in acceptable performance. grading and feedback purposes.
5.5 Discussions There are many avenues of future work. First, since most
Overall, we find that the GRU-based and tree embedding- real-world open-ended solutions contain a mixture of math
basedmathexpressionencodersinconjunctionwithaclassi- expressionsandtext,thereisaneedtolearnajointrepresen-
fierperformalmostequallywellinmostsituations,whilethe tation of math expressions and text in a shared embedding
CNN-based encoder performs worse. The tree embedding- space. Second, this joint representation will enable us to
based encoder has stronger generalizability across different trainautomatedfeedbackgenerationmethodsinanend-to-
datadistributions. Webelievethatasthemathexpressions end manner, using sequence-to-sequence learning methods
and operations get more complicated, methods that lever- [41]. Third, using learned math expression representations
age the tree structure of math expressions would be more as the states and learned math operation representations
advantageous. We also observe that TransR outperforms fromtheTransEandTransRframeworksasthestatetransi-
TransEmostofthetime,althoughinsomeexperimentsus- tionmodel,wecanapplyreinforcementlearningandinverse
ing TransE and TransR to explicitly learn math operation reinforcementlearningmethodstolearnsolutionstrategies,
embeddings lead to slightly worse performance than clas- i.e.,whichmathoperationtoapplyinthenextstep. Wecan
sifiers using implicit representations of math expressions. alsostudysolutionstrategiesemployedbyrealstudents[33]
However, TransE and TransR are much more powerful and anddiagnosetheirerrorsanddesigncorrespondingfeedback
enable us to study more tasks such as clustering solution mechanisms to improve their learning outcomes. These fu-
stepsandidentifyingtypicalstudenterrorsandlearningso- tureworkdirectionswillenableustotapintothefullpoten-
lution strategies; See Section 6 for a detailed discussion. tialofexplicitmathoperationrepresentations,whichisnot
fully demonstrated in this paper: on the CogTutor dataset,
6. CONCLUSIONSANDFUTUREWORK theonlyrelevantreal-worlddatasetwefound,wecouldonly
evaluate these explicit representations on the math opera-
In this paper, we developed a series of methods to learn
tionandfeedbackpredictiontasks,wheretheymaynotout-
representations of math operations by observing how math
performtreeembedding-basedclassification-basedmethods.
expressionschangeasaresultoftheseoperationsinstep-by-
Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 225