Table Of Content

Automatically classifying student help requests: a multi-year analysis Zhikai Gao Collin Lynch NorthCarolinaState NorthCarolinaState University University [email protected] cfl[email protected] Sarah Heckman Tiffany Barnes NorthCarolinaState NorthCarolinaState University University [email protected] [email protected] ABSTRACT ated challenges for instructors with increasing demands for As Computer Science has increased in popularity so too individual support, collaborative learning, and automated have class sizes and demands on faculty to provide sup- guidance [12, 2, 16, 15]. As the size of courses and cohorts port. It is therefore more important than ever for us to haveincreased,thedemandforofficehourshasbeguntoex- identifynewwaystotriagestudentquestions,identifycom- ceed the time that instructors and staff have available [12, monproblems,targetstudentswhoneedthemosthelp,and 13]. Toaddresstheseneedsinstructorshaveadoptedawide better manage instructors’ time. By analyzing interaction range of innovative support models including virtual office data from office hours we can identify common patterns, hours [10], peer support [7], and ticketing systems for help- and help to guide future help-seeking. My Digital Hand seeking interactions [13]. The last approach is exemplified (MDH) is an online ticketing system that allows students by My Digital Hand (MDH) [21], an online support sys- to post help requests, and for instructors to prioritize sup- temforofficehourswhichallowsstudentstoqueueforoffice port and track common issues. In this research, we have hours, post questions in advance, and record the outcome collected and analyzed a corpus of student questions from of interactions. MDH assists students in structuring their acrosssixsemestersofaCS2withafocusonobject-oriented help-seeking interactions with teaching staff. It also assists programming course [17]. As part of this work, we grouped instructors and teaching assistants (TA) in managing their the interactions into five categories, analyzed the distribu- courses, by allowing them to triage student questions and tion of help requests, balanced the categories by Synthetic targettheireffortduringofficehourstobeefficientandmeet Minority Oversampling Technique (SMOTE) , and trained group and individual needs. MDH also tracks help-seeking an automatic classifier based upon LightGBM to automat- and interaction data throughout the whole semester. Using ically classify student requests. We found that over 69% of thisdata,wecanidentifypatternsinstudents’helprequests the questions were unclear or barely specified. We proved andautomaticallyclassifythequestions. Onecommonchal- thestabilityofthemodelacrosssemestersthroughleaveone lengeforhelp-seekinginteractiononlargeclassesariseswhen out cross-validation and the target model achieves an accu- many students ask the same or similar questions but must racy of 91.8%. Finally, we find that online office hours can be dealt with separately thus eating up limited instructor provide more help for more students. time. Oneapproachtoaddressthisistodevelopautomated Q&Asystemswhichcanleverage commonproblems. Inor- Keywords derforthistoworkhowever,studentsmustprovidesufficient information about their problems so that they can receive Office Hour, Computer Science Education Research, Text targeted support. Analysis, help-seeking request 1. INTRODUCTION Our goal is to develop analytical methods to understand what kinds of help students seek during office hours, how Over the past decade the popularity of CS majors has in- they frame their questions to the instructors, and whether creasedandenrollmentshaveskyrocketed[2]. Thishascre- or not we can automatically classify questions to support guidance and time management. By analyzing students’ help requests across course offerings we can better understand what kinds of challenges the students are facing, and ZhikaiGao,CollinLynch,SarahHeckmanandTiffanyBarnes“Automati- howtheteachingstaffcanbetteranticipatestudents’needs callyclassifyingstudenthelprequests: amulti-yearanalysis”. 2021. In: and target their limited support. Moreover, by automat- Proceedings of The 14th International Conference on Educational Data ically classifying help requests we can help teaching staff Mining(EDM21). InternationalEducationalDataMiningSociety,81-92. to efficiently triage student questions and identify common https://educationaldatamining.org/edm2021/ problems that may be solved with group support or peer EDM’21June29-July022021,Paris,France assistance. Over the long term we will develop summary Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 81 statisticswhichcanbeusedtosupportinstructorsincourse Prior researchers have also studied how instructors man- management,andwewillaugmentourexistingticketingsys- age office hours and how to make face to face support time tem with support for automatic categorization. more efficient and effective. Guzdial, for example, argued that office hours should incorporate diverse teaching tech- Inthispaperwewilladdressfourspecificresearchquestions niques including pair programming, peer instruction, and in the context of a CS2, object-oriented-focused course: backward design. These approaches, he argued, would po- tentially work to reduce wait times and support enhanced • RQ1: What types of questions do students ask on learningoutcomes[8]. Inordertoprovidemoreconvenience MDH during office hours and how do they formulate for students, Harvard University introduced virtual office their description? hourstoanintroductoryprogrammingcourseCS50sothat students can interact with teaching staff online [14]. How- ever,theyfoundthatthosevirtualsessionswereoftenineffi- • RQ2: How can we automatically classify student help cientandtookmoretimetoaddressthestudents’problems. requests and tickets? This research is complicated by the fact that students frequently avoid seeking help from teaching staff when they • RQ3: How robust is our classification model across need it [1]. Some of the factors behind this help-avoidance different offerings? include a lack of trust in the tutor’s abilities, inaccessibil- • RQ4: Compared to regular office hours, does online ity of office hours due to timing or other constraints, and a office hours provide more benefits? desireforindependenceinlearning[18]. Whileourresearch providessomeguidanceonthedesignofofficehoursandthe need to reach out to students, the impact of how students In order to address RQ1 we analyzed our dataset to iden- frame their help requests has not yet been analyzed exten- tify common patterns of student questions and to classify sively. One notable exception is the work of Ren, Krish- themintofivecategories. WethenaddressRQ2andRQ3by namurthi, and Fisler, who designed a survey-based method training an automated classifier for student questions with to help track the students’ help-seeking interactions during the goal of evaluating its’ stability across semesters. Due office hours in programming-based CS courses [20]. While totheCOVID-19pandemic,allcoursesareoperatingonline informative, their approach is difficult to generalize as it in Fall 2020, which gives us an opportunity to study the depends on requiring the teaching staff to complete a de- advantages and disadvantages of hosting office hour online. tailed form after every interaction. In MDH, by contrast, Therefore we analyzed and compared the data pattern on we collect much of the data upfront as an integral part of Fall 2020(F20) with other regular semesters in RQ4. the process. 1.1 Background 2. METHODS Prior researchers have analyzed student help requests with 2.1 MDHsystem thegoalofunderstandingstudentbehaviors. XuandLynch, MyDigitalHand(MDH)[21],isaticketingsystemforoffice forexampleapplieddeeplearningapproachestoclassifystu- hoursthatwasdevelopedtofacilitatelargeCScourses. Stu- dent question topics in MOOC discussion forums [24]. In dentsusingMDHrequesthelpduringofficehoursby”raising that work Xu and Lynch collected student posts from two avirtualhand”,thatiscreatingaticketwhichliststhetopic offeringsofaMOOConBigDatainEducation. Theauthors they need help on, describes the issue they are facing, and classified student questions into one of three types (Course the steps they have taken to address it. Once the ticket is Content Question, Technique Question, and Course Logic createditisvisibletotheteachingstaffwhocanthenuseit Question) and developed an automatic classifier using Re- toprioritizeinteractionsorevengroupstudentstogetherfor currentNeuralNetworkstodividequestions’intothosethree help. Oncetheinteractioniscompletetheteachingstaffcan categories. Whilethemodelsweresuccessfulwithinasingle closetheticketanddescribehowtheinteractionplayedout. offeringtheyXuandLynch,foundthattheydidnotgener- Studentsarealsogiventheopportunitytoevaluatethehelp alizeacrossofferings. Thus,thesystemsufferedfromacold received. Thesefeedbackquestionsareconfigurableandset start problem on each semester. by instructors at the start of the semester. Vellukunnel et al. in turn collected Piazza posts from CS2 Thisdataallowsinstructorstoidentifycommonissuesfacing courses offered at two institutions and analyzed the type students and to track the time it takes for students to re- and distribution of the questions students asked [22]. As ceivesupportfromtheteachingstaffaswellastheduration part of this work they manually partitioned the questions of each help session. A prior analysis of MDH data, Smith into five categories and then analyzed the impact of stu- et al. found that 5% of students in a course accounted for dents’ question types on their final grades. They concluded 50%ofofficehourtime,andthatlongindividualinteraction that asking constructive questions can help students to de- times, representing students who needed long and detailed velop a better understanding of the course materials and in guidance, served to delay many other short questions [21]. turn receive better grades. This analysis has informed our They concluded that a small but critical group of students own work. However the Piazza platform, unlike MDH, is arereliantonindividualtutoringviaofficehours,whileother designed to support interactive discussion and online peer studentswhoneedintermittenthelpareoftenunabletoob- supportthroughtheuseofthreadsandreplies. Bycontrast tain support. These findings have motivated our own focus the MDH system is focused on initial help seeking and not ondevelopinganalyticaltoolswhichcanbeusedtoanalyze, on collaborative dialogue. Therefore it is unclear whether prioritize, and manage help requests so that high-demand our results will align with theirs. students do not shut out their peers. 82 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 2.2 DataCollection Table 2: Attributes of the interaction data Weanalyzeddatafromsevensemestersofatypicalsecond- Attributes Content Explain Example semesterObject-Orientedprogrammingcourse[17]ataresearch- intensivepublicuniversityinthesouth-easternUnitedStates. interaction id Id for each ticket 30072 Basic descriptive statistics for the dataset are shown in Ta- student id Id for the student 1950 ble 1. Students produced an average of 1477 tickets per whoraisedthisticket semester with higher volumes in the fall semesters due to teacher id Id for teacher who 20810 larger class sizes. The number of tickets also increased year deal with the ticket over year due in part to larger class sizes, greater emphasis time raised hand Timestamp for each 2019- on tool use by the course instructor, and higher per-capita ticketsthatareasked 03-08 demand for office hours. 19:34:09 time interaction Timestamp for each 2019- The course is structured as a single lecture section with 12 began tickets began 03-08 small-group lab sessions which are held weekly. Over the 19:38:39 courseofthesemesterstudentscompleteweeklylabassign- time interaction Timestamp for each 2019- ments, 2-3 individual or team Projects (C-Projects), and 3 ended tickets ended 03-08 separateGuidedProjects(G-Projects). TheG-Projectsare 20:01:02 designed to provide a review of prerequisite materials and I’m working on Topic for the ques- Program1Part1 introduce students to new concepts. The C-Projects are tion of each ticket generallystructuredastwosub-assignments,onefocusedon my problem is Detail statement for Null design and system testing, and the other on implementa- the question of each Pointer on tion and unit testing. Students manage their code via the ticket TS test GitHubplatformwithintegratedsupportfortheJenkinsau- I’ve tried The solution the stu- Debugging tomation testing server. When students commit code they denttriedbeforethey receive automated testing results from instructor-authored raised the tickets test cases as well as test cases that they supplied. The stu- dentsusefeedbackfromtestfailurestoguidetheirworkand their help-seeking.[6] Rapidlyidentifying,orevenanticipating,students’question topicswouldallowteachingstafftoanticipatethekindsofis- In Fall 2020, this course was moved fully online due to the suestheyshouldbepreparedforandmayalsoallowthemto pandemic. Allofficehourswerehostedthroughzoommeet- setupmini-groupswithinofficehourstodealwithproblems ingwhere students canshare their screen with the teaching assignment by assignment, or to separate code questions staff to show their code or any problems. from conceptual ones. We therefore performed a manual analysisofthetopicsinourstudydatasetoverallsemesters Table 1: Number of tickets and students for each semester withthegoalofdetermininghowstudentslabeltheirtopics, (F= Fall, S=Spring) and whether it is possible to either anticipate or sort their posts as they come in. F17 S18 F18 S19 F19 S20 F20 tickets 1146 609 1224 860 1650 1401 3452 Ourpreliminaryanalysisshowedthatinmostcasesthestu- students 208 157 259 174 256 191 303 dents simply entered the name of their current assignment oranabbreviationofitandprovidednootherdetails. More- over,duetothestructureofthecoursedeadlinesalmostev- The interaction data is the most important for our current ery help request in a given session was focused on the same analysis. The format of the interaction records, along with assignment. In the newest version of MDH, the question is selectedexamplesisshowninTable2. Forthisanalysiseach nowacheckboxandtheinstructorcansettheassignments. ticket consists of three major parts: the participants, time As a consequence we decided to omit this from our classifi- and duration of interaction, and the context. cation task and focus on the types of help being sought. 2.3 RQ1: CategorizationofQuestions 2.3.2 ClassifiedbyDescription Our primary focus in RQ1 is to identify the types of ques- In the description section (“my problem is”), the students tions the students are asking and to understand how they can provide a rich summary of their problem including a describe their work. MDH allows students to frame their text description, bug reports, or even code snippets. If it is question topic or description in any way that they wish. possibletoautomaticallyclassifystudentpoststhenwecan The platform does not provide a list of suggested topics or use that approach to triage student questions as they come mandate content beyond the basic text. This, in turn, lead in,perhapsseparatinglongquestionsfromshort. Wethere- studentstovarywidelyinthedescriptionsandcontentthat foreperformedamanualanalysisofthedescriptioncontent theyprovide. Wethereforestudiedtwofeaturesoftheques- aswellwiththegoalofidentifyingusefulcategoriesofposts. tionswiththegoalofsupportingclassification,thestudents’ We also sought to examine how complex the problem de- topic,ascontainedinthe"I’m working on"field. Andthe scriptions were. In our prior discussions with the teaching longer problem description, as stated in the "my problem staff they reported that many students provide too little is" field. information in the description (e.g. a single word such as ”Errors”),providetoomuch(e.g. afullexecutiondumpand 2.3.1 ClassifiedbyTopic error log), or they simply type gibberish with the simple Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 83 goal of securing a place in line. All of these strategies are together withthe assignmentinformation, providedenough problematic either because they provide too little informa- context for the help request. tion to effectively triage posts, or because the dump is too complex and likely out of date before the student reaches This led us to focus on the specific terms that students use the head of the line. In our analysis we analyzed both the in their problem description. In this analysis, we grouped length and structure of the students’ submissions as well words by stemming and ignored stopwords to focus on the with the goal of understanding whether we can provide au- primary content information. The top 20 words are shown tomaticscaffoldingforusefulposts,andautomatictriageof in Figure 2, top among them being test. This is consis- the submissions. tentwiththedesignoftheassignmentswherestudentswere provided with tests and required to develop their own. It is also consistent with the teaching staff’s observation that manystudentsfocusonthetestsasaguidefortheirprogress and for where they need help. Upon closer examination of questions using this word we found that most interactions were focused on failed test cases; a typical description for a question of this type was ”2 test case fail”. It would be difficult for instructors to interpret these without reviewing the code and the test results in more detail but such a review takes time. Another closely-related word that was common in this dataset is error which was used primarily when students encounter bugs or other failures. In these cases in particular, the teaching staff noted that some stu- dentswouldsimplypastethecrashreportintothequestion withlittleothercontext. Thiskindofbehaviorisrareinthe Figure 1: Distribution of the number of words in the ques- databutwasalsousefulforinstructors,wethereforeusedit tions’ description as an additional factor. Based upon this preliminary analysis we defined five cate- goriesofhelprequestsbasedupontheproblemdescriptions. These categories are shown in Table 3. We then labeled all interactions related to the problem manually. For each question, we also ranked the clarity or comprehensibility of student questions based upon the description provided. As wediscussbelow,mostofthequestionsprovidedinsufficient information to diagnose the problem. However as Figure 1 shows, some students did elaborate on their problem thor- oughlyasmeasuredbythenumberofwordsintheproblem description. 2.4 LabelingProcess 2.4.1 Codebook To investigate the distribution of the above five categories inourdata,wefirstneedtosetupastandardtocategorize ourdataandapplyit. Allsevensemesters’datawaslabeled by one researcher by the following rules: Figure 2: Top 20 Popular words in question description • Checkifthereisanytextthatisclearlyanerrormes- Weexaminedthelength,content,andcomplexityofthestu- sage copied from the compiler or a test failure. If so, dents’ problem description including the specific my prob- label it as Copied Error. Notice that if the student lem is prompt. We also grouped common words to find describes the error message in their own words, then keywords that are associated with specific types of submis- it should also be classified as Sufficient. sions. We then used this information to inform our under- standingofthestudents’postingbehaviorandtoinformthe • Checkifthereisanytextindicatingthatthisisatest design of the posting categories. This preliminary analysis problem, no matter if the description gives you the wasconsistentwiththeexperiencedescribedbytheteaching detail of their test error or not. If you are sure that it staff. Figure 1 shows that the average description was less isatestproblem,labelitasTest. Ifitalsoqualifiesas than five words long. When reviewing those short posts we Copied Error, classified as Copied Error notedthatmanystudentspreferredtousekeywordstoindi- catetheirquestiontopicsandproblemsratherthanspelling • If the text does not provide any information about their issues out. For example, when encountering an error their question and you cannot understand or deduce in implementing an add() function, they often put ”add()” anythingthatrelatedtotheirquestion,classifyasUse- astheproblemdescriptionassumingthatthemethodname, less 84 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Table 3: Explanation and example for all five categories we developed Category Explanation Example description(my problem is) Useless Thedescriptioncontainnothingre- I would like to check of it lated to the question Insufficient Thedescriptioncontainspartialin- TicketManager getInstanceOf formation about the student question,butnotenoughforinstructors to understand the details. Sufficient Containsenoughdetailontheques- CourseRecordIOTest,Iseemtobefail- tion for instructors to understand. ingreadingthefiles,atthemomentsit Usually a very clear sentence. istestingthesizeoftheArrayLists,but I am passing writing the files Copied Error Contains a copied error from the Igotthiserror: TypeError: barh()got compiler. multiple values for argument ’width’ Test Test case fail related problem 2 test cases failed • Ifthetextdoesprovidewhatorwheretheproblemis, punctuation, and the character case. These meta-text fea- butnotenoughforyoutofullyunderstandoridentify tures have the advantage that they are easy to extract au- what is their question, marked as Insufficient tomaticallyandcanthereforebeusedforautomatedtriage. Length, for example, is a suitable proxy for completeness • If the text only contains one or multiple words of the andcoherencewhilepunctuationandcaseshiftingarecom- methodnameassociatedwithaproblem,itcanhelpto mon in error messages. The full list of these features is localizetheproblembutprovidesnoadditionaldetails. shown in table 4. It should be classified as Insufficient We represented the text features as a tf-idf [19] matrix and • If the text is in a form of“I don’t understand xxxx” basic word count matrix over the content. The word count withoutfurtherexplanationofwhichparttheydonot matrixissimplya2DArraywhichdescribehowmanytimes understand or other details, classify as Insufficient each term appears in each question text. The tf–idf ma- • If the text is in a form of“I don’t understand xxxx” trix is the product of two statistics, term frequency and with some further explanation, classify as Sufficient inverse document frequency. The term frequency uses the raw count of a term in a text. The inverse document fre- • If the text tells you what their question or describes quencyisameasureofhowmuchinformationthewordpro- howtheyencounterthisproblem,classifyasSufficient vides. Some common words like ”is”or ”that”do not pro- videmuchinformationbuttheydousuallyhaveahighterm 2.4.2 Interraterreliability frequency. Those words should have less inverse document frequency(idf). We can calculate the value as: After all data was labeled, we randomly generated a subset of 150 unique questions(30 for each category) and sent it Totalnumberofdocuments to another researcher to rate. In this subset, we reveal the idf(t)=ln( ) (1) Numberofdocumentswithtermtinit labelof10questionsforeachcategoryasexampledataand the rater classifies the remaining questions based on those example data and the code book. Then we compare the Inourpreliminaryanalysiswefoundthatthematricesper- resultwiththeoriginallabelsandcalculateKappatorepre- formed poorly in classification due to the fact that both senttheinterraterreliability. Kappa[4]iswidelyappliedfor wereextremelysparse. Wethereforeoptedtocompressthem measuringtheagreementbetweentwocodersthataccounts so that they can be compatible with the dense feature ap- for chance agreement. Generally a score higher than 0.8 is proaches. To that end we built a Naive Bayes model [9] considered acceptable. In our cases, the final unweighted usingthetf-idfsparsefeaturesandthenusethepredictions Kappa value is 0.815 which is acceptable. features. Fromthismodelwegeneratedfiveshallowpredic- tion features which correspond to the probability that the 2.5 RQ2: Modeling question belongs to each category. We followed this same approach with the word count vector and used those fea- In addressing RQ2 we drew on our basic categorization de- tures as probabilities. The final list of extracted features is velopedinRQ1totrainautomaticclassifiersthatcantriage shown in Table 5. posts by topic and content. We used the first six semester dataastrainingdatatotrainourmodelandthelastsemester (F20)asthetestingsettoevaluateourmodel. Totrainour 2.5.1 ModelTraining classificationmodel,wefirstextractedtrainingfeaturesfrom We trained our classification models using LightGBM [11], the problem descriptions across our dataset. The features a Gradient Boosting Decision Tree (GBDT) algorithm pro- included content features such as the keywords described vided by Microsoft. GBDT is an ensemble model of de- aboveaswellasmeta-textfeaturessuchaslength,thenum- cision trees trained in sequence. In each iteration, GBDT ber of stop-words (as a general proxy for specificity), the learns the decision trees by fitting the negative gradients Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 85 Table 4: Text meta features list Feature name Explanation value example length number of words in the problem description 12 character number of characters in the problem description 12 stop words number of stop words in the problem description 2 punctuation number of punctuation in the problem description 0 uppercase number of uppercase words in the problem description 0 Table 5: Text content features lists Feature name Explanation value example prob-tf-useless the probability of this tickets belong to category ”useless”using tf-idf 0.75 prob-tf-ins the probability of this tickets belong to category ”insufficient”using tf-idf 0.82 prob-tf-suf the probability of this tickets belong to category ”sufficient”using tf-idf 0.77 prob-tf-error the probability of this tickets belong to category ”copied error”using tf-idf 0.99 prob-tf-test the probability of this tickets belong to category ”test”using tf-idf 0.65 prob-cnt-useless the probability of this tickets belong to category ”useless”using using common word count 0.75 prob-cnt-ins the probability of this tickets belong to category ”insufficient”using common word count 0.82 prob-cnt-suf the probability of this tickets belong to category ”sufficient” 0.77 prob-cnt-error the probability of this tickets belong to category ”copied error”common word count 0.99 prob-cnt-test the probability of this tickets belong to category ”test”common word count 0.65 (alsoknownasresidualerrors). Toreducethecomplexityof bors at random and connecting those two instance to form GBDT, LightGBM utilize two novel techniques to improve alinesegmentinthefeaturespace. Weappliedthismethod the algorithm: Gradient-based One-Side Sampling and Ex- withk=5andoversamplingthedatatogeneratethetraining clusive Feature Bundling. This method also utilizes a Leaf- datasets and testing datasets for further model training. wiseTreeGrowthalgorithmtooptimizetheaccuracyofthe model and it applies a max depth of the trees to overcome 2.6 RQ3: Modelstabilityoversemesters the over-fitting problem that it might cause. Further, it optimizes the speed of training by calculating the gain for For a trained model to be useful however, it must be stable each split and uses histogram subtraction. LightGBM is across semesters or else we suffer from a cold-start problem known for its outstanding performance and relatively good [3]. In order to assess the model stability we ran a series speed. Thus, many researches applied this method to ma- of experiments where we assessed the relative utility of the chine learning tasks. models by applying a leave-one-out validation strategy on a semester-by-semester basis. Showing that all models per- The implementation code for LightGBM was provided by formatacomparablelevelprovidesastrongindicationthat Microsoft[11] in 2013 and we are utilizing its Python li- the models themselves are consistent and useful, even early brary for modeling process. We applied features and the in the semester. label of training data by LightGBM to train a model, and fit that model on the testing data to predict each question 2.7 RQ4: OnlineOfficehouranalysis in those data. By calculating the accuracy of the predic- InFall2020,alltheofficehourswereheldonline,whichpro- tion, we can evaluate the performance of this model. We vided valuable data about online office hours interactions. ran a series of 20 preliminary experiments to explore the We are very curious to analyze and see whether the stu- space of parameters before we settled on the values listed dentsbehaviorchangedwiththemovetoonlineofficehours in Table 6. A list of crucial parameters people generally andifweshouldkeepsomeonlineofficehourssessionsonce needtotunetoimproveclassificationmodelperformanceis we resume in-person instruction. also in Table 6. Since our goal is to achieve better Accu- racy, we will tuning toward larger max bin, smaller learn- Wefirstanalyzedwhethertheonlinesessionattractedmore ing rate with larger num iterations, larger num leaves and students to seek help during office hours. For an in-person largermax deptheachexperimentuntiltheaccuracyisnot session, students need to physically find the teaching staff improving. in the office and physically stay in line. For online sessions, studentsonlyneedtoclickthelinktojointhemeetingwith 2.5.2 SMOTE teachingstaff. Withonlineofficehours,thefrictionofphys- Duringthemodelingprocess,anotherissuewefacedisthat ically going to a campus location has been removed. How- thecategoriesarehighlyimbalanced. Overhalfoftheques- ever,onlineofficehourshaveadditionaloverheadincreating tions are in the Insufficient category and the Copied Er- aconnectionbetweenpartiesandtransitioningbetweenstu- ror categorycontainedfewerthanonepercentofquestions. dents. Tobetterunderstandonlineofficehourhelp-seeking, Toaddresstheproblem,weappliedSMOTEmethodwhich we calculated the average number of tickets per student over-samples examples in the minority class. SMOTE [5], andthepercentageofstudentswhousedofficehourineach first selects one minority class instance at random, create a semester and compared earlier semesters with in-person of- synthetic instance by choosing one of the k nearest neigh- ficehourstotheFall2020semesterwithonlineofficehours. 86 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Table 6: LightGBM common training parameters and the final optimal value after tuning our model parameter meaning final optimal value num leaves number of leaves in one tree 1000 max depth Specify the max depth to which tree will grow. 10 max bin max number of bins to bucket the feature values. 150 learning rate learning rate of gradient boost 0.1 num iterations number of boosting iterations to be performed 32 num class number of classes. used only for multi-class classification 5 Table7: Distributionoflabeledquestionsonthosefivecat- Table8: Averageinteractiontime(inminutes)andstandard egories in all seven semesters deviation of each semester Useless Insufficient Sufficient Copied Error Test Useless Insufficient Sufficient 3.01% 69.02% 12.04% 0.10% 15.83% AVG 19.7 21.9 18.3 STD 125.3 237.0 103.6 Copied Error Test AVG 11.5 24.8 However,onlineofficehourscouldhaveanimpactontheeffi- STD 67.5 208.2 ciencyofcommunication. Whenteachingstaffandstudents meet online, it creates more challenges for teaching staff to indicate problems to the students and to help explain why their code is failing. Screensharing allows the staff to view remain an important feature. thestudents’work,butphysicalinteractionslikepointingto a portion of the screen to indicate which button to click is Table 8 shows the average and standard deviation of inter- lost. The teaching staff member needs to verbally describe actiontime(thedifferencebetweenwhentheinteractionbe- thedebuggingprocessandaskstudentstofollowit. There- gananditwasclosed)ofeachcategoryacrossthesemesters. fore, we calculated the interaction time and the wait time For this calculation we did not consider tickets with an in- for each ticket. The we compare the distribution of inter- teraction time less than 10 seconds in length or which were action time and wait time of F20 tickets with the rest of longer than one hour. Our discussion with teaching staff tickets. Additionaloverheadisincurredwhenconnectingto and the instructors showed that the former were cases that ameeting. Thereisalagwhenastudentjoinsameetingfor were never seen as the student set a placeholder but fixed their audio to set up to start the conversation. their problem before their turn came up or changed their mind, while the latter represents cases where the teaching 3. RESULTS staff offered help but did not close the ticket, often until well after the tutoring session was over. The Useless, In- 3.1 RQ1: CategorizationResults sufficient and Sufficient categories averaged around twenty Table 7 shows the distribution of question categories across minutes in length with no meaningful difference in their in- our dataset. As the figure shows, the most common cate- teraction times. The Copied Error category was slightly gory is Insufficient which occupies over 69 percent of the shorter on average which may reflect the specificity of the questions. The Test category coming next at 15 percent. students’ problems while the Test category had a slightly Approximately 12 percent of the questions belong to the longeraverageinteractiontime. Thismayindicatethatthis Sufficient category while 3 percent were rated as Useless. kind of question is more complex or more substantive rel- Surprisingly, despite comments from the teaching staff, the ative to the others. Overall these results indicate that the least common category was Copied Error with at most 10- amount of information provided does not necessarily affect 15 questions per semester falling into this group. As our the speed with which the issue can be addressed. resultsshow,thestudentstendedtousethesystemprimar- ily as a way of getting in line and typically provided little useful information for the teaching staff. These results also 3.2 RQ2: ModelingResults highlightthesignificanceoftestingtasksfortheassignments To evaluate the performance of our model, we trained the and for students’ help-seeking given the high proportion of modelusingthefirstsixsemesters’dataandtesteditonFall help tickets that are triggered by them. 20 data.The training dataset applied SMOTE method to oversamplingtheminoritycategoriesandresultineachcat- Toassessthestabilityoftheseresultswealsoexaminedthe egoryhavingthesameamount(5037)ofquestionsintraining frequencieswithineachsemester. Figure3showsthisbreak- dataset. The model achieved an overall accuracy of 91.8%. down. We found that the relative distribution is generally We then conducted a more detailed analysis of the perfor- similaracrosssemesterswhiletheabsolutepercentagesvary. mance for precision, recall, and F-score on each question In more recent semesters the students have authored more type. The results are in Table 9. As our results show Sufficient tickets than in prior years suggesting that there the model is relatively balanced across the categories with has been greater effort by the instructional staff to encour- the exception of the Test category which had substantially age good communication. Yet the persistence of the other higherprecisionandlowerrecall. Thisindicatesthatitwas tickettypesuggeststhatautomaticclassificationandtriage far more likely for other categories to be erroneously classi- Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 87 Figure 3: Frequency of each category for each semester fiedasTests whenthestudentsubmittedotherinformation. Fall2020semesterhadthehighestnumberofaveragetickets Onecommentthattriggeredthiserrorwas: ”Jenkinserrors”. perstudentandthelargestpercentagesofstudentsutilizing office hours. Additionally, the average tickets per student Table 9: Precision, Recall and F-measure for each category in Fall 2020 is nearly twice the average tickets per students inFall2019. Thissuggeststhatonlineofficehourssupports Useless Insufficient Sufficient Test AVG increased student participation in office hours interactions. Precision 0.902 0.915 0.913 0.967 0.922 However, this increment could be explained by other fac- Recall 0.901 0.903 0.905 0.877 0.901 tors. The data shows that in general more students take F-measure 0.896 0.894 0.901 0.914 0.901 advantageofthehelpeachyear. Thisisconsistentwiththe increasing class sizes but it is important to note that there 3.3 RQ3:LeaveOneoutResults areotherpatternsaswellsuchasregulardipsineachspring. Having shown that the model is relatively balanced across Overallitservestohighlighttheneedforbettercourseman- categories we then analyzed the relative accuracy of the agement. Secondly,sincethecourselecturealsoholdsonline model on each semester. The results are shown in Table in F20, the increase of office hour usage supports a general 10. With the exception of Fall 2019, the model achieved expectation that students are facing additional challenges an accuracy of at least 0.91 across each semester. Fall 2019 withonlineclasseshoweverwestillbelievethatonlineoffice wasthelargestandbusiestsemesterinourtrainingdataset hours are a practical means to minimize the cost of help- whichmayindicatethatstudentsweremorediverseintheir seeking and thus encourage more students. habitsorpostingbehaviorbutevenstillweachievedanac- curacyof0.899. Inlightoftheseresultswebelievethatour The distribution of interaction times shown in Figure 4 in- modelingmethodissufficientlystabletoassistinprocessing dicatesthattheteachingstaffgenerallytookslightlylonger unseen semesters without a cold-start problem. to support students in Fall 2020 than other semesters. The median interaction time of Fall 2020 is 8.78 minutes while Table 10: Leave one out accuracy result other semesters are 8.17 minutes. We believe that this is caused by the inconvenience of remote instruction and de- left-out F17 S18 F18 S19 F19 S20 bugging. Additionally, the high percentages of interactions semester within one minute in all semesters are usually caused by accuracy 0.913 0.915 0.908 0.907 0.899 0.912 teaching staff forgetting to open the tickets when the interaction begins. In Fall 2020, the percentage of short tickets was much higher suggesting the teaching staff were more 3.4 RQ4: OnlineOfficeHourAnalysisResults likelytomakesuchmistakesbecausetheyareworkingwith bothMDHandtheonlineinteractiontool. Afternotifythe Table 11 shows the various summary measures associated student, the teaching staff has to waited in the zoom until withofficehoursinteractionsforthesemestersstudied. The 88 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) Table 11: Statistic Analysis for each semester F17 S18 F18 S19 F19 S20 F20 total tickets 1146 609 1224 860 1650 1401 3452 total students 208 157 259 174 256 191 303 students use office hour 104 63 141 84 158 108 209 average tickets per student 5.51 3.88 4.73 4.94 6.44 7.34 11.40 percentage of students using office hour 50.0% 40.1% 54.4% 48.3% 61.7% 56.5% 69.0% (a) (b) Figure4: HistogramofInteractiontime(timedifferencebetweenopentimeandclosetime)forticketsin(a)Regularsemesters and (b)Fall 20 (a) (b) Figure 5: Histogram of wait time (time difference between open time and raised time) for tickets in (a) Regular semesters and (b)Fall 20 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021) 89 the student actually join the zoom meeting to start the in- There are several limitations to our work that must be ac- teraction. Itisverycommonfortheteachingstaffjuststart knowledged. Whileourresultsspansemesters,theyarestill the interaction in zoom without open ticket in MDH. The taken from a single course with a single instructor. As a waittimeinFall2020ismuchlongerthanregularsemesters, consequence our results are necessarily dependant on the asshowninFigure5. Whilewehad37hoursofregularoffice training that students have received and it is not yet clear hour time each week for 17 TAs and 2 instructors, the in- whether this stability will be apparent in models created creased demand in office hour help-seeking did have impact from interaction data for other courses, particularly those on student wait time and throughput. We additionally ob- thatarenotaslarge,donotusethesameassignmentstruc- served anincrease inthe number of canceledtickets. While ture, or rely so heavily on tests. online office hours lowers barriers for student attendance, additional resources to support demand are needed to en- Additionally, for our analysis toward online office hour, we sure timely support. As we transition back to in-person did not consider the influence of teaching lectures online instruction,thecourseinstructorwillcontinuetooffersome could raise more challenges for students and thus increase online office hours to support access to help-seeking. theusageofofficehour. Ourconclusionofonlineofficehour encourage studentstoseekhelpisbasedonthe assumption that there is no significant difference of academic difficulty 4. CONCLUSIONS between F20 and other semesters. The results of our analysis in RQ1 show that students’ use of the MDH platform does vary substantially from the de- 6. FUTUREWORK velopers’intentions. Farfromusingittowritecomplexhelp Thisresearchcansupportfutureinstructorsincourseman- requestsmanydouseitmerelytoreserveaplaceinline. Of agementandtheautomaticcategorizationforMDHsystem. thefivecategories,ourlabelingshowedthatmostofthehelp We therefore plan to address these limitations, expand our tickets submitted lack sufficient detail to be clear what the dataset, and build upon the models that we have obtained. student is asking about while those that do provide detail First, we plan to conduct a more robust process of tagging are most commonly focused on test cases which constitute andclassifyingourticketswiththegoalofassessingthesta- a major feature of the class. Contrary to our initial expec- bilityofourcategorieswithotherevaluatorsandofidentify- tations, the students rarely use the system to enter specific ingotherimportantwaysofgroupingtheticketsthemselves. error messages, even if they have them. Thus the teaching staff have relatively little to go on when triaging questions. Second, we will extend My Digital Hand to take advantage Clearly the proportion of useful information in the tickets ofthesetrainedmodelsinsupportingboththestudentsand increased in more recent years of the course but insufficient instructors. We will support instructors by providing auto- detailremainsthemostcommonfeature. Despitethishow- matic triage approaches that can help to guide their plan- ever, our resultsalsoshowthatthere areclearcategories of ning. And we will use automated guidance to prompt stu- usethatwecanbuildupontoassistteachingstaff. Andour dents to produce better tickets in the first place. results show that it may be possible to extend the system withminimalautomatedinterventionssuchasdetectorsfor Third,wealsoplantoinvestigateotheraspectsoftheoffice word counts or grammar that can be used to scaffold, or hours that are captured in the MDH data. These include: simply enforce, good posting behavior. whether students in the same office hours post similar tickets,thushighlightingthepotentialofpeerfeedback;andthe Informedbyouranalysis,wewereabletoaddressourmod- presence or absence of serial ticketers; that is students who eling questions, RQ2 and RQ3 by developing accurate and keepmultiplefollow-upticketsgoingtomonopolizesupport. robust classification models that achieved an overall accu- We plan to build models for these features with the goal of racy of 92.6% and individual accuracy of 0.899% to 0.91% understandinghowhelptimeisbeingusedandbyextension . Moreover, the results are robust on a per-category basis. how to better coordinate limited support. Whiletheseresultsarenotperfect,theyshowthatwehave thepotentialtousemodelsofthistypeforeffectivetriageof Finally, we plan to apply our models to provide automated student questions as well as to provide scaffolding and im- scaffolding for students when they provide insufficient com- mediate guidance for students as they author help tickets. ments or errors. Specifically, we will integrate this model While such guidance has not been evaluated for its’ educa- to the MDH system and every time a student raise a hand, tionalimpactpriorworkonself-explanation(e.g. [23])leads we will use our model to predict the question category. If us to conclude that it may help students to diagnose their their description is insufficient or useless, then we can im- own challenges. mediately notify them to revise it. This will help students tobetterframetheirquestions,anditwillhelptheteaching ForRQ4,thecomparisonbetweenanonlinesessionsemester staff can be better prepared to answer the students’ ques- andregularsemestershowsboththestrengthandweakness tion. Thisinitialfiltercanbefollowedbyadditionalmodels ofonlineofficehours. Theadvantagesofhostingofficehours to suggest debugging steps or common answers based upon online is that it can encourage students to utilize the help- their revised question. seekingresources;However,thelargeamountofhelp-seeking requestscanbeoverloadedforteachingstaffandtheremote 7. ACKNOWLEDGEMENTS debuggingthroughscreensharingisclearlylessefficientthan face-to-face interactions. This research was supported by NSF #1821475“Concert: Coordinating Educational Interactions for Student Engage- ment”CollinF.Lynch,TiffanyBarnes,andSarahHeckman 5. LIMITATIONS (Co-PIs). 90 Proceedings of The 14th International Conference on Educational Data Mining (EDM 2021)

ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis PDF

2021·0.43 MB·English

by ERIC

#additional_collections #ericarchive

Checking for file health...

Save to my drive

Quick download

Download

Download ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis PDF Free - Full Version

by ERIC| 2021| 0.43| English

Download ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis by ERIC in PDF format completely FREE. No registration required, no payment needed. Get instant access to this valuable resource on PDFdrive.to!

Free Download PDF

About ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis

No description available for this book.

Detailed Information

Author:	ERIC
Publication Year:	2021
Language:	English
File Size:	0.43
Format:	PDF
Price:	FREE

Download Free PDF

Safe & Secure Download - No registration required

Why Choose PDFdrive for Your Free ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis Download?

100% Free: No hidden fees or subscriptions required for one book every day.
No Registration: Immediate access is available without creating accounts for one book every day.
Safe and Secure: Clean downloads without malware or viruses
Multiple Formats: PDF, MOBI, Mpub,... optimized for all devices
Educational Resource: Supporting knowledge sharing and learning

Frequently Asked Questions

Is it really free to download ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis PDF?

Yes, on https://PDFdrive.to you can download ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis by ERIC completely free. We don't require any payment, subscription, or registration to access this PDF file. For 3 books every day.

How can I read ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis on my mobile device?

After downloading ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis PDF, you can open it with any PDF reader app on your phone or tablet. We recommend using Adobe Acrobat Reader, Apple Books, or Google Play Books for the best reading experience.

Is this the full version of ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis?

Yes, this is the complete PDF version of ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis by ERIC. You will be able to read the entire content as in the printed version without missing any pages.

Is it legal to download ERIC ED615496: Automatically Classifying Student Help Requests: A Multi-Year Analysis PDF for free?

https://PDFdrive.to provides links to free educational resources available online. We do not store any files on our servers. Please be aware of copyright laws in your country before downloading.

The materials shared are intended for research, educational, and personal use in accordance with fair use principles.