Day 8a. Free text marking of short answer questions. I’ve already talked about the automatic marking of short answers quite a lot on this blog so today, the first day of 2013, I’d like to do two slightly different things (1) Set our work in this area in context; (2) Give an example of the actual PMatch answer matching that we use for one question. Actually, we’ve been out walking and I wanted to get my other blog written up, so I will only do (1) today, letting the ‘Twelve Days’ slip and ending on 6th January (which some people consider to be the 12th Day of Christmas in any case) rather than 5th. A clever solution and another reminder that assessment questions need to be unambiguous (not like ‘What date is the 12th Day of Christmas?’) and to mark all correct answers as correct and all incorrect answers as incorrect – that is very important for short-answer free text questions too.
Just to recap our work: we started off using Intelligent Assessment Technologies ‘FreeText Author’ to mark our short answer questions. The marking accuracy was good, always at least as good as human markers. We initially allowed students to give answers of any length. Most answers given were of an appropriate length but a few were very, very long. The software coped remarkably well with this, but the longer an answer is, the more likely it is to contain a correct answer within an incorrect one (or perhaps an incorrect answer within a correct one?). This could be (though in general I don’t think it is) because a student is trying to ‘play the system’ or it could be that they are genuinely confused. Answers that are both right and wrong are problematic for automatic marking of any sort – this is the one area where human markers do better. So we started only accepting answers of up to 20 words in length.
IAT’s FreeTextAuthor uses the Natural Language Programming technique of ‘information extraction’. It forms ‘templates’ for each model answer (right and wrong), and student responses are matched against these. We chose FreeTextAuthor partly because it has an authoring tool that can be used by someone (like me) with no expertise in either computer programming or NLP.
Then came the surprising discovery that we could obtain equally accurate marking using a system that doesn’t need NLP but rather relies on the recognition of keywords and their synonyms and also on the order of the words and their proximity. So we moved to this system – now known as ‘Pattern Match’ or PMatch and available as a Moodle question type. It continues to provide remarkably accurate answer matching.
Two important points in passing: firstly, I am absolutely sure (and have evidence of this) that the success of our answer matching, whether by FreeTextAuthor or PMatch, is as a result of the fact that it has been developed iteratively and using responses from real students. I believe that it is the difficulty of collecting and marking student responses, and developing answer matching after the first use of the question, that is the cause of so few people using questions of this type. They are hard work and require a fair number of student responses.
My second point is that, where we have had responses that don’t match, this is generally not for some sophisticated reason, but rather because students have used a synomyn which we had not thought of. It is also important to deal sensitively with incorrectly spelt responses.
There is a useful literature review of technologies for marking short-answer free text questions. See
Valenti, S. & Neri, F. & Cucchiarelli, A. (2003). An overview of current research on automated essay grading. Journal of Information Technology Education, 2, 319-330. [note that ‘FreeTextAuthor’ is desribed as ‘Automark’ in the review].
The review was written before our work on PMatch, and the other technologies that are described for the marking of short-answers also rely on computational lingusitics or similar. But however they are marked, in short answers we are marking the content of the answers (which doesn’t mean to say that they have to assess just knowledge-based learning outcomes). In contrast, many of the systems used to mark longer tree text answers (e.g. essays E-rater) are actually marking writing style not content. And paradoxically, when content of longer answers is assessed, it is in some senses simpler to do this than in shorter answers. In a short answer question, the answers ‘the forces are balanced’ and ‘the forces are not balanced’ need to be distinguished, but if an essay discusses balanced forces, that is probably good enough.