At CAA 2012 I gave a paper with the title ‘Short-answer e-assessment questions : five years on’ in which I discussed OU work in this area. There was a lot of interest in what I said, especially concerning evaluation findings. However I wanted to get a discussion going on the reasons why more people don’t use assessment items of this type, and this didn’t really happen. So I’m trying again here. (My CAA 2012 paper is at Open Research Online if you want more background information.)
What might be the reasons for the lack of widespread use of short-answer free-text questions? People have suggested that students might not like questions of this type or might find them difficult. I have evidence that implies that neither of these things is generally true. So is the marking sufficiently accurate? Yes, it is – the point that most people seemed to take away from my presentation is that computers mark at least as accurately as human markers. I’m reported this before and others have reported on the lack of accuracy in human marking of GCSE and A level papers, so it should no longer surprise us. But it does. Similarly, people remain surprised that PMatch‘s relatively simple answer matching is as effective as more sophisticated answer matching. In summary, I’d want to be careful before using PMatch for high-stakes summative e-assessment questions (but in similar fashion, I will never again be able to have great confidence in outcomes determined by human markers) but for low stakes use it is fine…
Provided that is, that answer matching is developed iteratively using marked responses from real students. Most of the problems I have encountered have not been to do with particularly sophisticated or borderline responses but rather with synonyms that I simply hadn’t thought out. The number of responses needed in order to develop sufficiently accurate answer matching varies from question to question, but it is usually at least 200 responses. This means two things:
1. Developing the answer matching rules and amending them is a time-consuming activity. True. I was only able to do this work because I was fortunate to have a CETL-funded teaching fellowship. And, at the OU, with large student numbers (approx. 4000 per year for the module in question) and an ability to re-use questions from year to year, time spent in the early years of a module is recouped later. Machine learnining offers the potential to take some of the drudgery out of developing the answer matching rules – but an ‘expert’ human marker has to mark them in the first place (or you could get several markers to mark the responses, and then consider situations in which the human markers disagree).
2. If you simply don’t have the student numbers in order to gather a large number of responses, you’ve got a problem. True. We chose to gather responses online (because we were not sure that responses to questions asked on paper would be sufficiently similar) but I think responses gathered on paper would probably be sufficiently similar in order to develop answer matching, and other people have done this. It would also be OK to offer a question as an ‘add on’ for a small number of students, and to gather sufficient responses until you felt sufficiently confident to ‘use it in anger’. However, in my opinion, the ultimate limitation to the use of short-answer free-text questions is not having sufficient marked responses.
This leads to a research question that I don’t think anyone has addressed, relevant to the CAA Conference’s theme of sharing questions: do students in different universities give sufficiently similar responses to enable questions that we have developed in one organisation, on the basis of thousands of responses from their students, to be used elsewhere?