The multiple limitations of assessment criteria

Sadly, I don’t get as much time as I used to in which to think about assessment. So last Wednesday was a particular joy. First thing in the morning I participated in a fantastic webinar that marked the start of a brand new collaboration between two initiatives that are close to my heart – Transforming Assessment (who run a webinar series that I have been following for a long time) and Assessment in Higher Education (whose International Conferences I have helped to organise for 4 years or so). Then I spent most of the afternoon in a workshop discussing peer review. The workshop was good too, and I will post about it when time permits. For now, I’d like to talk about that webinar.

header 1

The speaker was Sue Bloxham, Emeritus Professor at the University of Cumbria and the founding Chair of the Assessment in Higher Education Conference. It was thus entirely fitting that Sue gave this webinar and, despite never having used the technology before, she did a brilliant job – lots of good ideas but also lots of discussion. Well done Sue!

Capture 2

Assessment criteria are designed to make the processes and judgement of assessment more transparent to staff and students and to reduce the arbitrariness of staff decisions. The aim of the webinar was to draw on research to explore the use of assessment criteria by experienced markers and discuss the implications for fairness, standards and guidance to students.

Sue talked about the evidence of poor reliability and consistency of standards amongst those assessing complex performance at higher education level, and suggested some reasons for this, including different understanding, different interpretation of criteria, ‘marking habits’ and ignoring or choosing not to use criteria.

Sue then described a study, joint with colleagues from the ASKe Pedagogical research centre at Oxford Brookes University, which had sought to  investigate the consistency of standards between examiners within and between disciplines. 24 experienced examiners from 4 disciplines & 20 diverse UK universities were employed and each considered 5 borderline (2i/2ii or B/C) examples of typical assignments for the discipline.

The headline finding was that overall agreement on a mark by assessors appears to mask considerable variability in individual criteria. The difference in the historians’ appraisal of individual constructs was further investigated and five potential reasons were identified that link judgement about specific elements of assignments to potential variation in grading:

  • Using different criteria from those published
  • Assessors have different understanding of shared criteria
  • Assessors have a different sense of appropriate standards for each criterion
  • The constructs/criteria are complex in themselves, even comprising various sub-criteria which are hidden to view
  • Assessors value and weight criteria differently in their judgements

Sue led us into a discussion of the implications of all of this. Should we recognise the impossibility of giving a “right” mark for complex assessments? (for what it’s worth, my personal response to this question is “yes” – but we should still do everything in our power to be as consistent as possible). Sue also discussed the possibility of ‘flipping’ the assessment cycle, with much more discussion pre assessment and sharing the nature of professional judgement with students. Yes, yes, yes!

If I have a complaint about the webinar it is purely that some of the participants took a slightly holier than thou approach, assuming that the results from the study Sue described were as a result of poor assessment tasks or insufficiently detailed criteria (Sue explained that she didn’t think more detailed criteria would help, and I agree) or examiners who were below par in some sense. Oh dear, oh dear, how I wanted to tell those people to carry out a study like this in their own context. Moderation helps, but those who assume high level consistency are only deluding themselves.

While we are on the subject of the subjective nature of assessment, don’t take my word for the high quality of this webinar, watch it yourself at

Posted in assessment criteria, human marking, marking accuracy | Tagged , , , , | Leave a comment

So what is assessment really about?

P1030194I’ve just returned home from Barcelona, where I was visiting the eLearn Center at the Universitat Oberto de Catalunya (UOC), the Open University of Catalonia. UOC has an “educational model” which is similar to that used at the UK Open University, though they are not “open” in the same sense (they have entry qualifications) and they are an entirely online university. Overall I was extremely impressed (and Barcelona was quite nice too…).

Partly as a result of my discussions in Barcelona and partly just as a result of taking a break from the usual routine, I have been reflecting on what we do in the name of assessment. It’s so easy to make assessment an “add on” at the end of a module (and if you have an exam, I guess it is exactly that). But even if you are then using that assessment as  assessment of learning, are you really assessing what you hope that your students have learnt (i.e. your learning outcomes), or are you assessing something altogether different? And if assessment is assessment for learning, is it really driving learning in the way you hope?

At least some courses at UOC make good use of collaborative assessment and surely, in principle at least, a solution  is to assess all of the actual activities that you expect your students to do i.e. to put assessment at the centre. In an online environment it should be possible to assess the way in which students actually engage with the materials and collaborate with their peers. However, in my experience, practice is often somewhat different.  At the very least, if you have an activity where students work together to produce an output of some sort, it makes sense to assess that output not a different one, even if you then have to enter the murky world of assessing an individual contribution to a group project.

So where does that leave all my work on sophisticated online computer-marked assessment? I still think it can still be very useful as a means of delivering instantaneous, unbiased and targeted feedback interventions and as a way of motivating students and helping them to pace their studies. But that’s about learning not assessment…I need to think about this some more. Perhaps assessment is a bit like quantum mechanics, the more you think you understand it, the more problematic it becomes…


Posted in authentic assessment | Tagged , , | 2 Comments

Gender differences on force concept inventory

Hot on the heals of my last post, reporting on work which did not provide support for the previous finding that men and women perform differentially on different types of assessed tasks, I bring you a very interesting finding from work done at the University of Hull (with Ross Galloway from the University of Edinburgh, and me). David Sands from Hull and two of his students came to the OU on Thursday and gave a presentation which is to be repeated at  the GIREP-EPEC conference in Poland next week.

We are seeking to investigate whether findings from use of the well-established force concept inventory (FCI) (Hestenes et al, 1992), are replicated when the questions are asked as free text rather than multiple choice questions. Free text versions of the questions have been trialed at Hull and Edinburgh, and the next step is to attempt to write automatically marked versions of these.

However, the interesting finding for now is that whilst in general students perform in a similar way on the free text and multiple choice version of the FCI, there are some variations in the detail. In particular, whilst men outperform women in the MCQ version of the FCI  (Bates etc al., 2013) it seems that the gender difference may be reduced or even reversed with the free text version. We don’t have enough responses yet to be sure, but watch this space!

Bates, S., Donnelly, R., MacPhee, C., Sands, D., Birch, M., & Walet, N. R. (2013). Gender differences in conceptual understanding of Newtonian mechanics: a UK cross-institution comparison. European Journal of Physics, 34(2), 421-434

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141-158.

Posted in force concept inventory, gender | Tagged , | 2 Comments

More on the gender differences on our level 2 physics module

I’m returning to the topic raised here. To summarise, significantly fewer women than men study our level 2 (FHEQ L5) physics module S207 and, more worryingly, those who do are significantly less likely to complete it, and those who complete it are less likely to pass…It’s a depressing situation and we have been trying to find out more about what is going on. We don’t have definite answers yet, but we do have some pointers – and we are hoping that if we can begin to address the issues we will be able to improve the outcomes for all students on S207 (both men and women).

In my previous post I explained that women do less well on almost all interactive computer-marked assessment (iCMA) questions, but the amount by which they do less well varies from question to question. This does not appear to depend on question type.

Next, let’s consider the S207 exam. The exam has three parts with (a) multiple-choice questions; (b) short-answer questions; (c) longer questions. Students are expected to attempt all questions in part (a) and part (b), whilst in part (c) they should attempt three questions from a choice of 7 (one on each of the main books in the module).

Let’s start by considering performance on each of the three parts of the exam (all the data are for the 2013-14 presentation). The average score for men and women for each of the three parts are shown in the figure below (blue = men; pink = women, with my apologies for any offence caused by my sexist stereotyping on colour, but I’m sticking with it because it is obvious).









So, women do less well on multiple-choice questions, as you would have been expecting if you’ve read the literature…but they also do less well on short-answer and long-answer questions (though do note the fact that the error bars overlap)…Hmmm.

Things get much more interesting if we consider how many men and women choose to answer each of the longer questions in part (c):









So relatively fewer women are choosing to answer the first two questions; relatively more are choosing the answer the others. And how well do they do on each question? See below:









So, all questions are not equal. Men and women appear to prefer different questions and to perform differently on different questions. And we have also seen that we are more likely to loose students when they are studying the materials that are assessed in the first two exam questions. So it looks as if women do even worse on some parts in the module than others. What we don’t know yet is whether this is as a result of the topic in question (in this case Newtonian mechanics) or that they are less good at problem solving, less familiar with the abstract types of questions that we are asking, or whether less structured long questions put them off (c.f. questions where they are given more tips). We also need to do more work to test some hypotheses that might explain some of these factors e.g. that whilst women are as likely as men to have A levels, they may be less likely to have A level physics or maths. Our investigation is far from over.

Posted in gender | Tagged | Leave a comment

Reflections on AHEC 2: Assessment transparency

mangleI should start by saying that Tim Hunt’s summary of last week’s Assessment in Higher Education Conference is excellent, so I don’t know why I’m bothering! Seriously, we went to some different sessions, in particular Tim went to many more sessions on feedback than I did, so do take a look at his blog posting.

Moving on to “Assessment transparency”. I’m picking up here on one of the themes that Tim also alludes to, the extend to which our students do, or don’t, understand what is required of them in assessed tasks. The fact that students don’t understand what we expect them to do is one of the findings I reported on in my presentation “Formative thresholded evaluation : Reflections on the evaluation of a faculty-wide change in assessment practice” which is on Slideshare here. Similar issues were raised in the presentation I attended immediately beforehand (by Anke Buttner and entitled “Charting the assessment landscape: Preliminary evaluations of an assessment map”). This is not complicated stuff we’re talking about – not anything as sophisticated as having a shared understanding of the purpose of assessment (though that would be nice!).

It might seem obvious that  we want students to know what they have to do in assessment tasks, but there is actually a paradox in all of this. To quote Tim Hunt’s description of a point in Jo-Anne Baird’s final keynote: “if assessment is too transparent it encourages pathological teaching to the test. This is probably where most school assessment is right now, and it is exacerbated by the excessive ways school exams are made hight stakes, for the student, the teacher and the school. Too much transparency (and risk averseness) in setting assessment can lead to exams that are too predicable, hence students can get a good mark by studying just those things that are likely to be on the exam. This damages validity, and more importantly damages education.”. Suddenly things don’t seem quite so straightforward.

Posted in conferences | Tagged , | Leave a comment

Reflections on AHEC 1: remembering that students are individuals

I’ve been at the 5th Assessment in Higher Education Conference, now truly international, and a superb conference. As in 2013, the conference was at Maple House in Birmingham.  With 200 delegates we filled the venue completely, but it was a deliberate decision to use the same venue and to keep the conference relatively small. As the conference goes from strength to strength we will need to review that decision again for 2017, but a small and friendly conference has a lot to commend it. We had some ‘big names’, with Masterclasses from David Boud, David Carless, Tansy Jessop and Margaret Price, and keynotes from Maddalena Taras and Jo-Anne Baird. There were also practitioners from a variety of backgrounds and with varying knowledge of assessment literature.

For various reasons I attended some talks that only attracted small audiences, but I learnt a lot from these. One talk that had a lot of resonance with my own experience was Robert Prince’s presentation on “Placement for Access and a fair chance of success in South African Higher Education institutions”.  Robert talked about the different educational success of students of different ethnicity, both at School and at South African HE institutions. The differences are really shocking. They are seeking to address the situation at school level, but Robert rightly recognises that universities also need to be able to respond appropriately to students from different backgrounds, perhaps allowing the qualification to be completed over a longer period of time.

Robert went on to talk about the ‘National Benchmark Tests (NBT)’ Project, which has produced tests of academic literacy, quantitative literacy and mathematics. The really scary, though sadly predictable, finding is that the National Benchmark Tests are extremely good at predicting outcome. But the hope is that the tests can be used to direct students to extended or flexible programmes of study.

In my mind, Robert’s talk sits alongside Gwyneth Hughes’s talk on ipsative assessment i.e. assessing the progress of an individual student (looking for ‘value added’). Gwyneth talked about ways in which ipsative assessment (with a grade for progress) might be combined with conventional summative assessment, but that for me is the problem area. If we are assessing someone’s progress and they have just not progressed far enough I’m not convinced it is helpful to use the same assessment as for students who are flying high.

But the important thing is that we are looking at the needs of individual students rather than teaching  and assessing a phantom average student.

Posted in conferences | Tagged , | Leave a comment

Performance on interactive computer-marked questions – gender differences

We have become aware of a significant difference in outcome for male and female students on our level 2 physics module; around 25% of the students on the module are women, and they are both less likely to complete the module and less likely to pass if they survive to the end. This effect is not present in our level 1 Science Module or for other scientific disciplines apart from astronomy at Level 2; and the women who get through to Level 3 do better than men.

Many theories have been proposed as to the reason for the effect, which may be related to persistent gender differences in performance on the force concept inventory – see for example Bates et al. (2013). I proposed that the effect might have been related to the assessment we use; there is evidence (e.g. Gipps & Murphy, 1994; Hazel et al., 1997) that girls are less happy with multiple-choice questions.

One of the things we have done is looked at performance differences on each interactive computer-marked assignment question. The results are summarised in the figure below (click on it to see the detail).






Points to note are as follows:

Women score less well than men on most questions, but the effect is no greater than for tutor-marked assignment questions.

The gender difference is much greater for some questions than others; but the questions with a large differences are not all questions of one type. So multiple-choice is not to blame! It appears more likely that the issue is with what the questions are assessing; there is some indication that our female students are less good at complex, abstract, problem-solving type questions.

The gender difference is much less for the final iCMA. My hypothesis is that the usual reasons for women doing less well are counter-balanced by the fact that women are more persistent; they are more likely to attempt this iCMA whilst men are more likely to reckon they have reached the required threshold on 5 out of 7 iCMAs, and so not to bother.

More work is required before I can be confident of this analysis; it is an interesting and extremely important investigation.


Bates, S., Donnelly, R., MacPhee, C., Sands, D., Birch, M., & Walet, N. R. (2013). Gender differences in conceptual understanding of Newtonian mechanics: a UK cross-institution comparison. European Journal of Physics, 34(2), 421-434.

Gipps, C. V. & Murphy, P. (1994). A fair test? Assessment, achievement and equity. Buckingham: Open University Press.

Hazel, E., Logan, P., & Gallagher, P. (1997). Equitable assessment of students in physics: importance of gender and language background. International Journal of Science Education, 19(4), 381-392.



Posted in gender, multiple-choice questions | Tagged , | Leave a comment

Pixelated assessment

I’m indebted to the colleague who told me about Cees van der Vleuten’s keynote at the EARLI Assessment SIG Conference in Madrid last August ( I should perhaps point out that I am reporting third hand, so I may have got it all wrong. All that I can claim is that I am reporting on my own reaction to what I think my colleague said about what she thinks Professor van der Vleuten said…

I understand that he was talking about the assessment of professional competence, which is very important. The point that really grabbed my attention though was that since we need professionals to be able to do their job day after day, in a reliable fashion, ‘one off’ assessment, at the end of a programme of study isn’t really appropriate. Of course, one off assessment is always open to challenge – you will do less well if you have a headache on the day of the exam; you will do better if you happen to have revised the ‘right’ things. But there has been something of a backlash against continuous assessment recently, most obviously in the renewed emphasis placed on exams at the expense of coursework in UK schools (courtesy of governmental policy). Perhaps with more justification, some argue that you should assess outcomes at the end of a module rather than progress towards those outcomes and I have argued (e.g. here) that summative continuous assessment can lead to confusion over its purpose (is it formative or summative; is it for learning or of learning?).

Professor van der Vleuten’s keynote suggested that we should use ‘little and often’ continuous assessment that is very low stakes, perhaps with the stakes increasing as the module progresses – so that a student’s overall assessment record builds up slowly, in the same way that pixels build up to make a picture. Pixelated assessment. Nice!


Posted in continuous assessment | Tagged , , , | 1 Comment

Authentic assessment vs authentication of identity

This is a post on which I would particularly welcome comments. I am aware of the issues but distinctly lacking in solutions.

A couple of years ago I posted (here) about the fact that there are a range of skills (e.g. practical work) that are difficult to assess authentically by examination. So, in general terms, the answer is easy; we should assess these skills in better, more authentic ways. So we should be making less use of examations…

exam hallBut in our distance-learning environment, we have a problem. At some stage in a qualification, we really ought to check that the student we think we are assessing is actually the person doing the work. Examinations provide us with a mechanism for doing this; student identity can be checked in the good old-fashioned way (by photo ID etc.). In conventional environments, student identity can be verified for a range of other assessed tasks too, but that is much more difficult when we simply do not meet our students. At the Open University, exams are just about the only occasion when our students are required to be physically present in a particular place (and for students for whom this is not possible, the invigilator goes to them). So we should be making more use of examinations…

As in so many of the topics I post about in this Blog, there is a tension. What’s the way forward?

Here are a few of my thoughts, and those from some colleagues. We could:

1. review what we do in “examinations” to make the assessed tasks more authentic;

2. make greater use of open book exams;

3. tell students the questions in advance, and allow notes into an examination hall;

4. is there a technical solution? If we truly crack the issue of secure home exams at scale, then the assessed tasks could perhaps be longer and more open ended, with a remote invigilator just looking in from time to time;

5. Are there any other technical solutions?

6. moving away from examinations in the conventional sense, our Masters programmes sometimes require students to turn up for an assessment ‘Poster Day’. We have had some success in replicating this in a synchronous online environment.

7. we could have an examinable component that requires a student to reflect on collaborative work in forums. The student’s tutor could then check that the student has posted what they say they have posted throughout the presentation of the module.

8. Option (6) is essentially a viva. We could extend this approach by requiring every student (or a certain percentage) to have a conversation with their tutor or a module team member (by phone or Skype etc.) about their progress through the module/qualification.

We would be extremely grateful for comments and other ideas.

Posted in authentic assessment, authentication, exams, identity | Tagged , , , | 4 Comments

Effective feedback

“As long as we hold this image of feedback as being something that one person gives to another person to educate that person, we’ve missed the ultimate point of the feedback system…”

Sound familiar? How about

“Feedback as a concept (or the thing that happens when you talk into your microphone too close to the speaker) is simply information that goes into a system (and comes back at you with a high-pitched squeal). What happens next is where things get interesting – the postfeedback learning, which is the point of feedback in the first place.”

However it may surprise you to hear that these quotes are not from a book on assessment, but rather from “Changing on the job: Developing leaders for a complex world” by Jennifer Garvey Berger. I’ve been on a leadership course at work for much of 2014 and I’ve been thinking a lot about the concepts, especially the challenging issue academic leadership. Just how do you get the best out of clever people? The quotes highlight some extremely interesting similarities with what I have been banging on about for years, in this blog and elsewhere.

 The first point of similarity is that it is not the feedback intervention itself that is significant but rather the way in which the person receiving the feedback intervention responds to it. And if the person receiving the feedback intervention is in charge of their own response, so much the better.

However, feedback, purely as information, still needs to happen. In the staff management situation, sadly sometimes people don’t appreciate when there are issues that need to be addressed. So there is a need for a very clear exchange of information. In the case of feedback on assessed tasks, this is one area where e-assessment has huge potential. Computers can give information in a non-judgemental and impersonal way, leaving the interpretation of this information to people.

Posted in feedback | Tagged | Leave a comment