Researching engagement with assessment, as a physicist

Posted on November 21st, 2015 at 1:35 pm by Sally Jordan

I have not posted as much as I might have wished recently, and when I have, I’ve tended to start with a grovelling apologies on the grounds of lack of time because of my head of department duties. I sometimes also hesitate to post because of a lack of confidence: I’m not really an expert; what grounds do I have to be so opinionated. However, following my  seminar in our own Department of Physical Science’s Seminar Series at the Open University on Thursday (for which the slides – I hope – are at SEJ DPS Seminar Nov 2015) I have decided that it is time to take a more robust attitude. OK, I’m unusual to be a physicist, let alone the head of a Department of Physical Sciences, doing pedagogic research. But that’s what I am; that’s who I am. The point is that I am researching learning, but I am doing so as a numerate scientist. I’m going to stop apologising for the fact and I might even stop moaning about the resultant difficulty that I sometimes have in getting papers published. I am not a social scientist, I’m a physicist.

So what does that mean? It means that I try to use scientific methodology; I listen to student opinion because it is important, but I also look for hard data. I don’t say that one thing causes another unless there is evidence that it does. Furthermore – and scientists sometimes fall down here too – I report my findings even when they don’t show what I was expecting. Well, that’s my aspiration. As frequently happens, I was slightly worried by some of the comments following my talk on Thursday – people say “ah yes, we have found such and such”. Have they REALLY found this, or is it what they think might be happening? Hypotheses are important but they need testing. Even more worryingly, I’m writing a paper at the moment and it is very tempting to ignore findings that don’t support the story I want to tell. Please don’t let me do that. Please give me the courage to stand my ground and to report the truth, the whole truth and nothing but the truth.

I have just realised that I don’t seem to have posted about the talk that Tim Hunt and I gave at the Assessment in Higher Education Conference in the summer on “I wish I could believe you: the frustrating unreliability of some assessment research”. I will rectify that as soon as possible (…remember, I’m a head of department…) but in the meantime, our slides are on slideshare here.

The multiple limitations of assessment criteria

Posted on November 8th, 2015 at 6:31 pm by Sally Jordan

Sadly, I don’t get as much time as I used to in which to think about assessment. So last Wednesday was a particular joy. First thing in the morning I participated in a fantastic webinar that marked the start of a brand new collaboration between two initiatives that are close to my heart – Transforming Assessment (who run a webinar series that I have been following for a long time) and Assessment in Higher Education (whose International Conferences I have helped to organise for 4 years or so). Then I spent most of the afternoon in a workshop discussing peer review. The workshop was good too, and I will post about it when time permits. For now, I’d like to talk about that webinar.

header 1

The speaker was Sue Bloxham, Emeritus Professor at the University of Cumbria and the founding Chair of the Assessment in Higher Education Conference. It was thus entirely fitting that Sue gave this webinar and, despite never having used the technology before, she did a brilliant job – lots of good ideas but also lots of discussion. Well done Sue!

Capture 2

Assessment criteria are designed to make the processes and judgement of assessment more transparent to staff and students and to reduce the arbitrariness of staff decisions. The aim of the webinar was to draw on research to explore the use of assessment criteria by experienced markers and discuss the implications for fairness, standards and guidance to students.

Sue talked about the evidence of poor reliability and consistency of standards amongst those assessing complex performance at higher education level, and suggested some reasons for this, including different understanding, different interpretation of criteria, ‘marking habits’ and ignoring or choosing not to use criteria.

Sue then described a study, joint with colleagues from the ASKe Pedagogical research centre at Oxford Brookes University, which had sought to  investigate the consistency of standards between examiners within and between disciplines. 24 experienced examiners from 4 disciplines & 20 diverse UK universities were employed and each considered 5 borderline (2i/2ii or B/C) examples of typical assignments for the discipline.

The headline finding was that overall agreement on a mark by assessors appears to mask considerable variability in individual criteria. The difference in the historians’ appraisal of individual constructs was further investigated and five potential reasons were identified that link judgement about specific elements of assignments to potential variation in grading:

  • Using different criteria from those published
  • Assessors have different understanding of shared criteria
  • Assessors have a different sense of appropriate standards for each criterion
  • The constructs/criteria are complex in themselves, even comprising various sub-criteria which are hidden to view
  • Assessors value and weight criteria differently in their judgements

Sue led us into a discussion of the implications of all of this. Should we recognise the impossibility of giving a “right” mark for complex assessments? (for what it’s worth, my personal response to this question is “yes” – but we should still do everything in our power to be as consistent as possible). Sue also discussed the possibility of ‘flipping’ the assessment cycle, with much more discussion pre assessment and sharing the nature of professional judgement with students. Yes, yes, yes!

If I have a complaint about the webinar it is purely that some of the participants took a slightly holier than thou approach, assuming that the results from the study Sue described were as a result of poor assessment tasks or insufficiently detailed criteria (Sue explained that she didn’t think more detailed criteria would help, and I agree) or examiners who were below par in some sense. Oh dear, oh dear, how I wanted to tell those people to carry out a study like this in their own context. Moderation helps, but those who assume high level consistency are only deluding themselves.

While we are on the subject of the subjective nature of assessment, don’t take my word for the high quality of this webinar, watch it yourself at

So what is assessment really about?

Posted on October 4th, 2015 at 5:06 pm by Sally Jordan

P1030194I’ve just returned home from Barcelona, where I was visiting the eLearn Center at the Universitat Oberto de Catalunya (UOC), the Open University of Catalonia. UOC has an “educational model” which is similar to that used at the UK Open University, though they are not “open” in the same sense (they have entry qualifications) and they are an entirely online university. Overall I was extremely impressed (and Barcelona was quite nice too…).

Partly as a result of my discussions in Barcelona and partly just as a result of taking a break from the usual routine, I have been reflecting on what we do in the name of assessment. It’s so easy to make assessment an “add on” at the end of a module (and if you have an exam, I guess it is exactly that). But even if you are then using that assessment as  assessment of learning, are you really assessing what you hope that your students have learnt (i.e. your learning outcomes), or are you assessing something altogether different? And if assessment is assessment for learning, is it really driving learning in the way you hope?

At least some courses at UOC make good use of collaborative assessment and surely, in principle at least, a solution  is to assess all of the actual activities that you expect your students to do i.e. to put assessment at the centre. In an online environment it should be possible to assess the way in which students actually engage with the materials and collaborate with their peers. However, in my experience, practice is often somewhat different.  At the very least, if you have an activity where students work together to produce an output of some sort, it makes sense to assess that output not a different one, even if you then have to enter the murky world of assessing an individual contribution to a group project.

So where does that leave all my work on sophisticated online computer-marked assessment? I still think it can still be very useful as a means of delivering instantaneous, unbiased and targeted feedback interventions and as a way of motivating students and helping them to pace their studies. But that’s about learning not assessment…I need to think about this some more. Perhaps assessment is a bit like quantum mechanics, the more you think you understand it, the more problematic it becomes…


Gender differences on force concept inventory

Posted on July 4th, 2015 at 10:59 am by Sally Jordan

Hot on the heals of my last post, reporting on work which did not provide support for the previous finding that men and women perform differentially on different types of assessed tasks, I bring you a very interesting finding from work done at the University of Hull (with Ross Galloway from the University of Edinburgh, and me). David Sands from Hull and two of his students came to the OU on Thursday and gave a presentation which is to be repeated at  the GIREP-EPEC conference in Poland next week.

We are seeking to investigate whether findings from use of the well-established force concept inventory (FCI) (Hestenes et al, 1992), are replicated when the questions are asked as free text rather than multiple choice questions. Free text versions of the questions have been trialed at Hull and Edinburgh, and the next step is to attempt to write automatically marked versions of these.

However, the interesting finding for now is that whilst in general students perform in a similar way on the free text and multiple choice version of the FCI, there are some variations in the detail. In particular, whilst men outperform women in the MCQ version of the FCI  (Bates etc al., 2013) it seems that the gender difference may be reduced or even reversed with the free text version. We don’t have enough responses yet to be sure, but watch this space!

Bates, S., Donnelly, R., MacPhee, C., Sands, D., Birch, M., & Walet, N. R. (2013). Gender differences in conceptual understanding of Newtonian mechanics: a UK cross-institution comparison. European Journal of Physics, 34(2), 421-434

Hestenes, D., Wells, M., & Swackhamer, G. (1992). Force concept inventory. The Physics Teacher, 30(3), 141-158.

More on the gender differences on our level 2 physics module

Posted on June 27th, 2015 at 2:28 pm by Sally Jordan

I’m returning to the topic raised here. To summarise, significantly fewer women than men study our level 2 (FHEQ L5) physics module S207 and, more worryingly, those who do are significantly less likely to complete it, and those who complete it are less likely to pass…It’s a depressing situation and we have been trying to find out more about what is going on. We don’t have definite answers yet, but we do have some pointers – and we are hoping that if we can begin to address the issues we will be able to improve the outcomes for all students on S207 (both men and women).

In my previous post I explained that women do less well on almost all interactive computer-marked assessment (iCMA) questions, but the amount by which they do less well varies from question to question. This does not appear to depend on question type.

Next, let’s consider the S207 exam. The exam has three parts with (a) multiple-choice questions; (b) short-answer questions; (c) longer questions. Students are expected to attempt all questions in part (a) and part (b), whilst in part (c) they should attempt three questions from a choice of 7 (one on each of the main books in the module).

Let’s start by considering performance on each of the three parts of the exam (all the data are for the 2013-14 presentation). The average score for men and women for each of the three parts are shown in the figure below (blue = men; pink = women, with my apologies for any offence caused by my sexist stereotyping on colour, but I’m sticking with it because it is obvious).









So, women do less well on multiple-choice questions, as you would have been expecting if you’ve read the literature…but they also do less well on short-answer and long-answer questions (though do note the fact that the error bars overlap)…Hmmm.

Things get much more interesting if we consider how many men and women choose to answer each of the longer questions in part (c):









So relatively fewer women are choosing to answer the first two questions; relatively more are choosing the answer the others. And how well do they do on each question? See below:









So, all questions are not equal. Men and women appear to prefer different questions and to perform differently on different questions. And we have also seen that we are more likely to loose students when they are studying the materials that are assessed in the first two exam questions. So it looks as if women do even worse on some parts in the module than others. What we don’t know yet is whether this is as a result of the topic in question (in this case Newtonian mechanics) or that they are less good at problem solving, less familiar with the abstract types of questions that we are asking, or whether less structured long questions put them off (c.f. questions where they are given more tips). We also need to do more work to test some hypotheses that might explain some of these factors e.g. that whilst women are as likely as men to have A levels, they may be less likely to have A level physics or maths. Our investigation is far from over.

Reflections on AHEC 2: Assessment transparency

Posted on June 27th, 2015 at 1:17 pm by Sally Jordan

mangleI should start by saying that Tim Hunt’s summary of last week’s Assessment in Higher Education Conference is excellent, so I don’t know why I’m bothering! Seriously, we went to some different sessions, in particular Tim went to many more sessions on feedback than I did, so do take a look at his blog posting.

Moving on to “Assessment transparency”. I’m picking up here on one of the themes that Tim also alludes to, the extend to which our students do, or don’t, understand what is required of them in assessed tasks. The fact that students don’t understand what we expect them to do is one of the findings I reported on in my presentation “Formative thresholded evaluation : Reflections on the evaluation of a faculty-wide change in assessment practice” which is on Slideshare here. Similar issues were raised in the presentation I attended immediately beforehand (by Anke Buttner and entitled “Charting the assessment landscape: Preliminary evaluations of an assessment map”). This is not complicated stuff we’re talking about – not anything as sophisticated as having a shared understanding of the purpose of assessment (though that would be nice!).

It might seem obvious that  we want students to know what they have to do in assessment tasks, but there is actually a paradox in all of this. To quote Tim Hunt’s description of a point in Jo-Anne Baird’s final keynote: “if assessment is too transparent it encourages pathological teaching to the test. This is probably where most school assessment is right now, and it is exacerbated by the excessive ways school exams are made hight stakes, for the student, the teacher and the school. Too much transparency (and risk averseness) in setting assessment can lead to exams that are too predicable, hence students can get a good mark by studying just those things that are likely to be on the exam. This damages validity, and more importantly damages education.”. Suddenly things don’t seem quite so straightforward.

Reflections on AHEC 1: remembering that students are individuals

Posted on June 25th, 2015 at 9:05 pm by Sally Jordan

I’ve been at the 5th Assessment in Higher Education Conference, now truly international, and a superb conference. As in 2013, the conference was at Maple House in Birmingham.  With 200 delegates we filled the venue completely, but it was a deliberate decision to use the same venue and to keep the conference relatively small. As the conference goes from strength to strength we will need to review that decision again for 2017, but a small and friendly conference has a lot to commend it. We had some ‘big names’, with Masterclasses from David Boud, David Carless, Tansy Jessop and Margaret Price, and keynotes from Maddalena Taras and Jo-Anne Baird. There were also practitioners from a variety of backgrounds and with varying knowledge of assessment literature.

For various reasons I attended some talks that only attracted small audiences, but I learnt a lot from these. One talk that had a lot of resonance with my own experience was Robert Prince’s presentation on “Placement for Access and a fair chance of success in South African Higher Education institutions”.  Robert talked about the different educational success of students of different ethnicity, both at School and at South African HE institutions. The differences are really shocking. They are seeking to address the situation at school level, but Robert rightly recognises that universities also need to be able to respond appropriately to students from different backgrounds, perhaps allowing the qualification to be completed over a longer period of time.

Robert went on to talk about the ‘National Benchmark Tests (NBT)’ Project, which has produced tests of academic literacy, quantitative literacy and mathematics. The really scary, though sadly predictable, finding is that the National Benchmark Tests are extremely good at predicting outcome. But the hope is that the tests can be used to direct students to extended or flexible programmes of study.

In my mind, Robert’s talk sits alongside Gwyneth Hughes’s talk on ipsative assessment i.e. assessing the progress of an individual student (looking for ‘value added’). Gwyneth talked about ways in which ipsative assessment (with a grade for progress) might be combined with conventional summative assessment, but that for me is the problem area. If we are assessing someone’s progress and they have just not progressed far enough I’m not convinced it is helpful to use the same assessment as for students who are flying high.

But the important thing is that we are looking at the needs of individual students rather than teaching  and assessing a phantom average student.

Special edition of Open Learning

Posted on May 27th, 2015 at 9:42 pm by Sally Jordan

If you are interested in assessment in a distance learning context, you may be interested to know about a special edition of the journal ‘Open Learning’, with the theme of assessment. Click on the link to see the call for papers.

Open Learning The Journal of Open Distance and eLearning CFP for Assessment Special Issue March 2015

Performance on interactive computer-marked questions – gender differences

Posted on March 15th, 2015 at 8:47 pm by Sally Jordan

We have become aware of a significant difference in outcome for male and female students on our level 2 physics module; around 25% of the students on the module are women, and they are both less likely to complete the module and less likely to pass if they survive to the end. This effect is not present in our level 1 Science Module or for other scientific disciplines apart from astronomy at Level 2; and the women who get through to Level 3 do better than men.

Many theories have been proposed as to the reason for the effect, which may be related to persistent gender differences in performance on the force concept inventory – see for example Bates et al. (2013). I proposed that the effect might have been related to the assessment we use; there is evidence (e.g. Gipps & Murphy, 1994; Hazel et al., 1997) that girls are less happy with multiple-choice questions.

One of the things we have done is looked at performance differences on each interactive computer-marked assignment question. The results are summarised in the figure below (click on it to see the detail).






Points to note are as follows:

Women score less well than men on most questions, but the effect is no greater than for tutor-marked assignment questions.

The gender difference is much greater for some questions than others; but the questions with a large differences are not all questions of one type. So multiple-choice is not to blame! It appears more likely that the issue is with what the questions are assessing; there is some indication that our female students are less good at complex, abstract, problem-solving type questions.

The gender difference is much less for the final iCMA. My hypothesis is that the usual reasons for women doing less well are counter-balanced by the fact that women are more persistent; they are more likely to attempt this iCMA whilst men are more likely to reckon they have reached the required threshold on 5 out of 7 iCMAs, and so not to bother.

More work is required before I can be confident of this analysis; it is an interesting and extremely important investigation.


Bates, S., Donnelly, R., MacPhee, C., Sands, D., Birch, M., & Walet, N. R. (2013). Gender differences in conceptual understanding of Newtonian mechanics: a UK cross-institution comparison. European Journal of Physics, 34(2), 421-434.

Gipps, C. V. & Murphy, P. (1994). A fair test? Assessment, achievement and equity. Buckingham: Open University Press.

Hazel, E., Logan, P., & Gallagher, P. (1997). Equitable assessment of students in physics: importance of gender and language background. International Journal of Science Education, 19(4), 381-392.



Pixelated assessment

Posted on January 17th, 2015 at 5:27 pm by Sally Jordan

I’m indebted to the colleague who told me about Cees van der Vleuten’s keynote at the EARLI Assessment SIG Conference in Madrid last August ( I should perhaps point out that I am reporting third hand, so I may have got it all wrong. All that I can claim is that I am reporting on my own reaction to what I think my colleague said about what she thinks Professor van der Vleuten said…

I understand that he was talking about the assessment of professional competence, which is very important. The point that really grabbed my attention though was that since we need professionals to be able to do their job day after day, in a reliable fashion, ‘one off’ assessment, at the end of a programme of study isn’t really appropriate. Of course, one off assessment is always open to challenge – you will do less well if you have a headache on the day of the exam; you will do better if you happen to have revised the ‘right’ things. But there has been something of a backlash against continuous assessment recently, most obviously in the renewed emphasis placed on exams at the expense of coursework in UK schools (courtesy of governmental policy). Perhaps with more justification, some argue that you should assess outcomes at the end of a module rather than progress towards those outcomes and I have argued (e.g. here) that summative continuous assessment can lead to confusion over its purpose (is it formative or summative; is it for learning or of learning?).

Professor van der Vleuten’s keynote suggested that we should use ‘little and often’ continuous assessment that is very low stakes, perhaps with the stakes increasing as the module progresses – so that a student’s overall assessment record builds up slowly, in the same way that pixels build up to make a picture. Pixelated assessment. Nice!