The unscientific method

The title of this post is copied from another New Scientist article, this time by Sonia van Gilder Cooke, and published in Issue number 3069 (16th April 2016) on pages 39-41. The article starts “Listening to When I’m Sixty-Four by The Beatles can make you younger. This miraculous effect, dubbed ‘chronological rejuvenation’ was revealed in the journal Pyschological Science in 2011. It wasn’t a hoax, but you’d be right to be suspicious. The aim was to show how easy it is to generate statistical evidence for pretty much anything, simply by picking and choosing methods and data in ways that researchers do every day.”

The article is wider ranging than the one that I’ve just posted about here. However, what is most worrying is that it goes on to point out that dubious results are alarmingly common in many fields of science. The summary of causes of bias includes some things that I suspect I have been guilty of:

  • Wishful thinking – unconsciously biasing methods to confirm your hypothesis
  • Sneaky stats – using the statistical analysis that best supports your hypothesis
  • Burying evidence – not sharing research data so that results can be scrutinised
  • Rewriting history – inventing a new hypothesis in order to explain unexpected results
  • Tidying up – ignoring inconvenient data points and analyses in the write-up

I will discuss one cause that isn’t explicitly mentioned in the summary, namely our wish to only publish ‘positive’ results, in my next post in this morning’s trilogy:

The article goes on to suggest a number of fixes:

  • Pre-registration – publicly declaring procedures before doing a study
  • Blindfolding – deciding on a data analysis method before the data are collected
  • Sharing – making methods and data transparent and available to others
  • Collaboration – working with others to increase the rigour of experiments
  • Statistical education – acquiring the tools required to assess data meaningfully

 

 

Posted in research methods, statistics | Tagged , | Leave a comment

Simpson’s paradox

Back in November, I posted about the fact that I was going to be more bullish about the fact that I am a physicist but that I do educational research. As I try to build my confidence to say some of things that follow from that in my own voice, I’ll start by quoting some more articles I have read in the past few months.

To start with, there was a piece in New Scientist back in February (Issue number 3062, 27th February 2016, pg 35-37), by Michael Brooks and entitled “Thinking 2.0”. This article starts by pointing out Newton’s genius in recognising the hidden variable (gravity) that connects a falling apple and the rising sun. He goes on to explain that “we know that correlation does not equal causation, but we don’t grasp the depth of it” – and to point out that our sloppy understanding of  statistics can lead us into deep water.

Brooks gives a powerful hypothetical example of Simpson’s paradox, defined by Wikipedia as a paradox “in which a trend appears in different groups of data but disappears or reverses when these groups are combined” (the Wikipedia article gives some more examples and is worth reading). The example in the New Scientist article is about a clinical trial involving 400 men and 400 women that apparently shows that a new drug is effective in treating an illness – for both the men and the women. However, if you look at the 800 participants as a whole, it becomes apparent that more of those who were NOT given the drug recovered than those who received the drug. How so? Well, although the sample was nicely balanced between men and women, and half of the participants received the drug whilst half didn’t, it turns out that far more men were given the drug in this particular study, and men are much more likely to recover, whether or not they receive the drug. The men’s higher overall recovery rate masked the drug’s negative effect. This is a hypothetical example, and in a structured environment such as a clinical trial, such potential pitfalls can generally be circumnavigated. But medical – and educational – research often operates in what Brooks rightly describes as muddy waters. Controls may not be possible and we can be led astray by irrelevant, confusing or missing data.

Although I was aware of Simpson’s paradox and thought I had a reasonable understanding of ‘lies, damned lies and statistics’ it took me some time to get my head around what is going on here. We need to be really careful.

Posted in research methods, Simpson's paradox, statistics | Tagged , , | 2 Comments

Do we need assessment at all?

I’m surprised I haven’t posted on this before, but it looks as if I haven’t, and I am reminded to do so now by another New Scientist piece, this time from back in January:

Rutkin, A. (2nd Jan 2016) Robotutor is a class act. New Scientist, 3054, p. 22.

The article talks about an algorithm developed by researchers at Stanford University and Google in California which analyses students’ performance on past problems, identifies where they tend to go wrong and forms a picture of their overall knowledge.

Chris Piech from Stanford goes on to say “Our intuition tells us if you pay enough attention to what a student did as they were learning, you wouldn’t need to have them sit down and do a test.”

The first paper I heard suggesting that we might assess students by analysing their engagement with an online learning environment (rather than adding a separate test) was Redecker et al. (2012) and it blew me away.

Redecker, C., Punie, Y., & Ferrari, A. (2012). eAssessment for 21st Century Learning and Skills. In A. Ravenscroft, S. Lindstaedt, C.D. Kloos & D. Hernandez-Leo (Eds.), 21st Century Learning for 21st Century Skills (pp. 292-305). Berlin: Springer.

In reality of course, and as much discussed in this blog, I would never want to do away with interaction with humans, and there are things (e.g. essays, problem solving) where I think marking should be done by human markers. However,  if we can do away with separate tests that are just tests, I’d be delighted.

Posted in learning analytics | Tagged | 1 Comment

Positive discrimination?

plusThis isn’t really about assessment, or perhaps it is. First of all, some background. Because of a change in the dates used to establish school years where I lived when I was  small, I missed a year at primary school. So, in a sense, I was disadvantaged. But I understand that, up to a certain age, they then gave those of us affected extra marks in exams. I’ve no idea whether that was actually the case. What I do know is that if I felt I’d been given unfair advantage over others  in my more recent career (in particular as a female physicist) I would not be happy.

EqualsMy definition of equality of opportunity has to do with leveling the playing field. I once arrived at a tutorial venue to give a tutorial, having requested a ground floor room because I knew someone in a wheelchair would be there. The problem was that the venue had given me a room in a portacabin up three steps. Only three steps but the effect was the same – the student couldn’t access the tutorial (well, not until I got angry and got us moved to another room). Sometimes apparently small things can get in the way of learning, for some students not for others, and promoting equal opportunity is to do with ensuring that these “small things” are removed. In my book, equality of opportunity is not the same of positive discrimination; I’d give a student extra time in an exam if a medical condition suggested it was necessary; I would not give a student extra marks just by virtue of the medical condition. I’m happy to argue my case for that…or at least I was…

At the Open University we have found that female students do less well on one of our physics modules, and we continue to investigate the causes for this and to seek to put it right. Start here to learn more about this work. However, I’d never have thought of increasing marks just for women or others in minority groups. After all, these are averages, some women and some black students do well, even if their average attainment is lower.

Then, in my catch-up reading of old copies of New Scientist I came across an opinion piece from Joshua Sokol entitled “Mix it up”. This points out, as I know from other sources, that there can be a mismatch between scores in tests and future performance. So if women and blacks do less well in a test, and we use that test to determine entry onto a subsequent programme (in this case an Astronomy PhD) we are both disadvantaging racial minorities and women, and failing to get the best students on the subsequent programme.

By coincidence, I been trying to come to terms with all of this in the week when my Department at the Open University has been awarded Institute of Physics Juno Champion Status for our commitment to gender equality. It’s great news, but it doesn’t mean we have arrived! More thought needed, and I think my conclusion to the conundrum described in this post is probably to be careful not to judge ANYONE on a single measure.

Sokol, Joshua (9th January 2016). Mix it up. New Scientist, number 3055, p. 24.

Posted in gender, postive discrimination | Tagged , | 5 Comments

Feedback from a computer

Back in Feb 2011 – gosh that’s five years ago – I was blogging about some contradictory results on how people respond to feedback from a computer. The “computers as social actors” hypothesis contends that people react to feedback from a computer as if it were from a human. In my own work, I found some evidence of that, though I also found evidence that when people don’t agree with the feedback, or perhaps just when they don’t understand it, they are quick to blame the computer as having “got it wrong”.

The other side to this is that computers are objective, and – in theory at least – there is less emotional baggage in dealing with feedback from a computer than in dealing with feedback from a person; you don’t have to deal with the aspect that “my tutor thinks I’m stupid” or, even perhaps even worse for peer feedback, “my peers think I’m stupid”.

I was reminded of this in reading an interesting little piece in this week’s New Scientist. The article is about practising public performance to a vritual audience, and describes a system developed by Charles Hughes at the University of Central Florida. The audience are avatars, deliberately designed to look like cartoon characters. A user who has tried the system says “We all know that’s fake but when you start interacting with it you feel like it’s real” – that’s Computers as Social Actors. However, Charles Hughes goes on to comment “Even if we give feedback from a computer and it is actually came from a human, people buy into it more because they view it as objective”.

Wong, S. (6th Feb 2016). Virtual confidence, New Scientist, number 3059, p. 20

Posted in Computers as Social Actors, feedback from a computer | Tagged , | 2 Comments

Tails wagging dogs

Earlier in the week I gave a workshop at another University. I’m not going to say where I was, because it might sound as if I’m cricitising their practice. Actually I’m not crititising them particularly, and indeed the honest, reflective conversation we had was amazing. I think that many of us would be advised to stop and think in the way we all did on Wednesday.

I was running a workshop on the electronic handling of human-marked assignments. I should perhaps mention that I was in a physics department – and I am a physicist too. This is significant, because if we expect students to submit assignments electronically, then they have to produce them electronically – complete with all the symbolic notation and graphs etc that we use. This can be a real challenge. At the Open University we encourage students to write their answers by hand and scan them, but then the quality can be mixed – and plagiarism-checking software doesn’t work. Many OU students chose to input their maths in Word Equation Editor or LaTeX, but it takes them time and they make mistakes – and frequently they don’t show as much working as they should.

Then there’s the problem of marking; how do we put comments on the scripts in a way that’s helpful without it taking an unreasonable amount of time? At the OU, we comment in Word or using PDF annotator, using various hardware like the iPad Pro and the Microsoft Surface. We can make it work reasonably well, and actually some of our tutors now get on quite well with the technology.  As a distance-learning University, we can at least argue that electronic handing of assignments speeds the process up and saves postage costs – and trees!

I’d been asked to run a workshop about what we do at the OU; I think I failed in that regard – they knew as much as I do. However, halfway through, someone commented to the effect that if the best we can do is to mimic handwritten submission and marking, why are we doing this? They’ve been told they have to, so that there’s an audit trail, but is that a good enough reason? They are allowed to make a special case and the mathematicians have done this;  but isn’t it time to stop and think about the policy?

We then started thinking about feedback – there is evidence that audio/video feedback can be more useful than written feedback. So why are we giving written feedback? Indeed, is the feedback we give really worth the time we spend on it? We’re driven to give more and more feedback because we want high scores on the NSS, and students tell us they want more feedback. But is it really useful? I’ve blogged on that before and I expect that I will again, but my general point in this post is that we should stop and think about our practice rather than just looking for solutions.

On a related point, note that JISC have run a big project on the “Electronic Managament of Assessment” – there is more on this here.

 

Posted in electronic management of assessment | Tagged | Leave a comment

What do I really think about learning analytics?

There have been two very good webinars on learning analytics recently in the Transforming Assessment series. On 9th Sept 2015, Cath Ellis from the University of New South Wales and Rachel Forsyth from Manchester Metropolitan University spoke on “What can we do with assessment analytics?”. On 9th December 2015, Gregor Kennedy, Linda Corrin and Paula de Barba from the University of Melbourne spoke on “Providing meaningful learning analytics to teachers: a tool to complete the loop” . I would heartily recommend both recordings to you.

Having said that, talk of  learning analytics is suddenly everywhere. If we take Doug Clow’s (2013, p. 683) short definition of learning analytics as “the analysis and representation of data about learners in order to improve learning” then it is beyond argument that this is something we should be doing. And I agree that student engagement in assessment is something that must be included in the analysis, hence “assessment analytics”. It could be argued that I’ve been using assessment analytics since before the term was invented, though my approach has been a bit of a cottage industry; one of the recent changes is that learning analytics now (rightly) tends to be available at the whole institution level.

There seems to be some dispute about (i) whether the terms learning and assessment analytics apply to analysis at the cohort level as well as analysis at the individual student level and (ii) as to whether it is legitimate to include analysis done retrospectively to learn more about learning. I would include all of these; all are important if we are to improve the student experience and their chances of success.

So far so good. However, learning analytics has become fashionable; that in itself is perhaps a cause for some anxiety. It is all too easy to leap on board the bandwagon without giving the matter sufficient thought. I know that many mainstream academics (i.e. those who probably don’t read blogs like this one) are deeply uneasy about the approach. This is partly because the information given to academics is sometimes blindingly obvious…e.g. telling us that students who do not engage at all with module materials are not very likely to pass. Some of us have been banging on about this for years. I am also anxious that the analytics are sometimes simplistic, equating clicking on an activity with engagement with it, something I’ve posted about before.

So, what’s the way forward? I think learning analytics has to cease being the preserve of the few and become something that ordinary lecturers use in their teaching as a matter of routine; if this is to happen they need to trust and understand the data and to take ownership of it. Furthermore, the emphasis needs to change from the giving of data to the real use of data. When learning analytics (at the individual or the cohort level, and in real time or retrospectively) reveals a problem, let’s do something about it.

Clow, D. (2013). An overview of learning analytics. Teaching in Higher Education, 18(6), 683-695.

Posted in assessment analytics, learning analytics | Tagged , , | Leave a comment

Can multiple-choice questions be used to give useful feedback?

I was asked the answer to this question recently, and I thought it was worth a blog post. My simple answer to the question in the title, I’m afraid to say, is “no”. Perhaps that’s a bit unfair, but I think that relying on MCQs to provide meaningful feedback is somewhat short-sighted; surely we can do better.

My argument goes thus: a question author provides feedback which they believe to be meaningful on each of the distractors, but that assumes that the student was using the same logic as the question author in reaching that distractor. In reality, students may guess the answer, or work backwards from the distractors or something in between e.g. they may rule out distractors they know to be wrong and then guess from amongst the rest. Feedback is only effective when the student encounters it in a receptive frame of mind (timing and concepts such as response certitude come into play here); if the student has given a response for one reason and the feedback assumes a different logic then the feedback is, at best, of dubious value. It is also the case that there is growing evidence that when given the option to give responses without the ‘hint’ provided by MCQs, students give answers that were not amongst those provided in the distractors.

It is no secret that I am not a fan of selected response questions, though my views have mellowed slightly over the years. My biggest problem with them is the lack of authenticity. However if that is not an issue for the use being made, and the questions are well written, based on responses that students are known to give (rather than those that ‘experts’ assume students will give), then perhaps MCQs are OK. Even relatively simple multiple-choice questions can create “moments of contingency” (Black & Wiliam, 2009; Dermo & Carpenter, 2011) and Draper’s (2009) concept of catalytic assessment is based on the use of selected-response questions to trigger subsequent deep learning without direct teacher involvement. However I think the usefulness here is in making students think, not in the direct provision of feedback.

There are other things that can be done to improve the usefulness of multiple-choice questions e.g. certainty-based marking (Gardner-Medwin, 2006). However, when there are so many better question types, why not use them? For example, there are the free-text questions – with feedback – used by the free language courses at https://www.duolingo.com/. I’m not sure what technology they are using, but I think it is linked to crowd-sourcing, which I definitely see as the way ahead for developing automatic marking and feedback on short-answer constructed response questions.

Let’s make 2016 the year in which we really look at the evidence and improve the quality of what we do in the name of computer-marked assessment and computer-generated feedback. Please.

References

Black, P. & Wiliam, D. (2009). Developing the theory of formative assessment. Educational Assessment, Evaluation and Accountability, 21(1), 5-31.

Dermo, J. & Carpenter, L. (2011). e-Assessment for learning: Can online selected response questions really provide useful formative feedback? In Proceedings of the 2011 International Computer Assisted Assessment (CAA) Conference, Southampton, 5th-6th July 2011.

Draper, S. (2009a). Catalytic assessment: Understanding how MCQs and EVS can foster deep learning. British Journal of Educational Technology, 40(2), 285-293.

Gardner-Medwin, A. R. (2006). Confidence-based marking: Towards deeper learning and better exams. In C. Bryan & K. Clegg (Eds.), Innovative Assessment in Higher Education (pp. 141-149). London: Routledge.

 

Posted in feedback, multiple-choice questions | Tagged , , | 2 Comments

Researching engagement with assessment, as a physicist

I have not posted as much as I might have wished recently, and when I have, I’ve tended to start with a grovelling apologies on the grounds of lack of time because of my head of department duties. I sometimes also hesitate to post because of a lack of confidence: I’m not really an expert; what grounds do I have to be so opinionated. However, following my  seminar in our own Department of Physical Science’s Seminar Series at the Open University on Thursday, I have decided that it is time to take a more robust attitude. OK, I’m unusual to be a physicist, let alone the head of a Department of Physical Sciences, doing pedagogic research. But that’s what I am; that’s who I am. The point is that I am researching learning, but I am doing so as a numerate scientist. I’m going to stop apologising for the fact and I might even stop moaning about the resultant difficulty that I sometimes have in getting papers published. I am not a social scientist, I’m a physicist.

So what does that mean? It means that I try to use scientific methodology; I listen to student opinion because it is important, but I also look for hard data. I don’t say that one thing causes another unless there is evidence that it does. Furthermore – and scientists sometimes fall down here too – I report my findings even when they don’t show what I was expecting. Well, that’s my aspiration. As frequently happens, I was slightly worried by some of the comments following my talk on Thursday – people say “ah yes, we have found such and such”. Have they REALLY found this, or is it what they think might be happening? Hypotheses are important but they need testing. Even more worryingly, I’m writing a paper at the moment and it is very tempting to ignore findings that don’t support the story I want to tell. Please don’t let me do that. Please give me the courage to stand my ground and to report the truth, the whole truth and nothing but the truth.

I have just realised that I don’t seem to have posted about the talk that Tim Hunt and I gave at the Assessment in Higher Education Conference in the summer on “I wish I could believe you: the frustrating unreliability of some assessment research”. I will rectify that as soon as possible (…remember, I’m a head of department…) but in the meantime, our slides are on slideshare here.

Posted in research methods | Tagged | Leave a comment

The multiple limitations of assessment criteria

Sadly, I don’t get as much time as I used to in which to think about assessment. So last Wednesday was a particular joy. First thing in the morning I participated in a fantastic webinar that marked the start of a brand new collaboration between two initiatives that are close to my heart – Transforming Assessment (who run a webinar series that I have been following for a long time) and Assessment in Higher Education (whose International Conferences I have helped to organise for 4 years or so). Then I spent most of the afternoon in a workshop discussing peer review. The workshop was good too, and I will post about it when time permits. For now, I’d like to talk about that webinar.

header 1

The speaker was Sue Bloxham, Emeritus Professor at the University of Cumbria and the founding Chair of the Assessment in Higher Education Conference. It was thus entirely fitting that Sue gave this webinar and, despite never having used the technology before, she did a brilliant job – lots of good ideas but also lots of discussion. Well done Sue!

Capture 2

Assessment criteria are designed to make the processes and judgement of assessment more transparent to staff and students and to reduce the arbitrariness of staff decisions. The aim of the webinar was to draw on research to explore the use of assessment criteria by experienced markers and discuss the implications for fairness, standards and guidance to students.

Sue talked about the evidence of poor reliability and consistency of standards amongst those assessing complex performance at higher education level, and suggested some reasons for this, including different understanding, different interpretation of criteria, ‘marking habits’ and ignoring or choosing not to use criteria.

Sue then described a study, joint with colleagues from the ASKe Pedagogical research centre at Oxford Brookes University, which had sought to  investigate the consistency of standards between examiners within and between disciplines. 24 experienced examiners from 4 disciplines & 20 diverse UK universities were employed and each considered 5 borderline (2i/2ii or B/C) examples of typical assignments for the discipline.

The headline finding was that overall agreement on a mark by assessors appears to mask considerable variability in individual criteria. The difference in the historians’ appraisal of individual constructs was further investigated and five potential reasons were identified that link judgement about specific elements of assignments to potential variation in grading:

  • Using different criteria from those published
  • Assessors have different understanding of shared criteria
  • Assessors have a different sense of appropriate standards for each criterion
  • The constructs/criteria are complex in themselves, even comprising various sub-criteria which are hidden to view
  • Assessors value and weight criteria differently in their judgements

Sue led us into a discussion of the implications of all of this. Should we recognise the impossibility of giving a “right” mark for complex assessments? (for what it’s worth, my personal response to this question is “yes” – but we should still do everything in our power to be as consistent as possible). Sue also discussed the possibility of ‘flipping’ the assessment cycle, with much more discussion pre assessment and sharing the nature of professional judgement with students. Yes, yes, yes!

If I have a complaint about the webinar it is purely that some of the participants took a slightly holier than thou approach, assuming that the results from the study Sue described were as a result of poor assessment tasks or insufficiently detailed criteria (Sue explained that she didn’t think more detailed criteria would help, and I agree) or examiners who were below par in some sense. Oh dear, oh dear, how I wanted to tell those people to carry out a study like this in their own context. Moderation helps, but those who assume high level consistency are only deluding themselves.

While we are on the subject of the subjective nature of assessment, don’t take my word for the high quality of this webinar, watch it yourself at http://ta.vu/4N2015

Posted in assessment criteria, human marking, marking accuracy | Tagged , , , , | Leave a comment