Skip to content

Lies, Damn Lies and GCSE Grade boundaries

October 12, 2012

Some people think that in the Parliamentary Select Committee’s discussions of the GCSE grading fiasco, Glenys Stacey from Ofqual got a relatively easy ride. She was allowed to look like a regulator who was regulating, and taking some firm and uncomfortable decisions to bring the inefficient examination boards into line. Yes, she would have forced some boards to comply because the evidence backed the Ofqual analysis up and that was her responsibility. The committee didn’t challenge her evidence then but, perhaps, they wish now that they had – given some of the other information which has leaked into the public domain.

They didn’t because they were blinded with science and statistics. The means of ensuring that examination outcomes are allegedly accurate across a range of boards, and across a varied and shifting population of candidates have become increasingly complicated. There is a world industry in statistical comparability and a degree of competition in suggesting that one way of making comparative judgements is better than another. There is also an understanding that no method is perfect and that it is as important to understand the range and influence of potential error as it is to get an assessment in the right place.

Of course, the effect of complex statistical analysis is also to mystify the whole process. Arguably, the achievement of transparency might be more important. Whatever one thinks about that, one of the things which clearly happened in 2012 was that the statistics took over.

This has not always been the case. When I first started examining English in the early 1970s under the old GCE framework which Michael Gove would rather like to hearken back to, I went as a raw assistant marker to a meeting at the Oxford Delegacy chaired by Colin Dexter at a time when Inspector Morse was no more than an idea in the back of his head.

The marking process was straightforward. Work through two pieces of continuous writing and correct every error by ringing the spellings, underlining grammatical inaccuracy and putting a wavy underline beneath any inappropriate phraseology. The selective use of marginal ticks was also encouraged. Having done that, you made a holistic judgement about quality. Colin Dexter, not known for a lack of confidence, reticence or modesty about his talents, then told the audience in no uncertain terms what the script was worth. His criteria were not exactly apparent but they did seem to involve separating sheep from goats, common oiks from the literati, rejoicing over the occasional use of the semi-colon, and luxuriating in classical references so that if Phoebus and his golden orb made an appearance that was a definite tick. The intention to norm reference was as clear as was the attention to where the school was located and the kind of chaps it attracted. I suspect that if you were called Cholmondeley it helped as well!

While it would be perfectly reasonable to question an English assessment rooted in the prejudices of Inspector Morse (and he certainly had plenty) what you did get from this meeting was a clear benchmark. As you went on to complete your marking, you could tell relatively quickly – as you slavishly identified the errors – whether this was going to be a pass or fail and you could do this well before Phoebus bade his farewell.

Since then, the importance of the judgement in examinations has been gradually eroded. Examiners are asked to make decisions in terms of marks (the notion of grades is studiously avoided) as they go along and it is considered to be a good thing that grade awarding is kept separate. In a multi-component assessment like modular English, each component is only graded at the end of the marking process to decide what specific mark is actually worth a Grade C or Grade D, or worth a Grade A or Grade B.

This is a committee job informed by a statistician who already has a clear understanding of what the outcomes should be. New members of these committees frequently complain at the limitations of the samples they are presented with and the way in which the judgement is downplayed. Examination board officers would quickly point out in reply that these people do not see enough evidence to do more than confirm the statistics because, unlike in the case of Colin Dexter, the judgement has been separated from the assessment.

In 2012, the committees and the statisticians didn’t know what they were weighing and the reason is that they did not have proper benchmarks. A benchmark in these terms is the judgmental standard which constitutes a reasonable expectation of performance. Also, if you know how a similar cohort (in terms of socio-economic status and prior performance) scored last year then you have an expectation of how this year’s cohort ought to do. If they do better, standards are rising and if they do worse the Daily Mail will be happy. That is statistical benchmarking and, although given lots of fancy names, both forms of benchmarking remain at the heart of comparative assessment.

In 2012, there was a further complication in the introduction of controlled assessment, the marking of which was underpinned or overwhelmed (choose your word) by a mass of criterial detail and exemplification referring to a mixture of content, quality, accuracy and appropriateness in relation to set tasks. Different examination boards had different expectations of controlled assessment and different mark schemes. Teachers were given reams of guidance in how to approach it and assess it but no one had a genuine benchmark.

Prior performance measures in 2012 could have provided this and were desperately needed because this was a new specification but they weren’t available. Mass testing at key stage 3 was, despite all of its intrinsic faults, actually quite an accurate predictor of GCSE performance but that had gone for this cohort so, almost absurdly in terms of any commonsense view, Ofqual and the boards used key stage 2 data – five years old, rubbished by headteachers and latterly discredited, to create an expectation of performance.

Because it isn’t unknown for this kind of data to be unreliable, the statisticians have other tricks up their sleeves. One of their favourites is to compare performance in a particular component with the overall performance achieved in every other component when those are aggregated together and this one is excluded. The argument is that the performance should not be the same (because otherwise it would be obvious you were simply testing the same thing twice) but it should also not be too different. If it is, the judicious application of your statistical sledgehammer can knock it into shape.

Another trick is to compare the outcomes for the subject with the outcomes for a number of other subjects offered by the board. This works on the broad assumption that if the number of candidates is large enough there will be enough overlap to let you see if performance in one subject is out of line with performances overall. Clearly, this was a factor in Ofqual’s advice to specific examination boards in 2012.

Something else you can do is to create a legacy group. In this case, you might identify a group of students who were taking the examination and also had key stage 2 scores. The prediction should be that whichever examination board they sat their GCSE with the outcomes should be similar. You don’t have to say what they will be but you can say that if this standardised group does better with Edexcel than with OCR that the two boards are not working to the same standard.

The problem for Ofqual in 2012 was that it had neither reliable judgements nor reliable statistics. When it instructed individual boards to move their grade boundaries it was simply on the basis of the year on year figures. Worse still, the regulator does not appear to have been left with a benchmark if, as is generally recognised, the June standard was shifted to compensate for generosity in January.

The question that leaves us with is where should the benchmark be drawn for 2013 and for the forthcoming resit opportunity which has been offered to candidates free of charge? Statistically, because this is a one-off, it will be even harder to make sense of the outcomes and it will not be possible to carry them forward. If the severity of June is carried forward, candidates will do no better in October and standards will appear to decline once again in 2013. If it is not carried forward, that begs the question of what will be.

This is a significant issue which is bound to emerge at the judicial review. It is also possible that the assertion that only 1.5% fewer candidates achieved a grade C or better in English in 2012 might be challenged. Anecdotally, it looks to be closer to a minimum of 5%.

The problem now for Glenys Stacey is that being firm is not enough. What she really needs is Colin Dexter and the kind of evidence would convince Inspector Morse. When she met the Select Committee back in February she established her capacity to be firm with people on the basis that she had closed a footpath during a Foot and Mouth epidemic so that it was easier for a group of marksmen to shoot some wild cattle. She thought that the presence of journalists and photographers might put the guns off their gory task.

Glenys Stacey hasn’t shot anybody this autumn – yet!


From → Uncategorized

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: