Why should we judge the Judges? Competition is done, scores are posted, who cares?
The
first time I heard this question was after my first "attempt" as a
shadow judge at an international competition in Le Touquet, France in
1991. I wasn't sure judging would ever be something for me, and I
wanted to know how one could judge judging quality. After judging in
some 65 international events (including 5 World cups, 7 Euro cups, 3
French, 1 German, numerous Dutch nationals) I still wonder, and every
now and then I hear remarks about "good" or "bad" judges, so it seems
people must still be judging the judges.
Regardless
of how it is done, competitors and others seem to strongly believe it
does make a difference to competition if judges are "good".
If
the goal of competition is to find the best flier, then a good judge
should pick the best flight routine, following rules and criteria as
set and score the others relative to that number one. If she or he is
not able to do it, "competition" is no longer competition.
Defining "good judging" is not easy though, and some might say it is probably impossible.
Which "activities" of the judges could (or should) be checked?
The
first one might be knowledge of rules and guidelines, of flying a kite
or all kites. We can test the knowledge about rules etc., taking a kind
of exam. It would do no harm if we tried that, but I am not sure it
would make a big difference. On the field the rules etc. play a more
important role in establishing a format for the results, more than in
actually judging the figures or routines.
Second,
we can check a judge’s knowledge about kite-flying, which is of course
essential. We might use videos to check if the judge sees what "we all
see". Personally I like the training sessions we have had in the
Netherlands every now and then, where I learn probably more from the
pilots than they learn from me as a judge. Discussing the differences
between what I see and know, and what they fly or claim to fly is good
and needed feedback. We might assume the flier knows enough about
flying, it does not always mean though that the different talent to
judge is there too..)
Experience is
not too difficult to check. And someone having judged often in the past
might mean they are good at it, or that they are a popular person.
Having
a judge can explain his or her conclusions during debriefings could be
another activity we might look into, kind of like checking their
"bedside manners". To me this feedback to and from fliers is the most
important part of judging (judge's) activities. Explaining (I
definitively do not mean arguing!) your views and opinion to other
judges enables you to learn. Talking with competitors, or answering
their questions can give you a good idea about what the flier wanted to
show you, which might be quite different from what you have seen (and
the flier will appreciate this feedback from a trained observer,
especially if it comes from a 'good' judge). It is a less numerical way
to address the judges' quality, usually far more informative to both
fliers and judges than the "why did I score ..." question. Judging
(de-)briefings are an essential part of judging a competition, general
debriefings (and the discussions with pilots just after) are just as
essential for competition in general.
That
last "activity", the scoring, is the most obvious to check and most
questioned, but also the most difficult part of judging to interpret.
Obvious, because it is the only visible "result" of the judging
process. Difficult, because the numbers alone lose a great deal of
meaning without the "attached" flier and judge.
It
might be good to analyze these numbers… You should actually judge
yourselves in the same competition, but then you would also double the
problem.
As in real life (I know, for
some of you kite flying is real life), then judges might have the best
opportunity to judge the judges. Sure, to compare your own conclusions
about your flying with that of the judge ("why is my score so low") can
give you some idea of judging, but only comparison with the other
judge's scores really shows the value of 'your' score.
So
that is why I (as a judge/scorer in a competition), tend to combine all
scores and analyze what has happened with scoring. I do not presume
scores are objective, or that judges will always score the same routine
with the same number. Scores are just as much an opinion as a
conclusion and so it will always differ between judges.
But
judges are asked in the rules to be objective, and judges usually try
to be as objective as possible, so if they really were objective, just
one judge would suffice.
The first
thing I do is to calculate the average scoring of each judge, and so be
able to compare a "low" or "high" score with that average. It might
show that a "low" score of one judge might get you a higher ranking
than the "high" score of another!
I
check the "spread" in scores for each judge to find which judge stays a
bit in the middle, and which one is more extreme in hers or his
scoring.
I then look for the
differences in ranking for each judge, since finding the best, and
second and third best, is the more interesting part of judging for the
fliers!
Combined, this information
gives me some insight regarding how well the judges agree. Which
routines and figures we agree on or might give cause for discussion,
and which routines and figures for which the quality is appreciated
roughly the same.
It is in the points
that judges seriously disagree upon where we can find what troubled the
judges, how well they succeeded in their strive for objectivity, and
even what elements in routines haven't found their place yet, like some
new tricks.
Most interesting is the
analysis of the scores for compulsory figures. Short, simple -for the
judges- and well described, they should result in very similar scores
from a panel of "objective" judges.
Flying
is not done to please the judges. Judges deliver a service to fliers to
establish who is the best competitor in each discipline. Of course the
"tools" must make that possible. The agreement between fliers and
judges (the rules and guidelines) must have a form and content that
allows judges to work with them. To give an example, doing something
totally new (and difficult) will no doubt impress other fliers, but
having "originality" in the rules as criterion might eventually deal
more with the knowledge of the judge than with the ability of fliers.
The
way compulsory figures are defined is another example. Definition of
figures, and the figures themselves have changed over the years, and it
seems not for the better. It must be, I think, because over the last
years (I kept track for 13 years) the differences in scoring of
compulsories have grown steadily and considerably. To compare two
comparable events, the world cup in Long Beach USA in 1998 showed a
maximum difference of 20 points (one compulsory by one team, two
different judges); in the team event in Berck, France this year (2004)
it was 48 (and it was just as bad in Euro cup this year). Of course,
part of the problem is the diminishing time that is spent on actually
discussing compulsories and rules amongst judges at big events, from
more than 20 hours in Guadeloupe, about 15 in Long Beach to barely 3
hours at Euro cup.
Monitoring the
quality of judging might not be so interesting for competitors. The
competitors may just need to trust that the judges will declare the
best flier as number one. Other, older, judged sports show that when
that trust is lost, establishing the actual quality of judging is
difficult, certainly if it has not been done before, seriously. Judging
kite acrobatics is about as difficult as it can get (in team ballet:
full '3D', 3 or more kites, five minutes totally free, no prescribed
structure or format, no previous knowledge) but the end result of that
judging is simply a list of competitors, the best performance on top.
When
I started this analysis in 1994, the main reason to do so was to assure
other judges that their fear of having given their friends or fellow
countrymen an unfair advantage was unjustified. In almost all cases,
judges are too strict to the people or routines they know very well,
and only a very few actually show any bias. Over the last years I have
seen a gradual change in this (in Europe). The lack of exchange between
countries (a lack of international competition) has not given judges
enough opportunity to compare different styles and ideas, and more and
more they start to see differences in style and ideas as differences in
quality.
Judges (both the
flier/judges and the ones who "just" judge) should keep an eye on each
other, outside the field of course, to maintain and improve the quality
of judging. Analysis of scores, good judges meetings and debriefings,
and talking to competitors will help. Maybe even flying a kite every
now and then might help.
Best winds,
Hans Jansen op de Haar
P.s.
- For those interested in the 'statistical' analysis of scores, a
spreadsheet with explanations, containing the public scores of the
Berck 2004 event as an example is available, just drop me a message by
clicking on my name above.
P.p.s - I
have been in kite acrobatics since 1988, first as team pilot (Dike
Hoppers), and since 1991 as judge. I have judged thousands of routines
and even a lot more figures. Over the last 35 years I have been
interested in cognitive and design-processes, intuitive reasoning, and
artificial intelligence. In my former job as building cost engineer
(the actual Dutch profession title and position are hard to translate)
being able to analyze numbers is essential. The main drive to put these
thoughts toward improving judging and competition is, of course,
friendship!
A long time contributor to Kitelife and STACK panel member, Hans
was selected as a judge for the 1994, 1995 and 1996 World Cups, as well
as Chief Judge at the 1997 and 1998 World Cups... You can visit his
home page here:
http://hans.kitesonlines.org/
|