Datafilos: How to design and evaluate tests in schools

There two ways how to end up with high quality products. Either you invest more energy into a design phase, or into a test phase. Following paragraphs describe the second approach.

Premise: If you replace paper-based tests with electronic tests, you can easily analyze the student's responses.

The first trivial analysis is to rank the questions based on the students' average success rate on the question. By looking at the bottom 10% of questions, you may identify badly formulated questions or flawed answer options. Alternatively, the topic may have not be thoroughly explained and practiced. Similarly, you may look at the top 10% of questions and identify accidentally too easy questions.

The second analysis is based on correlation coefficients. Pearson correlation (although we should call it point biserial correlation) is calculated for each question between the vector of students' responses on that question and the students' overall gain of points from the test. Ideally, each question would highly correlate with the overall test score. If we sort the questions in the descending order and look at the bottom 10% of questions, we may identify irrelevant, too easy, too hard or simply flawed questions.

An example of a irrelevant question is calculation of Euclidean distance between 2 points in 3 dimensional space - each university student should already know how to calculate that. But this question could have been missed by the first analysis simply because of frequent numerical errors.

On the other end, an example of a relevant question is computation of Chebyshev distance - it is unlikely that students have ever heard about it before the course. Plus, it is easier to compute than Euclidean distance. Hence, the effect of numerical errors is minimized.

After identifying 10% of the worst questions, give students free points for these questions. It will make students happy for following reasons:

If a test consists of more than a single question, students will almost inevitably complain, that some of the questions were too difficult. Not counting the toughest questions solves the problem.
If a test is long, students will almost inevitably complain, that something was not properly covered. Not counting the toughest questions solves the problem.
The correction may improve students' scores, but never hurt.

Professors like it because:

They may dedicate the test creation to unskilled workforce (read: teacher assistants), as the questions do not have to perfect. The students themselves will identify troublesome questions and the system will deal with them.
It is a systematic (and proactive) approach how deal with students' complains about the tests.
As a side product, it allows generation of a large set of relevant, correct and appropriately difficult test questions, which may work as poor man's replacement for absent/obsolete textbooks.

Summary: The described approach is suitable for fast evolving domains like machine learning (or any computer science related domain), where battle-field tested questions are still not available or are already obsolete.

Datafilos

sobota 27. srpna 2016

How to design and evaluate tests in schools

Žádné komentáře:

Okomentovat