Approximate randomization tests

Introduction

This package provides a library and utilities to perform paired and unpaired approximate randomization tests. These can be used to test if two samples differ significantly without assuming an underlying distribution.

The utilities can perform randomization tests and draw histograms of the test statistic for the randomized samples.

Download

Downloads for Windows, Mac OS X, and Linux will be available soon. If you feel adventurous, you can use the instructions below to compile the software yourself.

Usage

The package also provides two utilities: approx_rand_test and approx_rand_test_paired. Both utilities provide nearly the same options. The first is for unpaired tests, the latter for paired tests.

The format for samples is simple: use one value per line. Three samples are provided in the examples directory. The three samples contain evaluation scores of fluency ranking components:

ngram.scores: Scores of an n-gram language model
fluency.scores: Scores of a feature-based fluency ranking model
reversible.scores: Scores of a reversible model (a model that can be used in parsing and generation)

Running a test

We can now use pair-wise test utility to see that the evaluation scores of the n-gram language model and the feature-based fluency model differ significantly:

% approx_rand_test_paired -i 10000 -p 0.05 examples/ngram.scores examples/fluency.scores
Iterations: 10000
Sample size: 1621
Test statistic: -0.030646088066079557
Test type: TwoTailed
Test significance: 0.05
Tail significance: 0.025
Significant: 0.00009999000099990002

Here we generate 10,000 shuffled samples, with a significance level of p = 0.05. Likewise, we can compare the scores of feature-based fluency model and the reversible model:

% approx_rand_test_paired -i 10000 -p 0.05 examples/fluency.scores examples/reversible.scores 
Iterations: 10000
Sample size: 1621
Test statistic: 0.0032431465344367667
Test type: TwoTailed
Test significance: 0.05
Tail significance: 0.025
Not significant: 0.0273972602739726

In this case, the samples do not differ significantly.

Visualizing test score distributions

Both utilities can also draw the distribution of test scores of the randomized samples and how it relates to the test score of the original samples:

% approx_rand_test_paired -h -i 10000 -p 0.05 examples/fluency.scores examples/reversible.scores 
Iterations: 10000
Sample size: 1621
Test type: TwoTailed
Test significance: 0.05
Tail significance: 0.025
Test statistic: 0.0032431465344367667
Not significant: 0.025997400259974

   -4.402e-3 | █
   -3.794e-3 | ████
   -3.187e-3 | ████████
   -2.579e-3 | ███████████████
   -1.972e-3 | █████████████████████████
   -1.364e-3 | ███████████████████████████████████
   -7.567e-4 | ███████████████████████████████████████████████
   -1.492e-4 | ███████████████████████████████████████████████████
    4.583e-4 | ██████████████████████████████████████████████████
    1.066e-3 | ███████████████████████████████████████
    1.673e-3 | ███████████████████████████████
    2.281e-3 | ██████████████████████
    2.888e-3 | ███████████
    3.496e-3 | ✣✣✣✣✣✣
    4.103e-3 | ██
    4.711e-3 | █

Or, if you prefer, you can create a chart in a format such as SVG:

% approx_rand_test_paired -w chart.svg -i 10000 -p 0.05 examples/fluency.scores examples/reversible.scores

Compiling from source

Install the latest version of the Haskell Platform and run:

cabal update
cabal install

The command-line utilities also have support for making histograms. By default, this uses the diagrams backend, which can produce EPS and SVG files. The Cairo backend can produce more file types (such as PNG and PDF), but may be more difficult to install on some systems. To compile the package with Cairo support, use:

cabal install -fwithCairo

Background

Approximate randomization tests rely on a simple premise: given a test statistic, if the null hypothesis (two samples do not differ) holds, we can randomy swap values between samples without an (extreme) impact on the test statistic.

The test works by generating a given number of sample shuffles and computing the test statistic for each shuffle. If r is the number of shuffled samples where the test statistic is at least as high as the test statistic applied on the original samples; and N the number of shuffles, then the null-hypothesis is rejected iff (r + 1):(N + 1) < p-value (for one-sided tests).