*For the code, see GitHub.*

Recently I stumbled upon a problem while working on benchmarking machine learning methods. I wanted to do cross-validation (CV) on different methods, but some were implemented in R through certain libraries, and some were implemented in Python through scikit-learn. The problem was that in order to have proper cross-validation, it had to be possible to create the same train and test folds in both languages.

A possible solution to this problem is to explicitly generate all folds in one language, and then save them in datafiles to be loaded from the other language. Although possible, this would lead to a lot of redundant datasets, and in my case disk space was getting a bit scarce. So instead I opted for the alternative: generating the CV folds in both languages using the same random numbers.

Both R and NumPy (which scikit-learn uses underneath) implement the Mersenne-Twister random number generator (RNG). However, the implementations differ such that setting the same seed does not yield the same sequence of random numbers in both languages. Unfortunately for me, that meant I had to implement my own RNG in both languages separately.

Initially I started with implementing the Mersenne-Twister RNG in R and Python separately. This soon gave me problems because bitwise operations on integers in R can lead to NA values being generated. So instead, I chose to implement the RNG in C, and link this to both R and Python. Because I wanted fast random number generation and a small state vector, I changed the Mersenne-Twister for a Tausworthe RNG, which is a very efficient random number generator with a state of only 4 integers. The implementation I chose comes straight from Pierre L’Ecuyers 1999 paper entitled Tables of Maximally Equidistributed Combined LFSR Generators.

I implemented the Tausworthe RNG in C, and used the same C file to write the R
and Python interfaces to the RNG. This way, I could create both an R library
and a Python package without writing the C code twice. The different sections
of the C file are separated using a compile macro, so the R library doesn’t
contain the Python code, and visa versa. The Python interface contains a bit
more defines to support both Python 2 and 3. The resulting package was dubbed
*SyncRNG*.

The Python package implements a *SyncRNG* class and the R library implements a
*SyncRNG* reference class, to support comparable usage of the RNG. For
instance, in Python the package can be used as follows:

```
>>> from SyncRNG import SyncRNG
>>> s = SyncRNG(seed=123)
>>> r = s.randi()
```

and in R:

```
> library(SyncRNG)
> s <- SyncRNG(seed=123)
> r <- s$randi()
```

As you can see usage of the classes is syntactically very similar. In both
languages a shuffle method is implemented for the *SyncRNG* class, which
returns a shuffled instance of a supplied vector. This I can then use to
shuffle a list of indices which in turn can be used to create train and test
folds for cross validation. Because the random numbers are guaranteed to be
the same, I can be confident that the CV folds will be also, which solves my
original problem.

You can install *SyncRNG* in Python through pip, and in R through CRAN. For
the source code, see GitHub.