For the code, see GitHub.
Recently I stumbled upon a problem while working on benchmarking machine learning methods. I wanted to do cross-validation (CV) on different methods, but some were implemented in R through certain libraries, and some were implemented in Python through scikit-learn. The problem was that in order to have proper cross-validation, it had to be possible to create the same train and test folds in both languages.
A possible solution to this problem is to explicitly generate all folds in one language, and then save them in datafiles to be loaded from the other language. Although possible, this would lead to a lot of redundant datasets, and in my case disk space was getting a bit scarce. So instead I opted for the alternative: generating the CV folds in both languages using the same random numbers.
Both R and NumPy (which scikit-learn uses underneath) implement the Mersenne-Twister random number generator (RNG). However, the implementations differ such that setting the same seed does not yield the same sequence of random numbers in both languages. Unfortunately for me, that meant I had to implement my own RNG in both languages separately.
Initially I started with implementing the Mersenne-Twister RNG in R and Python separately. This soon gave me problems because bitwise operations on integers in R can lead to NA values being generated. So instead, I chose to implement the RNG in C, and link this to both R and Python. Because I wanted fast random number generation and a small state vector, I changed the Mersenne-Twister for a Tausworthe RNG, which is a very efficient random number generator with a state of only 4 integers. The implementation I chose comes straight from Pierre L’Ecuyers 1999 paper entitled Tables of Maximally Equidistributed Combined LFSR Generators.
I implemented the Tausworthe RNG in C, and used the same C file to write the R and Python interfaces to the RNG. This way, I could create both an R library and a Python package without writing the C code twice. The different sections of the C file are separated using a compile macro, so the R library doesn’t contain the Python code, and visa versa. The Python interface contains a bit more defines to support both Python 2 and 3. The resulting package was dubbed SyncRNG.
The Python package implements a SyncRNG class and the R library implements a SyncRNG reference class, to support comparable usage of the RNG. For instance, in Python the package can be used as follows:
>>> from SyncRNG import SyncRNG
>>> s = SyncRNG(seed=123)
>>> r = s.randi()
and in R:
> library(SyncRNG)
> s <- SyncRNG(seed=123)
> r <- s$randi()
As you can see usage of the classes is syntactically very similar. In both languages a shuffle method is implemented for the SyncRNG class, which returns a shuffled instance of a supplied vector. This I can then use to shuffle a list of indices which in turn can be used to create train and test folds for cross validation. Because the random numbers are guaranteed to be the same, I can be confident that the CV folds will be also, which solves my original problem.
You can install SyncRNG in Python through pip, and in R through CRAN. For the source code, see GitHub.