I am broadly interested in the development and application of machine learning methods. Currently I focus on developing AI-based tools for data wrangling in an effort to automate the tedious tasks of data preparation and data cleaning that often precede a machine learning analysis. My past research has been on multiclass classification — typically involving SVMs — as well as meta-learning and hierarchical classifier design. I've also worked on regularized regression methods, where I'm mainly interested in optimization algorithms for non-convex problems. On the more practical side I have developed software packages for most of my research projects, as well as a command-line tool to automate benchmarking of machine learning methods on distributed architectures.
- Wrangling Messy CSV Files by Detecting Row and Type Patterns (HTML; PDF)
G. J. J. van den Burg and A. Nazabal and C. SuttonData Mining and Knowledge Discovery, 2019.▸ Show abstract Data scientists spend the majority of their time on preparing data for analysis. One of the first steps in this preparation phase is to load the data from the raw storage format. Comma-separated value (CSV) files are a popular format for tabular data due to their simplicity and ostensible ease of use. However, formatting standards for CSV files are not followed consistently, so each file requires manual inspection and potentially repair before the data can be loaded, an enormous waste of human effort for a task that should be one of the simplest parts of data science. The first and most essential step in retrieving data from CSV files is deciding on the dialect of the file, such as the cell delimiter and quote character. Existing dialect detection approaches are few and non-robust. In this paper, we propose a dialect detection method based on a novel measure of data consistency of parsed data files. Our method achieves 97% overall accuracy on a large corpus of real-world CSV files and improves the accuracy on messy CSV files by almost 22% compared to existing approaches, including those in the Python standard library. Our measure of data consistency is not specific to the data parsing problem, and has potential for more general applicability.
- GenSVM: A Generalized Multiclass Support Vector Machine (PDF)
G. J. J. van den Burg and P. J. F. GroenenJournal of Machine Learning Research, 17(224):1–42, 2016.▸ Show abstract Traditional extensions of the binary support vector machine (SVM) to multiclass problems are either heuristics or require solving a large dual optimization problem. Here, a generalized multiclass SVM is proposed called GenSVM. In this method classification boundaries for a K-class problem are constructed in a (K−1)-dimensional space using a simplex encoding. Additionally, several different weightings of the misclassification errors are incorporated in the loss function, such that it generalizes three existing multiclass SVMs through a single optimization problem. An iterative majorization algorithm is derived that solves the optimization problem without the need of a dual formulation. This algorithm has the advantage that it can use warm starts during cross validation and during a grid search, which significantly speeds up the training phase. Rigorous numerical experiments compare linear GenSVM with seven existing multiclass SVMs on both small and large data sets. These comparisons show that the proposed method is competitive with existing methods in both predictive accuracy and training time, and that it significantly outperforms several existing methods on these criteria.
- Fast Meta-Learning for Adaptive Hierarchical Classifier Design (PDF)
G. J. J. van den Burg and A. O. HeroarXiv preprint 1711.03512, 2017.Code: Python▸ Show abstract We propose a new splitting criterion for a meta-learning approach to multiclass classifier design that adaptively merges the classes into a tree-structured hierarchy of increasingly difficult binary classification problems. The classification tree is constructed from empirical estimates of the Henze-Penrose bounds on the pairwise Bayes misclassification rates that rank the binary subproblems in terms of difficulty of classification. The proposed empirical estimates of the Bayes error rate are computed from the minimal spanning tree (MST) of the samples from each pair of classes. Moreover, a meta-learning technique is presented for quantifying the one-vs-rest Bayes error rate for each individual class from a single MST on the entire dataset. Extensive simulations on benchmark datasets show that the proposed hierarchical method can often be learned much faster than competing methods, while achieving competitive accuracy.
- SparseStep: Approximating the Counting Norm for Sparse Regularization (PDF)
G. J. J. van den Burg and P. J. F. Groenen and A. AlfonsarXiv preprint 1701.06967, 2017.Code: R▸ Show abstract The SparseStep algorithm is presented for the estimation of a sparse parameter vector in the linear regression problem. The algorithm works by adding an approximation of the exact counting norm as a constraint on the model parameters and iteratively strengthening this approximation to arrive at a sparse solution. Theoretical analysis of the penalty function shows that the estimator yields unbiased estimates of the parameter vector. An iterative majorization algorithm is derived which has a straightforward implementation reminiscent of ridge regression. In addition, the SparseStep algorithm is compared with similar methods through a rigorous simulation study which shows it often outperforms existing methods in both model fit and prediction accuracy.
- Algorithms for Multiclass Classification and Regularized Regression (PDF)
G. J. J. van den BurgErasmus University Rotterdam, 2018.▸ Show abstract
Multiclass classification and regularized regression problems are very common in modern statistical and machine learning applications. On the one hand, multiclass classification problems require the prediction of class labels: given observations of objects that belong to certain classes, can we predict to which class a new object belongs? On the other hand, the regularized regression problem is a variation of the common regression problem, which measures how changes in independent variables influence an observed outcome. In regularized regression, constraints are placed on the coefficients of the regression model to enforce certain properties in the solution, such as sparsity or limited size.
In this dissertation several new algorithms are presented for both multiclass classification and regularized regression problems. For multiclass classification the GenSVM method is presented. This method extends the binary support vector machine to multiclass classification problems in a way that is both flexible and general, while maintaining competitive performance and training time. In a different chapter, accurate estimates of the Bayes error are applied to both meta-learning and the construction of so-called classification hierarchies: structures in which a multiclass classification problem is decomposed into several binary classification problems.
For regularized regression problems a new algorithm is presented in two parts: first for the sparse regression problem and second as a general algorithm for regularized regression where the regularization function is a measure of the size of the coefficients. In the proposed algorithm graduated nonconvexity is used to slowly introduce the nonconvexity in the problem while iterating towards a solution. The empirical performance and theoretical convergence properties of the algorithm are analyzed with numerical experiments that demonstrate the ability for the algorithm to obtain globally optimal solutions.
I aim to make my research accessible by providing software packages for the methods I develop.
- CleverCSV. Implements the method from this paper. PyPI - GitHub.
- SmartSVM. Implements the SmartSVM classifier from this paper. PyPI - GitHub.
- SparseStep. Implements the SparseStep method from this paper. CRAN - GitHub.
- GenSVM. Implements the GenSVM method from this paper. PyPI - CRAN - GitHub.
- Abed. Tool for benchmarking ML methods on compute clusters. PyPI - GitHub.
- SyncRNG. The same random numbers in R and Python. CRAN - PyPI - GitHub.
- Programming – part-time lecturer, set up and pioneered the use of Autolab for this course (2015, 2016)
- Supervised two MSc thesis students in Econometrics, among whom:
- G. van Rooij, Clustering Stores of Retailers via Consumer Behavior, 2017.
- Supervised four BSc thesis students in Econometrics, among whom:
- L.W. Hoogenboom, Recommender System Optimization through Collaborative Filtering, 2016.
- E.L.J. Mathol, Neighborhood-based Collaborative Filtering: Providing the best recommendations, 2016.
- M.L. Jongsma, Categorised Neighborhood-based Collaborative Filtering, 2016.