I would classify myself as a data scientist because my research is always problem (data) driven. My guiding line in research is investing into statistical tools that, once developed, can be applied to solve real problems. Pure theories, as elegant as they can be, are not much of an interest for me. I believe in collaborative work with everyone contributing his best skills. I like to work on topics that are fairly new and where a little research has been done so far. In such areas, I feel that the imagination and creativity are the limits.

I very much associate myself with the following approach to data analysis:

                        "R for Data Science book by Hadley Wickham and Garrett Grolemund (2006)

Data visualisation plays a large part in my research. I am using many different ways to visualise the data to understand it better. I am also using a lot of plots in motivating my analysis and communicating the results. It happens often that I need to come up with an innovative way to represent my findings.

When it comes to modelling I use a variety of non-parametric and data mining techniques. I also use ideas and approaches from existing methods to develop methodologies for solving new problems. 

During my study I did works in a number of areas:

  • My bachelor's senior project was an application of the Gale-Shapley’s matching algorithm to the assignment of students at the Technion dormitories.
  • In my master thesis I studied the relationship between government expenses and Gross Domestic Product using the asymmetrical cointegration models.
  • During the PhD my research included kernel conditional densities estimation and its potential applications, nonparametric modelling and computational statistics.
In September 2012 I joined the Centre for Molecular, Environmental, Genetic and Analytic (MEGA) Epidemiology, University of Melbourne and a VicBiostat centre as a Research Fellow in biostatistics. As part of my role I collaborated on methods for causal inference in cohort studies that aim to identify factors associated with the death risk, lifestyle and common chronic conditions of childhood and adult life. I have also collaborated with Dr Morgan Sangeux, senior biomedical engineer at the Hugh Williamson Gait Analysis Laboratory, The Royal Children's Hospital, Melbourne. Together we developed a method to choose the most representative stride and detect outliers using tools from functional analysis methods (the full paper can be downloaded from here.

I am currently involved in two large projects: 

  • Gait analysis via functional data analysis tools.
    This is a joint work with Dr Morgan Sangeux.

    Certain gait pathologies are associated with particular patterns of hip, knee and ankle movement. Classifying patients into specific pathologies requires the recognition of characteristic patterns in 3-dimensional data observed simultaneously in three locations.
    This project’s goal is to develop a new functional classification method that considers not only the movement of a joint in space but also the correlation of movement of all three joints, i.e. the spatial dependency of the movement. Such a method can help a clinician in decision making by highlighting the characteristics of particular pathologies and allows comparisons to be made with already treated patients with similar moving patterns. 
    At this point in time we have already developed a method to select the most representative movement in a set of repeated measured movements (the full paper can be downloaded from
    here). We have also developed a method to arrange the patients in an order based on their (most representative) movements at one particular location on the body in a particular dimension. The ordering is from the most similar to the pattern observed in the movements of healthy children to the one with the most severe abnormality observed. We now work on extending this method of ordering by taking into account patients’ movement patterns at all locations and dimensions together. Finally, we plan to extend this approach to the classification problem. 
  • Mixed distribution approach for identification of gene expression importance.
    This is a joint work with Prof. Murray Aitkin.

    In a JRSSB paper of 2013, Lee and Bjornstad developed an analysis of 6,033 gene expression differences between 52 controls and 50 prostate cancer patients, identifying a number of important genes which showed substantial differences in mean expression levels between cancer cases and controls. Their analysis was based on t-test of differences between mean levels. 
    Human being is an extremely complicated organism that is a result of hundreds of thousands years of evolution. Surely, the search for the reason for a resistance to a particular disease should go beyond comparison of the individual gene expression means. We were motivated by Lee and Bjornstad work to search beyond the difference in means. Additional challenge is the dimensionality of the data set. This data set is a 'small' example of "big data", in which the number of features (gene expressions) is 59 times the number of observations. Moreover, how can one examine, or even just look at, 6033 features?
    We started by developing innovative graphical display that can be helpful in interpreting
    data structures of this kind. Visual representation of the peculiar structure of the gene
    expression levels within and across genes led us to model the expression levels with a
    shift-transformed mixed Weibull distribution, in which the mixture components appear in both cancer cases and controls. This is very much a work in progress but so far our findings are very promising. With this work we hope to propose a 'cleverer' way of analysing genetic data and 
    identification of the 'important genes'. 

Refereed research papers:
  • Perach, N., J. Polak, and U. G. Rothblum (2007) A Stable Matching Model with an Entrance Criterion Applied to the Assignment of Students to Dormitories at the Technion. International Journal of Game Theory, Volume 36, Pages 519-535.
  • Sangeux M and J. Polak (Published online: December 9, 2014). A simple method to choose the most representative stride and detect outliers. Gait & Posture Journal.

Papers in preparation:
  • Polak, J., M. King, and X. Zhang (----). A Model Validation Procedure (abstract)
  • Polak, J., M. King, and X. Zhang (----). Model clarification by testing the dynamics of functional data by scores density (abstract)
  • Polak, J., M. King, and X. Zhang (----). Improving conditional density estimation to make it more useable for econometricians (abstract)