A Distributed Random Forest Algorithm for Massive Data
Abstract: The rapid development of technology has improved people's ability to produce and collect data, quantification and high dimensionality become the main characteristics for more and more data. These Large-scale data brings challenges in computational efficiency to statistical analysis. Ananalysis framework is proposed according to the idea of" divide and conquer" , and a distributed random forest algorithm for massive data (BLOCK-SDB-RF) is designed. The researchers analyze the advantages of the proposed algorithm from the perspective of data coverage and time complexity, and discuss the performance of the BLOCK-SDB-RF algorithm through numerical simulation. The numerical simulations show that: (1) With the increase of data sample size and feature dimension, the advantage of the algorithm in computational efficiency becomes more and more obvious; (2) Although the correlation between variables does not influence the computational efficiency of the algorithm, the researchers need to sacrifice some of the prediction accuracy as the correlation increases. In the empirical analysis, the log data is provided by music streaming service provider KKBOX as an example to further explore the role of BLOCK-SDB-RF algorithm in massive data analysis.
Keywords: big data; computational efficiency; random forest; distributed computing
LI Yang, Professor at RUC, Member of NCSRC