[Abstract] The Very Fast Decision Tree (VFDT) algorithm is a classification algorithm for data streams. When processing large amounts of data, VFDT requires less time than traditional decision tree algorithms. However, when training samples become fewer, the label values of VFDT leaf nodes will have more errors, and the classification ability of single VFDT decision tree is limited. The Random Forest algorithm is a combinational classifier with high prediction accuracy and noise⁃tolerant ability. It is constituted by multiple decision trees and can make up for the shortage of single decision tree. In this paper, in order to improve the classification accuracy on data streams, the Random Forest algorithm is integrated into the process of tree building of the VFDT algorithm, and a new Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed. The RFVFDT algorithm adopts the decision tree building criterion of a Random Forest classifier, and improves Random Forest algorithm with sliding window to meet the unboundedness of data streams and avoid process delay and data loss. Experimental results of the classification of KDD CUP data sets show that the classification accuracy of RFVFDT algorithm is higher than that of VFDT. The less the samples are, the more obvious the advantage is. RFVFDT is fast when running in the multi⁃thread mode.
[Keywords] data stream; data classification; Random Forest algorithm; VFDT algorithm