|
Abstract
|
Storing and processing large volume datasets presents one of the most critical challenges in large-scale processing. Therefore, reducing their size before further processing is essential. This paper proposes a framework for data reduction in large-scale datasets, based on the MapReduce algorithm. The framework comprises three steps. Firstly, reservoir sampling is used to select instances from the dataset. In the second step, the features of these selected instances are weighted using the ReliefF algorithm. Subsequently, the weights for each feature are averaged, and features with the highest average weights are selected. Finally, these selected features are used in the classification process. Implementation results demonstrate that the proposed framework effectively reduces processing time and, when a large amount of data is removed by eliminating irrelevant features, either increases or maintains classification accuracy.
|