Problem Statement:
- Dataset Pruning: What is the effect on Machine Learning Performance?
If you prune a dataset, i.e. if you remove certain data that you consider as suboptimal (e.g. all users with less than x ratings), how does that affect the evaluation of machine learning algorithms? For instance, how much “better” is an algorithm becoming when the data is pruned? Or, would an unpruned dataset show that algorithm B is better than algorithm A, but using a pruned dataset would show that algorithm A is better than algorithm B? If you read a research article and the authors report that x% of the original data was removed, how meaningful is the evaluation?