Thursday 1 April 2021

Statistical solution to processing very large datasets efficiently with memory limit

Any high-performance computing should be able to handle a vast amount of data in a short amount of time—an important aspect on which entire fields (data science, Big Data) are based. Usually, the first step to managing a large amount of data is either to classify it based on well-defined attributes or—as is typical in machine learning—"cluster" them into groups such that data points in the same group are more similar to one another than to those in another group. However, for an extremely large dataset, which can have trillions of sample points, it is tedious to even group data points into a single cluster without huge memory requirements.

source https://techxplore.com/news/2021-04-statistical-solution-large-datasets-efficiently.html