By Dorit Simona Hochbaum.
The dominant algorithms for machine learning tasks fall most often in the realm of AI or continuous optimization of intractable problems. This tutorial presents combinatorial algorithms for machine learning, data mining, and image segmentation that, unlike the majority of existing machine learning methods, utilize pairwise similarities. These algorithms are efficient and reduce the classification problem to a network flow problem on a graph. One of these algorithms addresses the problem of finding a cluster that is as dissimilar as possible from the complement, while having as much similarity as possible within the cluster. These two objectives are combined either as a ratio or with linear weights. This problem is a variant of normalized cut, which is intractable. The problem and the polynomial-time algorithm solving it are called HNC. It is demonstrated here, via an extensive empirical study, that incorporating the use of pairwise similarities improves accuracy of classification and clustering. However, a drawback of the use of similarities is the quadratic rate of growth in the size of the data. A methodology called “sparse computation” has been devised to address and eliminate this quadratic growth. It is demonstrated that the technique of “sparse computation” enables the scalability of similarity-based algorithms to very large-scale data sets while maintaining high levels of accuracy. We demonstrate several applications of variants of HNC for data mining, medical imaging, and image segmentation tasks, including a recent one in which HNC is among the top performing methods in a benchmark for cell identification in calcium imaging movies for neuroscience brain research.