github.com/schollz/clusters@v0.0.0-20221201012527-c6c68863636f/README.md (about) 1 # Clusters 2 3 [](https://godoc.org/github.com/mpraski/clusters) 4 5 Go implementations of several clustering algoritms (k-means++, DBSCAN, OPTICS), as well as utilities for importing data and estimating optimal number of clusters. 6 7 ## The reason 8 9 This library was built out of necessity for a collection of performant cluster analysis utilities for Golang. Go, thanks to its numerous advantages (single binary distrubution, relative performance, growing community) seems to become an attractive alternative to languages commonly used in statistical computations and machine learning, yet it still lacks crucial tools and libraries. I use the [*floats* package](https://github.com/gonum/gonum/tree/master/floats) from the robust Gonum library to perform optimized vector calculations in tight loops. 10 11 ## Install 12 13 If you have Go 1.7+ 14 ```bash 15 go get github.com/mpraski/clusters 16 ``` 17 18 ## Usage 19 20 The currently supported hard clustering algorithms are represented by the *HardClusterer* interface, which defines several common operations. To show an example we create, train and use a KMeans++ clusterer: 21 22 ```go 23 var data [][]float64 24 var observation []float64 25 26 // Create a new KMeans++ clusterer with 1000 iterations, 27 // 8 clusters and a distance measurement function of type func([]float64, []float64) float64). 28 // Pass nil to use clusters.EuclideanDistance 29 c, e := clusters.KMeans(1000, 8, clusters.EuclideanDistance) 30 if e != nil { 31 panic(e) 32 } 33 34 // Use the data to train the clusterer 35 if e = c.Learn(data); e != nil { 36 panic(e) 37 } 38 39 fmt.Printf("Clustered data set into %d\n", c.Sizes()) 40 41 fmt.Printf("Assigned observation %v to cluster %d\n", observation, c.Predict(observation)) 42 43 for index, number := range c.Guesses() { 44 fmt.Printf("Assigned data point %v to cluster %d\n", data[index], number) 45 } 46 ``` 47 48 Algorithms currenly supported are KMeans++, DBSCAN and OPTICS. 49 50 Algorithms which support online learning can be trained this way using Online() function, which relies on channel communication to coordinate the process: 51 52 ```go 53 c, e := clusters.KmeansClusterer(1000, 8, clusters.EuclideanDistance) 54 if e != nil { 55 panic(e) 56 } 57 58 c = c.WithOnline(clusters.Online{ 59 Alpha: 0.5, 60 Dimension: 4, 61 }) 62 63 var ( 64 send = make(chan []float64) 65 finish = make(chan struct{}) 66 ) 67 68 events := c.Online(send, finish) 69 70 go func() { 71 for { 72 select { 73 case e := <-events: 74 fmt.Printf("Classified observation %v into cluster: %d\n", e.Observation, e.Cluster) 75 } 76 } 77 }() 78 79 for i := 0; i < 10000; i++ { 80 point := make([]float64, 4) 81 for j := 0; j < 4; j++ { 82 point[j] = 10 * (rand.Float64() - 0.5) 83 } 84 send <- point 85 } 86 87 finish <- struct{}{} 88 89 fmt.Printf("Clustered data set into %d\n", c.Sizes()) 90 ``` 91 92 The Estimator interface defines an operation of guessing an optimal number of clusters in a dataset. As of now the KMeansEstimator is implemented using gap statistic and k-means++ as the clustering algorithm (see https://web.stanford.edu/~hastie/Papers/gap.pdf): 93 94 ```go 95 var data [][]float64 96 97 // Create a new KMeans++ estimator with 1000 iterations, 98 // a maximum of 8 clusters and default (EuclideanDistance) distance measurement 99 c, e := clusters.KMeansEstimator(1000, 8, clusters.EuclideanDistance) 100 if e != nil { 101 panic(e) 102 } 103 104 r, e := c.Estimate(data) 105 if e != nil { 106 panic(e) 107 } 108 109 fmt.Printf("Estimated number of clusters: %d\n", r) 110 111 ``` 112 113 The library also provides an Importer to load data from file (as of now the CSV importer is implemented): 114 115 ```go 116 // Import first three columns from data.csv 117 d, e := i.Import("data.csv", 0, 2) 118 if e != nil { 119 panic(e) 120 } 121 ``` 122 123 ## Development 124 125 The list of project goals include: 126 - [x] Implement commonly used hard clustering algorithms 127 - [ ] Implement commonly used soft clustering algorithms 128 - [ ] Devise reliable tests of performance and quality of each algorithm 129 130 ## Benchmarks 131 132 Soon to come. 133 134 ## Licence 135 136 MIT