github.com/schollz/clusters@v0.0.0-20221201012527-c6c68863636f/README.md

github.com/schollz/clusters@v0.0.0-20221201012527-c6c68863636f/README.md (about)

     1  # Clusters
     2  
     3  [![](https://godoc.org/github.com/mpraski/clusters?status.svg)](https://godoc.org/github.com/mpraski/clusters)
     4  
     5  Go implementations of several clustering algoritms (k-means++, DBSCAN, OPTICS), as well as utilities for importing data and estimating optimal number of clusters.
     6  
     7  ## The reason
     8  
     9  This library was built out of necessity for a collection of performant cluster analysis utilities for Golang. Go, thanks to its numerous advantages (single binary distrubution, relative performance, growing community) seems to become an attractive alternative to languages commonly used in statistical computations and machine learning, yet it still lacks crucial tools and libraries. I use the [*floats* package](https://github.com/gonum/gonum/tree/master/floats) from the robust Gonum library to perform optimized vector calculations in tight loops.
    10  
    11  ## Install
    12  
    13  If you have Go 1.7+
    14  ```bash
    15  go get github.com/mpraski/clusters
    16  ```
    17  
    18  ## Usage
    19  
    20  The currently supported hard clustering algorithms are represented by the *HardClusterer* interface, which defines several common operations. To show an example we create, train and use a KMeans++ clusterer:
    21  
    22  ```go
    23  var data [][]float64
    24  var observation []float64
    25  
    26  // Create a new KMeans++ clusterer with 1000 iterations, 
    27  // 8 clusters and a distance measurement function of type func([]float64, []float64) float64).
    28  // Pass nil to use clusters.EuclideanDistance
    29  c, e := clusters.KMeans(1000, 8, clusters.EuclideanDistance)
    30  if e != nil {
    31  	panic(e)
    32  }
    33  
    34  // Use the data to train the clusterer
    35  if e = c.Learn(data); e != nil {
    36  	panic(e)
    37  }
    38  
    39  fmt.Printf("Clustered data set into %d\n", c.Sizes())
    40  
    41  fmt.Printf("Assigned observation %v to cluster %d\n", observation, c.Predict(observation))
    42  
    43  for index, number := range c.Guesses() {
    44  	fmt.Printf("Assigned data point %v to cluster %d\n", data[index], number)
    45  }
    46  ```
    47  
    48  Algorithms currenly supported are KMeans++, DBSCAN and OPTICS.
    49  
    50  Algorithms which support online learning can be trained this way using Online() function, which relies on channel communication to coordinate the process:
    51  
    52  ```go
    53  c, e := clusters.KmeansClusterer(1000, 8, clusters.EuclideanDistance)
    54  if e != nil {
    55  	panic(e)
    56  }
    57  
    58  c = c.WithOnline(clusters.Online{
    59  	Alpha:     0.5,
    60  	Dimension: 4,
    61  })
    62  
    63  var (
    64  	send   = make(chan []float64)
    65  	finish = make(chan struct{})
    66  )
    67  
    68  events := c.Online(send, finish)
    69  
    70  go func() {
    71  	for {
    72  		select {
    73  		case e := <-events:
    74  			fmt.Printf("Classified observation %v into cluster: %d\n", e.Observation, e.Cluster)
    75  		}
    76  	}
    77  }()
    78  
    79  for i := 0; i < 10000; i++ {
    80  	point := make([]float64, 4)
    81  	for j := 0; j < 4; j++ {
    82  		point[j] = 10 * (rand.Float64() - 0.5)
    83  	}
    84  	send <- point
    85  }
    86  
    87  finish <- struct{}{}
    88  
    89  fmt.Printf("Clustered data set into %d\n", c.Sizes())
    90  ```
    91  
    92  The Estimator interface defines an operation of guessing an optimal number of clusters in a dataset. As of now the KMeansEstimator is implemented using gap statistic and k-means++ as the clustering algorithm (see https://web.stanford.edu/~hastie/Papers/gap.pdf):
    93  
    94  ```go
    95  var data [][]float64
    96  
    97  // Create a new KMeans++ estimator with 1000 iterations, 
    98  // a maximum of 8 clusters and default (EuclideanDistance) distance measurement
    99  c, e := clusters.KMeansEstimator(1000, 8, clusters.EuclideanDistance)
   100  if e != nil {
   101  	panic(e)
   102  }
   103  
   104  r, e := c.Estimate(data)
   105  if e != nil {
   106  	panic(e)
   107  }
   108  
   109  fmt.Printf("Estimated number of clusters: %d\n", r)
   110  
   111  ```
   112  
   113  The library also provides an Importer to load data from file (as of now the CSV importer is implemented):
   114  
   115  ```go
   116  // Import first three columns from data.csv
   117  d, e := i.Import("data.csv", 0, 2)
   118  if e != nil {
   119  	panic(e)
   120  }
   121  ```
   122  
   123  ## Development
   124  
   125  The list of project goals include:
   126  - [x] Implement commonly used hard clustering algorithms
   127  - [ ] Implement commonly used soft clustering algorithms
   128  - [ ] Devise reliable tests of performance and quality of each algorithm
   129  
   130  ## Benchmarks
   131  
   132  Soon to come.
   133  
   134  ## Licence
   135  
   136  MIT