gitlab.com/thomasboni/go-enry/v2@v2.8.3-0.20220418031202-30b0d7a3de98/README.md (about) 1 # go-enry [![GoDoc](https://godoc.org/gitlab.com/thomasboni/go-enry?status.svg)](https://pkg.go.dev/gitlab.com/thomasboni/go-enry/v2) [![Test](https://gitlab.com/thomasboni/go-enry/workflows/Test/badge.svg)](https://gitlab.com/thomasboni/go-enry/actions?query=workflow%3ATest+branch%3Amaster) [![codecov](https://codecov.io/gh/go-enry/go-enry/branch/master/graph/badge.svg)](https://codecov.io/gh/go-enry/go-enry) 2 3 Programming language detector and toolbox to ignore binary or vendored files. _enry_, started as a port to _Go_ of the original [Linguist](https://github.com/github/linguist) _Ruby_ library, that has an improved _2x performance_. 4 5 - [CLI](#cli) 6 - [Library](#library) 7 - [Use cases](#use-cases) 8 - [By filename](#by-filename) 9 - [By text](#by-text) 10 - [By file](#by-file) 11 - [Filtering](#filtering-vendoring-binaries-etc) 12 - [Coloring](#language-colors-and-groups) 13 - [Languages](#languages) 14 - [Go](#go) 15 - [Java bindings](#java-bindings) 16 - [Python bindings](#python-bindings) 17 - [Rust bindings](#rust-bindings) 18 - [Divergences from linguist](#divergences-from-linguist) 19 - [Benchmarks](#benchmarks) 20 - [Why Enry?](#why-enry) 21 - [Development](#development) 22 - [Sync with github/linguist upstream](#sync-with-githublinguist-upstream) 23 - [Misc](#misc) 24 - [License](#license) 25 26 # CLI 27 28 The CLI binary is hosted in a separate repository [go-enry/enry](https://github.com/go-enry/enry). 29 30 # Library 31 32 _enry_ is also a Go library for guessing a programming language that exposes API through FFI to multiple programming environments. 33 34 ## Use cases 35 36 _enry_ guesses a programming language using a sequence of matching _strategies_ that are 37 applied progressively to narrow down the possible options. Each _strategy_ varies on the type 38 of input data that it needs to make a decision: file name, extension, the first line of the file, the full content of the file, etc. 39 40 Depending on available input data, enry API can be roughly divided into the next categories or use cases. 41 42 ### By filename 43 44 Next functions require only a name of the file to make a guess: 45 46 - `GetLanguageByExtension` uses only file extension (wich may be ambiguous) 47 - `GetLanguageByFilename` useful for cases like `.gitignore`, `.bashrc`, etc 48 - all [filtering helpers](#filtering) 49 50 Please note that such guesses are expected not to be very accurate. 51 52 ### By text 53 54 To make a guess only based on the content of the file or a text snippet, use 55 56 - `GetLanguageByShebang` reads only the first line of text to identify the [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>). 57 - `GetLanguageByModeline` for cases when Vim/Emacs modeline e.g. `/* vim: set ft=cpp: */` may be present at a head or a tail of the text. 58 - `GetLanguageByClassifier` uses a Bayesian classifier trained on all the `./samples/` from Linguist. 59 60 It usually is a last-resort strategy that is used to disambiguate the guess of the previous strategies, and thus it requires a list of "candidate" guesses. One can provide a list of all known languages - keys from the `data.LanguagesLogProbabilities` as possible candidates if more intelligent hypotheses are not available, at the price of possibly suboptimal accuracy. 61 62 ### By file 63 64 The most accurate guess would be one when both, the file name and the content are available: 65 66 - `GetLanguagesByContent` only uses file extension and a set of regexp-based content heuristics. 67 - `GetLanguages` uses the full set of matching strategies and is expected to be most accurate. 68 69 ### Filtering: vendoring, binaries, etc 70 71 _enry_ expose a set of file-level helpers `Is*` to simplify filtering out the files that are less interesting for the purpose of source code analysis: 72 73 - `IsBinary` 74 - `IsVendor` 75 - `IsConfiguration` 76 - `IsDocumentation` 77 - `IsDotFile` 78 - `IsImage` 79 - `IsTest` 80 - `IsGenerated` 81 82 ### Language colors and groups 83 84 _enry_ exposes function to get language color to use for example in presenting statistics in graphs: 85 86 - `GetColor` 87 - `GetLanguageGroup` can be used to group similar languages together e.g. for `Less` this function will return `CSS` 88 89 ## Languages 90 91 ### Go 92 93 In a [Go module](https://github.com/golang/go/wiki/Modules), 94 import `enry` to the module by running: 95 96 ```sh 97 go get gitlab.com/thomasboni/go-enry/v2 98 ``` 99 100 The rest of the examples will assume you have either done this or fetched the 101 library into your `GOPATH`. 102 103 ```go 104 // The examples here and below assume you have imported the library. 105 import "gitlab.com/thomasboni/go-enry/v2" 106 107 lang, safe := enry.GetLanguageByExtension("foo.go") 108 fmt.Println(lang, safe) 109 // result: Go true 110 111 lang, safe := enry.GetLanguageByContent("foo.m", []byte("<matlab-code>")) 112 fmt.Println(lang, safe) 113 // result: Matlab true 114 115 lang, safe := enry.GetLanguageByContent("bar.m", []byte("<objective-c-code>")) 116 fmt.Println(lang, safe) 117 // result: Objective-C true 118 119 // all strategies together 120 lang := enry.GetLanguage("foo.cpp", []byte("<cpp-code>")) 121 // result: C++ true 122 ``` 123 124 Note that the returned boolean value `safe` is `true` if there is only one possible language detected. 125 126 A plural version of the same API allows getting a list of all possible languages for a given file. 127 128 ```go 129 langs := enry.GetLanguages("foo.h", []byte("<cpp-code>")) 130 // result: []string{"C", "C++", "Objective-C} 131 132 langs := enry.GetLanguagesByExtension("foo.asc", []byte("<content>"), nil) 133 // result: []string{"AGS Script", "AsciiDoc", "Public Key"} 134 135 langs := enry.GetLanguagesByFilename("Gemfile", []byte("<content>"), []string{}) 136 // result: []string{"Ruby"} 137 ``` 138 139 ### Java bindings 140 141 Generated Java bindings using a C shared library and JNI are available under [`java`](https://gitlab.com/thomasboni/go-enry/blob/master/java). 142 143 A library is published on Maven as [tech.sourced:enry-java](https://mvnrepository.com/artifact/tech.sourced/enry-java) for macOS and linux platforms. Windows support is planned under [src-d/enry#150](https://github.com/src-d/enry/issues/150). 144 145 ### Python bindings 146 147 Generated Python bindings using a C shared library and cffi are WIP under [src-d/enry#154](https://github.com/src-d/enry/issues/154). 148 149 A library is going to be published on pypi as [enry](https://pypi.org/project/enry/) for 150 macOS and linux platforms. Windows support is planned under [src-d/enry#150](https://github.com/src-d/enry/issues/150). 151 152 ### Rust bindings 153 154 Generated Rust bindings using a C static library are available at https://github.com/go-enry/rs-enry. 155 156 157 ## Divergences from Linguist 158 159 The `enry` library is based on the data from `github/linguist` version **v7.20.0**. 160 161 Parsing [linguist/samples](https://github.com/github/linguist/tree/master/samples) the following `enry` results are different from the Linguist: 162 163 - [Heuristics for ".txt" extension](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L521) in Vim Help File could not be parsed, due to unsupported negative lookahead in RE2 regexp engine. 164 165 - [Heuristics for ".sol" extension](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L464) in Solidity could not be parsed, due to unsupported negative lookahead in RE2 regexp engine. 166 167 - [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine. 168 169 - [Heuristics for ".rno" extension](https://github.com/github/linguist/blob/3a1bd3c3d3e741a8aaec4704f782e06f5cd2a00d/lib/linguist/heuristics.yml#L365) in RUNOFF could not be parsed, due to unsupported lookahead in RE2 regexp engine. 170 171 - [Heuristics for ".inc" extension](https://github.com/github/linguist/blob/f0e2d0d7f1ce600b2a5acccaef6b149c87d8b99c/lib/linguist/heuristics.yml#L222) in NASL could not be parsed, due to unsupported possessive quantifier in RE2 regexp engine. 172 173 - [Heuristics for ".as" extension](https://github.com/github/linguist/blob/223c00bb80eff04788e29010f98c5778993d2b2a/lib/linguist/heuristics.yml#L67) in ActionScript could not be parsed, due to unsupported positive lookahead in RE2 regexp engine. 174 175 - [Heuristics for ".csc", ".gsc" and ".gsh" extension](https://github.com/github/linguist/blob/7469c7982d93f2ad922230d712f586a353dc1a42/lib/linguist/heuristics.yml#L650-L651) in GSC could not be parsed, due to unsupported non-backtracking subexpressions in RE2 regexp engine. 176 177 - As of [Linguist v5.3.2](https://github.com/github/linguist/releases/tag/v5.3.2) it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry still uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. See [#193](https://github.com/src-d/enry/issues/193). 178 179 - Bayesian classifier can't distinguish "SQL" from "PLpgSQL. See [#194](https://github.com/src-d/enry/issues/194). 180 181 - Overriding languages and types though `.gitattributes` is not yet supported. See [#18](https://github.com/src-d/enry/issues/18). 182 183 - `enry` CLI output does NOT exclude `.gitignore`ed files and git submodules, as Linguist does 184 185 In all the cases above that have an issue number - we plan to update enry to match Linguist behavior. 186 187 ## Benchmarks 188 189 Enry's language detection has been compared with Linguist's on [_linguist/samples_](https://github.com/github/linguist/tree/master/samples). 190 191 We got these results: 192 193 ![histogram](benchmarks/histogram/distribution.png) 194 195 The histogram shows the _number of files_ (y-axis) per _time interval bucket_ (x-axis). 196 Most of the files were detected faster by enry. 197 198 There are several cases where enry is slower than Linguist due to 199 Go regexp engine being slower than Ruby's on, wich is based on [oniguruma](https://github.com/kkos/oniguruma) library, written in C. 200 201 See [instructions](#misc) for running enry with oniguruma. 202 203 ## Why Enry? 204 205 In the movie [My Fair Lady](https://en.wikipedia.org/wiki/My_Fair_Lady), [Professor Henry Higgins](http://www.imdb.com/character/ch0011719/) is a linguist who at the very beginning of the movie enjoys guessing the origin of people based on their accent. 206 207 "Enry Iggins" is how [Eliza Doolittle](http://www.imdb.com/character/ch0011720/), [pronounces](https://www.youtube.com/watch?v=pwNKyTktDIE) the name of the Professor. 208 209 ## Development 210 211 To run the tests use: 212 213 go test ./... 214 215 Setting `ENRY_TEST_REPO` to the path to existing checkout of Linguist will avoid cloning it and sepeed tests up. 216 Setting `ENRY_DEBUG=1` will provide insight in the Bayesian classifier building done by `make code-generate`. 217 218 ### Sync with github/linguist upstream 219 220 _enry_ re-uses parts of the original [github/linguist](https://github.com/github/linguist) to generate internal data structures. 221 In order to update to the latest release of linguist do: 222 223 ```bash 224 $ git clone https://github.com/github/linguist.git .linguist 225 $ cd .linguist; git checkout <release-tag>; cd .. 226 227 # put the new release's commit sha in the generator_test.go (to re-generate .gold test fixtures) 228 # https://gitlab.com/thomasboni/go-enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18 229 230 $ make code-generate 231 ``` 232 233 To stay in sync, enry needs to be updated when a new release of the linguist includes changes to any of the following files: 234 235 - [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) 236 - [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml) 237 - [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml) 238 - [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml) 239 240 There is no automation for detecting the changes in the linguist project, so this process above has to be done manually from time to time. 241 242 When submitting a pull request syncing up to a new release, please make sure it only contains the changes in 243 the generated files (in [data](https://gitlab.com/thomasboni/go-enry/blob/master/data) subdirectory). 244 245 Separating all the necessary "manual" code changes to a different PR that includes some background description and an update to the documentation on ["divergences from linguist"](#divergences-from-linguist) is very much appreciated as it simplifies the maintenance (review/release notes/etc). 246 247 ## Misc 248 249 <details> 250 <summary>Running a benchmark & faster regexp engine</summary> 251 252 ### Benchmark 253 254 All benchmark scripts are in [_benchmarks_](https://gitlab.com/thomasboni/go-enry/blob/master/benchmarks) directory. 255 256 #### Dependencies 257 258 As benchmarks depend on Ruby and Github-Linguist gem make sure you have: 259 260 - Ruby (e.g using [`rbenv`](https://github.com/rbenv/rbenv)), [`bundler`](https://bundler.io/) installed 261 - Docker 262 - [native dependencies](https://github.com/github/linguist/#dependencies) installed 263 - Build the gem `cd .linguist && bundle install && rake build_gem && cd -` 264 - Install it `gem install --no-rdoc --no-ri --local .linguist/github-linguist-*.gem` 265 266 #### Quick benchmark 267 268 To run quicker benchmarks 269 270 make benchmarks 271 272 to get average times for the primary detection function and strategies for the whole samples set. If you want to see measures per sample file use: 273 274 make benchmarks-samples 275 276 #### Full benchmark 277 278 If you want to reproduce the same benchmarks as reported above: 279 280 - Make sure all [dependencies](#benchmark-dependencies) are installed 281 - Install [gnuplot](http://gnuplot.info) (in order to plot the histogram) 282 - Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h) 283 284 It will run the benchmarks for enry and Linguist, parse the output, create csv files and plot the histogram. 285 286 ### Faster regexp engine (optional) 287 288 [Oniguruma](https://github.com/kkos/oniguruma) is CRuby's regular expression engine. 289 It is very fast and performs better than the one built into Go runtime. _enry_ supports swapping 290 between those two engines thanks to [rubex](https://github.com/moovweb/rubex) project. 291 The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library. 292 On macOS with [Homebrew](https://brew.sh/), it is: 293 294 ``` 295 brew install oniguruma 296 ``` 297 298 On Ubuntu, it is 299 300 ``` 301 sudo apt install libonig-dev 302 ``` 303 304 To build enry with Oniguruma regexps use the `oniguruma` build tag 305 306 ``` 307 go get -v -t --tags oniguruma ./... 308 ``` 309 310 and then rebuild the project. 311 312 </details> 313 314 ## License 315 316 Apache License, Version 2.0. See [LICENSE](LICENSE)