github.com/src-d/enry@v1.7.3/README.md (about) 1 # enry [![GoDoc](https://godoc.org/gopkg.in/src-d/enry.v1?status.svg)](https://godoc.org/gopkg.in/src-d/enry.v1) [![Build Status](https://travis-ci.com/src-d/enry.svg?branch=master)](https://travis-ci.com/src-d/enry) [![codecov](https://codecov.io/gh/src-d/enry/branch/master/graph/badge.svg)](https://codecov.io/gh/src-d/enry) 2 3 File programming language detector and toolbox to ignore binary or vendored files. *enry*, started as a port to _Go_ of the original [linguist](https://github.com/github/linguist) _Ruby_ library, that has an improved *2x performance*. 4 5 6 Installation 7 ------------ 8 9 The recommended way to install enry is 10 11 ``` 12 go get gopkg.in/src-d/enry.v1/cmd/enry 13 ``` 14 15 To build enry's CLI you must run 16 17 make build 18 19 this will generate a binary in the project's root directory called `enry`. You can then move this binary to anywhere in your `PATH`. 20 21 This project is now part of [source{d} Engine](https://sourced.tech/engine), 22 which provides the simplest way to get started with a single command. 23 Visit [sourced.tech/engine](https://sourced.tech/engine) for more information. 24 25 ### Faster regexp engine (optional) 26 27 [Oniguruma](https://github.com/kkos/oniguruma) is CRuby's regular expression engine. 28 It is very fast and performs better than the one built into Go runtime. *enry* supports swapping 29 between those two engines thanks to [rubex](https://github.com/moovweb/rubex) project. 30 The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library. 31 On macOS with brew, it is 32 33 ``` 34 brew install oniguruma 35 ``` 36 37 On Ubuntu, it is 38 39 ``` 40 sudo apt install libonig-dev 41 ``` 42 43 To build enry with Oniguruma regexps use the `oniguruma` build tag 44 45 ``` 46 go get -v -t --tags oniguruma ./... 47 ``` 48 49 and then rebuild the project. 50 51 Examples 52 ------------ 53 54 ```go 55 lang, safe := enry.GetLanguageByExtension("foo.go") 56 fmt.Println(lang, safe) 57 // result: Go true 58 59 lang, safe := enry.GetLanguageByContent("foo.m", []byte("<matlab-code>")) 60 fmt.Println(lang, safe) 61 // result: Matlab true 62 63 lang, safe := enry.GetLanguageByContent("bar.m", []byte("<objective-c-code>")) 64 fmt.Println(lang, safe) 65 // result: Objective-C true 66 67 // all strategies together 68 lang := enry.GetLanguage("foo.cpp", []byte("<cpp-code>")) 69 // result: C++ true 70 ``` 71 72 Note that the returned boolean value `safe` is set either to `true`, if there is only one possible language detected, or to `false` otherwise. 73 74 To get a list of possible languages for a given file, you can use the plural version of the detecting functions. 75 76 ```go 77 langs := enry.GetLanguages("foo.h", []byte("<cpp-code>")) 78 // result: []string{"C", "C++", "Objective-C} 79 80 langs := enry.GetLanguagesByExtension("foo.asc", []byte("<content>"), nil) 81 // result: []string{"AGS Script", "AsciiDoc", "Public Key"} 82 83 langs := enry.GetLanguagesByFilename("Gemfile", []byte("<content>"), []string{}) 84 // result: []string{"Ruby"} 85 ``` 86 87 88 CLI 89 ------------ 90 91 You can use enry as a command, 92 93 ```bash 94 $ enry --help 95 enry v1.5.0 build: 10-02-2017_14_01_07 commit: 95ef0a6cf3, based on linguist commit: 37979b2 96 enry, A simple (and faster) implementation of github/linguist 97 usage: enry <path> 98 enry [-json] [-breakdown] <path> 99 enry [-json] [-breakdown] 100 enry [-version] 101 ``` 102 103 and it'll return an output similar to *linguist*'s output, 104 105 ```bash 106 $ enry 107 55.56% Shell 108 22.22% Ruby 109 11.11% Gnuplot 110 11.11% Go 111 ``` 112 113 but not only the output; its flags are also the same as *linguist*'s ones, 114 115 ```bash 116 $ enry --breakdown 117 55.56% Shell 118 22.22% Ruby 119 11.11% Gnuplot 120 11.11% Go 121 122 Gnuplot 123 plot-histogram.gp 124 125 Ruby 126 linguist-samples.rb 127 linguist-total.rb 128 129 Shell 130 parse.sh 131 plot-histogram.sh 132 run-benchmark.sh 133 run-slow-benchmark.sh 134 run.sh 135 136 Go 137 parser/main.go 138 ``` 139 140 even the JSON flag, 141 142 ```bash 143 $ enry --json 144 {"Gnuplot":["plot-histogram.gp"],"Go":["parser/main.go"],"Ruby":["linguist-samples.rb","linguist-total.rb"],"Shell":["parse.sh","plot-histogram.sh","run-benchmark.sh","run-slow-benchmark.sh","run.sh"]} 145 ``` 146 147 Note that even if enry's CLI is compatible with linguist's, its main point is that **_enry doesn't need a git repository to work!_** 148 149 Java bindings 150 ------------ 151 152 Generated Java bindings using a C-shared library and JNI are located under [`java`](https://github.com/src-d/enry/blob/master/java) 153 154 Development 155 ------------ 156 157 *enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate all the necessary code you must run: 158 159 git clone https://github.com/github/linguist.git .linguist 160 # update commit in generator_test.go (to re-generate .gold fixtures) 161 # https://github.com/src-d/enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18 162 go generate 163 164 We update enry when changes are done in linguist's master branch on the following files: 165 166 * [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) 167 * [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml) 168 * [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml) 169 * [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml) 170 171 Currently we don't have any procedure established to automatically detect changes in the linguist project and regenerate the code. 172 So we update the generated code as needed, without any specific criteria. 173 174 If you want to update *enry* because of changes in linguist, you can run the *go 175 generate* command and do a pull request that only contains the changes in 176 generated files (those files in the subdirectory [data](https://github.com/src-d/enry/blob/master/data)). 177 178 To run the tests, 179 180 make test 181 182 183 Divergences from linguist 184 ------------ 185 186 `enry` [CLI tool](#cli) does *not* require a full Git repository to be present in the filesystem in order to report languages. 187 188 Using [linguist/samples](https://github.com/github/linguist/tree/master/samples) 189 as a set for the tests, the following issues were found: 190 191 * [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine 192 193 * As of (Linguist v5.3.2)[https://github.com/github/linguist/releases/tag/v5.3.2] it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry still uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. See [#193](https://github.com/src-d/enry/issues/193). 194 195 * Bayesian classifier can't distinguish "SQL" from "PLpgSQL. See [#194](https://github.com/src-d/enry/issues/194). 196 197 * Detection of [generated files](https://github.com/github/linguist/blob/bf95666fc15e49d556f2def4d0a85338423c25f3/lib/linguist/generated.rb#L53) is not supported yet. 198 (Thus they are not excluded from CLI output). See [#213](https://github.com/src-d/enry/issues/213). 199 200 * XML detection strategy is not implemented. See [#192](https://github.com/src-d/enry/issues/192). 201 202 * Overriding languages and types though `.gitattributes` is not yet supported. See [#18](https://github.com/src-d/enry/issues/18). 203 204 * `enry` CLI output does NOT exclude `.gitignore`ed files and git submodules, as linguist does 205 206 In all the cases above that have an issue number - we plan to update enry to match Linguist behaviour. 207 208 209 Benchmarks 210 ------------ 211 212 Enry's language detection has been compared with Linguist's one. In order to do that, Linguist's project directory [*linguist/samples*](https://github.com/github/linguist/tree/master/samples) was used as a set of files to run benchmarks against. 213 214 We got these results: 215 216 ![histogram](benchmarks/histogram/distribution.png) 217 218 The histogram represents the number of files for which spent time in language 219 detection was in the range of the time interval indicated in the x axis. 220 221 So you can see that most of the files were detected quicker in enry. 222 223 We found some few cases where enry turns slower than linguist. This is due to 224 Golang's regexp engine being slower than Ruby's, which uses the [oniguruma](https://github.com/kkos/oniguruma) library, written in C. 225 226 You can find scripts and additional information (like software and hardware used 227 and benchmarks' results per sample file) in [*benchmarks*](https://github.com/src-d/enry/blob/master/benchmarks) directory. 228 229 230 ### Benchmark Dependencies 231 As benchmarks depend on Ruby and Github-Linguist gem make sure you have: 232 - Ruby (e.g using [`rbenv`](https://github.com/rbenv/rbenv)), [`bundler`](https://bundler.io/) installed 233 - Docker 234 - [native dependencies](https://github.com/github/linguist/#dependencies) installed 235 - Build the gem `cd .linguist && bundle install && rake build_gem && cd -` 236 - Install it `gem install --no-rdoc --no-ri --local .linguist/github-linguist-*.gem` 237 238 239 ### How to reproduce current results 240 241 If you want to reproduce the same benchmarks as reported above: 242 - Make sure all [dependencies](#benchmark-dependencies) are installed 243 - Install [gnuplot](http://gnuplot.info) (in order to plot the histogram) 244 - Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h) 245 246 It will run the benchmarks for enry and linguist, parse the output, create csv files and plot the histogram. This takes some time. 247 248 ### Quick 249 To run quicker benchmarks you can either: 250 251 make benchmarks 252 253 to get average times for the main detection function and strategies for the whole samples set or: 254 255 make benchmarks-samples 256 257 if you want to see measures per sample file. 258 259 260 Why Enry? 261 ------------ 262 263 In the movie [My Fair Lady](https://en.wikipedia.org/wiki/My_Fair_Lady), [Professor Henry Higgins](http://www.imdb.com/character/ch0011719/?ref_=tt_cl_t2) is one of the main characters. Henry is a linguist and at the very beginning of the movie enjoys guessing the origin of people based on their accent. 264 265 `Enry Iggins` is how [Eliza Doolittle](http://www.imdb.com/character/ch0011720/?ref_=tt_cl_t1), [pronounces](https://www.youtube.com/watch?v=pwNKyTktDIE) the name of the Professor during the first half of the movie. 266 267 268 License 269 ------------ 270 271 Apache License, Version 2.0. See [LICENSE](LICENSE)