github.com/zkry/enry@v1.6.3/README.md (about) 1 # enry [![GoDoc](https://godoc.org/gopkg.in/src-d/enry.v1?status.svg)](https://godoc.org/gopkg.in/src-d/enry.v1) [![Build Status](https://travis-ci.org/src-d/enry.svg?branch=master)](https://travis-ci.org/src-d/enry) [![codecov](https://codecov.io/gh/src-d/enry/branch/master/graph/badge.svg)](https://codecov.io/gh/src-d/enry) 2 3 File programming language detector and toolbox to ignore binary or vendored files. *enry*, started as a port to _Go_ of the original [linguist](https://github.com/github/linguist) _Ruby_ library, that has an improved *2x performance*. 4 5 6 Installation 7 ------------ 8 9 The recommended way to install enry is 10 11 ``` 12 go get gopkg.in/src-d/enry.v1/... 13 ``` 14 15 To build enry's CLI you must run 16 17 make build-cli 18 19 this will generate a binary in the project's root directory called `enry`. You can then move this binary to anywhere in your `PATH`. 20 21 22 ### Faster regexp engine (optional) 23 24 [Onigumura](https://github.com/kkos/oniguruma) is CRuby's regular expression engine. 25 It is very fast and performs better than the one built into Go runtime. *enry* supports swapping 26 between those two engines thanks to [rubex](https://github.com/moovweb/rubex) project. 27 The typical overall speedup from using Onigumura is 1.5-2x. However, it requires CGo and the external shared library. 28 On macOS with brew, it is 29 30 ``` 31 brew install onigumura 32 ``` 33 34 On Ubuntu, it is 35 36 ``` 37 sudo apt install libonig-dev 38 ``` 39 40 To build enry with Onigumura regexps, patch the imports with 41 42 ``` 43 make onigumura 44 ``` 45 46 and then rebuild the project. 47 48 Examples 49 ------------ 50 51 ```go 52 lang, safe := enry.GetLanguageByExtension("foo.go") 53 fmt.Println(lang, safe) 54 // result: Go true 55 56 lang, safe := enry.GetLanguageByContent("foo.m", []byte("<matlab-code>")) 57 fmt.Println(lang, safe) 58 // result: Matlab true 59 60 lang, safe := enry.GetLanguageByContent("bar.m", []byte("<objective-c-code>")) 61 fmt.Println(lang, safe) 62 // result: Objective-C true 63 64 // all strategies together 65 lang := enry.GetLanguage("foo.cpp", []byte("<cpp-code>")) 66 // result: C++ true 67 ``` 68 69 Note that the returned boolean value `safe` is set either to `true`, if there is only one possible language detected, or to `false` otherwise. 70 71 To get a list of possible languages for a given file, you can use the plural version of the detecting functions. 72 73 ```go 74 langs := enry.GetLanguages("foo.h", []byte("<cpp-code>")) 75 // result: []string{"C", "C++", "Objective-C} 76 77 langs := enry.GetLanguagesByExtension("foo.asc", []byte("<content>"), nil) 78 // result: []string{"AGS Script", "AsciiDoc", "Public Key"} 79 80 langs := enry.GetLanguagesByFilename("Gemfile", []byte("<content>"), []string{}) 81 // result: []string{"Ruby"} 82 ``` 83 84 85 CLI 86 ------------ 87 88 You can use enry as a command, 89 90 ```bash 91 $ enry --help 92 enry v1.5.0 build: 10-02-2017_14_01_07 commit: 95ef0a6cf3, based on linguist commit: 37979b2 93 enry, A simple (and faster) implementation of github/linguist 94 usage: enry <path> 95 enry [-json] [-breakdown] <path> 96 enry [-json] [-breakdown] 97 enry [-version] 98 ``` 99 100 and it'll return an output similar to *linguist*'s output, 101 102 ```bash 103 $ enry 104 55.56% Shell 105 22.22% Ruby 106 11.11% Gnuplot 107 11.11% Go 108 ``` 109 110 but not only the output; its flags are also the same as *linguist*'s ones, 111 112 ```bash 113 $ enry --breakdown 114 55.56% Shell 115 22.22% Ruby 116 11.11% Gnuplot 117 11.11% Go 118 119 Gnuplot 120 plot-histogram.gp 121 122 Ruby 123 linguist-samples.rb 124 linguist-total.rb 125 126 Shell 127 parse.sh 128 plot-histogram.sh 129 run-benchmark.sh 130 run-slow-benchmark.sh 131 run.sh 132 133 Go 134 parser/main.go 135 ``` 136 137 even the JSON flag, 138 139 ```bash 140 $ enry --json 141 {"Gnuplot":["plot-histogram.gp"],"Go":["parser/main.go"],"Ruby":["linguist-samples.rb","linguist-total.rb"],"Shell":["parse.sh","plot-histogram.sh","run-benchmark.sh","run-slow-benchmark.sh","run.sh"]} 142 ``` 143 144 Note that even if enry's CLI is compatible with linguist's, its main point is that **_enry doesn't need a git repository to work!_** 145 146 Java bindings 147 ------------ 148 149 Generated Java binidings using a C shared library + JNI are located under [`java`](https://github.com/src-d/enry/blob/master/java) 150 151 Development 152 ------------ 153 154 *enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate the necessary code you must run: 155 156 go generate 157 158 We update enry when changes are done in linguist's master branch on the following files: 159 160 * [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml) 161 * [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb) 162 * [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml) 163 * [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml) 164 165 Currently we don't have any procedure established to automatically detect changes in the linguist project and regenerate the code. 166 So we update the generated code as needed, without any specific criteria. 167 168 If you want to update *enry* because of changes in linguist, you can run the *go 169 generate* command and do a pull request that only contains the changes in 170 generated files (those files in the subdirectory [data](https://github.com/src-d/enry/blob/master/data)). 171 172 To run the tests, 173 174 make test 175 176 177 Divergences from linguist 178 ------------ 179 180 Using [linguist/samples](https://github.com/github/linguist/tree/master/samples) 181 as a set for the tests, the following issues were found: 182 183 * With [hello.ms](https://github.com/github/linguist/blob/master/samples/Unix%20Assembly/hello.ms) we can't detect the language (Unix Assembly) because we don't have a matcher in contentMatchers (content.go) for Unix Assembly. Linguist uses this [regexp](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L300) in its code, 184 185 `elsif /(?<!\S)\.(include|globa?l)\s/.match(data) || /(?<!\/\*)(\A|\n)\s*\.[A-Za-z][_A-Za-z0-9]*:/.match(data.gsub(/"([^\\"]|\\.)*"|'([^\\']|\\.)*'|\\\s*(?:--.*)?\n/, ""))` 186 187 which we can't port. 188 189 * All files for the SQL language fall to the classifier because we don't parse 190 this [disambiguator 191 expression](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb#L433) 192 for `*.sql` files right. This expression doesn't comply with the pattern for the 193 rest in [heuristics.rb](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.rb). 194 195 196 Benchmarks 197 ------------ 198 199 Enry's language detection has been compared with Linguist's one. In order to do that, linguist's project directory [*linguist/samples*](https://github.com/github/linguist/tree/master/samples) was used as a set of files to run benchmarks against. 200 201 We got these results: 202 203 ![histogram](https://raw.githubusercontent.com/src-d/enry/master/benchmarks/histogram/distribution.png) 204 205 The histogram represents the number of files for which spent time in language 206 detection was in the range of the time interval indicated in the x axis. 207 208 So you can see that most of the files were detected quicker in enry. 209 210 We found some few cases where enry turns slower than linguist. This is due to 211 Golang's regexp engine being slower than Ruby's, which uses the [oniguruma](https://github.com/kkos/oniguruma) library, written in C. 212 213 You can find scripts and additional information (like software and hardware used 214 and benchmarks' results per sample file) in [*benchmarks*](https://github.com/src-d/enry/blob/master/benchmarks) directory. 215 216 If you want to reproduce the same benchmarks you can run: 217 218 benchmarks/run.sh 219 220 from the root's project directory and it'll run benchmarks for enry and linguist, parse the output, create csv files and create a histogram (you must have installed [gnuplot](http://gnuplot.info) in your system to get the histogram). 221 222 This can take some time, so to run local benchmarks for a quick check you can either: 223 224 make benchmarks 225 226 to get average times for the main detection function and strategies for the whole samples set or: 227 228 make benchmarks-samples 229 230 if you want to see measures by sample file. 231 232 233 Why Enry? 234 ------------ 235 236 In the movie [My Fair Lady](https://en.wikipedia.org/wiki/My_Fair_Lady), [Professor Henry Higgins](http://www.imdb.com/character/ch0011719/?ref_=tt_cl_t2) is one of the main characters. Henry is a linguist and at the very beginning of the movie enjoys guessing the origin of people based on their accent. 237 238 `Enry Iggins` is how [Eliza Doolittle](http://www.imdb.com/character/ch0011720/?ref_=tt_cl_t1), [pronounces](https://www.youtube.com/watch?v=pwNKyTktDIE) the name of the Professor during the first half of the movie. 239 240 241 License 242 ------------ 243 244 Apache License, Version 2.0. See [LICENSE](LICENSE)