github.com/go-enry/go-enry@v1.7.3/README.md (about)

     1  # enry [![GoDoc](https://godoc.org/gopkg.in/src-d/enry.v1?status.svg)](https://godoc.org/gopkg.in/src-d/enry.v1) [![Build Status](https://travis-ci.com/src-d/enry.svg?branch=master)](https://travis-ci.com/src-d/enry) [![codecov](https://codecov.io/gh/src-d/enry/branch/master/graph/badge.svg)](https://codecov.io/gh/src-d/enry)
     2  
     3  File programming language detector and toolbox to ignore binary or vendored files. *enry*, started as a port to _Go_ of the original [linguist](https://github.com/github/linguist) _Ruby_ library, that has an improved *2x performance*.
     4  
     5  
     6  Installation
     7  ------------
     8  
     9  The recommended way to install enry is
    10  
    11  ```
    12  go get gopkg.in/src-d/enry.v1/cmd/enry
    13  ```
    14  
    15  To build enry's CLI you must run
    16  
    17      make build
    18  
    19  this will generate a binary in the project's root directory called `enry`. You can then move this binary to anywhere in your `PATH`.
    20  
    21  This project is now part of [source{d} Engine](https://sourced.tech/engine),
    22  which provides the simplest way to get started with a single command.
    23  Visit [sourced.tech/engine](https://sourced.tech/engine) for more information.
    24  
    25  ### Faster regexp engine (optional)
    26  
    27  [Oniguruma](https://github.com/kkos/oniguruma) is CRuby's regular expression engine.
    28  It is very fast and performs better than the one built into Go runtime. *enry* supports swapping
    29  between those two engines thanks to [rubex](https://github.com/moovweb/rubex) project.
    30  The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library.
    31  On macOS with brew, it is
    32  
    33  ```
    34  brew install oniguruma
    35  ```
    36  
    37  On Ubuntu, it is
    38  
    39  ```
    40  sudo apt install libonig-dev
    41  ```
    42  
    43  To build enry with Oniguruma regexps use the `oniguruma` build tag
    44  
    45  ```
    46  go get -v -t --tags oniguruma ./...
    47  ```
    48  
    49  and then rebuild the project.
    50  
    51  Examples
    52  ------------
    53  
    54  ```go
    55  lang, safe := enry.GetLanguageByExtension("foo.go")
    56  fmt.Println(lang, safe)
    57  // result: Go true
    58  
    59  lang, safe := enry.GetLanguageByContent("foo.m", []byte("<matlab-code>"))
    60  fmt.Println(lang, safe)
    61  // result: Matlab true
    62  
    63  lang, safe := enry.GetLanguageByContent("bar.m", []byte("<objective-c-code>"))
    64  fmt.Println(lang, safe)
    65  // result: Objective-C true
    66  
    67  // all strategies together
    68  lang := enry.GetLanguage("foo.cpp", []byte("<cpp-code>"))
    69  // result: C++ true
    70  ```
    71  
    72  Note that the returned boolean value `safe` is set either to `true`, if there is only one possible language detected, or to `false` otherwise.
    73  
    74  To get a list of possible languages for a given file, you can use the plural version of the detecting functions.
    75  
    76  ```go
    77  langs := enry.GetLanguages("foo.h",  []byte("<cpp-code>"))
    78  // result: []string{"C", "C++", "Objective-C}
    79  
    80  langs := enry.GetLanguagesByExtension("foo.asc", []byte("<content>"), nil)
    81  // result: []string{"AGS Script", "AsciiDoc", "Public Key"}
    82  
    83  langs := enry.GetLanguagesByFilename("Gemfile", []byte("<content>"), []string{})
    84  // result: []string{"Ruby"}
    85  ```
    86  
    87  
    88  CLI
    89  ------------
    90  
    91  You can use enry as a command,
    92  
    93  ```bash
    94  $ enry --help
    95    enry v1.5.0 build: 10-02-2017_14_01_07 commit: 95ef0a6cf3, based on linguist commit: 37979b2
    96    enry, A simple (and faster) implementation of github/linguist
    97    usage: enry <path>
    98           enry [-json] [-breakdown] <path>
    99           enry [-json] [-breakdown]
   100           enry [-version]
   101  ```
   102  
   103  and it'll return an output similar to *linguist*'s output,
   104  
   105  ```bash
   106  $ enry
   107  55.56%    Shell
   108  22.22%    Ruby
   109  11.11%    Gnuplot
   110  11.11%    Go
   111  ```
   112  
   113  but not only the output; its flags are also the same as *linguist*'s ones,
   114  
   115  ```bash
   116  $ enry --breakdown
   117  55.56%    Shell
   118  22.22%    Ruby
   119  11.11%    Gnuplot
   120  11.11%    Go
   121  
   122  Gnuplot
   123  plot-histogram.gp
   124  
   125  Ruby
   126  linguist-samples.rb
   127  linguist-total.rb
   128  
   129  Shell
   130  parse.sh
   131  plot-histogram.sh
   132  run-benchmark.sh
   133  run-slow-benchmark.sh
   134  run.sh
   135  
   136  Go
   137  parser/main.go
   138  ```
   139  
   140  even the JSON flag,
   141  
   142  ```bash
   143  $ enry --json
   144  {"Gnuplot":["plot-histogram.gp"],"Go":["parser/main.go"],"Ruby":["linguist-samples.rb","linguist-total.rb"],"Shell":["parse.sh","plot-histogram.sh","run-benchmark.sh","run-slow-benchmark.sh","run.sh"]}
   145  ```
   146  
   147  Note that even if enry's CLI is compatible with linguist's, its main point is that **_enry doesn't need a git repository to work!_**
   148  
   149  Java bindings
   150  ------------
   151  
   152  Generated Java bindings using a C-shared library and JNI are located under [`java`](https://github.com/src-d/enry/blob/master/java)
   153  
   154  Development
   155  ------------
   156  
   157  *enry* re-uses parts of original [linguist](https://github.com/github/linguist) to generate internal data structures. In order to update to the latest upstream and generate all the necessary code you must run:
   158  
   159      git clone https://github.com/github/linguist.git .linguist
   160      # update commit in generator_test.go (to re-generate .gold fixtures)
   161      # https://github.com/src-d/enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18
   162      go generate
   163  
   164  We update enry when changes are done in linguist's master branch on the following files:
   165  
   166  * [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml)
   167  * [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml)
   168  * [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml)
   169  * [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml)
   170  
   171  Currently we don't have any procedure established to automatically detect changes in the linguist project and regenerate the code.
   172  So we update the generated code as needed, without any specific criteria.
   173  
   174  If you want to update *enry* because of changes in linguist, you can run the *go
   175  generate* command and do a pull request that only contains the changes in
   176  generated files (those files in the subdirectory [data](https://github.com/src-d/enry/blob/master/data)).
   177  
   178  To run the tests,
   179  
   180      make test
   181  
   182  
   183  Divergences from linguist
   184  ------------
   185  
   186  `enry` [CLI tool](#cli) does *not* require a full Git repository to be present in the filesystem in order to report languages.
   187  
   188  Using [linguist/samples](https://github.com/github/linguist/tree/master/samples)
   189  as a set for the tests, the following issues were found:
   190  
   191  * [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine
   192  
   193  * As of (Linguist v5.3.2)[https://github.com/github/linguist/releases/tag/v5.3.2] it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry still uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. See [#193](https://github.com/src-d/enry/issues/193).
   194  
   195  * Bayesian classifier can't distinguish "SQL" from "PLpgSQL. See [#194](https://github.com/src-d/enry/issues/194).
   196  
   197  * Detection of [generated files](https://github.com/github/linguist/blob/bf95666fc15e49d556f2def4d0a85338423c25f3/lib/linguist/generated.rb#L53) is not supported yet.
   198   (Thus they are not excluded from CLI output). See [#213](https://github.com/src-d/enry/issues/213).
   199  
   200  * XML detection strategy is not implemented. See [#192](https://github.com/src-d/enry/issues/192).
   201  
   202  * Overriding languages and types though `.gitattributes` is not yet supported. See [#18](https://github.com/src-d/enry/issues/18).
   203  
   204  * `enry` CLI output does NOT exclude `.gitignore`ed files and git submodules, as linguist does
   205  
   206  In all the cases above that have an issue number - we plan to update enry to match Linguist behaviour.
   207  
   208  
   209  Benchmarks
   210  ------------
   211  
   212  Enry's language detection has been compared with Linguist's one. In order to do that, Linguist's project directory [*linguist/samples*](https://github.com/github/linguist/tree/master/samples) was used as a set of files to run benchmarks against.
   213  
   214  We got these results:
   215  
   216  ![histogram](benchmarks/histogram/distribution.png)
   217  
   218  The histogram represents the number of files for which spent time in language
   219  detection was in the range of the time interval indicated in the x axis.
   220  
   221  So you can see that most of the files were detected quicker in enry.
   222  
   223  We found some few cases where enry turns slower than linguist. This is due to
   224  Golang's regexp engine being slower than Ruby's, which uses the [oniguruma](https://github.com/kkos/oniguruma) library, written in C.
   225  
   226  You can find scripts and additional information (like software and hardware used
   227  and benchmarks' results per sample file) in [*benchmarks*](https://github.com/src-d/enry/blob/master/benchmarks) directory.
   228  
   229  
   230  ### Benchmark Dependencies
   231  As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
   232   - Ruby (e.g using [`rbenv`](https://github.com/rbenv/rbenv)), [`bundler`](https://bundler.io/) installed
   233   - Docker
   234   - [native dependencies](https://github.com/github/linguist/#dependencies) installed
   235   - Build the gem `cd .linguist && bundle install && rake build_gem && cd -`
   236   - Install it `gem install --no-rdoc --no-ri --local .linguist/github-linguist-*.gem`
   237  
   238  
   239  ### How to reproduce current results
   240  
   241  If you want to reproduce the same benchmarks as reported above:
   242   - Make sure all [dependencies](#benchmark-dependencies) are installed
   243   - Install [gnuplot](http://gnuplot.info) (in order to plot the histogram)
   244   - Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)
   245  
   246  It will run the benchmarks for enry and linguist, parse the output, create csv files and plot the histogram. This takes some time.
   247  
   248  ### Quick
   249  To run quicker benchmarks you can either:
   250  
   251      make benchmarks
   252  
   253  to get average times for the main detection function and strategies for the whole samples set or:
   254  
   255      make benchmarks-samples
   256  
   257  if you want to see measures per sample file.
   258  
   259  
   260  Why Enry?
   261  ------------
   262  
   263  In the movie [My Fair Lady](https://en.wikipedia.org/wiki/My_Fair_Lady), [Professor Henry Higgins](http://www.imdb.com/character/ch0011719/?ref_=tt_cl_t2) is one of the main characters. Henry is a linguist and at the very beginning of the movie enjoys guessing the origin of people based on their accent.
   264  
   265  `Enry Iggins` is how [Eliza Doolittle](http://www.imdb.com/character/ch0011720/?ref_=tt_cl_t1), [pronounces](https://www.youtube.com/watch?v=pwNKyTktDIE) the name of the Professor during the first half of the movie.
   266  
   267  
   268  License
   269  ------------
   270  
   271  Apache License, Version 2.0. See [LICENSE](LICENSE)