gitlab.com/thomasboni/go-enry/v2@v2.8.3-0.20220418031202-30b0d7a3de98/README.md (about)

     1  # go-enry [![GoDoc](https://godoc.org/gitlab.com/thomasboni/go-enry?status.svg)](https://pkg.go.dev/gitlab.com/thomasboni/go-enry/v2) [![Test](https://gitlab.com/thomasboni/go-enry/workflows/Test/badge.svg)](https://gitlab.com/thomasboni/go-enry/actions?query=workflow%3ATest+branch%3Amaster) [![codecov](https://codecov.io/gh/go-enry/go-enry/branch/master/graph/badge.svg)](https://codecov.io/gh/go-enry/go-enry)
     2  
     3  Programming language detector and toolbox to ignore binary or vendored files. _enry_, started as a port to _Go_ of the original [Linguist](https://github.com/github/linguist) _Ruby_ library, that has an improved _2x performance_.
     4  
     5  - [CLI](#cli)
     6  - [Library](#library)
     7    - [Use cases](#use-cases)
     8      - [By filename](#by-filename)
     9      - [By text](#by-text)
    10      - [By file](#by-file)
    11      - [Filtering](#filtering-vendoring-binaries-etc)
    12      - [Coloring](#language-colors-and-groups)
    13    - [Languages](#languages)
    14      - [Go](#go)
    15      - [Java bindings](#java-bindings)
    16      - [Python bindings](#python-bindings)
    17      - [Rust bindings](#rust-bindings)
    18  - [Divergences from linguist](#divergences-from-linguist)
    19  - [Benchmarks](#benchmarks)
    20  - [Why Enry?](#why-enry)
    21  - [Development](#development)
    22    - [Sync with github/linguist upstream](#sync-with-githublinguist-upstream)
    23  - [Misc](#misc)
    24  - [License](#license)
    25  
    26  # CLI
    27  
    28  The CLI binary is hosted in a separate repository [go-enry/enry](https://github.com/go-enry/enry).
    29  
    30  # Library
    31  
    32  _enry_ is also a Go library for guessing a programming language that exposes API through FFI to multiple programming environments.
    33  
    34  ## Use cases
    35  
    36  _enry_ guesses a programming language using a sequence of matching _strategies_ that are
    37  applied progressively to narrow down the possible options. Each _strategy_ varies on the type
    38  of input data that it needs to make a decision: file name, extension, the first line of the file, the full content of the file, etc.
    39  
    40  Depending on available input data, enry API can be roughly divided into the next categories or use cases.
    41  
    42  ### By filename
    43  
    44  Next functions require only a name of the file to make a guess:
    45  
    46  - `GetLanguageByExtension` uses only file extension (wich may be ambiguous)
    47  - `GetLanguageByFilename` useful for cases like `.gitignore`, `.bashrc`, etc
    48  - all [filtering helpers](#filtering)
    49  
    50  Please note that such guesses are expected not to be very accurate.
    51  
    52  ### By text
    53  
    54  To make a guess only based on the content of the file or a text snippet, use
    55  
    56  - `GetLanguageByShebang` reads only the first line of text to identify the [shebang](<https://en.wikipedia.org/wiki/Shebang_(Unix)>).
    57  - `GetLanguageByModeline` for cases when Vim/Emacs modeline e.g. `/* vim: set ft=cpp: */` may be present at a head or a tail of the text.
    58  - `GetLanguageByClassifier` uses a Bayesian classifier trained on all the `./samples/` from Linguist.
    59  
    60    It usually is a last-resort strategy that is used to disambiguate the guess of the previous strategies, and thus it requires a list of "candidate" guesses. One can provide a list of all known languages - keys from the `data.LanguagesLogProbabilities` as possible candidates if more intelligent hypotheses are not available, at the price of possibly suboptimal accuracy.
    61  
    62  ### By file
    63  
    64  The most accurate guess would be one when both, the file name and the content are available:
    65  
    66  - `GetLanguagesByContent` only uses file extension and a set of regexp-based content heuristics.
    67  - `GetLanguages` uses the full set of matching strategies and is expected to be most accurate.
    68  
    69  ### Filtering: vendoring, binaries, etc
    70  
    71  _enry_ expose a set of file-level helpers `Is*` to simplify filtering out the files that are less interesting for the purpose of source code analysis:
    72  
    73  - `IsBinary`
    74  - `IsVendor`
    75  - `IsConfiguration`
    76  - `IsDocumentation`
    77  - `IsDotFile`
    78  - `IsImage`
    79  - `IsTest`
    80  - `IsGenerated`
    81  
    82  ### Language colors and groups
    83  
    84  _enry_ exposes function to get language color to use for example in presenting statistics in graphs:
    85  
    86  - `GetColor`
    87  - `GetLanguageGroup` can be used to group similar languages together e.g. for `Less` this function will return `CSS`
    88  
    89  ## Languages
    90  
    91  ### Go
    92  
    93  In a [Go module](https://github.com/golang/go/wiki/Modules),
    94  import `enry` to the module by running:
    95  
    96  ```sh
    97  go get gitlab.com/thomasboni/go-enry/v2
    98  ```
    99  
   100  The rest of the examples will assume you have either done this or fetched the
   101  library into your `GOPATH`.
   102  
   103  ```go
   104  // The examples here and below assume you have imported the library.
   105  import "gitlab.com/thomasboni/go-enry/v2"
   106  
   107  lang, safe := enry.GetLanguageByExtension("foo.go")
   108  fmt.Println(lang, safe)
   109  // result: Go true
   110  
   111  lang, safe := enry.GetLanguageByContent("foo.m", []byte("<matlab-code>"))
   112  fmt.Println(lang, safe)
   113  // result: Matlab true
   114  
   115  lang, safe := enry.GetLanguageByContent("bar.m", []byte("<objective-c-code>"))
   116  fmt.Println(lang, safe)
   117  // result: Objective-C true
   118  
   119  // all strategies together
   120  lang := enry.GetLanguage("foo.cpp", []byte("<cpp-code>"))
   121  // result: C++ true
   122  ```
   123  
   124  Note that the returned boolean value `safe` is `true` if there is only one possible language detected.
   125  
   126  A plural version of the same API allows getting a list of all possible languages for a given file.
   127  
   128  ```go
   129  langs := enry.GetLanguages("foo.h",  []byte("<cpp-code>"))
   130  // result: []string{"C", "C++", "Objective-C}
   131  
   132  langs := enry.GetLanguagesByExtension("foo.asc", []byte("<content>"), nil)
   133  // result: []string{"AGS Script", "AsciiDoc", "Public Key"}
   134  
   135  langs := enry.GetLanguagesByFilename("Gemfile", []byte("<content>"), []string{})
   136  // result: []string{"Ruby"}
   137  ```
   138  
   139  ### Java bindings
   140  
   141  Generated Java bindings using a C shared library and JNI are available under [`java`](https://gitlab.com/thomasboni/go-enry/blob/master/java).
   142  
   143  A library is published on Maven as [tech.sourced:enry-java](https://mvnrepository.com/artifact/tech.sourced/enry-java) for macOS and linux platforms. Windows support is planned under [src-d/enry#150](https://github.com/src-d/enry/issues/150).
   144  
   145  ### Python bindings
   146  
   147  Generated Python bindings using a C shared library and cffi are WIP under [src-d/enry#154](https://github.com/src-d/enry/issues/154).
   148  
   149  A library is going to be published on pypi as [enry](https://pypi.org/project/enry/) for
   150  macOS and linux platforms. Windows support is planned under [src-d/enry#150](https://github.com/src-d/enry/issues/150).
   151  
   152  ### Rust bindings
   153  
   154  Generated Rust bindings using a C static library are available at https://github.com/go-enry/rs-enry.
   155  
   156  
   157  ## Divergences from Linguist
   158  
   159  The `enry` library is based on the data from `github/linguist` version **v7.20.0**.
   160  
   161  Parsing [linguist/samples](https://github.com/github/linguist/tree/master/samples) the following `enry` results are different from the Linguist:
   162  
   163  - [Heuristics for ".txt" extension](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L521) in Vim Help File could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
   164  
   165  - [Heuristics for ".sol" extension](https://github.com/github/linguist/blob/8083cb5a89cee2d99f5a988f165994d0243f0d1e/lib/linguist/heuristics.yml#L464) in Solidity could not be parsed, due to unsupported negative lookahead in RE2 regexp engine.
   166  
   167  - [Heuristics for ".es" extension](https://github.com/github/linguist/blob/e761f9b013e5b61161481fcb898b59721ee40e3d/lib/linguist/heuristics.yml#L103) in JavaScript could not be parsed, due to unsupported backreference in RE2 regexp engine.
   168  
   169  - [Heuristics for ".rno" extension](https://github.com/github/linguist/blob/3a1bd3c3d3e741a8aaec4704f782e06f5cd2a00d/lib/linguist/heuristics.yml#L365) in RUNOFF could not be parsed, due to unsupported lookahead in RE2 regexp engine.
   170  
   171  - [Heuristics for ".inc" extension](https://github.com/github/linguist/blob/f0e2d0d7f1ce600b2a5acccaef6b149c87d8b99c/lib/linguist/heuristics.yml#L222) in NASL could not be parsed, due to unsupported possessive quantifier in RE2 regexp engine.
   172  
   173  - [Heuristics for ".as" extension](https://github.com/github/linguist/blob/223c00bb80eff04788e29010f98c5778993d2b2a/lib/linguist/heuristics.yml#L67) in ActionScript could not be parsed, due to unsupported positive lookahead in RE2 regexp engine.
   174  
   175  - [Heuristics for ".csc", ".gsc" and ".gsh" extension](https://github.com/github/linguist/blob/7469c7982d93f2ad922230d712f586a353dc1a42/lib/linguist/heuristics.yml#L650-L651) in GSC could not be parsed, due to unsupported non-backtracking subexpressions in RE2 regexp engine.
   176  
   177  - As of [Linguist v5.3.2](https://github.com/github/linguist/releases/tag/v5.3.2) it is using [flex-based scanner in C for tokenization](https://github.com/github/linguist/pull/3846). Enry still uses [extract_token](https://github.com/github/linguist/pull/3846/files#diff-d5179df0b71620e3fac4535cd1368d15L60) regex-based algorithm. See [#193](https://github.com/src-d/enry/issues/193).
   178  
   179  - Bayesian classifier can't distinguish "SQL" from "PLpgSQL. See [#194](https://github.com/src-d/enry/issues/194).
   180  
   181  - Overriding languages and types though `.gitattributes` is not yet supported. See [#18](https://github.com/src-d/enry/issues/18).
   182  
   183  - `enry` CLI output does NOT exclude `.gitignore`ed files and git submodules, as Linguist does
   184  
   185  In all the cases above that have an issue number - we plan to update enry to match Linguist behavior.
   186  
   187  ## Benchmarks
   188  
   189  Enry's language detection has been compared with Linguist's on [_linguist/samples_](https://github.com/github/linguist/tree/master/samples).
   190  
   191  We got these results:
   192  
   193  ![histogram](benchmarks/histogram/distribution.png)
   194  
   195  The histogram shows the _number of files_ (y-axis) per _time interval bucket_ (x-axis).
   196  Most of the files were detected faster by enry.
   197  
   198  There are several cases where enry is slower than Linguist due to
   199  Go regexp engine being slower than Ruby's on, wich is based on [oniguruma](https://github.com/kkos/oniguruma) library, written in C.
   200  
   201  See [instructions](#misc) for running enry with oniguruma.
   202  
   203  ## Why Enry?
   204  
   205  In the movie [My Fair Lady](https://en.wikipedia.org/wiki/My_Fair_Lady), [Professor Henry Higgins](http://www.imdb.com/character/ch0011719/) is a linguist who at the very beginning of the movie enjoys guessing the origin of people based on their accent.
   206  
   207  "Enry Iggins" is how [Eliza Doolittle](http://www.imdb.com/character/ch0011720/), [pronounces](https://www.youtube.com/watch?v=pwNKyTktDIE) the name of the Professor.
   208  
   209  ## Development
   210  
   211  To run the tests use:
   212  
   213      go test ./...
   214  
   215  Setting `ENRY_TEST_REPO` to the path to existing checkout of Linguist will avoid cloning it and sepeed tests up.
   216  Setting `ENRY_DEBUG=1` will provide insight in the Bayesian classifier building done by `make code-generate`.
   217  
   218  ### Sync with github/linguist upstream
   219  
   220  _enry_ re-uses parts of the original [github/linguist](https://github.com/github/linguist) to generate internal data structures.
   221  In order to update to the latest release of linguist do:
   222  
   223  ```bash
   224  $ git clone https://github.com/github/linguist.git .linguist
   225  $ cd .linguist; git checkout <release-tag>; cd ..
   226  
   227  # put the new release's commit sha in the generator_test.go (to re-generate .gold test fixtures)
   228  # https://gitlab.com/thomasboni/go-enry/blob/13d3d66d37a87f23a013246a1b0678c9ee3d524b/internal/code-generator/generator/generator_test.go#L18
   229  
   230  $ make code-generate
   231  ```
   232  
   233  To stay in sync, enry needs to be updated when a new release of the linguist includes changes to any of the following files:
   234  
   235  - [languages.yml](https://github.com/github/linguist/blob/master/lib/linguist/languages.yml)
   236  - [heuristics.yml](https://github.com/github/linguist/blob/master/lib/linguist/heuristics.yml)
   237  - [vendor.yml](https://github.com/github/linguist/blob/master/lib/linguist/vendor.yml)
   238  - [documentation.yml](https://github.com/github/linguist/blob/master/lib/linguist/documentation.yml)
   239  
   240  There is no automation for detecting the changes in the linguist project, so this process above has to be done manually from time to time.
   241  
   242  When submitting a pull request syncing up to a new release, please make sure it only contains the changes in
   243  the generated files (in [data](https://gitlab.com/thomasboni/go-enry/blob/master/data) subdirectory).
   244  
   245  Separating all the necessary "manual" code changes to a different PR that includes some background description and an update to the documentation on ["divergences from linguist"](#divergences-from-linguist) is very much appreciated as it simplifies the maintenance (review/release notes/etc).
   246  
   247  ## Misc
   248  
   249  <details>
   250    <summary>Running a benchmark & faster regexp engine</summary>
   251  
   252  ### Benchmark
   253  
   254  All benchmark scripts are in [_benchmarks_](https://gitlab.com/thomasboni/go-enry/blob/master/benchmarks) directory.
   255  
   256  #### Dependencies
   257  
   258  As benchmarks depend on Ruby and Github-Linguist gem make sure you have:
   259  
   260  - Ruby (e.g using [`rbenv`](https://github.com/rbenv/rbenv)), [`bundler`](https://bundler.io/) installed
   261  - Docker
   262  - [native dependencies](https://github.com/github/linguist/#dependencies) installed
   263  - Build the gem `cd .linguist && bundle install && rake build_gem && cd -`
   264  - Install it `gem install --no-rdoc --no-ri --local .linguist/github-linguist-*.gem`
   265  
   266  #### Quick benchmark
   267  
   268  To run quicker benchmarks
   269  
   270      make benchmarks
   271  
   272  to get average times for the primary detection function and strategies for the whole samples set. If you want to see measures per sample file use:
   273  
   274      make benchmarks-samples
   275  
   276  #### Full benchmark
   277  
   278  If you want to reproduce the same benchmarks as reported above:
   279  
   280  - Make sure all [dependencies](#benchmark-dependencies) are installed
   281  - Install [gnuplot](http://gnuplot.info) (in order to plot the histogram)
   282  - Run `ENRY_TEST_REPO="$PWD/.linguist" benchmarks/run.sh` (takes ~15h)
   283  
   284  It will run the benchmarks for enry and Linguist, parse the output, create csv files and plot the histogram.
   285  
   286  ### Faster regexp engine (optional)
   287  
   288  [Oniguruma](https://github.com/kkos/oniguruma) is CRuby's regular expression engine.
   289  It is very fast and performs better than the one built into Go runtime. _enry_ supports swapping
   290  between those two engines thanks to [rubex](https://github.com/moovweb/rubex) project.
   291  The typical overall speedup from using Oniguruma is 1.5-2x. However, it requires CGo and the external shared library.
   292  On macOS with [Homebrew](https://brew.sh/), it is:
   293  
   294  ```
   295  brew install oniguruma
   296  ```
   297  
   298  On Ubuntu, it is
   299  
   300  ```
   301  sudo apt install libonig-dev
   302  ```
   303  
   304  To build enry with Oniguruma regexps use the `oniguruma` build tag
   305  
   306  ```
   307  go get -v -t --tags oniguruma ./...
   308  ```
   309  
   310  and then rebuild the project.
   311  
   312  </details>
   313  
   314  ## License
   315  
   316  Apache License, Version 2.0. See [LICENSE](LICENSE)