github.com/everdrone/grab@v0.1.7-0.20230416223925-40674b995521/README.md (about)

     1  <div align="center">
     2      <h1>
     3          <img width="750" src="https://raw.githubusercontent.com/everdrone/grab/main/.github/media/Dark@2x.png#gh-light-mode-only" alt="GRAB" />
     4          <img width="750" src="https://raw.githubusercontent.com/everdrone/grab/main/.github/media/Light@2x.png#gh-dark-mode-only" alt="GRAB" />
     5      </h1>
     6      <h3>Greedy, Regex-Aware Binary Downloader</h3>
     7  </div>
     8  
     9  <p align="center">
    10  <a href="https://github.com/everdrone/grab/stargazers">
    11      <img src="https://img.shields.io/github/stars/everdrone/grab?color=8bd5ca&logo=github&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge" alt="Stargazers">
    12  </a>
    13  <a href="https://github.com/everdrone/grab/releases/latest">
    14      <img src="https://img.shields.io/github/v/release/everdrone/grab?color=b7bdf8&logo=&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge" alt="Latest Release">
    15  </a>
    16  <a href="https://app.codecov.io/gh/everdrone/grab" target="_blank">
    17      <img src="https://img.shields.io/codecov/c/github/everdrone/grab?color=c6a0f6&logo=codecov&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge&token=NkRjXNdxZI" alt="Codecov">
    18  </a>
    19  <a href="https://github.com/everdrone/grab/issues">
    20      <img src="https://img.shields.io/github/issues/everdrone/grab?color=f8bd96&logo=&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge" alt="GitHub issues">
    21  </a>
    22  </p>
    23  
    24  # Table of contents
    25  
    26  - [Motivation](#why)
    27  - [Installation](#installation)
    28  - [Usage](#usage)
    29  - [Quickstart](#quickstart)
    30  - [Options](#command-options)
    31  - [Next steps](#next-steps)
    32  
    33  # Why
    34  
    35  This project helps you automate scraping data and downloading assets from the internet. Based on Go's Regular Expression engine and HCL, for ease of use, performance and flexibility.
    36  
    37  # Installation
    38  
    39  Download and install the [latest release](https://github.com/everdrone/grab/releases/latest).
    40  
    41  # Usage
    42  
    43  Run the following command to generate a new configuration file in the current directory.
    44  
    45  ```
    46  grab config generate
    47  ```
    48  
    49  > **Note**  
    50  > Grab's configuration file uses [Hashicorp's HCL](https://github.com/hashicorp/hcl).  
    51  > You can always refer to their specification for topics not covered by the documentation in this repo.
    52  
    53  Once you're happy with your configuration, you can check if everything is ok by running:
    54  
    55  ```
    56  grab config check
    57  ```
    58  
    59  To scrape and download assets, pass one or more URLs to the `get` subcommand:
    60  
    61  ```ini
    62  # single URL
    63  grab get https://url.to/scrape/files?from
    64  
    65  # list of URLs
    66  grab get urls.ini
    67  
    68  # at least one of each
    69  grab get https://my.url/and urls.ini list.ini
    70  ```
    71  
    72  > **Note**  
    73  > The list of URLs can contain comments, like the `ini` format: all lines starting with `#` and `;` will be ignored.
    74  
    75  # Quickstart
    76  
    77  The default configuration, generated with `grab config generate` already works out of the box.
    78  
    79  ```hcl
    80  global {
    81    location = "/home/yourusername/Downloads/grab"
    82  }
    83  
    84  site "unsplash" {
    85    test = "unsplash"
    86  
    87    asset "image" {
    88      pattern = "contentUrl\":\"([^\"]+)\""
    89      capture = 1
    90  
    91      transform filename {
    92        pattern = "(?:.+)photos\\/(.*)"
    93        replace = "$${1}.jpg"
    94      }
    95    }
    96  
    97    info "title" {
    98      pattern = "meta[^>]+property=\"og:title\"[^>]+content=\"(?P<title>[^\"]+)\""
    99      capture = "title"
   100    }
   101  
   102    subdirectory {
   103      pattern = "\\(@(?P<username>\\w+)\\)"
   104      capture = "username"
   105      from    = body
   106    }
   107  }
   108  ```
   109  
   110  For demonstration purposes, we can already download pictures from [unsplash](https://unsplash.com) by using the following command:
   111  
   112  ```
   113  grab get https://unsplash.com/photos/uOi3lg8fGl4
   114  ```
   115  
   116  > **Warning**  
   117  > Please use this tool responsibly. Don't use this tool for Denial of Service attacks! Don't violate Copyright or intellectual property!
   118  
   119  Internally, the program checks checks each URL passed to `get`, if it matches a `test` pattern inside of any `site` block, it will parse find all matches for assets or data defined in `asset` and `info` blocks.
   120  Once all the asset URLs are gathered, the download starts.
   121  
   122  After running the above command, you should have a new `grab` directory in your `~/Downloads` folder, containing subdirectories for each site defined in the configuration. Inside each site directories you will find all the assets extracted from the provided URLs.
   123  
   124  The configuration syntax is based on a few fundamental blocks:
   125  
   126  - `global` block defines the main download directory and global network options.
   127  - `site <name>` blocks group other blocks based on the site URL.
   128  - `asset <name>` blocks define what to look for from each site and how to download it.
   129  - `info <name>` blocks define what strings to extract from the page body.
   130  
   131  Additional configuration settings can be specified:
   132  
   133  - `network` blocks to pass headers and other network options when making requests.
   134  - `transform url` blocks to replace the asset URL before downloading.
   135  - `transform filename` blocks to replace the asset's destination path.
   136  - `subdirectory` blocks to organize downloads into subdirectories named by strings present in the page body or URL.
   137  
   138  For a more in-depth look into Grab's confguration options, check out [the guide](/docs/guide.md).
   139  
   140  # Command Options
   141  
   142  To get help about any command, use the `help` subcommand or the `--help` flag:
   143  
   144  ```ini
   145  # to list all available commands:
   146  grab help
   147  
   148  # to show instructions for a specific subcommand:
   149  grab help <subcommand>
   150  ```
   151  
   152  ### `get`
   153  
   154  #### Arguments
   155  
   156  Accepts both URLs or path to lists of URLs. Both can be provided at the same time.
   157  
   158  ```sh
   159  # grab get <url|file> [url|file...] [options]
   160  
   161  grab get https://example.com/gallery/1 \
   162           https://example.com/gallery/2 \
   163           path/to/list.ini \
   164           other/file.ini -n
   165  ```
   166  
   167  #### Options
   168  
   169  | Long       | Short | Default | Description                                                                                                                    |
   170  | ---------- | ----- | ------- | ------------------------------------------------------------------------------------------------------------------------------ |
   171  | `force`    | `f`   | `false` | To overwrite already existing files                                                                                            |
   172  | `config`   | `c`   | `nil`   | To specify the path to a configuration file                                                                                    |
   173  | `strict`   | `s`   | `false` | To stop the program at the first encountered error                                                                             |
   174  | `dry-run`  | `n`   | `false` | To send requests without writing to the disk                                                                                   |
   175  | `progress` | `p`   | `false` | To show a progress bar                                                                                                         |
   176  | `quiet`    | `q`   | `false` | To suppress all output to `stdout` (errors will still be printed to `stderr`).<br/>This option takes precedence over `verbose` |
   177  | `verbose`  | `v`   | `1`     | To set the verbosity level:<br/>`-v` is 1, `-vv` is 2 and so on...<br/>`quiet` overrides this option.                          |
   178  
   179  ## Next steps
   180  
   181  - [x] Retries & Timeout
   182  - [x] Network options with inheritance
   183  - [x] URL manipulation
   184  - [x] Destination manipulation
   185  - [x] Improve logging
   186  - [x] Check for updates
   187  - [ ] Display a progress bar
   188  - [ ] Add HCL eval context functions
   189  - [ ] Distribute via various package managers:
   190    - [ ] Homebrew
   191    - [ ] Apt
   192    - [ ] Chocolatey
   193    - [ ] Scoop
   194  - [ ] Scripting language integration
   195  - [ ] Plugin system
   196  - [ ] Sequential jobs (like GitHub workflows)
   197  
   198  ## Credits
   199  
   200  - [Catppuccin](https://github.com/catppuccin/) for the color palette
   201  - [Shields.io](https://github.com/badges/shields) for the badges
   202  
   203  ## License
   204  
   205  Distributed under the [MIT License](/LICENSE).