github.com/everdrone/grab@v0.1.7-0.20230416223925-40674b995521/README.md (about) 1 <div align="center"> 2 <h1> 3 <img width="750" src="https://raw.githubusercontent.com/everdrone/grab/main/.github/media/Dark@2x.png#gh-light-mode-only" alt="GRAB" /> 4 <img width="750" src="https://raw.githubusercontent.com/everdrone/grab/main/.github/media/Light@2x.png#gh-dark-mode-only" alt="GRAB" /> 5 </h1> 6 <h3>Greedy, Regex-Aware Binary Downloader</h3> 7 </div> 8 9 <p align="center"> 10 <a href="https://github.com/everdrone/grab/stargazers"> 11 <img src="https://img.shields.io/github/stars/everdrone/grab?color=8bd5ca&logo=github&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge" alt="Stargazers"> 12 </a> 13 <a href="https://github.com/everdrone/grab/releases/latest"> 14 <img src="https://img.shields.io/github/v/release/everdrone/grab?color=b7bdf8&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAxNiAxNiIgd2lkdGg9IjE2IiBoZWlnaHQ9IjE2Ij48cGF0aCBmaWxsPSIjZDllMGVlIiBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGQ9Ik02LjEyMi4zOTJhMS43NSAxLjc1IDAgMDExLjc1NiAwbDUuMjUgMy4wNDVjLjU0LjMxMy44NzIuODkuODcyIDEuNTE0VjcuMjVhLjc1Ljc1IDAgMDEtMS41IDBWNS42NzdMNy43NSA4LjQzMnY2LjM4NGExIDEgMCAwMS0xLjUwMi44NjVMLjg3MiAxMi41NjNBMS43NSAxLjc1IDAgMDEwIDExLjA0OVY0Ljk1MWMwLS42MjQuMzMyLTEuMi44NzItMS41MTRMNi4xMjIuMzkyek03LjEyNSAxLjY5bDQuNjMgMi42ODVMNyA3LjEzMyAyLjI0NSA0LjM3NWw0LjYzLTIuNjg1YS4yNS4yNSAwIDAxLjI1IDB6TTEuNSAxMS4wNDlWNS42NzdsNC43NSAyLjc1NXY1LjUxNmwtNC42MjUtMi42ODNhLjI1LjI1IDAgMDEtLjEyNS0uMjE2em0xMC44MjggMy42ODRhLjc1Ljc1IDAgMTAxLjA4NyAxLjAzNGwyLjM3OC0yLjVhLjc1Ljc1IDAgMDAwLTEuMDM0bC0yLjM3OC0yLjVhLjc1Ljc1IDAgMDAtMS4wODcgMS4wMzRMMTMuNTAxIDEySDEwLjI1YS43NS43NSAwIDAwMCAxLjVoMy4yNTFsLTEuMTczIDEuMjMzeiI+PC9wYXRoPjwvc3ZnPg==&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge" alt="Latest Release"> 15 </a> 16 <a href="https://app.codecov.io/gh/everdrone/grab" target="_blank"> 17 <img src="https://img.shields.io/codecov/c/github/everdrone/grab?color=c6a0f6&logo=codecov&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge&token=NkRjXNdxZI" alt="Codecov"> 18 </a> 19 <a href="https://github.com/everdrone/grab/issues"> 20 <img src="https://img.shields.io/github/issues/everdrone/grab?color=f8bd96&logo=data:image/svg+xml;base64,PHN2ZyB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciIHZpZXdCb3g9IjAgMCAxNiAxNiIgd2lkdGg9IjE2IiBoZWlnaHQ9IjE2Ij48cGF0aCBmaWxsPSIjZDllMGVlIiBmaWxsLXJ1bGU9ImV2ZW5vZGQiIGQ9Ik0xMC41NjEgMS41YS4wMTYuMDE2IDAgMDAtLjAxLjAwNEwzLjI4NiA4LjU3MUEuMjUuMjUgMCAwMDMuNDYyIDlINi43NWEuNzUuNzUgMCAwMS42OTQgMS4wMzRsLTEuNzEzIDQuMTg4IDYuOTgyLTYuNzkzQS4yNS4yNSAwIDAwMTIuNTM4IDdIOS4yNWEuNzUuNzUgMCAwMS0uNjgzLTEuMDZsMi4wMDgtNC40MTguMDAzLS4wMDZhLjAyLjAyIDAgMDAtLjAwNC0uMDA5LjAyLjAyIDAgMDAtLjAwNi0uMDA2TDEwLjU2IDEuNXpNOS41MDQuNDNhMS41MTYgMS41MTYgMCAwMTIuNDM3IDEuNzEzTDEwLjQxNSA1LjVoMi4xMjNjMS41NyAwIDIuMzQ2IDEuOTA5IDEuMjIgMy4wMDRsLTcuMzQgNy4xNDJhMS4yNSAxLjI1IDAgMDEtLjg3MS4zNTRoLS4zMDJhMS4yNSAxLjI1IDAgMDEtMS4xNTctMS43MjNMNS42MzMgMTAuNUgzLjQ2MmMtMS41NyAwLTIuMzQ2LTEuOTA5LTEuMjItMy4wMDRMOS41MDMuNDI5eiI+PC9wYXRoPjwvc3ZnPg==&logoColor=d9e0ee&labelColor=1e1d2f&style=for-the-badge" alt="GitHub issues"> 21 </a> 22 </p> 23 24 # Table of contents 25 26 - [Motivation](#why) 27 - [Installation](#installation) 28 - [Usage](#usage) 29 - [Quickstart](#quickstart) 30 - [Options](#command-options) 31 - [Next steps](#next-steps) 32 33 # Why 34 35 This project helps you automate scraping data and downloading assets from the internet. Based on Go's Regular Expression engine and HCL, for ease of use, performance and flexibility. 36 37 # Installation 38 39 Download and install the [latest release](https://github.com/everdrone/grab/releases/latest). 40 41 # Usage 42 43 Run the following command to generate a new configuration file in the current directory. 44 45 ``` 46 grab config generate 47 ``` 48 49 > **Note** 50 > Grab's configuration file uses [Hashicorp's HCL](https://github.com/hashicorp/hcl). 51 > You can always refer to their specification for topics not covered by the documentation in this repo. 52 53 Once you're happy with your configuration, you can check if everything is ok by running: 54 55 ``` 56 grab config check 57 ``` 58 59 To scrape and download assets, pass one or more URLs to the `get` subcommand: 60 61 ```ini 62 # single URL 63 grab get https://url.to/scrape/files?from 64 65 # list of URLs 66 grab get urls.ini 67 68 # at least one of each 69 grab get https://my.url/and urls.ini list.ini 70 ``` 71 72 > **Note** 73 > The list of URLs can contain comments, like the `ini` format: all lines starting with `#` and `;` will be ignored. 74 75 # Quickstart 76 77 The default configuration, generated with `grab config generate` already works out of the box. 78 79 ```hcl 80 global { 81 location = "/home/yourusername/Downloads/grab" 82 } 83 84 site "unsplash" { 85 test = "unsplash" 86 87 asset "image" { 88 pattern = "contentUrl\":\"([^\"]+)\"" 89 capture = 1 90 91 transform filename { 92 pattern = "(?:.+)photos\\/(.*)" 93 replace = "$${1}.jpg" 94 } 95 } 96 97 info "title" { 98 pattern = "meta[^>]+property=\"og:title\"[^>]+content=\"(?P<title>[^\"]+)\"" 99 capture = "title" 100 } 101 102 subdirectory { 103 pattern = "\\(@(?P<username>\\w+)\\)" 104 capture = "username" 105 from = body 106 } 107 } 108 ``` 109 110 For demonstration purposes, we can already download pictures from [unsplash](https://unsplash.com) by using the following command: 111 112 ``` 113 grab get https://unsplash.com/photos/uOi3lg8fGl4 114 ``` 115 116 > **Warning** 117 > Please use this tool responsibly. Don't use this tool for Denial of Service attacks! Don't violate Copyright or intellectual property! 118 119 Internally, the program checks checks each URL passed to `get`, if it matches a `test` pattern inside of any `site` block, it will parse find all matches for assets or data defined in `asset` and `info` blocks. 120 Once all the asset URLs are gathered, the download starts. 121 122 After running the above command, you should have a new `grab` directory in your `~/Downloads` folder, containing subdirectories for each site defined in the configuration. Inside each site directories you will find all the assets extracted from the provided URLs. 123 124 The configuration syntax is based on a few fundamental blocks: 125 126 - `global` block defines the main download directory and global network options. 127 - `site <name>` blocks group other blocks based on the site URL. 128 - `asset <name>` blocks define what to look for from each site and how to download it. 129 - `info <name>` blocks define what strings to extract from the page body. 130 131 Additional configuration settings can be specified: 132 133 - `network` blocks to pass headers and other network options when making requests. 134 - `transform url` blocks to replace the asset URL before downloading. 135 - `transform filename` blocks to replace the asset's destination path. 136 - `subdirectory` blocks to organize downloads into subdirectories named by strings present in the page body or URL. 137 138 For a more in-depth look into Grab's confguration options, check out [the guide](/docs/guide.md). 139 140 # Command Options 141 142 To get help about any command, use the `help` subcommand or the `--help` flag: 143 144 ```ini 145 # to list all available commands: 146 grab help 147 148 # to show instructions for a specific subcommand: 149 grab help <subcommand> 150 ``` 151 152 ### `get` 153 154 #### Arguments 155 156 Accepts both URLs or path to lists of URLs. Both can be provided at the same time. 157 158 ```sh 159 # grab get <url|file> [url|file...] [options] 160 161 grab get https://example.com/gallery/1 \ 162 https://example.com/gallery/2 \ 163 path/to/list.ini \ 164 other/file.ini -n 165 ``` 166 167 #### Options 168 169 | Long | Short | Default | Description | 170 | ---------- | ----- | ------- | ------------------------------------------------------------------------------------------------------------------------------ | 171 | `force` | `f` | `false` | To overwrite already existing files | 172 | `config` | `c` | `nil` | To specify the path to a configuration file | 173 | `strict` | `s` | `false` | To stop the program at the first encountered error | 174 | `dry-run` | `n` | `false` | To send requests without writing to the disk | 175 | `progress` | `p` | `false` | To show a progress bar | 176 | `quiet` | `q` | `false` | To suppress all output to `stdout` (errors will still be printed to `stderr`).<br/>This option takes precedence over `verbose` | 177 | `verbose` | `v` | `1` | To set the verbosity level:<br/>`-v` is 1, `-vv` is 2 and so on...<br/>`quiet` overrides this option. | 178 179 ## Next steps 180 181 - [x] Retries & Timeout 182 - [x] Network options with inheritance 183 - [x] URL manipulation 184 - [x] Destination manipulation 185 - [x] Improve logging 186 - [x] Check for updates 187 - [ ] Display a progress bar 188 - [ ] Add HCL eval context functions 189 - [ ] Distribute via various package managers: 190 - [ ] Homebrew 191 - [ ] Apt 192 - [ ] Chocolatey 193 - [ ] Scoop 194 - [ ] Scripting language integration 195 - [ ] Plugin system 196 - [ ] Sequential jobs (like GitHub workflows) 197 198 ## Credits 199 200 - [Catppuccin](https://github.com/catppuccin/) for the color palette 201 - [Shields.io](https://github.com/badges/shields) for the badges 202 203 ## License 204 205 Distributed under the [MIT License](/LICENSE).