github.com/instill-ai/component@v0.16.0-beta/pkg/connector/website/v0/README.mdx (about) 1 --- 2 title: "Website" 3 lang: "en-US" 4 draft: false 5 description: "Learn about how to set up a VDP Website connector https://github.com/instill-ai/instill-core" 6 --- 7 8 The Website component is a data connector that allows users to scrape websites. 9 It can carry out the following tasks: 10 11 - [Scrape Website](#scrape-website) 12 13 ## Release Stage 14 15 `Alpha` 16 17 ## Configuration 18 19 The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/connector/website/v0/config/definition.json). 20 21 ## Supported Tasks 22 23 ### Scrape Website 24 25 Scrape the website contents. 26 27 | Input | ID | Type | Description | 28 | :--- | :--- | :--- | :--- | 29 | Task ID (required) | `task` | string | `TASK_SCRAPE_WEBSITE` | 30 | Query (required) | `target_url` | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. | 31 | Allowed Domains | `allowed_domains` | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. | 32 | Max Number of Pages (required) | `max_k` | integer | The max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned. | 33 | Include Link Text | `include_link_text` | boolean | Indicate whether to scrape the link and include the text of the link associated with this page in the 'link_text' field | 34 | Include Link HTML | `include_link_html` | boolean | Indicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link_html' field | 35 36 | Output | ID | Type | Description | 37 | :--- | :--- | :--- | :--- | 38 | Pages | `pages` | array[object] | The scraped webpages |