github.com/instill-ai/component@v0.16.0-beta/pkg/connector/website/v0/README.mdx (about)

     1  ---
     2  title: "Website"
     3  lang: "en-US"
     4  draft: false
     5  description: "Learn about how to set up a VDP Website connector https://github.com/instill-ai/instill-core"
     6  ---
     7  
     8  The Website component is a data connector that allows users to scrape websites.
     9  It can carry out the following tasks:
    10  
    11  - [Scrape Website](#scrape-website)
    12  
    13  ## Release Stage
    14  
    15  `Alpha`
    16  
    17  ## Configuration
    18  
    19  The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/connector/website/v0/config/definition.json).
    20  
    21  ## Supported Tasks
    22  
    23  ### Scrape Website
    24  
    25  Scrape the website contents.
    26  
    27  | Input | ID | Type | Description |
    28  | :--- | :--- | :--- | :--- |
    29  | Task ID (required) | `task` | string | `TASK_SCRAPE_WEBSITE` |
    30  | Query (required) | `target_url` | string | The root URL to scrape. All links on this page will be scraped, and all links on those pages, and so on. |
    31  | Allowed Domains | `allowed_domains` | array[string] | A list of domains that are allowed to be scraped. If empty, all domains are allowed. |
    32  | Max Number of Pages (required) | `max_k` | integer | The max number of pages to return. If the number is set to 0, all pages will be returned. If the number is set to a positive integer, at most max k pages will be returned. |
    33  | Include Link Text | `include_link_text` | boolean | Indicate whether to scrape the link and include the text of the link associated with this page in the 'link_text' field |
    34  | Include Link HTML | `include_link_html` | boolean | Indicate whether to scrape the link and include the raw HTML of the link associated with this page in the 'link_html' field |
    35  
    36  | Output | ID | Type | Description |
    37  | :--- | :--- | :--- | :--- |
    38  | Pages | `pages` | array[object] | The scraped webpages |