github.com/instill-ai/component@v0.16.0-beta/pkg/operator/text/v0/README.mdx

github.com/instill-ai/component@v0.16.0-beta/pkg/operator/text/v0/README.mdx (about)

     1  ---
     2  title: "Text"
     3  lang: "en-US"
     4  draft: false
     5  description: "Learn about how to set up a VDP Text operator https://github.com/instill-ai/instill-core"
     6  ---
     7  
     8  The Text component is an operator that allows users to extract and manipulate text from different sources.
     9  It can carry out the following tasks:
    10  
    11  - [Convert To Text](#convert-to-text)
    12  - [Split By Token](#split-by-token)
    13  
    14  ## Release Stage
    15  
    16  `Alpha`
    17  
    18  ## Configuration
    19  
    20  The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/operator/text/v0/config/definition.json).
    21  
    22  ## Supported Tasks
    23  
    24  ### Convert To Text
    25  
    26  Convert document to text.
    27  
    28  | Input | ID | Type | Description |
    29  | :--- | :--- | :--- | :--- |
    30  | Task ID (required) | `task` | string | `TASK_CONVERT_TO_TEXT` |
    31  | Document (required) | `doc` | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text |
    32  
    33  | Output | ID | Type | Description |
    34  | :--- | :--- | :--- | :--- |
    35  | Body | `body` | string | Plain text converted from the document |
    36  | Meta | `meta` | object | Metadata extracted from the document |
    37  | MSecs | `msecs` | number | Time taken to convert the document |
    38  | Error | `error` | string | Error message if any during the conversion process |
    39  
    40  ### Split By Token
    41  
    42  Split text by token.
    43  
    44  | Input | ID | Type | Description |
    45  | :--- | :--- | :--- | :--- |
    46  | Task ID (required) | `task` | string | `TASK_SPLIT_BY_TOKEN` |
    47  | Text (required) | `text` | string | Text to be split |
    48  | Model (required) | `model` | string | ID of the model to use for tokenization |
    49  | Chunk Token Size | `chunk_token_size` | integer | Number of tokens per text chunk |
    50  
    51  | Output | ID | Type | Description |
    52  | :--- | :--- | :--- | :--- |
    53  | Token Count | `token_count` | integer | Total count of tokens in the input text |
    54  | Text Chunks | `text_chunks` | array[string] | Text chunks after splitting |
    55  | Number of Text Chunks | `chunk_num` | integer | Total number of output text chunks |