github.com/instill-ai/component@v0.16.0-beta/pkg/operator/text/v0/README.mdx (about) 1 --- 2 title: "Text" 3 lang: "en-US" 4 draft: false 5 description: "Learn about how to set up a VDP Text operator https://github.com/instill-ai/instill-core" 6 --- 7 8 The Text component is an operator that allows users to extract and manipulate text from different sources. 9 It can carry out the following tasks: 10 11 - [Convert To Text](#convert-to-text) 12 - [Split By Token](#split-by-token) 13 14 ## Release Stage 15 16 `Alpha` 17 18 ## Configuration 19 20 The component configuration is defined and maintained [here](https://github.com/instill-ai/component/blob/main/pkg/operator/text/v0/config/definition.json). 21 22 ## Supported Tasks 23 24 ### Convert To Text 25 26 Convert document to text. 27 28 | Input | ID | Type | Description | 29 | :--- | :--- | :--- | :--- | 30 | Task ID (required) | `task` | string | `TASK_CONVERT_TO_TEXT` | 31 | Document (required) | `doc` | string | Base64 encoded document (PDF, DOC, DOCX, XML, HTML, RTF, etc.) to be converted to plain text | 32 33 | Output | ID | Type | Description | 34 | :--- | :--- | :--- | :--- | 35 | Body | `body` | string | Plain text converted from the document | 36 | Meta | `meta` | object | Metadata extracted from the document | 37 | MSecs | `msecs` | number | Time taken to convert the document | 38 | Error | `error` | string | Error message if any during the conversion process | 39 40 ### Split By Token 41 42 Split text by token. 43 44 | Input | ID | Type | Description | 45 | :--- | :--- | :--- | :--- | 46 | Task ID (required) | `task` | string | `TASK_SPLIT_BY_TOKEN` | 47 | Text (required) | `text` | string | Text to be split | 48 | Model (required) | `model` | string | ID of the model to use for tokenization | 49 | Chunk Token Size | `chunk_token_size` | integer | Number of tokens per text chunk | 50 51 | Output | ID | Type | Description | 52 | :--- | :--- | :--- | :--- | 53 | Token Count | `token_count` | integer | Total count of tokens in the input text | 54 | Text Chunks | `text_chunks` | array[string] | Text chunks after splitting | 55 | Number of Text Chunks | `chunk_num` | integer | Total number of output text chunks |