github.com/apache/beam/sdks/v2@v2.48.2/java/io/cdap/README.md (about)

     1  <!--
     2      Licensed to the Apache Software Foundation (ASF) under one
     3      or more contributor license agreements.  See the NOTICE file
     4      distributed with this work for additional information
     5      regarding copyright ownership.  The ASF licenses this file
     6      to you under the Apache License, Version 2.0 (the
     7      "License"); you may not use this file except in compliance
     8      with the License.  You may obtain a copy of the License at
     9  
    10        http://www.apache.org/licenses/LICENSE-2.0
    11  
    12      Unless required by applicable law or agreed to in writing,
    13      software distributed under the License is distributed on an
    14      "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    15      KIND, either express or implied.  See the License for the
    16      specific language governing permissions and limitations
    17      under the License.
    18  -->
    19  
    20  # CdapIO
    21  CdapIO provides I/O transforms for [CDAP](https://cdap.io/) plugins.
    22  
    23  ## What is CDAP?
    24  
    25  [CDAP](https://cdap.io/) is an application platform for building and managing data applications in hybrid and multi-cloud environments.
    26  It enables developers, business analysts, and data scientists to use a visual rapid development environment and utilize common patterns,
    27  data, and application abstractions to accelerate the development of data applications, addressing a broader range of real-time and batch use cases.
    28  
    29  [CDAP plugins](https://github.com/data-integrations) types:
    30  - Batch source
    31  - Batch sink
    32  - Streaming source
    33  
    34  To learn more about CDAP plugins please see [io.cdap.cdap.api.annotation.Plugin](https://javadoc.io/static/io.cdap.cdap/cdap-api/6.7.2/io/cdap/cdap/api/annotation/Plugin.html) and [Data Integrations](https://github.com/data-integrations) plugins repository.
    35  
    36  ## CDAP Batch plugins support in CDAP IO
    37  
    38  CdapIO supports CDAP Batch plugins based on Hadoop [InputFormat](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/InputFormat.html) and [OutputFormat](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapred/OutputFormat.html).
    39  CDAP batch plugins support is implemented using [HadoopFormatIO](https://beam.apache.org/documentation/io/built-in/hadoop/).
    40  
    41  CdapIO currently supports the following CDAP Batch plugins by referencing `CDAP plugin` class:
    42  * [Hubspot Batch Source](https://github.com/data-integrations/hubspot/blob/develop/src/main/java/io/cdap/plugin/hubspot/source/batch/HubspotBatchSource.java)
    43  * [Hubspot Batch Sink](https://github.com/data-integrations/hubspot/blob/develop/src/main/java/io/cdap/plugin/hubspot/sink/batch/HubspotBatchSink.java)
    44  * [Salesforce Batch Source](https://github.com/data-integrations/salesforce/blob/develop/src/main/java/io/cdap/plugin/salesforce/plugin/source/batch/SalesforceBatchSource.java)
    45  * [Salesforce Batch Sink](https://github.com/data-integrations/salesforce/blob/develop/src/main/java/io/cdap/plugin/salesforce/plugin/sink/batch/SalesforceBatchSink.java)
    46  * [ServiceNow Batch Source](https://github.com/data-integrations/servicenow-plugins/blob/develop/src/main/java/io/cdap/plugin/servicenow/source/ServiceNowSource.java)
    47  * [Zendesk Batch Source](https://github.com/data-integrations/zendesk/blob/develop/src/main/java/io/cdap/plugin/zendesk/source/batch/ZendeskBatchSource.java)
    48  
    49  It means that all these plugins can be used like this:
    50  ``CdapIO.withCdapPluginClass(HubspotBatchSource.class)``
    51  
    52  ### Requirements for Cdap Batch plugins
    53  
    54  CDAP Batch plugin should be based on `HadoopFormat` implementation.
    55  
    56  ### How to add support for a new CDAP Batch plugin
    57  
    58  To add CdapIO support for a new CDAP Batch [Plugin](src/main/java/org/apache/beam/sdk/io/cdap/Plugin.java) perform the following steps:
    59  1. Find CDAP plugin artifacts in the Maven Central repository. *Example:* [Hubspot plugin Maven repository](https://mvnrepository.com/artifact/io.cdap/hubspot-plugins/1.0.0). *Note:* To add a custom CDAP plugin, please follow [Sonatype publishing guidelines](https://central.sonatype.org/publish/).
    60  2. Add the CDAP plugin Maven dependency to the `build.gradle` file. *Example:* ``implementation "io.cdap:hubspot-plugins:1.0.0"``.
    61  3. Here are two ways of using CDAP batch plugin with CdapIO:
    62     1. Using `Plugin.createBatch()` method. Pass Cdap Plugin class and correct `InputFormat` (or `OutputFormat`) and `InputFormatProvider` (or `OutputFormatProvider`) classes to CdapIO. *Example:*
    63     ```
    64     CdapIO.withCdapPlugin(
    65        Plugin.createBatch(
    66        EmployeeBatchSource.class,
    67        EmployeeInputFormat.class,
    68        EmployeeInputFormatProvider.class));
    69     ```
    70     2. Using `MappingUtils`.
    71        1. Navigate to [MappingUtils](src/main/java/org/apache/beam/sdk/io/cdap/MappingUtils.java) class.
    72        2. Modify `getPluginClassByName()` method:
    73        3. Add the code for mapping Cdap Plugin class name and `Input/Output Format` and `FormatProvider` classes.
    74        *Example:*
    75        ```
    76        if (pluginClass.equals(EmployeeBatchSource.class)){
    77           return Plugin.createBatch(pluginClass,
    78                         EmployeeInputFormat.class,
    79                         EmployeeInputFormatProvider.class);
    80        }
    81        ```
    82        4. After these steps you will be able to use Cdap Plugin by class name like this: ``CdapIO.withCdapPluginClass(EmployeeBatchSource.class)``
    83  
    84  To learn more, please check out [complete examples](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap).
    85  
    86  ## CDAP Streaming plugins support in CDAP IO
    87  
    88  CdapIO supports CDAP Streaming plugins based on [Apache Spark Receiver](https://spark.apache.org/docs/2.4.0/streaming-custom-receivers.html).
    89  CDAP streaming plugins support is implemented using [SparkReceiverIO](https://github.com/apache/beam/tree/master/sdks/java/io/sparkreceiver).
    90  
    91  ### Requirements for Cdap Streaming plugins
    92  
    93  1. CDAP Streaming plugin should be based on `Spark Receiver`.
    94  2. CDAP Streaming plugin should support work with offsets.
    95     1. Corresponding Spark Receiver should implement [HasOffset](https://github.com/apache/beam/blob/master/sdks/java/io/sparkreceiver/src/main/java/org/apache/beam/sdk/io/sparkreceiver/HasOffset.java) interface.
    96     2. Records should have the numeric field that represents record offset. *Example:* `RecordId` field for Salesforce and `vid` field for Hubspot plugins.
    97     For more details please see [GetOffsetUtils](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap/utils/GetOffsetUtils.java) class from examples.
    98  
    99  ### How to add support for a new CDAP Streaming plugin
   100  
   101  To add CdapIO support for a new CDAP Streaming SparkReceiver [Plugin](src/main/java/org/apache/beam/sdk/io/cdap/Plugin.java), perform the following steps:
   102  1. Find CDAP plugin artifacts in the Maven Central repository. *Example:* [Hubspot plugin Maven repository](https://mvnrepository.com/artifact/io.cdap/hubspot-plugins/1.0.0). *Note:* To add a custom CDAP plugin, please follow [Sonatype publishing guidelines](https://central.sonatype.org/publish/).
   103  2. Add CDAP plugin Maven dependency to the `build.gradle` file. *Example:* ``implementation "io.cdap:hubspot-plugins:1.0.0"``.
   104  3. Implement function that will define how to get `Long offset` from the record of the Cdap Plugin.
   105  *Example:* see [GetOffsetUtils](https://github.com/apache/beam/tree/master/examples/java/cdap/src/main/java/org/apache/beam/examples/complete/cdap/utils/GetOffsetUtils.java) class from examples.
   106  4. Here are two ways of using Cdap streaming Plugin with CdapIO:
   107      1. Using `Plugin.createStreaming()` method. Pass Cdap Plugin class, correct `getOffsetFn` (from step 3) and Spark `Receiver` class to CdapIO. *Example:*
   108     ```
   109     CdapIO.withCdapPlugin(
   110        Plugin.createStreaming(
   111        HubspotStreamingSource.class,
   112        offsetFnForHubspot,
   113        HubspotReceiver.class)));
   114     ```
   115      2. Using `MappingUtils`.
   116          1. Navigate to [MappingUtils](src/main/java/org/apache/beam/sdk/io/cdap/MappingUtils.java) class.
   117          2. Modify `getPluginClassByName()` method:
   118          3. Add the code for mapping Cdap Plugin class name, `getOffsetFn` function and Spark `Receiver` class.
   119             *Example:*
   120         ```
   121         if (pluginClass.equals(HubspotStreamingSource.class)){
   122            return Plugin.createStreaming(pluginClass,
   123                          getOffsetFnForHubpot(),
   124                          HubspotReceiverClass.class);
   125         }
   126         ```
   127          4. After these steps you will be able to use Cdap Plugin by class name like this: ``CdapIO.withCdapPluginClass(HubspotStreamingSource.class)``
   128  
   129  To learn more, please check out [complete examples](https://github.com/apache/beam/tree/master/examples/java/cdap).
   130  
   131  ## Dependencies
   132  
   133  To use CdapIO please add a dependency on `beam-sdks-java-io-cdap`.
   134  
   135  ```maven
   136  <dependency>
   137      <groupId>org.apache.beam</groupId>
   138      <artifactId>beam-sdks-java-io-cdap</artifactId>
   139      <version>...</version>
   140  </dependency>
   141  ```
   142  
   143  ## Documentation
   144  
   145  The documentation and usage examples are maintained in JavaDoc for [CdapIO.java](src/main/java/org/apache/beam/sdk/io/cdap/CdapIO.java).