github.com/apache/beam/sdks/v2@v2.48.2/typescript/README.md (about)

     1  <!--
     2      Licensed to the Apache Software Foundation (ASF) under one
     3      or more contributor license agreements.  See the NOTICE file
     4      distributed with this work for additional information
     5      regarding copyright ownership.  The ASF licenses this file
     6      to you under the Apache License, Version 2.0 (the
     7      "License"); you may not use this file except in compliance
     8      with the License.  You may obtain a copy of the License at
     9  
    10        http://www.apache.org/licenses/LICENSE-2.0
    11  
    12      Unless required by applicable law or agreed to in writing,
    13      software distributed under the License is distributed on an
    14      "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    15      KIND, either express or implied.  See the License for the
    16      specific language governing permissions and limitations
    17      under the License.
    18  -->
    19  
    20  # TypeScript Beam SDK
    21  
    22  A library for writing [Apache Beam](https://beam.apache.org/)
    23  pipelines in Typescript.
    24  
    25  As well as being a fully-functioning SDK, it serves as a cleaner, more modern
    26  template for building SDKs in other languages
    27  (see README-dev.md for more details).
    28  
    29  
    30  ## Getting started
    31  
    32  The Typescript SDK can be installed with
    33  
    34  ```
    35  npm install apache_beam
    36  ```
    37  
    38  Due to its extensive use of cross-language transforms, it is recommended that
    39  Python 3 and Java be available on the system as well.
    40  
    41  A fully working setup is provided as a clonable
    42  [starter project on github](https://github.com/apache/beam-starter-typescript).
    43  
    44  
    45  ### Running a pipeline
    46  
    47  Beam pipelines can be run on a variety of
    48  [runners](https://beam.apache.org/documentation/#runners).
    49  The typical way to create a runner is with
    50  `beam.runners.runner.create_runner({runner: "runnerType", ...})`,
    51  as seen in the [wordcount example](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/examples/wordcount.ts).
    52  
    53  After building, to run locally one can execute:
    54  
    55  ```
    56  node path/to/main.js --runner=direct
    57  ```
    58  
    59  To run against Flink, where the local infrastructure is automatically
    60  downloaded and set up:
    61  
    62  ```
    63  node path/to/main.js --runner=flink
    64  ```
    65  
    66  To run on Dataflow:
    67  
    68  ```
    69  node path/to/main.js \
    70      --runner=dataflow \
    71      --project=${PROJECT_ID} \
    72      --tempLocation=gs://${GCS_BUCKET}/wordcount-js/temp --region=${REGION}
    73  ```
    74  
    75  
    76  ## API
    77  
    78  We generally try to apply the concepts from the Beam API in a TypeScript
    79  idiomatic way, but it should be noted that few of the initial developers
    80  have extensive (if any) JavaScript/TypeScript development experience, so
    81  feedback is greatly appreciated.
    82  
    83  In addition, some notable departures are taken from the traditional SDKs:
    84  
    85  * We take a "relational foundations" approach, where
    86  [schema'd data](https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf)
    87  is the primary way to interact with data, and we generally eschew the key-value
    88  requiring transforms in favor of a more flexible approach naming fields or
    89  expressions. JavaScript's native Object is used as the row type.
    90  
    91  * As part of being schema-first we also de-emphasize Coders as a first-class
    92  concept in the SDK, relegating it to an advanced feature used for interop.
    93  Though we can infer schemas from individual elements, it is still TBD to
    94  figure out if/how we can leverage the type system and/or function introspection
    95  to regularly infer schemas at construction time. A fallback coder using BSON
    96  encoding is used when we don't have sufficient type information.
    97  
    98  * We have added additional methods to the PCollection object, notably `map`
    99  and `flatmap`, [rather than only allowing apply](https://www.mail-archive.com/dev@beam.apache.org/msg06035.html).
   100  In addition, `apply` can accept a function argument `(PCollection) => ...` as
   101  well as a PTransform subclass, which treats this callable as if it were a
   102  PTransform's expand.
   103  
   104  * In the other direction, we have eliminated the
   105  [problematic Pipeline object](https://s.apache.org/no-beam-pipeline)
   106  from the API, instead providing a `Root` PValue on which pipelines are built,
   107  and invoking run() on a Runner.  We offer a less error-prone `Runner.run`
   108  which finishes only when the pipeline is completely finished as well as
   109  `Runner.runAsync` which returns a handle to the running pipeline.
   110  
   111  * Rather than introduce PCollectionTuple, PCollectionList, etc. we let PValue
   112  literally be an
   113  [array or object with PValue values](https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116)
   114  which transforms can consume or produce.
   115  These are applied by wrapping them with the `P` operator, e.g.
   116  `P([pc1, pc2, pc3]).apply(new Flatten())`.
   117  
   118  * Like Python, `flatMap` and `ParDo.process` return multiple elements by
   119  yielding them from a generator, rather than invoking a passed-in callback.
   120  TBD how to output to multiple distinct PCollections.
   121  There is currently an operation to split a PCollection into multiple
   122  PCollections based on the properties of the elements, and
   123  we may consider using a callback for side outputs.
   124  
   125  * The `map`, `flatMap`, and `ParDo.process` methods take an additional
   126  (optional) context argument, which is similar to the keyword arguments
   127  used in Python. These are javascript objects whose members may be constants
   128  (which are passed as is) or special DoFnParam objects which provide getters to
   129  element-specific information (such as the current timestamp, window,
   130  or side input) at runtime.
   131  
   132  * Rather than introduce multiple-output complexity into the map/do operations
   133  themselves, producing multiple outputs is done by following with a new
   134  `Split` primitive that takes a
   135  `PCollection<{a?: AType, b: BType, ... }>` and produces an object
   136  `{a: PCollection<AType>, b: PCollection<BType>, ...}`.
   137  
   138  * JavaScript supports (and encourages) an asynchronous programing model, with
   139  many libraries requiring use of the async/await paradigm.
   140  As there is no way (by design) to go from the asynchronous style back to
   141  the synchronous style, this needs to be taken into account
   142  when designing the API.
   143  We currently offer asynchronous variants of `PValue.apply(...)` (in addition
   144  to the synchronous ones, as they are easier to chain) as well as making
   145  `Runner.run` asynchronous. TBD to do this for all user callbacks as well.
   146  
   147  An example pipeline can be found at https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/examples/wordcount.ts
   148  and more documentation can be found in the [beam programming guide](https://beam.apache.org/documentation/programming-guide/).