github.com/apache/beam/sdks/v2@v2.48.2/typescript/README.md (about) 1 <!-- 2 Licensed to the Apache Software Foundation (ASF) under one 3 or more contributor license agreements. See the NOTICE file 4 distributed with this work for additional information 5 regarding copyright ownership. The ASF licenses this file 6 to you under the Apache License, Version 2.0 (the 7 "License"); you may not use this file except in compliance 8 with the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12 Unless required by applicable law or agreed to in writing, 13 software distributed under the License is distributed on an 14 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15 KIND, either express or implied. See the License for the 16 specific language governing permissions and limitations 17 under the License. 18 --> 19 20 # TypeScript Beam SDK 21 22 A library for writing [Apache Beam](https://beam.apache.org/) 23 pipelines in Typescript. 24 25 As well as being a fully-functioning SDK, it serves as a cleaner, more modern 26 template for building SDKs in other languages 27 (see README-dev.md for more details). 28 29 30 ## Getting started 31 32 The Typescript SDK can be installed with 33 34 ``` 35 npm install apache_beam 36 ``` 37 38 Due to its extensive use of cross-language transforms, it is recommended that 39 Python 3 and Java be available on the system as well. 40 41 A fully working setup is provided as a clonable 42 [starter project on github](https://github.com/apache/beam-starter-typescript). 43 44 45 ### Running a pipeline 46 47 Beam pipelines can be run on a variety of 48 [runners](https://beam.apache.org/documentation/#runners). 49 The typical way to create a runner is with 50 `beam.runners.runner.create_runner({runner: "runnerType", ...})`, 51 as seen in the [wordcount example](https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/examples/wordcount.ts). 52 53 After building, to run locally one can execute: 54 55 ``` 56 node path/to/main.js --runner=direct 57 ``` 58 59 To run against Flink, where the local infrastructure is automatically 60 downloaded and set up: 61 62 ``` 63 node path/to/main.js --runner=flink 64 ``` 65 66 To run on Dataflow: 67 68 ``` 69 node path/to/main.js \ 70 --runner=dataflow \ 71 --project=${PROJECT_ID} \ 72 --tempLocation=gs://${GCS_BUCKET}/wordcount-js/temp --region=${REGION} 73 ``` 74 75 76 ## API 77 78 We generally try to apply the concepts from the Beam API in a TypeScript 79 idiomatic way, but it should be noted that few of the initial developers 80 have extensive (if any) JavaScript/TypeScript development experience, so 81 feedback is greatly appreciated. 82 83 In addition, some notable departures are taken from the traditional SDKs: 84 85 * We take a "relational foundations" approach, where 86 [schema'd data](https://docs.google.com/document/d/1tnG2DPHZYbsomvihIpXruUmQ12pHGK0QIvXS1FOTgRc/edit#heading=h.puuotbien1gf) 87 is the primary way to interact with data, and we generally eschew the key-value 88 requiring transforms in favor of a more flexible approach naming fields or 89 expressions. JavaScript's native Object is used as the row type. 90 91 * As part of being schema-first we also de-emphasize Coders as a first-class 92 concept in the SDK, relegating it to an advanced feature used for interop. 93 Though we can infer schemas from individual elements, it is still TBD to 94 figure out if/how we can leverage the type system and/or function introspection 95 to regularly infer schemas at construction time. A fallback coder using BSON 96 encoding is used when we don't have sufficient type information. 97 98 * We have added additional methods to the PCollection object, notably `map` 99 and `flatmap`, [rather than only allowing apply](https://www.mail-archive.com/dev@beam.apache.org/msg06035.html). 100 In addition, `apply` can accept a function argument `(PCollection) => ...` as 101 well as a PTransform subclass, which treats this callable as if it were a 102 PTransform's expand. 103 104 * In the other direction, we have eliminated the 105 [problematic Pipeline object](https://s.apache.org/no-beam-pipeline) 106 from the API, instead providing a `Root` PValue on which pipelines are built, 107 and invoking run() on a Runner. We offer a less error-prone `Runner.run` 108 which finishes only when the pipeline is completely finished as well as 109 `Runner.runAsync` which returns a handle to the running pipeline. 110 111 * Rather than introduce PCollectionTuple, PCollectionList, etc. we let PValue 112 literally be an 113 [array or object with PValue values](https://github.com/robertwb/beam-javascript/blob/de4390dd767f046903ac23fead5db333290462db/sdks/node-ts/src/apache_beam/pvalue.ts#L116) 114 which transforms can consume or produce. 115 These are applied by wrapping them with the `P` operator, e.g. 116 `P([pc1, pc2, pc3]).apply(new Flatten())`. 117 118 * Like Python, `flatMap` and `ParDo.process` return multiple elements by 119 yielding them from a generator, rather than invoking a passed-in callback. 120 TBD how to output to multiple distinct PCollections. 121 There is currently an operation to split a PCollection into multiple 122 PCollections based on the properties of the elements, and 123 we may consider using a callback for side outputs. 124 125 * The `map`, `flatMap`, and `ParDo.process` methods take an additional 126 (optional) context argument, which is similar to the keyword arguments 127 used in Python. These are javascript objects whose members may be constants 128 (which are passed as is) or special DoFnParam objects which provide getters to 129 element-specific information (such as the current timestamp, window, 130 or side input) at runtime. 131 132 * Rather than introduce multiple-output complexity into the map/do operations 133 themselves, producing multiple outputs is done by following with a new 134 `Split` primitive that takes a 135 `PCollection<{a?: AType, b: BType, ... }>` and produces an object 136 `{a: PCollection<AType>, b: PCollection<BType>, ...}`. 137 138 * JavaScript supports (and encourages) an asynchronous programing model, with 139 many libraries requiring use of the async/await paradigm. 140 As there is no way (by design) to go from the asynchronous style back to 141 the synchronous style, this needs to be taken into account 142 when designing the API. 143 We currently offer asynchronous variants of `PValue.apply(...)` (in addition 144 to the synchronous ones, as they are easier to chain) as well as making 145 `Runner.run` asynchronous. TBD to do this for all user callbacks as well. 146 147 An example pipeline can be found at https://github.com/apache/beam/blob/master/sdks/typescript/src/apache_beam/examples/wordcount.ts 148 and more documentation can be found in the [beam programming guide](https://beam.apache.org/documentation/programming-guide/).