github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/io/gcp/bigquery.py (about) 1 # 2 # Licensed to the Apache Software Foundation (ASF) under one or more 3 # contributor license agreements. See the NOTICE file distributed with 4 # this work for additional information regarding copyright ownership. 5 # The ASF licenses this file to You under the Apache License, Version 2.0 6 # (the "License"); you may not use this file except in compliance with 7 # the License. You may obtain a copy of the License at 8 # 9 # http://www.apache.org/licenses/LICENSE-2.0 10 # 11 # Unless required by applicable law or agreed to in writing, software 12 # distributed under the License is distributed on an "AS IS" BASIS, 13 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 14 # See the License for the specific language governing permissions and 15 # limitations under the License. 16 # 17 18 """BigQuery sources and sinks. 19 20 This module implements reading from and writing to BigQuery tables. It relies 21 on several classes exposed by the BigQuery API: TableSchema, TableFieldSchema, 22 TableRow, and TableCell. The default mode is to return table rows read from a 23 BigQuery source as dictionaries. Similarly a Write transform to a BigQuerySink 24 accepts PCollections of dictionaries. This is done for more convenient 25 programming. If desired, the native TableRow objects can be used throughout to 26 represent rows (use an instance of TableRowJsonCoder as a coder argument when 27 creating the sources or sinks respectively). 28 29 Also, for programming convenience, instances of TableReference and TableSchema 30 have a string representation that can be used for the corresponding arguments: 31 32 - TableReference can be a PROJECT:DATASET.TABLE or DATASET.TABLE string. 33 - TableSchema can be a NAME:TYPE{,NAME:TYPE}* string 34 (e.g. 'month:STRING,event_count:INTEGER'). 35 36 The syntax supported is described here: 37 https://cloud.google.com/bigquery/bq-command-line-tool-quickstart 38 39 BigQuery sources can be used as main inputs or side inputs. A main input 40 (common case) is expected to be massive and will be split into manageable chunks 41 and processed in parallel. Side inputs are expected to be small and will be read 42 completely every time a ParDo DoFn gets executed. In the example below the 43 lambda function implementing the DoFn for the Map transform will get on each 44 call *one* row of the main table and *all* rows of the side table. The runner 45 may use some caching techniques to share the side inputs between calls in order 46 to avoid excessive reading::: 47 48 main_table = pipeline | 'VeryBig' >> beam.io.ReadFromBigQuery(...) 49 side_table = pipeline | 'NotBig' >> beam.io.ReadFromBigQuery(...) 50 results = ( 51 main_table 52 | 'ProcessData' >> beam.Map( 53 lambda element, side_input: ..., AsList(side_table))) 54 55 There is no difference in how main and side inputs are read. What makes the 56 side_table a 'side input' is the AsList wrapper used when passing the table 57 as a parameter to the Map transform. AsList signals to the execution framework 58 that its input should be made available whole. 59 60 The main and side inputs are implemented differently. Reading a BigQuery table 61 as main input entails exporting the table to a set of GCS files (in AVRO or in 62 JSON format) and then processing those files. 63 64 Users may provide a query to read from rather than reading all of a BigQuery 65 table. If specified, the result obtained by executing the specified query will 66 be used as the data of the input transform.:: 67 68 query_results = pipeline | beam.io.gcp.bigquery.ReadFromBigQuery( 69 query='SELECT year, mean_temp FROM samples.weather_stations') 70 71 When creating a BigQuery input transform, users should provide either a query 72 or a table. Pipeline construction will fail with a validation error if neither 73 or both are specified. 74 75 When reading via `ReadFromBigQuery`, bytes are returned decoded as bytes. 76 This is due to the fact that ReadFromBigQuery uses Avro exports by default. 77 When reading from BigQuery using `apache_beam.io.BigQuerySource`, bytes are 78 returned as base64-encoded bytes. To get base64-encoded bytes using 79 `ReadFromBigQuery`, you can use the flag `use_json_exports` to export 80 data as JSON, and receive base64-encoded bytes. 81 82 ReadAllFromBigQuery 83 ------------------- 84 Beam 2.27.0 introduces a new transform called `ReadAllFromBigQuery` which 85 allows you to define table and query reads from BigQuery at pipeline 86 runtime.::: 87 88 read_requests = p | beam.Create([ 89 ReadFromBigQueryRequest(query='SELECT * FROM mydataset.mytable'), 90 ReadFromBigQueryRequest(table='myproject.mydataset.mytable')]) 91 results = read_requests | ReadAllFromBigQuery() 92 93 A good application for this transform is in streaming pipelines to 94 refresh a side input coming from BigQuery. This would work like so::: 95 96 side_input = ( 97 p 98 | 'PeriodicImpulse' >> PeriodicImpulse( 99 first_timestamp, last_timestamp, interval, True) 100 | 'MapToReadRequest' >> beam.Map( 101 lambda x: ReadFromBigQueryRequest(table='dataset.table')) 102 | beam.io.ReadAllFromBigQuery()) 103 main_input = ( 104 p 105 | 'MpImpulse' >> beam.Create(sample_main_input_elements) 106 | 107 'MapMpToTimestamped' >> beam.Map(lambda src: TimestampedValue(src, src)) 108 | 'WindowMpInto' >> beam.WindowInto( 109 window.FixedWindows(main_input_windowing_interval))) 110 result = ( 111 main_input 112 | 'ApplyCrossJoin' >> beam.FlatMap( 113 cross_join, rights=beam.pvalue.AsIter(side_input))) 114 115 **Note**: This transform is supported on Portable and Dataflow v2 runners. 116 117 **Note**: This transform does not currently clean up temporary datasets 118 created for its execution. (BEAM-11359) 119 120 Writing Data to BigQuery 121 ======================== 122 123 The `WriteToBigQuery` transform is the recommended way of writing data to 124 BigQuery. It supports a large set of parameters to customize how you'd like to 125 write to BigQuery. 126 127 Table References 128 ---------------- 129 130 This transform allows you to provide static `project`, `dataset` and `table` 131 parameters which point to a specific BigQuery table to be created. The `table` 132 parameter can also be a dynamic parameter (i.e. a callable), which receives an 133 element to be written to BigQuery, and returns the table that that element 134 should be sent to. 135 136 You may also provide a tuple of PCollectionView elements to be passed as side 137 inputs to your callable. For example, suppose that one wishes to send 138 events of different types to different tables, and the table names are 139 computed at pipeline runtime, one may do something like the following:: 140 141 with Pipeline() as p: 142 elements = (p | beam.Create([ 143 {'type': 'error', 'timestamp': '12:34:56', 'message': 'bad'}, 144 {'type': 'user_log', 'timestamp': '12:34:59', 'query': 'flu symptom'}, 145 ])) 146 147 table_names = (p | beam.Create([ 148 ('error', 'my_project:dataset1.error_table_for_today'), 149 ('user_log', 'my_project:dataset1.query_table_for_today'), 150 ]) 151 152 table_names_dict = beam.pvalue.AsDict(table_names) 153 154 elements | beam.io.gcp.bigquery.WriteToBigQuery( 155 table=lambda row, table_dict: table_dict[row['type']], 156 table_side_inputs=(table_names_dict,)) 157 158 In the example above, the `table_dict` argument passed to the function in 159 `table_dict` is the side input coming from `table_names_dict`, which is passed 160 as part of the `table_side_inputs` argument. 161 162 Schemas 163 --------- 164 165 This transform also allows you to provide a static or dynamic `schema` 166 parameter (i.e. a callable). 167 168 If providing a callable, this should take in a table reference (as returned by 169 the `table` parameter), and return the corresponding schema for that table. 170 This allows to provide different schemas for different tables:: 171 172 def compute_table_name(row): 173 ... 174 175 errors_schema = {'fields': [ 176 {'name': 'type', 'type': 'STRING', 'mode': 'NULLABLE'}, 177 {'name': 'message', 'type': 'STRING', 'mode': 'NULLABLE'}]} 178 queries_schema = {'fields': [ 179 {'name': 'type', 'type': 'STRING', 'mode': 'NULLABLE'}, 180 {'name': 'query', 'type': 'STRING', 'mode': 'NULLABLE'}]} 181 182 with Pipeline() as p: 183 elements = (p | beam.Create([ 184 {'type': 'error', 'timestamp': '12:34:56', 'message': 'bad'}, 185 {'type': 'user_log', 'timestamp': '12:34:59', 'query': 'flu symptom'}, 186 ])) 187 188 elements | beam.io.gcp.bigquery.WriteToBigQuery( 189 table=compute_table_name, 190 schema=lambda table: (errors_schema 191 if 'errors' in table 192 else queries_schema)) 193 194 It may be the case that schemas are computed at pipeline runtime. In cases 195 like these, one can also provide a `schema_side_inputs` parameter, which is 196 a tuple of PCollectionViews to be passed to the schema callable (much like 197 the `table_side_inputs` parameter). 198 199 Additional Parameters for BigQuery Tables 200 ----------------------------------------- 201 202 This sink is able to create tables in BigQuery if they don't already exist. It 203 also relies on creating temporary tables when performing file loads. 204 205 The WriteToBigQuery transform creates tables using the BigQuery API by 206 inserting a load job (see the API reference [1]), or by inserting a new table 207 (see the API reference for that [2][3]). 208 209 When creating a new BigQuery table, there are a number of extra parameters 210 that one may need to specify. For example, clustering, partitioning, data 211 encoding, etc. It is possible to provide these additional parameters by 212 passing a Python dictionary as `additional_bq_parameters` to the transform. 213 As an example, to create a table that has specific partitioning, and 214 clustering properties, one would do the following:: 215 216 additional_bq_parameters = { 217 'timePartitioning': {'type': 'DAY'}, 218 'clustering': {'fields': ['country']}} 219 with Pipeline() as p: 220 elements = (p | beam.Create([ 221 {'country': 'mexico', 'timestamp': '12:34:56', 'query': 'acapulco'}, 222 {'country': 'canada', 'timestamp': '12:34:59', 'query': 'influenza'}, 223 ])) 224 225 elements | beam.io.gcp.bigquery.WriteToBigQuery( 226 table='project_name1:dataset_2.query_events_table', 227 additional_bq_parameters=additional_bq_parameters) 228 229 Much like the schema case, the parameter with `additional_bq_parameters` can 230 also take a callable that receives a table reference. 231 232 233 [1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job\ 234 #jobconfigurationload 235 [2] https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert 236 [3] https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource 237 238 Chaining of operations after WriteToBigQuery 239 -------------------------------------------- 240 WritToBigQuery returns an object with several PCollections that consist of 241 metadata about the write operations. These are useful to inspect the write 242 operation and follow with the results:: 243 244 schema = {'fields': [ 245 {'name': 'column', 'type': 'STRING', 'mode': 'NULLABLE'}]} 246 247 error_schema = {'fields': [ 248 {'name': 'destination', 'type': 'STRING', 'mode': 'NULLABLE'}, 249 {'name': 'row', 'type': 'STRING', 'mode': 'NULLABLE'}, 250 {'name': 'error_message', 'type': 'STRING', 'mode': 'NULLABLE'}]} 251 252 with Pipeline() as p: 253 result = (p 254 | 'Create Columns' >> beam.Create([ 255 {'column': 'value'}, 256 {'bad_column': 'bad_value'} 257 ]) 258 | 'Write Data' >> WriteToBigQuery( 259 method=WriteToBigQuery.Method.STREAMING_INSERTS, 260 table=my_table, 261 schema=schema, 262 insert_retry_strategy=RetryStrategy.RETRY_NEVER 263 )) 264 265 _ = (result.failed_rows_with_errors 266 | 'Get Errors' >> beam.Map(lambda e: { 267 "destination": e[0], 268 "row": json.dumps(e[1]), 269 "error_message": e[2][0]['message'] 270 }) 271 | 'Write Errors' >> WriteToBigQuery( 272 method=WriteToBigQuery.Method.STREAMING_INSERTS, 273 table=error_log_table, 274 schema=error_schema, 275 )) 276 277 Often, the simplest use case is to chain an operation after writing data to 278 BigQuery.To do this, one can chain the operation after one of the output 279 PCollections. A generic way in which this operation (independent of write 280 method) could look like:: 281 282 def chain_after(result): 283 try: 284 # This works for FILE_LOADS, where we run load and possibly copy jobs. 285 return (result.load_jobid_pairs, result.copy_jobid_pairs) | beam.Flatten() 286 except AttributeError: 287 # Works for STREAMING_INSERTS, where we return the rows BigQuery rejected 288 return result.failed_rows 289 290 result = (pcoll | WriteToBigQuery(...)) 291 292 _ = (chain_after(result) 293 | beam.Reshuffle() # Force a 'commit' of the intermediate date 294 | MyOperationAfterWriteToBQ()) 295 296 Attributes can be accessed using dot notation or bracket notation: 297 ``` 298 result.failed_rows <--> result['FailedRows'] 299 result.failed_rows_with_errors <--> result['FailedRowsWithErrors'] 300 result.destination_load_jobid_pairs <--> result['destination_load_jobid_pairs'] 301 result.destination_file_pairs <--> result['destination_file_pairs'] 302 result.destination_copy_jobid_pairs <--> result['destination_copy_jobid_pairs'] 303 ``` 304 305 Writing with Storage Write API using Cross Language 306 --------------------------------------------------- 307 This sink is able to write with BigQuery's Storage Write API. To do so, specify 308 the method `WriteToBigQuery.Method.STORAGE_WRITE_API`. This will use the 309 StorageWriteToBigQuery() transform to discover and use the Java implementation. 310 Using this transform directly will require the use of beam.Row() elements. 311 312 Similar to streaming inserts, it returns two dead-letter queue PCollections: 313 one containing just the failed rows and the other containing failed rows and 314 errors. They can be accessed with `failed_rows` and `failed_rows_with_errors`, 315 respectively. See the examples above for how to do this. 316 317 318 *** Short introduction to BigQuery concepts *** 319 Tables have rows (TableRow) and each row has cells (TableCell). 320 A table has a schema (TableSchema), which in turn describes the schema of each 321 cell (TableFieldSchema). The terms field and cell are used interchangeably. 322 323 TableSchema: Describes the schema (types and order) for values in each row. 324 Has one attribute, 'field', which is list of TableFieldSchema objects. 325 326 TableFieldSchema: Describes the schema (type, name) for one field. 327 Has several attributes, including 'name' and 'type'. Common values for 328 the type attribute are: 'STRING', 'INTEGER', 'FLOAT', 'BOOLEAN', 'NUMERIC', 329 'GEOGRAPHY'. 330 All possible values are described at: 331 https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types 332 333 TableRow: Holds all values in a table row. Has one attribute, 'f', which is a 334 list of TableCell instances. 335 336 TableCell: Holds the value for one cell (or field). Has one attribute, 337 'v', which is a JsonValue instance. This class is defined in 338 apitools.base.py.extra_types.py module. 339 340 As of Beam 2.7.0, the NUMERIC data type is supported. This data type supports 341 high-precision decimal numbers (precision of 38 digits, scale of 9 digits). 342 The GEOGRAPHY data type works with Well-Known Text (See 343 https://en.wikipedia.org/wiki/Well-known_text) format for reading and writing 344 to BigQuery. 345 BigQuery IO requires values of BYTES datatype to be encoded using base64 346 encoding when writing to BigQuery. 347 348 **Updates to the I/O connector code** 349 350 For any significant updates to this I/O connector, please consider involving 351 corresponding code reviewers mentioned in 352 https://github.com/apache/beam/blob/master/sdks/python/OWNERS 353 """ 354 355 # pytype: skip-file 356 357 import collections 358 import io 359 import itertools 360 import json 361 import logging 362 import random 363 import time 364 import uuid 365 import warnings 366 from dataclasses import dataclass 367 from typing import Dict 368 from typing import List 369 from typing import Optional 370 from typing import Tuple 371 from typing import Union 372 373 import fastavro 374 375 import apache_beam as beam 376 from apache_beam import coders 377 from apache_beam import pvalue 378 from apache_beam.internal.gcp.json_value import from_json_value 379 from apache_beam.internal.gcp.json_value import to_json_value 380 from apache_beam.io import range_trackers 381 from apache_beam.io.avroio import _create_avro_source as create_avro_source 382 from apache_beam.io.filesystems import CompressionTypes 383 from apache_beam.io.filesystems import FileSystems 384 from apache_beam.io.gcp import bigquery_schema_tools 385 from apache_beam.io.gcp import bigquery_tools 386 from apache_beam.io.gcp.bigquery_io_metadata import create_bigquery_io_metadata 387 from apache_beam.io.gcp.bigquery_read_internal import _BigQueryReadSplit 388 from apache_beam.io.gcp.bigquery_read_internal import _JsonToDictCoder 389 from apache_beam.io.gcp.bigquery_read_internal import _PassThroughThenCleanup 390 from apache_beam.io.gcp.bigquery_read_internal import _PassThroughThenCleanupTempDatasets 391 from apache_beam.io.gcp.bigquery_read_internal import bigquery_export_destination_uri 392 from apache_beam.io.gcp.bigquery_tools import RetryStrategy 393 from apache_beam.io.gcp.internal.clients import bigquery 394 from apache_beam.io.iobase import BoundedSource 395 from apache_beam.io.iobase import RangeTracker 396 from apache_beam.io.iobase import SDFBoundedSourceReader 397 from apache_beam.io.iobase import SourceBundle 398 from apache_beam.io.textio import _TextSource as TextSource 399 from apache_beam.metrics import Metrics 400 from apache_beam.options import value_provider as vp 401 from apache_beam.options.pipeline_options import DebugOptions 402 from apache_beam.options.pipeline_options import GoogleCloudOptions 403 from apache_beam.options.pipeline_options import StandardOptions 404 from apache_beam.options.value_provider import StaticValueProvider 405 from apache_beam.options.value_provider import ValueProvider 406 from apache_beam.options.value_provider import check_accessible 407 from apache_beam.pvalue import PCollection 408 from apache_beam.transforms import DoFn 409 from apache_beam.transforms import ParDo 410 from apache_beam.transforms import PTransform 411 from apache_beam.transforms.display import DisplayDataItem 412 from apache_beam.transforms.external import BeamJarExpansionService 413 from apache_beam.transforms.external import SchemaAwareExternalTransform 414 from apache_beam.transforms.sideinputs import SIDE_INPUT_PREFIX 415 from apache_beam.transforms.sideinputs import get_sideinput_index 416 from apache_beam.transforms.util import ReshufflePerKey 417 from apache_beam.transforms.window import GlobalWindows 418 from apache_beam.typehints.row_type import RowTypeConstraint 419 from apache_beam.utils import retry 420 from apache_beam.utils.annotations import deprecated 421 422 try: 423 from apache_beam.io.gcp.internal.clients.bigquery import DatasetReference 424 from apache_beam.io.gcp.internal.clients.bigquery import TableReference 425 from apache_beam.io.gcp.internal.clients.bigquery import JobReference 426 except ImportError: 427 DatasetReference = None 428 TableReference = None 429 JobReference = None 430 431 _LOGGER = logging.getLogger(__name__) 432 433 try: 434 import google.cloud.bigquery_storage_v1 as bq_storage 435 except ImportError: 436 _LOGGER.info( 437 'No module named google.cloud.bigquery_storage_v1. ' 438 'As a result, the ReadFromBigQuery transform *CANNOT* be ' 439 'used with `method=DIRECT_READ`.') 440 441 __all__ = [ 442 'TableRowJsonCoder', 443 'BigQueryDisposition', 444 'BigQuerySource', 445 'BigQuerySink', 446 'BigQueryQueryPriority', 447 'WriteToBigQuery', 448 'WriteResult', 449 'ReadFromBigQuery', 450 'ReadFromBigQueryRequest', 451 'ReadAllFromBigQuery', 452 'SCHEMA_AUTODETECT', 453 ] 454 """ 455 Template for BigQuery jobs created by BigQueryIO. This template is: 456 `"beam_bq_job_{job_type}_{job_id}_{step_id}_{random}"`, where: 457 458 - `job_type` represents the BigQuery job type (e.g. extract / copy / load / 459 query). 460 - `job_id` is the Beam job name. 461 - `step_id` is a UUID representing the Dataflow step that created the 462 BQ job. 463 - `random` is a random string. 464 465 NOTE: This job name template does not have backwards compatibility guarantees. 466 """ 467 BQ_JOB_NAME_TEMPLATE = "beam_bq_job_{job_type}_{job_id}_{step_id}{random}" 468 """ 469 The maximum number of times that a bundle of rows that errors out should be 470 sent for insertion into BigQuery. 471 472 The default is 10,000 with exponential backoffs, so a bundle of rows may be 473 tried for a very long time. You may reduce this property to reduce the number 474 of retries. 475 """ 476 MAX_INSERT_RETRIES = 10000 477 478 479 @deprecated(since='2.11.0', current="bigquery_tools.parse_table_reference") 480 def _parse_table_reference(table, dataset=None, project=None): 481 return bigquery_tools.parse_table_reference(table, dataset, project) 482 483 484 @deprecated( 485 since='2.11.0', current="bigquery_tools.parse_table_schema_from_json") 486 def parse_table_schema_from_json(schema_string): 487 return bigquery_tools.parse_table_schema_from_json(schema_string) 488 489 490 @deprecated(since='2.11.0', current="bigquery_tools.default_encoder") 491 def default_encoder(obj): 492 return bigquery_tools.default_encoder(obj) 493 494 495 @deprecated(since='2.11.0', current="bigquery_tools.RowAsDictJsonCoder") 496 def RowAsDictJsonCoder(*args, **kwargs): 497 return bigquery_tools.RowAsDictJsonCoder(*args, **kwargs) 498 499 500 @deprecated(since='2.11.0', current="bigquery_tools.BigQueryWrapper") 501 def BigQueryWrapper(*args, **kwargs): 502 return bigquery_tools.BigQueryWrapper(*args, **kwargs) 503 504 505 class TableRowJsonCoder(coders.Coder): 506 """A coder for a TableRow instance to/from a JSON string. 507 508 Note that the encoding operation (used when writing to sinks) requires the 509 table schema in order to obtain the ordered list of field names. Reading from 510 sources on the other hand does not need the table schema. 511 """ 512 def __init__(self, table_schema=None): 513 # The table schema is needed for encoding TableRows as JSON (writing to 514 # sinks) because the ordered list of field names is used in the JSON 515 # representation. 516 self.table_schema = table_schema 517 # Precompute field names since we need them for row encoding. 518 if self.table_schema: 519 self.field_names = tuple(fs.name for fs in self.table_schema.fields) 520 self.field_types = tuple(fs.type for fs in self.table_schema.fields) 521 522 def encode(self, table_row): 523 if self.table_schema is None: 524 raise AttributeError( 525 'The TableRowJsonCoder requires a table schema for ' 526 'encoding operations. Please specify a table_schema argument.') 527 try: 528 return json.dumps( 529 collections.OrderedDict( 530 zip( 531 self.field_names, 532 [from_json_value(f.v) for f in table_row.f])), 533 allow_nan=False, 534 default=bigquery_tools.default_encoder) 535 except ValueError as e: 536 raise ValueError('%s. %s' % (e, bigquery_tools.JSON_COMPLIANCE_ERROR)) 537 538 def decode(self, encoded_table_row): 539 od = json.loads( 540 encoded_table_row, object_pairs_hook=collections.OrderedDict) 541 return bigquery.TableRow( 542 f=[bigquery.TableCell(v=to_json_value(e)) for e in od.values()]) 543 544 545 class BigQueryDisposition(object): 546 """Class holding standard strings used for create and write dispositions.""" 547 548 CREATE_NEVER = 'CREATE_NEVER' 549 CREATE_IF_NEEDED = 'CREATE_IF_NEEDED' 550 WRITE_TRUNCATE = 'WRITE_TRUNCATE' 551 WRITE_APPEND = 'WRITE_APPEND' 552 WRITE_EMPTY = 'WRITE_EMPTY' 553 554 @staticmethod 555 def validate_create(disposition): 556 values = ( 557 BigQueryDisposition.CREATE_NEVER, BigQueryDisposition.CREATE_IF_NEEDED) 558 if disposition not in values: 559 raise ValueError( 560 'Invalid create disposition %s. Expecting %s' % (disposition, values)) 561 return disposition 562 563 @staticmethod 564 def validate_write(disposition): 565 values = ( 566 BigQueryDisposition.WRITE_TRUNCATE, 567 BigQueryDisposition.WRITE_APPEND, 568 BigQueryDisposition.WRITE_EMPTY) 569 if disposition not in values: 570 raise ValueError( 571 'Invalid write disposition %s. Expecting %s' % (disposition, values)) 572 return disposition 573 574 575 class BigQueryQueryPriority(object): 576 """Class holding standard strings used for query priority.""" 577 578 INTERACTIVE = 'INTERACTIVE' 579 BATCH = 'BATCH' 580 581 582 # ----------------------------------------------------------------------------- 583 # BigQuerySource, BigQuerySink. 584 585 586 @deprecated(since='2.25.0', current="ReadFromBigQuery") 587 def BigQuerySource( 588 table=None, 589 dataset=None, 590 project=None, 591 query=None, 592 validate=False, 593 coder=None, 594 use_standard_sql=False, 595 flatten_results=True, 596 kms_key=None, 597 use_dataflow_native_source=False): 598 if use_dataflow_native_source: 599 warnings.warn( 600 "Native sources no longer implemented; " 601 "falling back to standard Beam source.") 602 return ReadFromBigQuery( 603 table=table, 604 dataset=dataset, 605 project=project, 606 query=query, 607 validate=validate, 608 coder=coder, 609 use_standard_sql=use_standard_sql, 610 flatten_results=flatten_results, 611 use_json_exports=True, 612 kms_key=kms_key) 613 614 615 @deprecated(since='2.25.0', current="ReadFromBigQuery") 616 def _BigQuerySource(*args, **kwargs): 617 """A source based on a BigQuery table.""" 618 warnings.warn( 619 "Native sources no longer implemented; " 620 "falling back to standard Beam source.") 621 return ReadFromBigQuery(*args, **kwargs) 622 623 624 # TODO(https://github.com/apache/beam/issues/21622): remove the serialization 625 # restriction in transform implementation once InteractiveRunner can work 626 # without runner api roundtrips. 627 @dataclass 628 class _BigQueryExportResult: 629 coder: beam.coders.Coder 630 paths: List[str] 631 632 633 class _CustomBigQuerySource(BoundedSource): 634 def __init__( 635 self, 636 method, 637 gcs_location=None, 638 table=None, 639 dataset=None, 640 project=None, 641 query=None, 642 validate=False, 643 pipeline_options=None, 644 coder=None, 645 use_standard_sql=False, 646 flatten_results=True, 647 kms_key=None, 648 bigquery_job_labels=None, 649 use_json_exports=False, 650 job_name=None, 651 step_name=None, 652 unique_id=None, 653 temp_dataset=None, 654 query_priority=BigQueryQueryPriority.BATCH): 655 if table is not None and query is not None: 656 raise ValueError( 657 'Both a BigQuery table and a query were specified.' 658 ' Please specify only one of these.') 659 elif table is None and query is None: 660 raise ValueError('A BigQuery table or a query must be specified') 661 elif table is not None: 662 self.table_reference = bigquery_tools.parse_table_reference( 663 table, dataset, project) 664 self.query = None 665 self.use_legacy_sql = True 666 else: 667 if isinstance(query, str): 668 query = StaticValueProvider(str, query) 669 self.query = query 670 # TODO(BEAM-1082): Change the internal flag to be standard_sql 671 self.use_legacy_sql = not use_standard_sql 672 self.table_reference = None 673 674 self.method = method 675 self.gcs_location = gcs_location 676 self.project = project 677 self.validate = validate 678 self.flatten_results = flatten_results 679 self.coder = coder or _JsonToDictCoder 680 self.kms_key = kms_key 681 self.export_result = None 682 self.options = pipeline_options 683 self.bq_io_metadata = None # Populate in setup, as it may make an RPC 684 self.bigquery_job_labels = bigquery_job_labels or {} 685 self.use_json_exports = use_json_exports 686 self.temp_dataset = temp_dataset 687 self.query_priority = query_priority 688 self._job_name = job_name or 'BQ_EXPORT_JOB' 689 self._step_name = step_name 690 self._source_uuid = unique_id 691 692 def _get_bq_metadata(self): 693 if not self.bq_io_metadata: 694 self.bq_io_metadata = create_bigquery_io_metadata(self._step_name) 695 return self.bq_io_metadata 696 697 def display_data(self): 698 export_format = 'JSON' if self.use_json_exports else 'AVRO' 699 return { 700 'method': str(self.method), 701 'table': str(self.table_reference), 702 'query': str(self.query), 703 'project': str(self.project), 704 'use_legacy_sql': self.use_legacy_sql, 705 'bigquery_job_labels': json.dumps(self.bigquery_job_labels), 706 'export_file_format': export_format, 707 'launchesBigQueryJobs': DisplayDataItem( 708 True, label="This Dataflow job launches bigquery jobs."), 709 } 710 711 def estimate_size(self): 712 bq = bigquery_tools.BigQueryWrapper() 713 if self.table_reference is not None: 714 table_ref = self.table_reference 715 if (isinstance(self.table_reference, vp.ValueProvider) and 716 self.table_reference.is_accessible()): 717 table_ref = bigquery_tools.parse_table_reference( 718 self.table_reference.get(), project=self._get_project()) 719 elif isinstance(self.table_reference, vp.ValueProvider): 720 # Size estimation is best effort. We return None as we have 721 # no access to the table that we're querying. 722 return None 723 if not table_ref.projectId: 724 table_ref.projectId = self._get_project() 725 table = bq.get_table( 726 table_ref.projectId, table_ref.datasetId, table_ref.tableId) 727 return int(table.numBytes) 728 elif self.query is not None and self.query.is_accessible(): 729 project = self._get_project() 730 query_job_name = bigquery_tools.generate_bq_job_name( 731 self._job_name, 732 self._source_uuid, 733 bigquery_tools.BigQueryJobTypes.QUERY, 734 '%s_%s' % (int(time.time()), random.randint(0, 1000))) 735 job = bq._start_query_job( 736 project, 737 self.query.get(), 738 self.use_legacy_sql, 739 self.flatten_results, 740 job_id=query_job_name, 741 priority=self.query_priority, 742 dry_run=True, 743 kms_key=self.kms_key, 744 job_labels=self._get_bq_metadata().add_additional_bq_job_labels( 745 self.bigquery_job_labels)) 746 size = int(job.statistics.totalBytesProcessed) 747 return size 748 else: 749 # Size estimation is best effort. We return None as we have 750 # no access to the query that we're running. 751 return None 752 753 def _get_project(self): 754 """Returns the project that queries and exports will be billed to.""" 755 756 project = self.options.view_as(GoogleCloudOptions).project 757 if isinstance(project, vp.ValueProvider): 758 project = project.get() 759 if self.temp_dataset: 760 return self.temp_dataset.projectId 761 if not project: 762 project = self.project 763 return project 764 765 def _create_source(self, path, coder): 766 if not self.use_json_exports: 767 return create_avro_source(path) 768 else: 769 return TextSource( 770 path, 771 min_bundle_size=0, 772 compression_type=CompressionTypes.UNCOMPRESSED, 773 strip_trailing_newlines=True, 774 coder=coder) 775 776 def split(self, desired_bundle_size, start_position=None, stop_position=None): 777 if self.export_result is None: 778 bq = bigquery_tools.BigQueryWrapper( 779 temp_dataset_id=( 780 self.temp_dataset.datasetId if self.temp_dataset else None)) 781 782 if self.query is not None: 783 self._setup_temporary_dataset(bq) 784 self.table_reference = self._execute_query(bq) 785 786 if isinstance(self.table_reference, vp.ValueProvider): 787 self.table_reference = bigquery_tools.parse_table_reference( 788 self.table_reference.get(), project=self._get_project()) 789 elif not self.table_reference.projectId: 790 self.table_reference.projectId = self._get_project() 791 792 schema, metadata_list = self._export_files(bq) 793 self.export_result = _BigQueryExportResult( 794 coder=self.coder(schema), 795 paths=[metadata.path for metadata in metadata_list]) 796 797 if self.query is not None: 798 bq.clean_up_temporary_dataset(self._get_project()) 799 800 for path in self.export_result.paths: 801 source = self._create_source(path, self.export_result.coder) 802 yield SourceBundle( 803 weight=1.0, source=source, start_position=None, stop_position=None) 804 805 def get_range_tracker(self, start_position, stop_position): 806 class CustomBigQuerySourceRangeTracker(RangeTracker): 807 """A RangeTracker that always returns positions as None.""" 808 def start_position(self): 809 return None 810 811 def stop_position(self): 812 return None 813 814 return CustomBigQuerySourceRangeTracker() 815 816 def read(self, range_tracker): 817 raise NotImplementedError('BigQuery source must be split before being read') 818 819 @check_accessible(['query']) 820 def _setup_temporary_dataset(self, bq): 821 if self.temp_dataset: 822 # Temp dataset was provided by the user so we can just return. 823 return 824 location = bq.get_query_location( 825 self._get_project(), self.query.get(), self.use_legacy_sql) 826 bq.create_temporary_dataset(self._get_project(), location) 827 828 @check_accessible(['query']) 829 def _execute_query(self, bq): 830 query_job_name = bigquery_tools.generate_bq_job_name( 831 self._job_name, 832 self._source_uuid, 833 bigquery_tools.BigQueryJobTypes.QUERY, 834 '%s_%s' % (int(time.time()), random.randint(0, 1000))) 835 job = bq._start_query_job( 836 self._get_project(), 837 self.query.get(), 838 self.use_legacy_sql, 839 self.flatten_results, 840 job_id=query_job_name, 841 priority=self.query_priority, 842 kms_key=self.kms_key, 843 job_labels=self._get_bq_metadata().add_additional_bq_job_labels( 844 self.bigquery_job_labels)) 845 job_ref = job.jobReference 846 bq.wait_for_bq_job(job_ref, max_retries=0) 847 return bq._get_temp_table(self._get_project()) 848 849 def _export_files(self, bq): 850 """Runs a BigQuery export job. 851 852 Returns: 853 bigquery.TableSchema instance, a list of FileMetadata instances 854 """ 855 job_labels = self._get_bq_metadata().add_additional_bq_job_labels( 856 self.bigquery_job_labels) 857 export_job_name = bigquery_tools.generate_bq_job_name( 858 self._job_name, 859 self._source_uuid, 860 bigquery_tools.BigQueryJobTypes.EXPORT, 861 '%s_%s' % (int(time.time()), random.randint(0, 1000))) 862 temp_location = self.options.view_as(GoogleCloudOptions).temp_location 863 gcs_location = bigquery_export_destination_uri( 864 self.gcs_location, temp_location, self._source_uuid) 865 try: 866 if self.use_json_exports: 867 job_ref = bq.perform_extract_job([gcs_location], 868 export_job_name, 869 self.table_reference, 870 bigquery_tools.FileFormat.JSON, 871 project=self._get_project(), 872 job_labels=job_labels, 873 include_header=False) 874 else: 875 job_ref = bq.perform_extract_job([gcs_location], 876 export_job_name, 877 self.table_reference, 878 bigquery_tools.FileFormat.AVRO, 879 project=self._get_project(), 880 include_header=False, 881 job_labels=job_labels, 882 use_avro_logical_types=True) 883 bq.wait_for_bq_job(job_ref) 884 except Exception as exn: # pylint: disable=broad-except 885 # The error messages thrown in this case are generic and misleading, 886 # so leave this breadcrumb in case it's the root cause. 887 logging.warning( 888 "Error exporting table: %s. " 889 "Note that external tables cannot be exported: " 890 "https://cloud.google.com/bigquery/docs/external-tables" 891 "#external_table_limitations", 892 exn) 893 raise 894 metadata_list = FileSystems.match([gcs_location])[0].metadata_list 895 896 if isinstance(self.table_reference, vp.ValueProvider): 897 table_ref = bigquery_tools.parse_table_reference( 898 self.table_reference.get(), project=self.project) 899 else: 900 table_ref = self.table_reference 901 table = bq.get_table( 902 table_ref.projectId, table_ref.datasetId, table_ref.tableId) 903 904 return table.schema, metadata_list 905 906 907 class _CustomBigQueryStorageSource(BoundedSource): 908 """A base class for BoundedSource implementations which read from BigQuery 909 using the BigQuery Storage API. 910 Args: 911 table (str, TableReference): The ID of the table. If **dataset** argument is 912 :data:`None` then the table argument must contain the entire table 913 reference specified as: ``'PROJECT:DATASET.TABLE'`` or must specify a 914 TableReference. 915 dataset (str): Optional ID of the dataset containing this table or 916 :data:`None` if the table argument specifies a TableReference. 917 project (str): Optional ID of the project containing this table or 918 :data:`None` if the table argument specifies a TableReference. 919 selected_fields (List[str]): Optional List of names of the fields in the 920 table that should be read. If empty, all fields will be read. If the 921 specified field is a nested field, all the sub-fields in the field will be 922 selected. The output field order is unrelated to the order of fields in 923 selected_fields. 924 row_restriction (str): Optional SQL text filtering statement, similar to a 925 WHERE clause in a query. Aggregates are not supported. Restricted to a 926 maximum length for 1 MB. 927 use_native_datetime (bool): If :data:`True`, BigQuery DATETIME fields will 928 be returned as native Python datetime objects. If :data:`False`, 929 DATETIME fields will be returned as formatted strings (for example: 930 2021-01-01T12:59:59). The default is :data:`False`. 931 """ 932 933 # The maximum number of streams which will be requested when creating a read 934 # session, regardless of the desired bundle size. 935 MAX_SPLIT_COUNT = 10000 936 # The minimum number of streams which will be requested when creating a read 937 # session, regardless of the desired bundle size. Note that the server may 938 # still choose to return fewer than ten streams based on the layout of the 939 # table. 940 MIN_SPLIT_COUNT = 10 941 942 def __init__( 943 self, 944 method: str, 945 query_priority: [BigQueryQueryPriority] = BigQueryQueryPriority.BATCH, 946 table: Optional[Union[str, TableReference]] = None, 947 dataset: Optional[str] = None, 948 project: Optional[str] = None, 949 query: Optional[str] = None, 950 selected_fields: Optional[List[str]] = None, 951 row_restriction: Optional[str] = None, 952 pipeline_options: Optional[GoogleCloudOptions] = None, 953 unique_id: Optional[uuid.UUID] = None, 954 bigquery_job_labels: Optional[Dict] = None, 955 bigquery_dataset_labels: Optional[Dict] = None, 956 job_name: Optional[str] = None, 957 step_name: Optional[str] = None, 958 use_standard_sql: Optional[bool] = False, 959 flatten_results: Optional[bool] = True, 960 kms_key: Optional[str] = None, 961 temp_dataset: Optional[DatasetReference] = None, 962 temp_table: Optional[TableReference] = None, 963 use_native_datetime: Optional[bool] = False): 964 965 if table is not None and query is not None: 966 raise ValueError( 967 'Both a BigQuery table and a query were specified.' 968 ' Please specify only one of these.') 969 elif table is None and query is None: 970 raise ValueError('A BigQuery table or a query must be specified') 971 elif table is not None: 972 self.table_reference = bigquery_tools.parse_table_reference( 973 table, dataset, project) 974 self.query = None 975 self.use_legacy_sql = True 976 else: 977 if isinstance(query, str): 978 query = StaticValueProvider(str, query) 979 self.query = query 980 # TODO(BEAM-1082): Change the internal flag to be standard_sql 981 self.use_legacy_sql = not use_standard_sql 982 self.table_reference = None 983 984 self.method = method 985 self.project = project 986 self.selected_fields = selected_fields 987 self.row_restriction = row_restriction 988 self.pipeline_options = pipeline_options 989 self.split_result = None 990 self.bigquery_job_labels = bigquery_job_labels or {} 991 self.bigquery_dataset_labels = bigquery_dataset_labels or {} 992 self.bq_io_metadata = None # Populate in setup, as it may make an RPC 993 self.flatten_results = flatten_results 994 self.kms_key = kms_key 995 self.temp_table = temp_table 996 self.query_priority = query_priority 997 self.use_native_datetime = use_native_datetime 998 self._job_name = job_name or 'BQ_DIRECT_READ_JOB' 999 self._step_name = step_name 1000 self._source_uuid = unique_id 1001 1002 def _get_parent_project(self): 1003 """Returns the project that will be billed.""" 1004 if self.temp_table: 1005 return self.temp_table.projectId 1006 1007 project = self.pipeline_options.view_as(GoogleCloudOptions).project 1008 if isinstance(project, vp.ValueProvider): 1009 project = project.get() 1010 if not project: 1011 project = self.project 1012 return project 1013 1014 def _get_table_size(self, bq, table_reference): 1015 project = ( 1016 table_reference.projectId 1017 if table_reference.projectId else self._get_parent_project()) 1018 table = bq.get_table( 1019 project, table_reference.datasetId, table_reference.tableId) 1020 return table.numBytes 1021 1022 def _get_bq_metadata(self): 1023 if not self.bq_io_metadata: 1024 self.bq_io_metadata = create_bigquery_io_metadata(self._step_name) 1025 return self.bq_io_metadata 1026 1027 @check_accessible(['query']) 1028 def _setup_temporary_dataset(self, bq): 1029 if self.temp_table: 1030 # Temp dataset was provided by the user so we can just return. 1031 return 1032 location = bq.get_query_location( 1033 self._get_parent_project(), self.query.get(), self.use_legacy_sql) 1034 _LOGGER.warning("### Labels: %s", str(self.bigquery_dataset_labels)) 1035 bq.create_temporary_dataset( 1036 self._get_parent_project(), location, self.bigquery_dataset_labels) 1037 1038 @check_accessible(['query']) 1039 def _execute_query(self, bq): 1040 query_job_name = bigquery_tools.generate_bq_job_name( 1041 self._job_name, 1042 self._source_uuid, 1043 bigquery_tools.BigQueryJobTypes.QUERY, 1044 '%s_%s' % (int(time.time()), random.randint(0, 1000))) 1045 job = bq._start_query_job( 1046 self._get_parent_project(), 1047 self.query.get(), 1048 self.use_legacy_sql, 1049 self.flatten_results, 1050 job_id=query_job_name, 1051 priority=self.query_priority, 1052 kms_key=self.kms_key, 1053 job_labels=self._get_bq_metadata().add_additional_bq_job_labels( 1054 self.bigquery_job_labels)) 1055 job_ref = job.jobReference 1056 bq.wait_for_bq_job(job_ref, max_retries=0) 1057 table_reference = bq._get_temp_table(self._get_parent_project()) 1058 return table_reference 1059 1060 def display_data(self): 1061 return { 1062 'method': self.method, 1063 'output_format': 'ARROW' if self.use_native_datetime else 'AVRO', 1064 'project': str(self.project), 1065 'table_reference': str(self.table_reference), 1066 'query': str(self.query), 1067 'use_legacy_sql': self.use_legacy_sql, 1068 'use_native_datetime': self.use_native_datetime, 1069 'selected_fields': str(self.selected_fields), 1070 'row_restriction': str(self.row_restriction), 1071 'launchesBigQueryJobs': DisplayDataItem( 1072 True, label="This Dataflow job launches bigquery jobs."), 1073 } 1074 1075 def estimate_size(self): 1076 # Returns the pre-filtering size of the (temporary) table being read. 1077 bq = bigquery_tools.BigQueryWrapper() 1078 if self.table_reference is not None: 1079 return self._get_table_size(bq, self.table_reference) 1080 elif self.query is not None and self.query.is_accessible(): 1081 query_job_name = bigquery_tools.generate_bq_job_name( 1082 self._job_name, 1083 self._source_uuid, 1084 bigquery_tools.BigQueryJobTypes.QUERY, 1085 '%s_%s' % (int(time.time()), random.randint(0, 1000))) 1086 job = bq._start_query_job( 1087 self._get_parent_project(), 1088 self.query.get(), 1089 self.use_legacy_sql, 1090 self.flatten_results, 1091 job_id=query_job_name, 1092 priority=self.query_priority, 1093 dry_run=True, 1094 kms_key=self.kms_key, 1095 job_labels=self._get_bq_metadata().add_additional_bq_job_labels( 1096 self.bigquery_job_labels)) 1097 size = int(job.statistics.totalBytesProcessed) 1098 return size 1099 else: 1100 # Size estimation is best effort. We return None as we have 1101 # no access to the query that we're running. 1102 return None 1103 1104 def split(self, desired_bundle_size, start_position=None, stop_position=None): 1105 if self.split_result is None: 1106 bq = bigquery_tools.BigQueryWrapper( 1107 temp_table_ref=(self.temp_table if self.temp_table else None)) 1108 1109 if self.query is not None: 1110 self._setup_temporary_dataset(bq) 1111 self.table_reference = self._execute_query(bq) 1112 1113 requested_session = bq_storage.types.ReadSession() 1114 requested_session.table = 'projects/{}/datasets/{}/tables/{}'.format( 1115 self.table_reference.projectId, 1116 self.table_reference.datasetId, 1117 self.table_reference.tableId) 1118 1119 if self.use_native_datetime: 1120 requested_session.data_format = bq_storage.types.DataFormat.ARROW 1121 requested_session.read_options\ 1122 .arrow_serialization_options.buffer_compression = \ 1123 bq_storage.types.ArrowSerializationOptions.CompressionCodec.LZ4_FRAME 1124 else: 1125 requested_session.data_format = bq_storage.types.DataFormat.AVRO 1126 1127 if self.selected_fields is not None: 1128 requested_session.read_options.selected_fields = self.selected_fields 1129 if self.row_restriction is not None: 1130 requested_session.read_options.row_restriction = self.row_restriction 1131 1132 storage_client = bq_storage.BigQueryReadClient() 1133 stream_count = 0 1134 if desired_bundle_size > 0: 1135 table_size = self._get_table_size(bq, self.table_reference) 1136 stream_count = min( 1137 int(table_size / desired_bundle_size), 1138 _CustomBigQueryStorageSource.MAX_SPLIT_COUNT) 1139 stream_count = max( 1140 stream_count, _CustomBigQueryStorageSource.MIN_SPLIT_COUNT) 1141 1142 parent = 'projects/{}'.format(self.table_reference.projectId) 1143 read_session = storage_client.create_read_session( 1144 parent=parent, 1145 read_session=requested_session, 1146 max_stream_count=stream_count) 1147 _LOGGER.info( 1148 'Sent BigQuery Storage API CreateReadSession request: \n %s \n' 1149 'Received response \n %s.', 1150 requested_session, 1151 read_session) 1152 1153 self.split_result = [ 1154 _CustomBigQueryStorageStreamSource( 1155 stream.name, self.use_native_datetime) 1156 for stream in read_session.streams 1157 ] 1158 1159 for source in self.split_result: 1160 yield SourceBundle( 1161 weight=1.0, source=source, start_position=None, stop_position=None) 1162 1163 def get_range_tracker(self, start_position, stop_position): 1164 class NonePositionRangeTracker(RangeTracker): 1165 """A RangeTracker that always returns positions as None. Prevents the 1166 BigQuery Storage source from being read() before being split().""" 1167 def start_position(self): 1168 return None 1169 1170 def stop_position(self): 1171 return None 1172 1173 return NonePositionRangeTracker() 1174 1175 def read(self, range_tracker): 1176 raise NotImplementedError( 1177 'BigQuery storage source must be split before being read') 1178 1179 1180 class _CustomBigQueryStorageStreamSource(BoundedSource): 1181 """A source representing a single stream in a read session.""" 1182 def __init__( 1183 self, read_stream_name: str, use_native_datetime: Optional[bool] = True): 1184 self.read_stream_name = read_stream_name 1185 self.use_native_datetime = use_native_datetime 1186 1187 def display_data(self): 1188 return { 1189 'output_format': 'ARROW' if self.use_native_datetime else 'AVRO', 1190 'read_stream': str(self.read_stream_name), 1191 'use_native_datetime': str(self.use_native_datetime) 1192 } 1193 1194 def estimate_size(self): 1195 # The size of stream source cannot be estimate due to server-side liquid 1196 # sharding. 1197 # TODO(https://github.com/apache/beam/issues/21126): Implement progress 1198 # reporting. 1199 return None 1200 1201 def split(self, desired_bundle_size, start_position=None, stop_position=None): 1202 # A stream source can't be split without reading from it due to 1203 # server-side liquid sharding. A split will simply return the current source 1204 # for now. 1205 return SourceBundle( 1206 weight=1.0, 1207 source=_CustomBigQueryStorageStreamSource( 1208 self.read_stream_name, self.use_native_datetime), 1209 start_position=None, 1210 stop_position=None) 1211 1212 def get_range_tracker(self, start_position, stop_position): 1213 # TODO(https://github.com/apache/beam/issues/21127): Implement dynamic work 1214 # rebalancing. 1215 assert start_position is None 1216 # Defaulting to the start of the stream. 1217 start_position = 0 1218 # Since the streams are unsplittable we choose OFFSET_INFINITY as the 1219 # default end offset so that all data of the source gets read. 1220 stop_position = range_trackers.OffsetRangeTracker.OFFSET_INFINITY 1221 range_tracker = range_trackers.OffsetRangeTracker( 1222 start_position, stop_position) 1223 # Ensuring that all try_split() calls will be ignored by the Rangetracker. 1224 range_tracker = range_trackers.UnsplittableRangeTracker(range_tracker) 1225 1226 return range_tracker 1227 1228 def read(self, range_tracker): 1229 _LOGGER.info( 1230 "Started BigQuery Storage API read from stream %s.", 1231 self.read_stream_name) 1232 if self.use_native_datetime: 1233 return self.read_arrow() 1234 else: 1235 return self.read_avro() 1236 1237 def read_arrow(self): 1238 storage_client = bq_storage.BigQueryReadClient() 1239 row_iter = iter(storage_client.read_rows(self.read_stream_name).rows()) 1240 row = next(row_iter, None) 1241 # Handling the case where the user might provide very selective filters 1242 # which can result in read_rows_response being empty. 1243 if row is None: 1244 return iter([]) 1245 1246 while row is not None: 1247 py_row = dict(map(lambda item: (item[0], item[1].as_py()), row.items())) 1248 row = next(row_iter, None) 1249 yield py_row 1250 1251 def read_avro(self): 1252 storage_client = bq_storage.BigQueryReadClient() 1253 read_rows_iterator = iter(storage_client.read_rows(self.read_stream_name)) 1254 # Handling the case where the user might provide very selective filters 1255 # which can result in read_rows_response being empty. 1256 first_read_rows_response = next(read_rows_iterator, None) 1257 if first_read_rows_response is None: 1258 return iter([]) 1259 1260 row_reader = _ReadReadRowsResponsesWithFastAvro( 1261 read_rows_iterator, first_read_rows_response) 1262 return iter(row_reader) 1263 1264 1265 class _ReadReadRowsResponsesWithFastAvro(): 1266 """An iterator that deserializes ReadRowsResponses using the fastavro 1267 library.""" 1268 def __init__(self, read_rows_iterator, read_rows_response): 1269 self.read_rows_iterator = read_rows_iterator 1270 self.read_rows_response = read_rows_response 1271 self.avro_schema = fastavro.parse_schema( 1272 json.loads(self.read_rows_response.avro_schema.schema)) 1273 self.bytes_reader = io.BytesIO( 1274 self.read_rows_response.avro_rows.serialized_binary_rows) 1275 1276 def __iter__(self): 1277 return self 1278 1279 def __next__(self): 1280 try: 1281 return fastavro.schemaless_reader(self.bytes_reader, self.avro_schema) 1282 except StopIteration: 1283 self.read_rows_response = next(self.read_rows_iterator, None) 1284 if self.read_rows_response is not None: 1285 self.bytes_reader = io.BytesIO( 1286 self.read_rows_response.avro_rows.serialized_binary_rows) 1287 return fastavro.schemaless_reader(self.bytes_reader, self.avro_schema) 1288 else: 1289 raise StopIteration 1290 1291 1292 @deprecated(since='2.11.0', current="WriteToBigQuery") 1293 def BigQuerySink(*args, validate=False, **kwargs): 1294 """A deprecated alias for WriteToBigQuery.""" 1295 warnings.warn( 1296 "Native sinks no longer implemented; " 1297 "falling back to standard Beam sink.") 1298 return WriteToBigQuery(*args, validate=validate, **kwargs) 1299 1300 1301 _KNOWN_TABLES = set() 1302 1303 1304 class BigQueryWriteFn(DoFn): 1305 """A ``DoFn`` that streams writes to BigQuery once the table is created.""" 1306 1307 DEFAULT_MAX_BUFFERED_ROWS = 2000 1308 DEFAULT_MAX_BATCH_SIZE = 500 1309 1310 FAILED_ROWS = 'FailedRows' 1311 FAILED_ROWS_WITH_ERRORS = 'FailedRowsWithErrors' 1312 STREAMING_API_LOGGING_FREQUENCY_SEC = 300 1313 1314 def __init__( 1315 self, 1316 batch_size, 1317 schema=None, 1318 create_disposition=None, 1319 write_disposition=None, 1320 kms_key=None, 1321 test_client=None, 1322 max_buffered_rows=None, 1323 retry_strategy=None, 1324 additional_bq_parameters=None, 1325 ignore_insert_ids=False, 1326 with_batched_input=False, 1327 ignore_unknown_columns=False, 1328 max_retries=MAX_INSERT_RETRIES): 1329 """Initialize a WriteToBigQuery transform. 1330 1331 Args: 1332 batch_size: Number of rows to be written to BQ per streaming API insert. 1333 schema: The schema to be used if the BigQuery table to write has to be 1334 created. This can be either specified as a 'bigquery.TableSchema' object 1335 or a single string of the form 'field1:type1,field2:type2,field3:type3' 1336 that defines a comma separated list of fields. Here 'type' should 1337 specify the BigQuery type of the field. Single string based schemas do 1338 not support nested fields, repeated fields, or specifying a BigQuery 1339 mode for fields (mode will always be set to 'NULLABLE'). 1340 create_disposition: A string describing what happens if the table does not 1341 exist. Possible values are: 1342 - BigQueryDisposition.CREATE_IF_NEEDED: create if does not exist. 1343 - BigQueryDisposition.CREATE_NEVER: fail the write if does not exist. 1344 write_disposition: A string describing what happens if the table has 1345 already some data. Possible values are: 1346 - BigQueryDisposition.WRITE_TRUNCATE: delete existing rows. 1347 - BigQueryDisposition.WRITE_APPEND: add to existing rows. 1348 - BigQueryDisposition.WRITE_EMPTY: fail the write if table not empty. 1349 For streaming pipelines WriteTruncate can not be used. 1350 kms_key: Optional Cloud KMS key name for use when creating new tables. 1351 test_client: Override the default bigquery client used for testing. 1352 1353 max_buffered_rows: The maximum number of rows that are allowed to stay 1354 buffered when running dynamic destinations. When destinations are 1355 dynamic, it is important to keep caches small even when a single 1356 batch has not been completely filled up. 1357 retry_strategy: The strategy to use when retrying streaming inserts 1358 into BigQuery. Options are shown in bigquery_tools.RetryStrategy attrs. 1359 additional_bq_parameters (dict, callable): A set of additional parameters 1360 to be passed when creating a BigQuery table. These are passed when 1361 triggering a load job for FILE_LOADS, and when creating a new table for 1362 STREAMING_INSERTS. 1363 ignore_insert_ids: When using the STREAMING_INSERTS method to write data 1364 to BigQuery, `insert_ids` are a feature of BigQuery that support 1365 deduplication of events. If your use case is not sensitive to 1366 duplication of data inserted to BigQuery, set `ignore_insert_ids` 1367 to True to increase the throughput for BQ writing. See: 1368 https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication 1369 with_batched_input: Whether the input has already been batched per 1370 destination. If not, perform best-effort batching per destination within 1371 a bundle. 1372 ignore_unknown_columns: Accept rows that contain values that do not match 1373 the schema. The unknown values are ignored. Default is False, 1374 which treats unknown values as errors. See reference: 1375 https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll 1376 max_retries: The number of times that we will retry inserting a group of 1377 rows into BigQuery. By default, we retry 10000 times with exponential 1378 backoffs (effectively retry forever). 1379 1380 """ 1381 self.schema = schema 1382 self.test_client = test_client 1383 self.create_disposition = create_disposition 1384 self.write_disposition = write_disposition 1385 if write_disposition in (BigQueryDisposition.WRITE_EMPTY, 1386 BigQueryDisposition.WRITE_TRUNCATE): 1387 raise ValueError( 1388 'Write disposition %s is not supported for' 1389 ' streaming inserts to BigQuery' % write_disposition) 1390 self._rows_buffer = [] 1391 self._reset_rows_buffer() 1392 1393 self._total_buffered_rows = 0 1394 self.kms_key = kms_key 1395 self._max_batch_size = batch_size or BigQueryWriteFn.DEFAULT_MAX_BATCH_SIZE 1396 self._max_buffered_rows = ( 1397 max_buffered_rows or BigQueryWriteFn.DEFAULT_MAX_BUFFERED_ROWS) 1398 self._retry_strategy = retry_strategy or RetryStrategy.RETRY_ALWAYS 1399 self.ignore_insert_ids = ignore_insert_ids 1400 self.with_batched_input = with_batched_input 1401 1402 self.additional_bq_parameters = additional_bq_parameters or {} 1403 1404 # accumulate the total time spent in exponential backoff 1405 self._throttled_secs = Metrics.counter( 1406 BigQueryWriteFn, "cumulativeThrottlingSeconds") 1407 self.batch_size_metric = Metrics.distribution(self.__class__, "batch_size") 1408 self.batch_latency_metric = Metrics.distribution( 1409 self.__class__, "batch_latency_ms") 1410 self.failed_rows_metric = Metrics.distribution( 1411 self.__class__, "rows_failed_per_batch") 1412 self.bigquery_wrapper = None 1413 self.streaming_api_logging_frequency_sec = ( 1414 BigQueryWriteFn.STREAMING_API_LOGGING_FREQUENCY_SEC) 1415 self.ignore_unknown_columns = ignore_unknown_columns 1416 self._max_retries = max_retries 1417 1418 def display_data(self): 1419 return { 1420 'max_batch_size': self._max_batch_size, 1421 'max_buffered_rows': self._max_buffered_rows, 1422 'retry_strategy': self._retry_strategy, 1423 'create_disposition': str(self.create_disposition), 1424 'write_disposition': str(self.write_disposition), 1425 'additional_bq_parameters': str(self.additional_bq_parameters), 1426 'ignore_insert_ids': str(self.ignore_insert_ids), 1427 'ignore_unknown_columns': str(self.ignore_unknown_columns) 1428 } 1429 1430 def _reset_rows_buffer(self): 1431 self._rows_buffer = collections.defaultdict(lambda: []) 1432 1433 @staticmethod 1434 def get_table_schema(schema): 1435 """Transform the table schema into a bigquery.TableSchema instance. 1436 1437 Args: 1438 schema: The schema to be used if the BigQuery table to write has to be 1439 created. This is a dictionary object created in the WriteToBigQuery 1440 transform. 1441 Returns: 1442 table_schema: The schema to be used if the BigQuery table to write has 1443 to be created but in the bigquery.TableSchema format. 1444 """ 1445 if schema is None: 1446 return schema 1447 elif isinstance(schema, str): 1448 return bigquery_tools.parse_table_schema_from_json(schema) 1449 elif isinstance(schema, dict): 1450 return bigquery_tools.parse_table_schema_from_json(json.dumps(schema)) 1451 else: 1452 raise TypeError('Unexpected schema argument: %s.' % schema) 1453 1454 def start_bundle(self): 1455 self._reset_rows_buffer() 1456 1457 if not self.bigquery_wrapper: 1458 self.bigquery_wrapper = bigquery_tools.BigQueryWrapper( 1459 client=self.test_client) 1460 1461 ( 1462 bigquery_tools.BigQueryWrapper.HISTOGRAM_METRIC_LOGGER. 1463 minimum_logging_frequency_msec 1464 ) = self.streaming_api_logging_frequency_sec * 1000 1465 1466 self._backoff_calculator = iter( 1467 retry.FuzzedExponentialIntervals( 1468 initial_delay_secs=0.2, 1469 num_retries=self._max_retries, 1470 max_delay_secs=1500)) 1471 1472 def _create_table_if_needed(self, table_reference, schema=None): 1473 str_table_reference = '%s:%s.%s' % ( 1474 table_reference.projectId, 1475 table_reference.datasetId, 1476 table_reference.tableId) 1477 if str_table_reference in _KNOWN_TABLES: 1478 return 1479 1480 if self.create_disposition == BigQueryDisposition.CREATE_NEVER: 1481 # If we never want to create the table, we assume it already exists, 1482 # and avoid the get-or-create step. 1483 return 1484 1485 _LOGGER.debug( 1486 'Creating or getting table %s with schema %s.', table_reference, schema) 1487 1488 table_schema = self.get_table_schema(schema) 1489 1490 if table_reference.projectId is None: 1491 table_reference.projectId = vp.RuntimeValueProvider.get_value( 1492 'project', str, '') 1493 self.bigquery_wrapper.get_or_create_table( 1494 table_reference.projectId, 1495 table_reference.datasetId, 1496 table_reference.tableId, 1497 table_schema, 1498 self.create_disposition, 1499 self.write_disposition, 1500 additional_create_parameters=self.additional_bq_parameters) 1501 _KNOWN_TABLES.add(str_table_reference) 1502 1503 def process(self, element, *schema_side_inputs): 1504 destination = bigquery_tools.get_hashable_destination(element[0]) 1505 1506 if callable(self.schema): 1507 schema = self.schema(destination, *schema_side_inputs) 1508 elif isinstance(self.schema, vp.ValueProvider): 1509 schema = self.schema.get() 1510 else: 1511 schema = self.schema 1512 1513 self._create_table_if_needed( 1514 bigquery_tools.parse_table_reference(destination), schema) 1515 1516 if not self.with_batched_input: 1517 row_and_insert_id = element[1] 1518 self._rows_buffer[destination].append(row_and_insert_id) 1519 self._total_buffered_rows += 1 1520 if len(self._rows_buffer[destination]) >= self._max_batch_size: 1521 return self._flush_batch(destination) 1522 elif self._total_buffered_rows >= self._max_buffered_rows: 1523 return self._flush_all_batches() 1524 else: 1525 # The input is already batched per destination, flush the rows now. 1526 batched_rows = element[1] 1527 self._rows_buffer[destination].extend(batched_rows) 1528 return self._flush_batch(destination) 1529 1530 def finish_bundle(self): 1531 bigquery_tools.BigQueryWrapper.HISTOGRAM_METRIC_LOGGER.log_metrics( 1532 reset_after_logging=True) 1533 return self._flush_all_batches() 1534 1535 def _flush_all_batches(self): 1536 _LOGGER.debug( 1537 'Attempting to flush to all destinations. Total buffered: %s', 1538 self._total_buffered_rows) 1539 1540 return itertools.chain( 1541 *[ 1542 self._flush_batch(destination) 1543 for destination in list(self._rows_buffer.keys()) 1544 if self._rows_buffer[destination] 1545 ]) 1546 1547 def _flush_batch(self, destination): 1548 1549 # Flush the current batch of rows to BigQuery. 1550 rows_and_insert_ids = self._rows_buffer[destination] 1551 table_reference = bigquery_tools.parse_table_reference(destination) 1552 1553 if table_reference.projectId is None: 1554 table_reference.projectId = vp.RuntimeValueProvider.get_value( 1555 'project', str, '') 1556 1557 _LOGGER.debug( 1558 'Flushing data to %s. Total %s rows.', 1559 destination, 1560 len(rows_and_insert_ids)) 1561 self.batch_size_metric.update(len(rows_and_insert_ids)) 1562 1563 rows = [r[0] for r in rows_and_insert_ids] 1564 if self.ignore_insert_ids: 1565 insert_ids = [None for r in rows_and_insert_ids] 1566 else: 1567 insert_ids = [r[1] for r in rows_and_insert_ids] 1568 1569 while True: 1570 start = time.time() 1571 passed, errors = self.bigquery_wrapper.insert_rows( 1572 project_id=table_reference.projectId, 1573 dataset_id=table_reference.datasetId, 1574 table_id=table_reference.tableId, 1575 rows=rows, 1576 insert_ids=insert_ids, 1577 skip_invalid_rows=True, 1578 ignore_unknown_values=self.ignore_unknown_columns) 1579 self.batch_latency_metric.update((time.time() - start) * 1000) 1580 1581 failed_rows = [(rows[entry['index']], entry["errors"]) 1582 for entry in errors] 1583 retry_backoff = next(self._backoff_calculator, None) 1584 1585 # If retry_backoff is None, then we will not retry and must log. 1586 should_retry = any( 1587 RetryStrategy.should_retry( 1588 self._retry_strategy, entry['errors'][0]['reason']) 1589 for entry in errors) and retry_backoff is not None 1590 1591 if not passed: 1592 self.failed_rows_metric.update(len(failed_rows)) 1593 message = ( 1594 'There were errors inserting to BigQuery. Will{} retry. ' 1595 'Errors were {}'.format(("" if should_retry else " not"), errors)) 1596 1597 # The log level is: 1598 # - WARNING when we are continuing to retry, and have a deadline. 1599 # - ERROR when we will no longer retry, or MAY retry forever. 1600 log_level = ( 1601 logging.WARN if should_retry or 1602 self._retry_strategy != RetryStrategy.RETRY_ALWAYS else 1603 logging.ERROR) 1604 1605 _LOGGER.log(log_level, message) 1606 1607 if not should_retry: 1608 break 1609 else: 1610 _LOGGER.info( 1611 'Sleeping %s seconds before retrying insertion.', retry_backoff) 1612 time.sleep(retry_backoff) 1613 rows = [fr[0] for fr in failed_rows] 1614 self._throttled_secs.inc(retry_backoff) 1615 1616 self._total_buffered_rows -= len(self._rows_buffer[destination]) 1617 del self._rows_buffer[destination] 1618 1619 return itertools.chain([ 1620 pvalue.TaggedOutput( 1621 BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS, 1622 GlobalWindows.windowed_value((destination, row, err))) for row, 1623 err in failed_rows 1624 ], 1625 [ 1626 pvalue.TaggedOutput( 1627 BigQueryWriteFn.FAILED_ROWS, 1628 GlobalWindows.windowed_value( 1629 (destination, row))) for row, 1630 unused_err in failed_rows 1631 ]) 1632 1633 1634 # The number of shards per destination when writing via streaming inserts. 1635 DEFAULT_SHARDS_PER_DESTINATION = 500 1636 # The max duration a batch of elements is allowed to be buffered before being 1637 # flushed to BigQuery. 1638 DEFAULT_BATCH_BUFFERING_DURATION_LIMIT_SEC = 0.2 1639 1640 1641 class _StreamToBigQuery(PTransform): 1642 def __init__( 1643 self, 1644 table_reference, 1645 table_side_inputs, 1646 schema_side_inputs, 1647 schema, 1648 batch_size, 1649 triggering_frequency, 1650 create_disposition, 1651 write_disposition, 1652 kms_key, 1653 retry_strategy, 1654 additional_bq_parameters, 1655 ignore_insert_ids, 1656 ignore_unknown_columns, 1657 with_auto_sharding, 1658 num_streaming_keys=DEFAULT_SHARDS_PER_DESTINATION, 1659 test_client=None, 1660 max_retries=None): 1661 self.table_reference = table_reference 1662 self.table_side_inputs = table_side_inputs 1663 self.schema_side_inputs = schema_side_inputs 1664 self.schema = schema 1665 self.batch_size = batch_size 1666 self.triggering_frequency = triggering_frequency 1667 self.create_disposition = create_disposition 1668 self.write_disposition = write_disposition 1669 self.kms_key = kms_key 1670 self.retry_strategy = retry_strategy 1671 self.test_client = test_client 1672 self.additional_bq_parameters = additional_bq_parameters 1673 self.ignore_insert_ids = ignore_insert_ids 1674 self.ignore_unknown_columns = ignore_unknown_columns 1675 self.with_auto_sharding = with_auto_sharding 1676 self._num_streaming_keys = num_streaming_keys 1677 self.max_retries = max_retries or MAX_INSERT_RETRIES 1678 1679 class InsertIdPrefixFn(DoFn): 1680 def start_bundle(self): 1681 self.prefix = str(uuid.uuid4()) 1682 self._row_count = 0 1683 1684 def process(self, element): 1685 key = element[0] 1686 value = element[1] 1687 insert_id = '%s-%s' % (self.prefix, self._row_count) 1688 self._row_count += 1 1689 yield (key, (value, insert_id)) 1690 1691 def expand(self, input): 1692 bigquery_write_fn = BigQueryWriteFn( 1693 schema=self.schema, 1694 batch_size=self.batch_size, 1695 create_disposition=self.create_disposition, 1696 write_disposition=self.write_disposition, 1697 kms_key=self.kms_key, 1698 retry_strategy=self.retry_strategy, 1699 test_client=self.test_client, 1700 additional_bq_parameters=self.additional_bq_parameters, 1701 ignore_insert_ids=self.ignore_insert_ids, 1702 ignore_unknown_columns=self.ignore_unknown_columns, 1703 with_batched_input=self.with_auto_sharding, 1704 max_retries=self.max_retries) 1705 1706 def _add_random_shard(element): 1707 key = element[0] 1708 value = element[1] 1709 return ((key, random.randint(0, self._num_streaming_keys)), value) 1710 1711 def _restore_table_ref(sharded_table_ref_elems_kv): 1712 sharded_table_ref = sharded_table_ref_elems_kv[0] 1713 table_ref = bigquery_tools.parse_table_reference(sharded_table_ref) 1714 return (table_ref, sharded_table_ref_elems_kv[1]) 1715 1716 tagged_data = ( 1717 input 1718 | 'AppendDestination' >> beam.ParDo( 1719 bigquery_tools.AppendDestinationsFn(self.table_reference), 1720 *self.table_side_inputs) 1721 | 'AddInsertIds' >> beam.ParDo(_StreamToBigQuery.InsertIdPrefixFn()) 1722 | 1723 'ToHashableTableRef' >> beam.Map(bigquery_tools.to_hashable_table_ref)) 1724 1725 if not self.with_auto_sharding: 1726 tagged_data = ( 1727 tagged_data 1728 | 'WithFixedSharding' >> beam.Map(_add_random_shard) 1729 | 'CommitInsertIds' >> ReshufflePerKey() 1730 | 'DropShard' >> beam.Map(lambda kv: (kv[0][0], kv[1]))) 1731 else: 1732 # Auto-sharding is achieved via GroupIntoBatches.WithShardedKey 1733 # transform which shards, groups and at the same time batches the table 1734 # rows to be inserted to BigQuery. 1735 1736 # Firstly the keys of tagged_data (table references) are converted to a 1737 # hashable format. This is needed to work with the keyed states used by 1738 # GroupIntoBatches. After grouping and batching is done, original table 1739 # references are restored. 1740 tagged_data = ( 1741 tagged_data 1742 | 'WithAutoSharding' >> beam.GroupIntoBatches.WithShardedKey( 1743 (self.batch_size or BigQueryWriteFn.DEFAULT_MAX_BUFFERED_ROWS), 1744 self.triggering_frequency or 1745 DEFAULT_BATCH_BUFFERING_DURATION_LIMIT_SEC) 1746 | 'DropShard' >> beam.Map(lambda kv: (kv[0].key, kv[1]))) 1747 1748 return ( 1749 tagged_data 1750 | 'FromHashableTableRef' >> beam.Map(_restore_table_ref) 1751 | 'StreamInsertRows' >> ParDo( 1752 bigquery_write_fn, *self.schema_side_inputs).with_outputs( 1753 BigQueryWriteFn.FAILED_ROWS, 1754 BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS, 1755 main='main')) 1756 1757 1758 # Flag to be passed to WriteToBigQuery to force schema autodetection 1759 SCHEMA_AUTODETECT = 'SCHEMA_AUTODETECT' 1760 1761 1762 class WriteToBigQuery(PTransform): 1763 """Write data to BigQuery. 1764 1765 This transform receives a PCollection of elements to be inserted into BigQuery 1766 tables. The elements would come in as Python dictionaries, or as `TableRow` 1767 instances. 1768 """ 1769 class Method(object): 1770 DEFAULT = 'DEFAULT' 1771 STREAMING_INSERTS = 'STREAMING_INSERTS' 1772 FILE_LOADS = 'FILE_LOADS' 1773 STORAGE_WRITE_API = 'STORAGE_WRITE_API' 1774 1775 def __init__( 1776 self, 1777 table, 1778 dataset=None, 1779 project=None, 1780 schema=None, 1781 create_disposition=BigQueryDisposition.CREATE_IF_NEEDED, 1782 write_disposition=BigQueryDisposition.WRITE_APPEND, 1783 kms_key=None, 1784 batch_size=None, 1785 max_file_size=None, 1786 max_files_per_bundle=None, 1787 test_client=None, 1788 custom_gcs_temp_location=None, 1789 method=None, 1790 insert_retry_strategy=None, 1791 additional_bq_parameters=None, 1792 table_side_inputs=None, 1793 schema_side_inputs=None, 1794 triggering_frequency=None, 1795 use_at_least_once=False, 1796 validate=True, 1797 temp_file_format=None, 1798 ignore_insert_ids=False, 1799 # TODO(https://github.com/apache/beam/issues/20712): Switch the default 1800 # when the feature is mature. 1801 with_auto_sharding=False, 1802 ignore_unknown_columns=False, 1803 load_job_project_id=None, 1804 num_streaming_keys=DEFAULT_SHARDS_PER_DESTINATION, 1805 expansion_service=None): 1806 """Initialize a WriteToBigQuery transform. 1807 1808 Args: 1809 table (str, callable, ValueProvider): The ID of the table, or a callable 1810 that returns it. The ID must contain only letters ``a-z``, ``A-Z``, 1811 numbers ``0-9``, or connectors ``-_``. If dataset argument is 1812 :data:`None` then the table argument must contain the entire table 1813 reference specified as: ``'DATASET.TABLE'`` 1814 or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one 1815 argument representing an element to be written to BigQuery, and return 1816 a TableReference, or a string table name as specified above. 1817 dataset (str): The ID of the dataset containing this table or 1818 :data:`None` if the table reference is specified entirely by the table 1819 argument. 1820 project (str): The ID of the project containing this table or 1821 :data:`None` if the table reference is specified entirely by the table 1822 argument. 1823 schema (str,dict,ValueProvider,callable): The schema to be used if the 1824 BigQuery table to write has to be created. This can be either specified 1825 as a :class:`~apache_beam.io.gcp.internal.clients.bigquery.\ 1826 bigquery_v2_messages.TableSchema`. or a `ValueProvider` that has a JSON string, 1827 or a python dictionary, or the string or dictionary itself, 1828 object or a single string of the form 1829 ``'field1:type1,field2:type2,field3:type3'`` that defines a comma 1830 separated list of fields. Here ``'type'`` should specify the BigQuery 1831 type of the field. Single string based schemas do not support nested 1832 fields, repeated fields, or specifying a BigQuery mode for fields 1833 (mode will always be set to ``'NULLABLE'``). 1834 If a callable, then it should receive a destination (in the form of 1835 a str, and return a str, dict or TableSchema). 1836 One may also pass ``SCHEMA_AUTODETECT`` here when using JSON-based 1837 file loads, and BigQuery will try to infer the schema for the files 1838 that are being loaded. 1839 create_disposition (BigQueryDisposition): A string describing what 1840 happens if the table does not exist. Possible values are: 1841 1842 * :attr:`BigQueryDisposition.CREATE_IF_NEEDED`: create if does not 1843 exist. 1844 * :attr:`BigQueryDisposition.CREATE_NEVER`: fail the write if does not 1845 exist. 1846 1847 write_disposition (BigQueryDisposition): A string describing what happens 1848 if the table has already some data. Possible values are: 1849 1850 * :attr:`BigQueryDisposition.WRITE_TRUNCATE`: delete existing rows. 1851 * :attr:`BigQueryDisposition.WRITE_APPEND`: add to existing rows. 1852 * :attr:`BigQueryDisposition.WRITE_EMPTY`: fail the write if table not 1853 empty. 1854 1855 For streaming pipelines WriteTruncate can not be used. 1856 kms_key (str): Optional Cloud KMS key name for use when creating new 1857 tables. 1858 batch_size (int): Number of rows to be written to BQ per streaming API 1859 insert. The default is 500. 1860 test_client: Override the default bigquery client used for testing. 1861 max_file_size (int): The maximum size for a file to be written and then 1862 loaded into BigQuery. The default value is 4TB, which is 80% of the 1863 limit of 5TB for BigQuery to load any file. 1864 max_files_per_bundle(int): The maximum number of files to be concurrently 1865 written by a worker. The default here is 20. Larger values will allow 1866 writing to multiple destinations without having to reshard - but they 1867 increase the memory burden on the workers. 1868 custom_gcs_temp_location (str): A GCS location to store files to be used 1869 for file loads into BigQuery. By default, this will use the pipeline's 1870 temp_location, but for pipelines whose temp_location is not appropriate 1871 for BQ File Loads, users should pass a specific one. 1872 method: The method to use to write to BigQuery. It may be 1873 STREAMING_INSERTS, FILE_LOADS, STORAGE_WRITE_API or DEFAULT. An 1874 introduction on loading data to BigQuery: 1875 https://cloud.google.com/bigquery/docs/loading-data. 1876 DEFAULT will use STREAMING_INSERTS on Streaming pipelines and 1877 FILE_LOADS on Batch pipelines. 1878 Note: FILE_LOADS currently does not support BigQuery's JSON data type: 1879 https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type"> 1880 insert_retry_strategy: The strategy to use when retrying streaming inserts 1881 into BigQuery. Options are shown in bigquery_tools.RetryStrategy attrs. 1882 Default is to retry always. This means that whenever there are rows 1883 that fail to be inserted to BigQuery, they will be retried indefinitely. 1884 Other retry strategy settings will produce a deadletter PCollection 1885 as output. Appropriate values are: 1886 1887 * `RetryStrategy.RETRY_ALWAYS`: retry all rows if 1888 there are any kind of errors. Note that this will hold your pipeline 1889 back if there are errors until you cancel or update it. 1890 * `RetryStrategy.RETRY_NEVER`: rows with errors 1891 will not be retried. Instead they will be output to a dead letter 1892 queue under the `'FailedRows'` tag. 1893 * `RetryStrategy.RETRY_ON_TRANSIENT_ERROR`: retry 1894 rows with transient errors (e.g. timeouts). Rows with permanent errors 1895 will be output to dead letter queue under `'FailedRows'` tag. 1896 1897 additional_bq_parameters (dict, callable): Additional parameters to pass 1898 to BQ when creating / loading data into a table. If a callable, it 1899 should be a function that receives a table reference indicating 1900 the destination and returns a dictionary. 1901 These can be 'timePartitioning', 'clustering', etc. They are passed 1902 directly to the job load configuration. See 1903 https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload 1904 table_side_inputs (tuple): A tuple with ``AsSideInput`` PCollections to be 1905 passed to the table callable (if one is provided). 1906 schema_side_inputs: A tuple with ``AsSideInput`` PCollections to be 1907 passed to the schema callable (if one is provided). 1908 triggering_frequency (float): 1909 When method is FILE_LOADS: 1910 Value will be converted to int. Every triggering_frequency seconds, a 1911 BigQuery load job will be triggered for all the data written since the 1912 last load job. BigQuery has limits on how many load jobs can be 1913 triggered per day, so be careful not to set this duration too low, or 1914 you may exceed daily quota. Often this is set to 5 or 10 minutes to 1915 ensure that the project stays well under the BigQuery quota. See 1916 https://cloud.google.com/bigquery/quota-policy for more information 1917 about BigQuery quotas. 1918 1919 When method is STREAMING_INSERTS and with_auto_sharding=True: 1920 A streaming inserts batch will be submitted at least every 1921 triggering_frequency seconds when data is waiting. The batch can be 1922 sent earlier if it reaches the maximum batch size set by batch_size. 1923 Default value is 0.2 seconds. 1924 1925 When method is STORAGE_WRITE_API: 1926 A stream of rows will be committed every triggering_frequency seconds. 1927 By default, this will be 5 seconds to ensure exactly-once semantics. 1928 use_at_least_once: Intended only for STORAGE_WRITE_API. When True, will 1929 use at-least-once semantics. This is cheaper and provides lower 1930 latency, but will potentially duplicate records. 1931 validate: Indicates whether to perform validation checks on 1932 inputs. This parameter is primarily used for testing. 1933 temp_file_format: The format to use for file loads into BigQuery. The 1934 options are NEWLINE_DELIMITED_JSON or AVRO, with NEWLINE_DELIMITED_JSON 1935 being used by default. For advantages and limitations of the two 1936 formats, see 1937 https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro 1938 and 1939 https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json. 1940 ignore_insert_ids: When using the STREAMING_INSERTS method to write data 1941 to BigQuery, `insert_ids` are a feature of BigQuery that support 1942 deduplication of events. If your use case is not sensitive to 1943 duplication of data inserted to BigQuery, set `ignore_insert_ids` 1944 to True to increase the throughput for BQ writing. See: 1945 https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication 1946 with_auto_sharding: Experimental. If true, enables using a dynamically 1947 determined number of shards to write to BigQuery. This can be used for 1948 all of FILE_LOADS, STREAMING_INSERTS, and STORAGE_WRITE_API. Only 1949 applicable to unbounded input. 1950 ignore_unknown_columns: Accept rows that contain values that do not match 1951 the schema. The unknown values are ignored. Default is False, 1952 which treats unknown values as errors. This option is only valid for 1953 method=STREAMING_INSERTS. See reference: 1954 https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll 1955 load_job_project_id: Specifies an alternate GCP project id to use for 1956 billingBatch File Loads. By default, the project id of the table is 1957 used. 1958 num_streaming_keys: The number of shards per destination when writing via 1959 streaming inserts. 1960 expansion_service: The address (host:port) of the expansion service. 1961 If no expansion service is provided, will attempt to run the default 1962 GCP expansion service. Used for STORAGE_WRITE_API method. 1963 """ 1964 self._table = table 1965 self._dataset = dataset 1966 self._project = project 1967 self.table_reference = bigquery_tools.parse_table_reference( 1968 table, dataset, project) 1969 self.create_disposition = BigQueryDisposition.validate_create( 1970 create_disposition) 1971 self.write_disposition = BigQueryDisposition.validate_write( 1972 write_disposition) 1973 if schema == SCHEMA_AUTODETECT: 1974 self.schema = schema 1975 else: 1976 self.schema = bigquery_tools.get_dict_table_schema(schema) 1977 self.batch_size = batch_size 1978 self.kms_key = kms_key 1979 self.test_client = test_client 1980 1981 # TODO(pabloem): Consider handling ValueProvider for this location. 1982 self.custom_gcs_temp_location = custom_gcs_temp_location 1983 self.max_file_size = max_file_size 1984 self.max_files_per_bundle = max_files_per_bundle 1985 self.method = method or WriteToBigQuery.Method.DEFAULT 1986 self.triggering_frequency = triggering_frequency 1987 self.use_at_least_once = use_at_least_once 1988 self.expansion_service = expansion_service 1989 self.with_auto_sharding = with_auto_sharding 1990 self.insert_retry_strategy = insert_retry_strategy 1991 self._validate = validate 1992 self._temp_file_format = temp_file_format or bigquery_tools.FileFormat.JSON 1993 1994 self.additional_bq_parameters = additional_bq_parameters or {} 1995 self.table_side_inputs = table_side_inputs or () 1996 self.schema_side_inputs = schema_side_inputs or () 1997 self._ignore_insert_ids = ignore_insert_ids 1998 self._ignore_unknown_columns = ignore_unknown_columns 1999 self.load_job_project_id = load_job_project_id 2000 self._num_streaming_keys = num_streaming_keys 2001 2002 # Dict/schema methods were moved to bigquery_tools, but keep references 2003 # here for backward compatibility. 2004 get_table_schema_from_string = \ 2005 staticmethod(bigquery_tools.get_table_schema_from_string) 2006 table_schema_to_dict = staticmethod(bigquery_tools.table_schema_to_dict) 2007 get_dict_table_schema = staticmethod(bigquery_tools.get_dict_table_schema) 2008 2009 def _compute_method(self, experiments, is_streaming_pipeline): 2010 # If the new BQ sink is not activated for experiment flags, then we use 2011 # streaming inserts by default (it gets overridden in dataflow_runner.py). 2012 if self.method == self.Method.DEFAULT and is_streaming_pipeline: 2013 return self.Method.STREAMING_INSERTS 2014 elif self.method == self.Method.DEFAULT and not is_streaming_pipeline: 2015 return self.Method.FILE_LOADS 2016 else: 2017 return self.method 2018 2019 def expand(self, pcoll): 2020 p = pcoll.pipeline 2021 2022 if (isinstance(self.table_reference, TableReference) and 2023 self.table_reference.projectId is None): 2024 self.table_reference.projectId = pcoll.pipeline.options.view_as( 2025 GoogleCloudOptions).project 2026 2027 # TODO(pabloem): Use a different method to determine if streaming or batch. 2028 is_streaming_pipeline = p.options.view_as(StandardOptions).streaming 2029 2030 if not is_streaming_pipeline and self.with_auto_sharding: 2031 raise ValueError( 2032 'with_auto_sharding is not applicable to batch pipelines.') 2033 2034 experiments = p.options.view_as(DebugOptions).experiments or [] 2035 method_to_use = self._compute_method(experiments, is_streaming_pipeline) 2036 2037 if method_to_use == WriteToBigQuery.Method.STREAMING_INSERTS: 2038 if self.schema == SCHEMA_AUTODETECT: 2039 raise ValueError( 2040 'Schema auto-detection is not supported for streaming ' 2041 'inserts into BigQuery. Only for File Loads.') 2042 2043 if self.triggering_frequency is not None and not self.with_auto_sharding: 2044 raise ValueError( 2045 'triggering_frequency with STREAMING_INSERTS can only be used with ' 2046 'with_auto_sharding=True.') 2047 2048 outputs = pcoll | _StreamToBigQuery( 2049 table_reference=self.table_reference, 2050 table_side_inputs=self.table_side_inputs, 2051 schema_side_inputs=self.schema_side_inputs, 2052 schema=self.schema, 2053 batch_size=self.batch_size, 2054 triggering_frequency=self.triggering_frequency, 2055 create_disposition=self.create_disposition, 2056 write_disposition=self.write_disposition, 2057 kms_key=self.kms_key, 2058 retry_strategy=self.insert_retry_strategy, 2059 additional_bq_parameters=self.additional_bq_parameters, 2060 ignore_insert_ids=self._ignore_insert_ids, 2061 ignore_unknown_columns=self._ignore_unknown_columns, 2062 with_auto_sharding=self.with_auto_sharding, 2063 test_client=self.test_client, 2064 num_streaming_keys=self._num_streaming_keys) 2065 2066 return WriteResult( 2067 method=WriteToBigQuery.Method.STREAMING_INSERTS, 2068 failed_rows=outputs[BigQueryWriteFn.FAILED_ROWS], 2069 failed_rows_with_errors=outputs[ 2070 BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS]) 2071 elif method_to_use == WriteToBigQuery.Method.FILE_LOADS: 2072 if self._temp_file_format == bigquery_tools.FileFormat.AVRO: 2073 if self.schema == SCHEMA_AUTODETECT: 2074 raise ValueError( 2075 'Schema auto-detection is not supported when using Avro based ' 2076 'file loads into BigQuery. Please specify a schema or set ' 2077 'temp_file_format="NEWLINE_DELIMITED_JSON"') 2078 if self.schema is None: 2079 raise ValueError( 2080 'A schema must be provided when writing to BigQuery using ' 2081 'Avro based file loads') 2082 2083 if self.schema and type(self.schema) is dict: 2084 2085 def find_in_nested_dict(schema): 2086 for field in schema['fields']: 2087 if field['type'] == 'JSON': 2088 raise ValueError( 2089 'Found JSON type in table schema. JSON data ' 2090 'insertion is currently not supported with ' 2091 'FILE_LOADS write method. This is supported with ' 2092 'STREAMING_INSERTS. For more information: ' 2093 'https://cloud.google.com/bigquery/docs/reference/' 2094 'standard-sql/json-data#ingest_json_data') 2095 elif field['type'] == 'STRUCT': 2096 find_in_nested_dict(field) 2097 2098 find_in_nested_dict(self.schema) 2099 2100 from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads 2101 # Only cast to int when a value is given. 2102 # We only use an int for BigQueryBatchFileLoads 2103 if self.triggering_frequency is not None: 2104 triggering_frequency = int(self.triggering_frequency) 2105 else: 2106 triggering_frequency = self.triggering_frequency 2107 output = pcoll | BigQueryBatchFileLoads( 2108 destination=self.table_reference, 2109 schema=self.schema, 2110 project=self._project, 2111 create_disposition=self.create_disposition, 2112 write_disposition=self.write_disposition, 2113 triggering_frequency=triggering_frequency, 2114 with_auto_sharding=self.with_auto_sharding, 2115 temp_file_format=self._temp_file_format, 2116 max_file_size=self.max_file_size, 2117 max_files_per_bundle=self.max_files_per_bundle, 2118 custom_gcs_temp_location=self.custom_gcs_temp_location, 2119 test_client=self.test_client, 2120 table_side_inputs=self.table_side_inputs, 2121 schema_side_inputs=self.schema_side_inputs, 2122 additional_bq_parameters=self.additional_bq_parameters, 2123 validate=self._validate, 2124 is_streaming_pipeline=is_streaming_pipeline, 2125 load_job_project_id=self.load_job_project_id) 2126 2127 return WriteResult( 2128 method=WriteToBigQuery.Method.FILE_LOADS, 2129 destination_load_jobid_pairs=output[ 2130 BigQueryBatchFileLoads.DESTINATION_JOBID_PAIRS], 2131 destination_file_pairs=output[ 2132 BigQueryBatchFileLoads.DESTINATION_FILE_PAIRS], 2133 destination_copy_jobid_pairs=output[ 2134 BigQueryBatchFileLoads.DESTINATION_COPY_JOBID_PAIRS]) 2135 else: 2136 # Storage Write API 2137 if self.schema is None: 2138 raise AttributeError( 2139 "A schema is required in order to prepare rows" 2140 "for writing with STORAGE_WRITE_API.") 2141 if callable(self.schema): 2142 raise NotImplementedError( 2143 "Writing to dynamic destinations is not" 2144 "supported for this write method.") 2145 elif isinstance(self.schema, vp.ValueProvider): 2146 schema = self.schema.get() 2147 else: 2148 schema = self.schema 2149 2150 table = bigquery_tools.get_hashable_destination(self.table_reference) 2151 # None type is not supported 2152 triggering_frequency = self.triggering_frequency or 0 2153 # SchemaTransform expects Beam Rows, so map to Rows first 2154 output_beam_rows = ( 2155 pcoll 2156 | 2157 beam.Map(lambda row: bigquery_tools.beam_row_from_dict(row, schema)). 2158 with_output_types( 2159 RowTypeConstraint.from_fields( 2160 bigquery_tools.get_beam_typehints_from_tableschema(schema))) 2161 | StorageWriteToBigQuery( 2162 table=table, 2163 create_disposition=self.create_disposition, 2164 write_disposition=self.write_disposition, 2165 triggering_frequency=triggering_frequency, 2166 use_at_least_once=self.use_at_least_once, 2167 with_auto_sharding=self.with_auto_sharding, 2168 expansion_service=self.expansion_service)) 2169 2170 # return back from Beam Rows to Python dict elements 2171 failed_rows = ( 2172 output_beam_rows[StorageWriteToBigQuery.FAILED_ROWS] 2173 | beam.Map(lambda row: row.as_dict())) 2174 failed_rows_with_errors = ( 2175 output_beam_rows[StorageWriteToBigQuery.FAILED_ROWS_WITH_ERRORS] 2176 | beam.Map( 2177 lambda row: { 2178 "error_message": row.error_message, 2179 "failed_row": row.failed_row.as_dict() 2180 })) 2181 2182 return WriteResult( 2183 method=WriteToBigQuery.Method.STORAGE_WRITE_API, 2184 failed_rows=failed_rows, 2185 failed_rows_with_errors=failed_rows_with_errors) 2186 2187 def display_data(self): 2188 res = {} 2189 if self.table_reference is not None and isinstance(self.table_reference, 2190 TableReference): 2191 tableSpec = '{}.{}'.format( 2192 self.table_reference.datasetId, self.table_reference.tableId) 2193 if self.table_reference.projectId is not None: 2194 tableSpec = '{}:{}'.format(self.table_reference.projectId, tableSpec) 2195 res['table'] = DisplayDataItem(tableSpec, label='Table') 2196 2197 res['validation'] = DisplayDataItem( 2198 self._validate, label="Validation Enabled") 2199 return res 2200 2201 def to_runner_api_parameter(self, context): 2202 from apache_beam.internal import pickler 2203 2204 # It'd be nice to name these according to their actual 2205 # names/positions in the orignal argument list, but such a 2206 # transformation is currently irreversible given how 2207 # remove_objects_from_args and insert_values_in_args 2208 # are currently implemented. 2209 def serialize(side_inputs): 2210 return {(SIDE_INPUT_PREFIX + '%s') % ix: 2211 si.to_runner_api(context).SerializeToString() 2212 for ix, 2213 si in enumerate(side_inputs)} 2214 2215 table_side_inputs = serialize(self.table_side_inputs) 2216 schema_side_inputs = serialize(self.schema_side_inputs) 2217 2218 config = { 2219 'table': self._table, 2220 'dataset': self._dataset, 2221 'project': self._project, 2222 'schema': self.schema, 2223 'create_disposition': self.create_disposition, 2224 'write_disposition': self.write_disposition, 2225 'kms_key': self.kms_key, 2226 'batch_size': self.batch_size, 2227 'max_file_size': self.max_file_size, 2228 'max_files_per_bundle': self.max_files_per_bundle, 2229 'custom_gcs_temp_location': self.custom_gcs_temp_location, 2230 'method': self.method, 2231 'insert_retry_strategy': self.insert_retry_strategy, 2232 'additional_bq_parameters': self.additional_bq_parameters, 2233 'table_side_inputs': table_side_inputs, 2234 'schema_side_inputs': schema_side_inputs, 2235 'triggering_frequency': self.triggering_frequency, 2236 'validate': self._validate, 2237 'temp_file_format': self._temp_file_format, 2238 'ignore_insert_ids': self._ignore_insert_ids, 2239 'with_auto_sharding': self.with_auto_sharding, 2240 } 2241 return 'beam:transform:write_to_big_query:v0', pickler.dumps(config) 2242 2243 @PTransform.register_urn('beam:transform:write_to_big_query:v0', bytes) 2244 def from_runner_api(unused_ptransform, payload, context): 2245 from apache_beam.internal import pickler 2246 from apache_beam.portability.api import beam_runner_api_pb2 2247 2248 config = pickler.loads(payload) 2249 2250 def deserialize(side_inputs): 2251 deserialized_side_inputs = {} 2252 for k, v in side_inputs.items(): 2253 side_input = beam_runner_api_pb2.SideInput() 2254 side_input.ParseFromString(v) 2255 deserialized_side_inputs[k] = side_input 2256 2257 # This is an ordered list stored as a dict (see the comments in 2258 # to_runner_api_parameter above). 2259 indexed_side_inputs = [( 2260 get_sideinput_index(tag), 2261 pvalue.AsSideInput.from_runner_api(si, context)) for tag, 2262 si in deserialized_side_inputs.items()] 2263 return [si for _, si in sorted(indexed_side_inputs)] 2264 2265 config['table_side_inputs'] = deserialize(config['table_side_inputs']) 2266 config['schema_side_inputs'] = deserialize(config['schema_side_inputs']) 2267 2268 return WriteToBigQuery(**config) 2269 2270 2271 class WriteResult: 2272 """The result of a WriteToBigQuery transform. 2273 """ 2274 def __init__( 2275 self, 2276 method: WriteToBigQuery.Method = None, 2277 destination_load_jobid_pairs: PCollection[Tuple[str, 2278 JobReference]] = None, 2279 destination_file_pairs: PCollection[Tuple[str, Tuple[str, int]]] = None, 2280 destination_copy_jobid_pairs: PCollection[Tuple[str, 2281 JobReference]] = None, 2282 failed_rows: PCollection[Tuple[str, dict]] = None, 2283 failed_rows_with_errors: PCollection[Tuple[str, dict, list]] = None): 2284 2285 self._method = method 2286 self._destination_load_jobid_pairs = destination_load_jobid_pairs 2287 self._destination_file_pairs = destination_file_pairs 2288 self._destination_copy_jobid_pairs = destination_copy_jobid_pairs 2289 self._failed_rows = failed_rows 2290 self._failed_rows_with_errors = failed_rows_with_errors 2291 2292 from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads 2293 self.attributes = { 2294 BigQueryWriteFn.FAILED_ROWS: WriteResult.failed_rows, 2295 BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS: WriteResult. 2296 failed_rows_with_errors, 2297 BigQueryBatchFileLoads.DESTINATION_JOBID_PAIRS: WriteResult. 2298 destination_load_jobid_pairs, 2299 BigQueryBatchFileLoads.DESTINATION_FILE_PAIRS: WriteResult. 2300 destination_file_pairs, 2301 BigQueryBatchFileLoads.DESTINATION_COPY_JOBID_PAIRS: WriteResult. 2302 destination_copy_jobid_pairs, 2303 } 2304 2305 def validate(self, valid_methods, attribute): 2306 if self._method not in valid_methods: 2307 raise AttributeError( 2308 f'Cannot get {attribute} because it is not produced ' 2309 f'by the {self._method} write method. Note: only ' 2310 f'{valid_methods} produces this attribute.') 2311 2312 @property 2313 def destination_load_jobid_pairs( 2314 self) -> PCollection[Tuple[str, JobReference]]: 2315 """A ``FILE_LOADS`` method attribute 2316 2317 Returns: A PCollection of the table destinations that were successfully 2318 loaded to using the batch load API, along with the load job IDs. 2319 2320 Raises: AttributeError: if accessed with a write method 2321 besides ``FILE_LOADS``.""" 2322 self.validate([WriteToBigQuery.Method.FILE_LOADS], 2323 'DESTINATION_JOBID_PAIRS') 2324 2325 return self._destination_load_jobid_pairs 2326 2327 @property 2328 def destination_file_pairs(self) -> PCollection[Tuple[str, Tuple[str, int]]]: 2329 """A ``FILE_LOADS`` method attribute 2330 2331 Returns: A PCollection of the table destinations along with the 2332 temp files used as sources to load from. 2333 2334 Raises: AttributeError: if accessed with a write method 2335 besides ``FILE_LOADS``.""" 2336 self.validate([WriteToBigQuery.Method.FILE_LOADS], 'DESTINATION_FILE_PAIRS') 2337 2338 return self._destination_file_pairs 2339 2340 @property 2341 def destination_copy_jobid_pairs( 2342 self) -> PCollection[Tuple[str, JobReference]]: 2343 """A ``FILE_LOADS`` method attribute 2344 2345 Returns: A PCollection of the table destinations that were successfully 2346 copied to, along with the copy job ID. 2347 2348 Raises: AttributeError: if accessed with a write method 2349 besides ``FILE_LOADS``.""" 2350 self.validate([WriteToBigQuery.Method.FILE_LOADS], 2351 'DESTINATION_COPY_JOBID_PAIRS') 2352 2353 return self._destination_copy_jobid_pairs 2354 2355 @property 2356 def failed_rows(self) -> PCollection[Tuple[str, dict]]: 2357 """A ``[STREAMING_INSERTS, STORAGE_WRITE_API]`` method attribute 2358 2359 Returns: A PCollection of rows that failed when inserting to BigQuery. 2360 2361 Raises: AttributeError: if accessed with a write method 2362 besides ``[STREAMING_INSERTS, STORAGE_WRITE_API]``.""" 2363 self.validate([ 2364 WriteToBigQuery.Method.STREAMING_INSERTS, 2365 WriteToBigQuery.Method.STORAGE_WRITE_API 2366 ], 2367 'FAILED_ROWS') 2368 2369 return self._failed_rows 2370 2371 @property 2372 def failed_rows_with_errors(self) -> PCollection[Tuple[str, dict, list]]: 2373 """A ``[STREAMING_INSERTS, STORAGE_WRITE_API]`` method attribute 2374 2375 Returns: 2376 A PCollection of rows that failed when inserting to BigQuery, 2377 along with their errors. 2378 2379 Raises: 2380 AttributeError: if accessed with a write method 2381 besides ``[STREAMING_INSERTS, STORAGE_WRITE_API]``.""" 2382 self.validate([ 2383 WriteToBigQuery.Method.STREAMING_INSERTS, 2384 WriteToBigQuery.Method.STORAGE_WRITE_API 2385 ], 2386 'FAILED_ROWS_WITH_ERRORS') 2387 2388 return self._failed_rows_with_errors 2389 2390 def __getitem__(self, key): 2391 if key not in self.attributes: 2392 raise AttributeError( 2393 f'Error trying to access nonexistent attribute `{key}` in write ' 2394 'result. Please see __documentation__ for available attributes.') 2395 2396 return self.attributes[key].__get__(self, WriteResult) 2397 2398 2399 def _default_io_expansion_service(append_args=None): 2400 return BeamJarExpansionService( 2401 'sdks:java:io:google-cloud-platform:expansion-service:build', 2402 append_args=append_args) 2403 2404 2405 class StorageWriteToBigQuery(PTransform): 2406 """Writes data to BigQuery using Storage API. 2407 2408 Experimental; no backwards compatibility guarantees. 2409 """ 2410 URN = "beam:schematransform:org.apache.beam:bigquery_storage_write:v1" 2411 FAILED_ROWS = "FailedRows" 2412 FAILED_ROWS_WITH_ERRORS = "FailedRowsWithErrors" 2413 2414 def __init__( 2415 self, 2416 table, 2417 create_disposition=BigQueryDisposition.CREATE_IF_NEEDED, 2418 write_disposition=BigQueryDisposition.WRITE_APPEND, 2419 triggering_frequency=0, 2420 use_at_least_once=False, 2421 with_auto_sharding=False, 2422 expansion_service=None): 2423 """Initialize a StorageWriteToBigQuery transform. 2424 2425 :param table: 2426 Fully-qualified table ID specified as ``'PROJECT:DATASET.TABLE'``. 2427 :param create_disposition: 2428 String specifying the strategy to take when the table doesn't 2429 exist. Possible values are: 2430 * ``'CREATE_IF_NEEDED'``: create if does not exist. 2431 * ``'CREATE_NEVER'``: fail the write if does not exist. 2432 :param write_disposition: 2433 String specifying the strategy to take when the table already 2434 contains data. Possible values are: 2435 * ``'WRITE_TRUNCATE'``: delete existing rows. 2436 * ``'WRITE_APPEND'``: add to existing rows. 2437 * ``'WRITE_EMPTY'``: fail the write if table not empty. 2438 :param triggering_frequency: 2439 The time in seconds between write commits. Should only be specified 2440 for streaming pipelines. Defaults to 5 seconds. 2441 :param use_at_least_once: 2442 Use at-least-once semantics. Is cheaper and provides lower latency, 2443 but will potentially duplicate records. 2444 :param with_auto_sharding: 2445 Experimental. If true, enables using a dynamically determined number of 2446 shards to write to BigQuery. Only applicable to unbounded input. 2447 :param expansion_service: 2448 The address (host:port) of the expansion service. If no expansion 2449 service is provided, will attempt to run the default GCP expansion 2450 service. 2451 """ 2452 super().__init__() 2453 self._table = table 2454 self._create_disposition = create_disposition 2455 self._write_disposition = write_disposition 2456 self._triggering_frequency = triggering_frequency 2457 self._use_at_least_once = use_at_least_once 2458 self._with_auto_sharding = with_auto_sharding 2459 self._expansion_service = ( 2460 expansion_service or _default_io_expansion_service()) 2461 self.schematransform_config = SchemaAwareExternalTransform.discover_config( 2462 self._expansion_service, self.URN) 2463 2464 def expand(self, input): 2465 external_storage_write = SchemaAwareExternalTransform( 2466 identifier=self.schematransform_config.identifier, 2467 expansion_service=self._expansion_service, 2468 rearrange_based_on_discovery=True, 2469 autoSharding=self._with_auto_sharding, 2470 createDisposition=self._create_disposition, 2471 table=self._table, 2472 triggeringFrequencySeconds=self._triggering_frequency, 2473 useAtLeastOnceSemantics=self._use_at_least_once, 2474 writeDisposition=self._write_disposition, 2475 ) 2476 2477 input_tag = self.schematransform_config.inputs[0] 2478 2479 return {input_tag: input} | external_storage_write 2480 2481 2482 class ReadFromBigQuery(PTransform): 2483 """Read data from BigQuery. 2484 2485 This PTransform uses a BigQuery export job to take a snapshot of the table 2486 on GCS, and then reads from each produced file. File format is Avro by 2487 default. 2488 2489 Args: 2490 method: The method to use to read from BigQuery. It may be EXPORT or 2491 DIRECT_READ. EXPORT invokes a BigQuery export request 2492 (https://cloud.google.com/bigquery/docs/exporting-data). DIRECT_READ reads 2493 directly from BigQuery storage using the BigQuery Read API 2494 (https://cloud.google.com/bigquery/docs/reference/storage). If 2495 unspecified, the default is currently EXPORT. 2496 use_native_datetime (bool): By default this transform exports BigQuery 2497 DATETIME fields as formatted strings (for example: 2498 2021-01-01T12:59:59). If :data:`True`, BigQuery DATETIME fields will 2499 be returned as native Python datetime objects. This can only be used when 2500 'method' is 'DIRECT_READ'. 2501 table (str, callable, ValueProvider): The ID of the table, or a callable 2502 that returns it. If dataset argument is :data:`None` then the table 2503 argument must contain the entire table reference specified as: 2504 ``'DATASET.TABLE'`` or ``'PROJECT:DATASET.TABLE'``. If it's a callable, 2505 it must receive one argument representing an element to be written to 2506 BigQuery, and return a TableReference, or a string table name as specified 2507 above. 2508 dataset (str): The ID of the dataset containing this table or 2509 :data:`None` if the table reference is specified entirely by the table 2510 argument. 2511 project (str): The ID of the project containing this table. 2512 query (str, ValueProvider): A query to be used instead of arguments 2513 table, dataset, and project. 2514 validate (bool): If :data:`True`, various checks will be done when source 2515 gets initialized (e.g., is table present?). This should be 2516 :data:`True` for most scenarios in order to catch errors as early as 2517 possible (pipeline construction instead of pipeline execution). It 2518 should be :data:`False` if the table is created during pipeline 2519 execution by a previous step. 2520 coder (~apache_beam.coders.coders.Coder): The coder for the table 2521 rows. If :data:`None`, then the default coder is 2522 _JsonToDictCoder, which will interpret every row as a JSON 2523 serialized dictionary. 2524 use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL 2525 dialect for this query. The default value is :data:`False`. 2526 If set to :data:`True`, the query will use BigQuery's updated SQL 2527 dialect with improved standards compliance. 2528 This parameter is ignored for table inputs. 2529 flatten_results (bool): Flattens all nested and repeated fields in the 2530 query results. The default value is :data:`True`. 2531 kms_key (str): Optional Cloud KMS key name for use when creating new 2532 temporary tables. 2533 gcs_location (str, ValueProvider): The name of the Google Cloud Storage 2534 bucket where the extracted table should be written as a string or 2535 a :class:`~apache_beam.options.value_provider.ValueProvider`. If 2536 :data:`None`, then the temp_location parameter is used. 2537 bigquery_job_labels (dict): A dictionary with string labels to be passed 2538 to BigQuery export and query jobs created by this transform. See: 2539 https://cloud.google.com/bigquery/docs/reference/rest/v2/\ 2540 Job#JobConfiguration 2541 use_json_exports (bool): By default, this transform works by exporting 2542 BigQuery data into Avro files, and reading those files. With this 2543 parameter, the transform will instead export to JSON files. JSON files 2544 are slower to read due to their larger size. 2545 When using JSON exports, the BigQuery types for DATE, DATETIME, TIME, and 2546 TIMESTAMP will be exported as strings. This behavior is consistent with 2547 BigQuerySource. 2548 When using Avro exports, these fields will be exported as native Python 2549 types (datetime.date, datetime.datetime, datetime.datetime, 2550 and datetime.datetime respectively). Avro exports are recommended. 2551 To learn more about BigQuery types, and Time-related type 2552 representations, see: https://cloud.google.com/bigquery/docs/reference/\ 2553 standard-sql/data-types 2554 To learn more about type conversions between BigQuery and Avro, see: 2555 https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro\ 2556 #avro_conversions 2557 temp_dataset (``apache_beam.io.gcp.internal.clients.bigquery.\ 2558 DatasetReference``): 2559 Temporary dataset reference to use when reading from BigQuery using a 2560 query. When reading using a query, BigQuery source will create a 2561 temporary dataset and a temporary table to store the results of the 2562 query. With this option, you can set an existing dataset to create the 2563 temporary table in. BigQuery source will create a temporary table in 2564 that dataset, and will remove it once it is not needed. Job needs access 2565 to create and delete tables within the given dataset. Dataset name 2566 should *not* start with the reserved prefix `beam_temp_dataset_`. 2567 query_priority (BigQueryQueryPriority): By default, this transform runs 2568 queries with BATCH priority. Use :attr:`BigQueryQueryPriority.INTERACTIVE` 2569 to run queries with INTERACTIVE priority. This option is ignored when 2570 reading from a table rather than a query. To learn more about query 2571 priority, see: https://cloud.google.com/bigquery/docs/running-queries 2572 output_type (str): By default, this source yields Python dictionaries 2573 (`PYTHON_DICT`). There is experimental support for producing a 2574 PCollection with a schema and yielding Beam Rows via the option 2575 `BEAM_ROW`. For more information on schemas, see 2576 https://beam.apache.org/documentation/programming-guide/\ 2577 #what-is-a-schema) 2578 """ 2579 class Method(object): 2580 EXPORT = 'EXPORT' # This is currently the default. 2581 DIRECT_READ = 'DIRECT_READ' 2582 2583 COUNTER = 0 2584 2585 def __init__( 2586 self, 2587 gcs_location=None, 2588 method=None, 2589 use_native_datetime=False, 2590 output_type=None, 2591 *args, 2592 **kwargs): 2593 self.method = method or ReadFromBigQuery.Method.EXPORT 2594 self.use_native_datetime = use_native_datetime 2595 self.output_type = output_type 2596 self._args = args 2597 self._kwargs = kwargs 2598 2599 if self.method is ReadFromBigQuery.Method.EXPORT \ 2600 and self.use_native_datetime is True: 2601 raise TypeError( 2602 'The "use_native_datetime" parameter cannot be True for EXPORT.' 2603 ' Please set the "use_native_datetime" parameter to False *OR*' 2604 ' set the "method" parameter to ReadFromBigQuery.Method.DIRECT_READ.') 2605 2606 if gcs_location and self.method is ReadFromBigQuery.Method.EXPORT: 2607 if not isinstance(gcs_location, (str, ValueProvider)): 2608 raise TypeError( 2609 '%s: gcs_location must be of type string' 2610 ' or ValueProvider; got %r instead' % 2611 (self.__class__.__name__, type(gcs_location))) 2612 if isinstance(gcs_location, str): 2613 gcs_location = StaticValueProvider(str, gcs_location) 2614 2615 if self.output_type == 'BEAM_ROW' and self._kwargs.get('query', 2616 None) is not None: 2617 raise ValueError( 2618 "Both a query and an output type of 'BEAM_ROW' were specified. " 2619 "'BEAM_ROW' is not currently supported with queries.") 2620 2621 self.gcs_location = gcs_location 2622 self.bigquery_dataset_labels = { 2623 'type': 'bq_direct_read_' + str(uuid.uuid4())[0:10] 2624 } 2625 2626 def expand(self, pcoll): 2627 if self.method is ReadFromBigQuery.Method.EXPORT: 2628 output_pcollection = self._expand_export(pcoll) 2629 elif self.method is ReadFromBigQuery.Method.DIRECT_READ: 2630 output_pcollection = self._expand_direct_read(pcoll) 2631 2632 else: 2633 raise ValueError( 2634 'The method to read from BigQuery must be either EXPORT' 2635 'or DIRECT_READ.') 2636 return self._expand_output_type(output_pcollection) 2637 2638 def _expand_output_type(self, output_pcollection): 2639 if self.output_type == 'PYTHON_DICT' or self.output_type is None: 2640 return output_pcollection 2641 elif self.output_type == 'BEAM_ROW': 2642 table_details = bigquery_tools.parse_table_reference( 2643 table=self._kwargs.get("table", None), 2644 dataset=self._kwargs.get("dataset", None), 2645 project=self._kwargs.get("project", None)) 2646 if isinstance(self._kwargs['table'], ValueProvider): 2647 raise TypeError( 2648 '%s: table must be of type string' 2649 '; got ValueProvider instead' % self.__class__.__name__) 2650 elif callable(self._kwargs['table']): 2651 raise TypeError( 2652 '%s: table must be of type string' 2653 '; got a callable instead' % self.__class__.__name__) 2654 return output_pcollection | bigquery_schema_tools.\ 2655 convert_to_usertype( 2656 bigquery_tools.BigQueryWrapper().get_table( 2657 project_id=table_details.projectId, 2658 dataset_id=table_details.datasetId, 2659 table_id=table_details.tableId).schema) 2660 2661 def _expand_export(self, pcoll): 2662 # TODO(https://github.com/apache/beam/issues/20683): Make ReadFromBQ rely 2663 # on ReadAllFromBQ implementation. 2664 temp_location = pcoll.pipeline.options.view_as( 2665 GoogleCloudOptions).temp_location 2666 job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name 2667 gcs_location_vp = self.gcs_location 2668 unique_id = str(uuid.uuid4())[0:10] 2669 2670 def file_path_to_remove(unused_elm): 2671 gcs_location = bigquery_export_destination_uri( 2672 gcs_location_vp, temp_location, unique_id, True) 2673 return gcs_location + '/' 2674 2675 files_to_remove_pcoll = beam.pvalue.AsList( 2676 pcoll.pipeline 2677 | 'FilesToRemoveImpulse' >> beam.Create([None]) 2678 | 'MapFilesToRemove' >> beam.Map(file_path_to_remove)) 2679 2680 try: 2681 step_name = self.label 2682 except AttributeError: 2683 step_name = 'ReadFromBigQuery_%d' % ReadFromBigQuery.COUNTER 2684 ReadFromBigQuery.COUNTER += 1 2685 return ( 2686 pcoll 2687 | beam.io.Read( 2688 _CustomBigQuerySource( 2689 gcs_location=self.gcs_location, 2690 pipeline_options=pcoll.pipeline.options, 2691 method=self.method, 2692 job_name=job_name, 2693 step_name=step_name, 2694 unique_id=unique_id, 2695 *self._args, 2696 **self._kwargs)) 2697 | _PassThroughThenCleanup(files_to_remove_pcoll)) 2698 2699 def _expand_direct_read(self, pcoll): 2700 project_id = None 2701 temp_table_ref = None 2702 if 'temp_dataset' in self._kwargs: 2703 temp_table_ref = bigquery.TableReference( 2704 projectId=self._kwargs['temp_dataset'].projectId, 2705 datasetId=self._kwargs['temp_dataset'].datasetId, 2706 tableId='beam_temp_table_' + uuid.uuid4().hex) 2707 else: 2708 project_id = pcoll.pipeline.options.view_as(GoogleCloudOptions).project 2709 2710 def _get_pipeline_details(unused_elm): 2711 pipeline_details = {} 2712 if temp_table_ref is not None: 2713 pipeline_details['temp_table_ref'] = temp_table_ref 2714 elif project_id is not None: 2715 pipeline_details['project_id'] = project_id 2716 pipeline_details[ 2717 'bigquery_dataset_labels'] = self.bigquery_dataset_labels 2718 return pipeline_details 2719 2720 project_to_cleanup_pcoll = beam.pvalue.AsList( 2721 pcoll.pipeline 2722 | 'ProjectToCleanupImpulse' >> beam.Create([None]) 2723 | 'MapProjectToCleanup' >> beam.Map(_get_pipeline_details)) 2724 2725 return ( 2726 pcoll 2727 | beam.io.Read( 2728 _CustomBigQueryStorageSource( 2729 pipeline_options=pcoll.pipeline.options, 2730 method=self.method, 2731 use_native_datetime=self.use_native_datetime, 2732 temp_table=temp_table_ref, 2733 bigquery_dataset_labels=self.bigquery_dataset_labels, 2734 *self._args, 2735 **self._kwargs)) 2736 | _PassThroughThenCleanupTempDatasets(project_to_cleanup_pcoll)) 2737 2738 2739 class ReadFromBigQueryRequest: 2740 """ 2741 Class that defines data to read from BQ. 2742 """ 2743 def __init__( 2744 self, 2745 query: str = None, 2746 use_standard_sql: bool = True, 2747 table: Union[str, TableReference] = None, 2748 flatten_results: bool = False): 2749 """ 2750 Only one of query or table should be specified. 2751 2752 :param query: SQL query to fetch data. 2753 :param use_standard_sql: 2754 Specifies whether to use BigQuery's standard SQL dialect for this query. 2755 The default value is :data:`True`. If set to :data:`False`, 2756 the query will use BigQuery's legacy SQL dialect. 2757 This parameter is ignored for table inputs. 2758 :param table: 2759 The ID of the table to read. Table should define project and dataset 2760 (ex.: ``'PROJECT:DATASET.TABLE'``). 2761 :param flatten_results: 2762 Flattens all nested and repeated fields in the query results. 2763 The default value is :data:`False`. 2764 """ 2765 self.flatten_results = flatten_results 2766 self.query = query 2767 self.use_standard_sql = use_standard_sql 2768 self.table = table 2769 self.validate() 2770 2771 # We use this internal object ID to generate BigQuery export directories. 2772 self.obj_id = random.randint(0, 100000) 2773 2774 def validate(self): 2775 if self.table is not None and self.query is not None: 2776 raise ValueError( 2777 'Both a BigQuery table and a query were specified.' 2778 ' Please specify only one of these.') 2779 elif self.table is None and self.query is None: 2780 raise ValueError('A BigQuery table or a query must be specified') 2781 if self.table is not None: 2782 if isinstance(self.table, str): 2783 assert self.table.find('.'), ( 2784 'Expected a table reference ' 2785 '(PROJECT:DATASET.TABLE or DATASET.TABLE) instead of %s' 2786 % self.table) 2787 2788 2789 class ReadAllFromBigQuery(PTransform): 2790 """Read data from BigQuery. 2791 2792 PTransform:ReadFromBigQueryRequest->Rows 2793 2794 This PTransform uses a BigQuery export job to take a snapshot of the table 2795 on GCS, and then reads from each produced file. Data is exported into 2796 a new subdirectory for each export using UUIDs generated in 2797 `ReadFromBigQueryRequest` objects. 2798 2799 It is recommended not to use this PTransform for streaming jobs on 2800 GlobalWindow, since it will not be able to cleanup snapshots. 2801 2802 Args: 2803 gcs_location (str): The name of the Google Cloud Storage 2804 bucket where the extracted table should be written as a string. If 2805 :data:`None`, then the temp_location parameter is used. 2806 validate (bool): If :data:`True`, various checks will be done when source 2807 gets initialized (e.g., is table present?). 2808 kms_key (str): Experimental. Optional Cloud KMS key name for use when 2809 creating new temporary tables. 2810 """ 2811 COUNTER = 0 2812 2813 def __init__( 2814 self, 2815 gcs_location: Union[str, ValueProvider] = None, 2816 validate: bool = False, 2817 kms_key: str = None, 2818 temp_dataset: Union[str, DatasetReference] = None, 2819 bigquery_job_labels: Dict[str, str] = None, 2820 query_priority: str = BigQueryQueryPriority.BATCH): 2821 if gcs_location: 2822 if not isinstance(gcs_location, (str, ValueProvider)): 2823 raise TypeError( 2824 '%s: gcs_location must be of type string' 2825 ' or ValueProvider; got %r instead' % 2826 (self.__class__.__name__, type(gcs_location))) 2827 2828 self.gcs_location = gcs_location 2829 self.validate = validate 2830 self.kms_key = kms_key 2831 self.bigquery_job_labels = bigquery_job_labels 2832 self.temp_dataset = temp_dataset 2833 self.query_priority = query_priority 2834 2835 def expand(self, pcoll): 2836 job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name 2837 project = pcoll.pipeline.options.view_as(GoogleCloudOptions).project 2838 unique_id = str(uuid.uuid4())[0:10] 2839 2840 try: 2841 step_name = self.label 2842 except AttributeError: 2843 step_name = 'ReadAllFromBigQuery_%d' % ReadAllFromBigQuery.COUNTER 2844 ReadAllFromBigQuery.COUNTER += 1 2845 2846 sources_to_read, cleanup_locations = ( 2847 pcoll 2848 | beam.ParDo( 2849 _BigQueryReadSplit( 2850 options=pcoll.pipeline.options, 2851 gcs_location=self.gcs_location, 2852 bigquery_job_labels=self.bigquery_job_labels, 2853 job_name=job_name, 2854 step_name=step_name, 2855 unique_id=unique_id, 2856 kms_key=self.kms_key, 2857 project=project, 2858 temp_dataset=self.temp_dataset, 2859 query_priority=self.query_priority)).with_outputs( 2860 "location_to_cleanup", main="files_to_read") 2861 ) 2862 2863 return ( 2864 sources_to_read 2865 | SDFBoundedSourceReader(data_to_display=self.display_data()) 2866 | _PassThroughThenCleanup(beam.pvalue.AsIter(cleanup_locations)))