github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/io/gcp/bigquery.py

github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/io/gcp/bigquery.py (about)

     1  #
     2  # Licensed to the Apache Software Foundation (ASF) under one or more
     3  # contributor license agreements.  See the NOTICE file distributed with
     4  # this work for additional information regarding copyright ownership.
     5  # The ASF licenses this file to You under the Apache License, Version 2.0
     6  # (the "License"); you may not use this file except in compliance with
     7  # the License.  You may obtain a copy of the License at
     8  #
     9  #    http://www.apache.org/licenses/LICENSE-2.0
    10  #
    11  # Unless required by applicable law or agreed to in writing, software
    12  # distributed under the License is distributed on an "AS IS" BASIS,
    13  # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    14  # See the License for the specific language governing permissions and
    15  # limitations under the License.
    16  #
    17  
    18  """BigQuery sources and sinks.
    19  
    20  This module implements reading from and writing to BigQuery tables. It relies
    21  on several classes exposed by the BigQuery API: TableSchema, TableFieldSchema,
    22  TableRow, and TableCell. The default mode is to return table rows read from a
    23  BigQuery source as dictionaries. Similarly a Write transform to a BigQuerySink
    24  accepts PCollections of dictionaries. This is done for more convenient
    25  programming.  If desired, the native TableRow objects can be used throughout to
    26  represent rows (use an instance of TableRowJsonCoder as a coder argument when
    27  creating the sources or sinks respectively).
    28  
    29  Also, for programming convenience, instances of TableReference and TableSchema
    30  have a string representation that can be used for the corresponding arguments:
    31  
    32    - TableReference can be a PROJECT:DATASET.TABLE or DATASET.TABLE string.
    33    - TableSchema can be a NAME:TYPE{,NAME:TYPE}* string
    34      (e.g. 'month:STRING,event_count:INTEGER').
    35  
    36  The syntax supported is described here:
    37  https://cloud.google.com/bigquery/bq-command-line-tool-quickstart
    38  
    39  BigQuery sources can be used as main inputs or side inputs. A main input
    40  (common case) is expected to be massive and will be split into manageable chunks
    41  and processed in parallel. Side inputs are expected to be small and will be read
    42  completely every time a ParDo DoFn gets executed. In the example below the
    43  lambda function implementing the DoFn for the Map transform will get on each
    44  call *one* row of the main table and *all* rows of the side table. The runner
    45  may use some caching techniques to share the side inputs between calls in order
    46  to avoid excessive reading:::
    47  
    48    main_table = pipeline | 'VeryBig' >> beam.io.ReadFromBigQuery(...)
    49    side_table = pipeline | 'NotBig' >> beam.io.ReadFromBigQuery(...)
    50    results = (
    51        main_table
    52        | 'ProcessData' >> beam.Map(
    53            lambda element, side_input: ..., AsList(side_table)))
    54  
    55  There is no difference in how main and side inputs are read. What makes the
    56  side_table a 'side input' is the AsList wrapper used when passing the table
    57  as a parameter to the Map transform. AsList signals to the execution framework
    58  that its input should be made available whole.
    59  
    60  The main and side inputs are implemented differently. Reading a BigQuery table
    61  as main input entails exporting the table to a set of GCS files (in AVRO or in
    62  JSON format) and then processing those files.
    63  
    64  Users may provide a query to read from rather than reading all of a BigQuery
    65  table. If specified, the result obtained by executing the specified query will
    66  be used as the data of the input transform.::
    67  
    68    query_results = pipeline | beam.io.gcp.bigquery.ReadFromBigQuery(
    69        query='SELECT year, mean_temp FROM samples.weather_stations')
    70  
    71  When creating a BigQuery input transform, users should provide either a query
    72  or a table. Pipeline construction will fail with a validation error if neither
    73  or both are specified.
    74  
    75  When reading via `ReadFromBigQuery`, bytes are returned decoded as bytes.
    76  This is due to the fact that ReadFromBigQuery uses Avro exports by default.
    77  When reading from BigQuery using `apache_beam.io.BigQuerySource`, bytes are
    78  returned as base64-encoded bytes. To get base64-encoded bytes using
    79  `ReadFromBigQuery`, you can use the flag `use_json_exports` to export
    80  data as JSON, and receive base64-encoded bytes.
    81  
    82  ReadAllFromBigQuery
    83  -------------------
    84  Beam 2.27.0 introduces a new transform called `ReadAllFromBigQuery` which
    85  allows you to define table and query reads from BigQuery at pipeline
    86  runtime.:::
    87  
    88    read_requests = p | beam.Create([
    89        ReadFromBigQueryRequest(query='SELECT * FROM mydataset.mytable'),
    90        ReadFromBigQueryRequest(table='myproject.mydataset.mytable')])
    91    results = read_requests | ReadAllFromBigQuery()
    92  
    93  A good application for this transform is in streaming pipelines to
    94  refresh a side input coming from BigQuery. This would work like so:::
    95  
    96    side_input = (
    97        p
    98        | 'PeriodicImpulse' >> PeriodicImpulse(
    99            first_timestamp, last_timestamp, interval, True)
   100        | 'MapToReadRequest' >> beam.Map(
   101            lambda x: ReadFromBigQueryRequest(table='dataset.table'))
   102        | beam.io.ReadAllFromBigQuery())
   103    main_input = (
   104        p
   105        | 'MpImpulse' >> beam.Create(sample_main_input_elements)
   106        |
   107        'MapMpToTimestamped' >> beam.Map(lambda src: TimestampedValue(src, src))
   108        | 'WindowMpInto' >> beam.WindowInto(
   109            window.FixedWindows(main_input_windowing_interval)))
   110    result = (
   111        main_input
   112        | 'ApplyCrossJoin' >> beam.FlatMap(
   113            cross_join, rights=beam.pvalue.AsIter(side_input)))
   114  
   115  **Note**: This transform is supported on Portable and Dataflow v2 runners.
   116  
   117  **Note**: This transform does not currently clean up temporary datasets
   118  created for its execution. (BEAM-11359)
   119  
   120  Writing Data to BigQuery
   121  ========================
   122  
   123  The `WriteToBigQuery` transform is the recommended way of writing data to
   124  BigQuery. It supports a large set of parameters to customize how you'd like to
   125  write to BigQuery.
   126  
   127  Table References
   128  ----------------
   129  
   130  This transform allows you to provide static `project`, `dataset` and `table`
   131  parameters which point to a specific BigQuery table to be created. The `table`
   132  parameter can also be a dynamic parameter (i.e. a callable), which receives an
   133  element to be written to BigQuery, and returns the table that that element
   134  should be sent to.
   135  
   136  You may also provide a tuple of PCollectionView elements to be passed as side
   137  inputs to your callable. For example, suppose that one wishes to send
   138  events of different types to different tables, and the table names are
   139  computed at pipeline runtime, one may do something like the following::
   140  
   141      with Pipeline() as p:
   142        elements = (p | beam.Create([
   143          {'type': 'error', 'timestamp': '12:34:56', 'message': 'bad'},
   144          {'type': 'user_log', 'timestamp': '12:34:59', 'query': 'flu symptom'},
   145        ]))
   146  
   147        table_names = (p | beam.Create([
   148          ('error', 'my_project:dataset1.error_table_for_today'),
   149          ('user_log', 'my_project:dataset1.query_table_for_today'),
   150        ])
   151  
   152        table_names_dict = beam.pvalue.AsDict(table_names)
   153  
   154        elements | beam.io.gcp.bigquery.WriteToBigQuery(
   155          table=lambda row, table_dict: table_dict[row['type']],
   156          table_side_inputs=(table_names_dict,))
   157  
   158  In the example above, the `table_dict` argument passed to the function in
   159  `table_dict` is the side input coming from `table_names_dict`, which is passed
   160  as part of the `table_side_inputs` argument.
   161  
   162  Schemas
   163  ---------
   164  
   165  This transform also allows you to provide a static or dynamic `schema`
   166  parameter (i.e. a callable).
   167  
   168  If providing a callable, this should take in a table reference (as returned by
   169  the `table` parameter), and return the corresponding schema for that table.
   170  This allows to provide different schemas for different tables::
   171  
   172      def compute_table_name(row):
   173        ...
   174  
   175      errors_schema = {'fields': [
   176        {'name': 'type', 'type': 'STRING', 'mode': 'NULLABLE'},
   177        {'name': 'message', 'type': 'STRING', 'mode': 'NULLABLE'}]}
   178      queries_schema = {'fields': [
   179        {'name': 'type', 'type': 'STRING', 'mode': 'NULLABLE'},
   180        {'name': 'query', 'type': 'STRING', 'mode': 'NULLABLE'}]}
   181  
   182      with Pipeline() as p:
   183        elements = (p | beam.Create([
   184          {'type': 'error', 'timestamp': '12:34:56', 'message': 'bad'},
   185          {'type': 'user_log', 'timestamp': '12:34:59', 'query': 'flu symptom'},
   186        ]))
   187  
   188        elements | beam.io.gcp.bigquery.WriteToBigQuery(
   189          table=compute_table_name,
   190          schema=lambda table: (errors_schema
   191                                if 'errors' in table
   192                                else queries_schema))
   193  
   194  It may be the case that schemas are computed at pipeline runtime. In cases
   195  like these, one can also provide a `schema_side_inputs` parameter, which is
   196  a tuple of PCollectionViews to be passed to the schema callable (much like
   197  the `table_side_inputs` parameter).
   198  
   199  Additional Parameters for BigQuery Tables
   200  -----------------------------------------
   201  
   202  This sink is able to create tables in BigQuery if they don't already exist. It
   203  also relies on creating temporary tables when performing file loads.
   204  
   205  The WriteToBigQuery transform creates tables using the BigQuery API by
   206  inserting a load job (see the API reference [1]), or by inserting a new table
   207  (see the API reference for that [2][3]).
   208  
   209  When creating a new BigQuery table, there are a number of extra parameters
   210  that one may need to specify. For example, clustering, partitioning, data
   211  encoding, etc. It is possible to provide these additional parameters by
   212  passing a Python dictionary as `additional_bq_parameters` to the transform.
   213  As an example, to create a table that has specific partitioning, and
   214  clustering properties, one would do the following::
   215  
   216      additional_bq_parameters = {
   217        'timePartitioning': {'type': 'DAY'},
   218        'clustering': {'fields': ['country']}}
   219      with Pipeline() as p:
   220        elements = (p | beam.Create([
   221          {'country': 'mexico', 'timestamp': '12:34:56', 'query': 'acapulco'},
   222          {'country': 'canada', 'timestamp': '12:34:59', 'query': 'influenza'},
   223        ]))
   224  
   225        elements | beam.io.gcp.bigquery.WriteToBigQuery(
   226          table='project_name1:dataset_2.query_events_table',
   227          additional_bq_parameters=additional_bq_parameters)
   228  
   229  Much like the schema case, the parameter with `additional_bq_parameters` can
   230  also take a callable that receives a table reference.
   231  
   232  
   233  [1] https://cloud.google.com/bigquery/docs/reference/rest/v2/Job\
   234          #jobconfigurationload
   235  [2] https://cloud.google.com/bigquery/docs/reference/rest/v2/tables/insert
   236  [3] https://cloud.google.com/bigquery/docs/reference/rest/v2/tables#resource
   237  
   238  Chaining of operations after WriteToBigQuery
   239  --------------------------------------------
   240  WritToBigQuery returns an object with several PCollections that consist of
   241  metadata about the write operations. These are useful to inspect the write
   242  operation and follow with the results::
   243  
   244    schema = {'fields': [
   245        {'name': 'column', 'type': 'STRING', 'mode': 'NULLABLE'}]}
   246  
   247    error_schema = {'fields': [
   248        {'name': 'destination', 'type': 'STRING', 'mode': 'NULLABLE'},
   249        {'name': 'row', 'type': 'STRING', 'mode': 'NULLABLE'},
   250        {'name': 'error_message', 'type': 'STRING', 'mode': 'NULLABLE'}]}
   251  
   252    with Pipeline() as p:
   253      result = (p
   254        | 'Create Columns' >> beam.Create([
   255                {'column': 'value'},
   256                {'bad_column': 'bad_value'}
   257              ])
   258        | 'Write Data' >> WriteToBigQuery(
   259                method=WriteToBigQuery.Method.STREAMING_INSERTS,
   260                table=my_table,
   261                schema=schema,
   262                insert_retry_strategy=RetryStrategy.RETRY_NEVER
   263              ))
   264  
   265      _ = (result.failed_rows_with_errors
   266        | 'Get Errors' >> beam.Map(lambda e: {
   267                "destination": e[0],
   268                "row": json.dumps(e[1]),
   269                "error_message": e[2][0]['message']
   270              })
   271        | 'Write Errors' >> WriteToBigQuery(
   272                method=WriteToBigQuery.Method.STREAMING_INSERTS,
   273                table=error_log_table,
   274                schema=error_schema,
   275              ))
   276  
   277  Often, the simplest use case is to chain an operation after writing data to
   278  BigQuery.To do this, one can chain the operation after one of the output
   279  PCollections. A generic way in which this operation (independent of write
   280  method) could look like::
   281  
   282    def chain_after(result):
   283      try:
   284        # This works for FILE_LOADS, where we run load and possibly copy jobs.
   285        return (result.load_jobid_pairs, result.copy_jobid_pairs) | beam.Flatten()
   286      except AttributeError:
   287        # Works for STREAMING_INSERTS, where we return the rows BigQuery rejected
   288        return result.failed_rows
   289  
   290    result = (pcoll | WriteToBigQuery(...))
   291  
   292    _ = (chain_after(result)
   293         | beam.Reshuffle() # Force a 'commit' of the intermediate date
   294         | MyOperationAfterWriteToBQ())
   295  
   296  Attributes can be accessed using dot notation or bracket notation:
   297  ```
   298  result.failed_rows                  <--> result['FailedRows']
   299  result.failed_rows_with_errors      <--> result['FailedRowsWithErrors']
   300  result.destination_load_jobid_pairs <--> result['destination_load_jobid_pairs']
   301  result.destination_file_pairs       <--> result['destination_file_pairs']
   302  result.destination_copy_jobid_pairs <--> result['destination_copy_jobid_pairs']
   303  ```
   304  
   305  Writing with Storage Write API using Cross Language
   306  ---------------------------------------------------
   307  This sink is able to write with BigQuery's Storage Write API. To do so, specify
   308  the method `WriteToBigQuery.Method.STORAGE_WRITE_API`. This will use the
   309  StorageWriteToBigQuery() transform to discover and use the Java implementation.
   310  Using this transform directly will require the use of beam.Row() elements.
   311  
   312  Similar to streaming inserts, it returns two dead-letter queue PCollections:
   313  one containing just the failed rows and the other containing failed rows and
   314  errors. They can be accessed with `failed_rows` and `failed_rows_with_errors`,
   315  respectively. See the examples above for how to do this.
   316  
   317  
   318  *** Short introduction to BigQuery concepts ***
   319  Tables have rows (TableRow) and each row has cells (TableCell).
   320  A table has a schema (TableSchema), which in turn describes the schema of each
   321  cell (TableFieldSchema). The terms field and cell are used interchangeably.
   322  
   323  TableSchema: Describes the schema (types and order) for values in each row.
   324    Has one attribute, 'field', which is list of TableFieldSchema objects.
   325  
   326  TableFieldSchema: Describes the schema (type, name) for one field.
   327    Has several attributes, including 'name' and 'type'. Common values for
   328    the type attribute are: 'STRING', 'INTEGER', 'FLOAT', 'BOOLEAN', 'NUMERIC',
   329    'GEOGRAPHY'.
   330    All possible values are described at:
   331    https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types
   332  
   333  TableRow: Holds all values in a table row. Has one attribute, 'f', which is a
   334    list of TableCell instances.
   335  
   336  TableCell: Holds the value for one cell (or field).  Has one attribute,
   337    'v', which is a JsonValue instance. This class is defined in
   338    apitools.base.py.extra_types.py module.
   339  
   340  As of Beam 2.7.0, the NUMERIC data type is supported. This data type supports
   341  high-precision decimal numbers (precision of 38 digits, scale of 9 digits).
   342  The GEOGRAPHY data type works with Well-Known Text (See
   343  https://en.wikipedia.org/wiki/Well-known_text) format for reading and writing
   344  to BigQuery.
   345  BigQuery IO requires values of BYTES datatype to be encoded using base64
   346  encoding when writing to BigQuery.
   347  
   348  **Updates to the I/O connector code**
   349  
   350  For any significant updates to this I/O connector, please consider involving
   351  corresponding code reviewers mentioned in
   352  https://github.com/apache/beam/blob/master/sdks/python/OWNERS
   353  """
   354  
   355  # pytype: skip-file
   356  
   357  import collections
   358  import io
   359  import itertools
   360  import json
   361  import logging
   362  import random
   363  import time
   364  import uuid
   365  import warnings
   366  from dataclasses import dataclass
   367  from typing import Dict
   368  from typing import List
   369  from typing import Optional
   370  from typing import Tuple
   371  from typing import Union
   372  
   373  import fastavro
   374  
   375  import apache_beam as beam
   376  from apache_beam import coders
   377  from apache_beam import pvalue
   378  from apache_beam.internal.gcp.json_value import from_json_value
   379  from apache_beam.internal.gcp.json_value import to_json_value
   380  from apache_beam.io import range_trackers
   381  from apache_beam.io.avroio import _create_avro_source as create_avro_source
   382  from apache_beam.io.filesystems import CompressionTypes
   383  from apache_beam.io.filesystems import FileSystems
   384  from apache_beam.io.gcp import bigquery_schema_tools
   385  from apache_beam.io.gcp import bigquery_tools
   386  from apache_beam.io.gcp.bigquery_io_metadata import create_bigquery_io_metadata
   387  from apache_beam.io.gcp.bigquery_read_internal import _BigQueryReadSplit
   388  from apache_beam.io.gcp.bigquery_read_internal import _JsonToDictCoder
   389  from apache_beam.io.gcp.bigquery_read_internal import _PassThroughThenCleanup
   390  from apache_beam.io.gcp.bigquery_read_internal import _PassThroughThenCleanupTempDatasets
   391  from apache_beam.io.gcp.bigquery_read_internal import bigquery_export_destination_uri
   392  from apache_beam.io.gcp.bigquery_tools import RetryStrategy
   393  from apache_beam.io.gcp.internal.clients import bigquery
   394  from apache_beam.io.iobase import BoundedSource
   395  from apache_beam.io.iobase import RangeTracker
   396  from apache_beam.io.iobase import SDFBoundedSourceReader
   397  from apache_beam.io.iobase import SourceBundle
   398  from apache_beam.io.textio import _TextSource as TextSource
   399  from apache_beam.metrics import Metrics
   400  from apache_beam.options import value_provider as vp
   401  from apache_beam.options.pipeline_options import DebugOptions
   402  from apache_beam.options.pipeline_options import GoogleCloudOptions
   403  from apache_beam.options.pipeline_options import StandardOptions
   404  from apache_beam.options.value_provider import StaticValueProvider
   405  from apache_beam.options.value_provider import ValueProvider
   406  from apache_beam.options.value_provider import check_accessible
   407  from apache_beam.pvalue import PCollection
   408  from apache_beam.transforms import DoFn
   409  from apache_beam.transforms import ParDo
   410  from apache_beam.transforms import PTransform
   411  from apache_beam.transforms.display import DisplayDataItem
   412  from apache_beam.transforms.external import BeamJarExpansionService
   413  from apache_beam.transforms.external import SchemaAwareExternalTransform
   414  from apache_beam.transforms.sideinputs import SIDE_INPUT_PREFIX
   415  from apache_beam.transforms.sideinputs import get_sideinput_index
   416  from apache_beam.transforms.util import ReshufflePerKey
   417  from apache_beam.transforms.window import GlobalWindows
   418  from apache_beam.typehints.row_type import RowTypeConstraint
   419  from apache_beam.utils import retry
   420  from apache_beam.utils.annotations import deprecated
   421  
   422  try:
   423    from apache_beam.io.gcp.internal.clients.bigquery import DatasetReference
   424    from apache_beam.io.gcp.internal.clients.bigquery import TableReference
   425    from apache_beam.io.gcp.internal.clients.bigquery import JobReference
   426  except ImportError:
   427    DatasetReference = None
   428    TableReference = None
   429    JobReference = None
   430  
   431  _LOGGER = logging.getLogger(__name__)
   432  
   433  try:
   434    import google.cloud.bigquery_storage_v1 as bq_storage
   435  except ImportError:
   436    _LOGGER.info(
   437        'No module named google.cloud.bigquery_storage_v1. '
   438        'As a result, the ReadFromBigQuery transform *CANNOT* be '
   439        'used with `method=DIRECT_READ`.')
   440  
   441  __all__ = [
   442      'TableRowJsonCoder',
   443      'BigQueryDisposition',
   444      'BigQuerySource',
   445      'BigQuerySink',
   446      'BigQueryQueryPriority',
   447      'WriteToBigQuery',
   448      'WriteResult',
   449      'ReadFromBigQuery',
   450      'ReadFromBigQueryRequest',
   451      'ReadAllFromBigQuery',
   452      'SCHEMA_AUTODETECT',
   453  ]
   454  """
   455  Template for BigQuery jobs created by BigQueryIO. This template is:
   456  `"beam_bq_job_{job_type}_{job_id}_{step_id}_{random}"`, where:
   457  
   458  - `job_type` represents the BigQuery job type (e.g. extract / copy / load /
   459      query).
   460  - `job_id` is the Beam job name.
   461  - `step_id` is a UUID representing the Dataflow step that created the
   462      BQ job.
   463  - `random` is a random string.
   464  
   465  NOTE: This job name template does not have backwards compatibility guarantees.
   466  """
   467  BQ_JOB_NAME_TEMPLATE = "beam_bq_job_{job_type}_{job_id}_{step_id}{random}"
   468  """
   469  The maximum number of times that a bundle of rows that errors out should be
   470  sent for insertion into BigQuery.
   471  
   472  The default is 10,000 with exponential backoffs, so a bundle of rows may be
   473  tried for a very long time. You may reduce this property to reduce the number
   474  of retries.
   475  """
   476  MAX_INSERT_RETRIES = 10000
   477  
   478  
   479  @deprecated(since='2.11.0', current="bigquery_tools.parse_table_reference")
   480  def _parse_table_reference(table, dataset=None, project=None):
   481    return bigquery_tools.parse_table_reference(table, dataset, project)
   482  
   483  
   484  @deprecated(
   485      since='2.11.0', current="bigquery_tools.parse_table_schema_from_json")
   486  def parse_table_schema_from_json(schema_string):
   487    return bigquery_tools.parse_table_schema_from_json(schema_string)
   488  
   489  
   490  @deprecated(since='2.11.0', current="bigquery_tools.default_encoder")
   491  def default_encoder(obj):
   492    return bigquery_tools.default_encoder(obj)
   493  
   494  
   495  @deprecated(since='2.11.0', current="bigquery_tools.RowAsDictJsonCoder")
   496  def RowAsDictJsonCoder(*args, **kwargs):
   497    return bigquery_tools.RowAsDictJsonCoder(*args, **kwargs)
   498  
   499  
   500  @deprecated(since='2.11.0', current="bigquery_tools.BigQueryWrapper")
   501  def BigQueryWrapper(*args, **kwargs):
   502    return bigquery_tools.BigQueryWrapper(*args, **kwargs)
   503  
   504  
   505  class TableRowJsonCoder(coders.Coder):
   506    """A coder for a TableRow instance to/from a JSON string.
   507  
   508    Note that the encoding operation (used when writing to sinks) requires the
   509    table schema in order to obtain the ordered list of field names. Reading from
   510    sources on the other hand does not need the table schema.
   511    """
   512    def __init__(self, table_schema=None):
   513      # The table schema is needed for encoding TableRows as JSON (writing to
   514      # sinks) because the ordered list of field names is used in the JSON
   515      # representation.
   516      self.table_schema = table_schema
   517      # Precompute field names since we need them for row encoding.
   518      if self.table_schema:
   519        self.field_names = tuple(fs.name for fs in self.table_schema.fields)
   520        self.field_types = tuple(fs.type for fs in self.table_schema.fields)
   521  
   522    def encode(self, table_row):
   523      if self.table_schema is None:
   524        raise AttributeError(
   525            'The TableRowJsonCoder requires a table schema for '
   526            'encoding operations. Please specify a table_schema argument.')
   527      try:
   528        return json.dumps(
   529            collections.OrderedDict(
   530                zip(
   531                    self.field_names,
   532                    [from_json_value(f.v) for f in table_row.f])),
   533            allow_nan=False,
   534            default=bigquery_tools.default_encoder)
   535      except ValueError as e:
   536        raise ValueError('%s. %s' % (e, bigquery_tools.JSON_COMPLIANCE_ERROR))
   537  
   538    def decode(self, encoded_table_row):
   539      od = json.loads(
   540          encoded_table_row, object_pairs_hook=collections.OrderedDict)
   541      return bigquery.TableRow(
   542          f=[bigquery.TableCell(v=to_json_value(e)) for e in od.values()])
   543  
   544  
   545  class BigQueryDisposition(object):
   546    """Class holding standard strings used for create and write dispositions."""
   547  
   548    CREATE_NEVER = 'CREATE_NEVER'
   549    CREATE_IF_NEEDED = 'CREATE_IF_NEEDED'
   550    WRITE_TRUNCATE = 'WRITE_TRUNCATE'
   551    WRITE_APPEND = 'WRITE_APPEND'
   552    WRITE_EMPTY = 'WRITE_EMPTY'
   553  
   554    @staticmethod
   555    def validate_create(disposition):
   556      values = (
   557          BigQueryDisposition.CREATE_NEVER, BigQueryDisposition.CREATE_IF_NEEDED)
   558      if disposition not in values:
   559        raise ValueError(
   560            'Invalid create disposition %s. Expecting %s' % (disposition, values))
   561      return disposition
   562  
   563    @staticmethod
   564    def validate_write(disposition):
   565      values = (
   566          BigQueryDisposition.WRITE_TRUNCATE,
   567          BigQueryDisposition.WRITE_APPEND,
   568          BigQueryDisposition.WRITE_EMPTY)
   569      if disposition not in values:
   570        raise ValueError(
   571            'Invalid write disposition %s. Expecting %s' % (disposition, values))
   572      return disposition
   573  
   574  
   575  class BigQueryQueryPriority(object):
   576    """Class holding standard strings used for query priority."""
   577  
   578    INTERACTIVE = 'INTERACTIVE'
   579    BATCH = 'BATCH'
   580  
   581  
   582  # -----------------------------------------------------------------------------
   583  # BigQuerySource, BigQuerySink.
   584  
   585  
   586  @deprecated(since='2.25.0', current="ReadFromBigQuery")
   587  def BigQuerySource(
   588      table=None,
   589      dataset=None,
   590      project=None,
   591      query=None,
   592      validate=False,
   593      coder=None,
   594      use_standard_sql=False,
   595      flatten_results=True,
   596      kms_key=None,
   597      use_dataflow_native_source=False):
   598    if use_dataflow_native_source:
   599      warnings.warn(
   600          "Native sources no longer implemented; "
   601          "falling back to standard Beam source.")
   602    return ReadFromBigQuery(
   603        table=table,
   604        dataset=dataset,
   605        project=project,
   606        query=query,
   607        validate=validate,
   608        coder=coder,
   609        use_standard_sql=use_standard_sql,
   610        flatten_results=flatten_results,
   611        use_json_exports=True,
   612        kms_key=kms_key)
   613  
   614  
   615  @deprecated(since='2.25.0', current="ReadFromBigQuery")
   616  def _BigQuerySource(*args, **kwargs):
   617    """A source based on a BigQuery table."""
   618    warnings.warn(
   619        "Native sources no longer implemented; "
   620        "falling back to standard Beam source.")
   621    return ReadFromBigQuery(*args, **kwargs)
   622  
   623  
   624  # TODO(https://github.com/apache/beam/issues/21622): remove the serialization
   625  # restriction in transform implementation once InteractiveRunner can work
   626  # without runner api roundtrips.
   627  @dataclass
   628  class _BigQueryExportResult:
   629    coder: beam.coders.Coder
   630    paths: List[str]
   631  
   632  
   633  class _CustomBigQuerySource(BoundedSource):
   634    def __init__(
   635        self,
   636        method,
   637        gcs_location=None,
   638        table=None,
   639        dataset=None,
   640        project=None,
   641        query=None,
   642        validate=False,
   643        pipeline_options=None,
   644        coder=None,
   645        use_standard_sql=False,
   646        flatten_results=True,
   647        kms_key=None,
   648        bigquery_job_labels=None,
   649        use_json_exports=False,
   650        job_name=None,
   651        step_name=None,
   652        unique_id=None,
   653        temp_dataset=None,
   654        query_priority=BigQueryQueryPriority.BATCH):
   655      if table is not None and query is not None:
   656        raise ValueError(
   657            'Both a BigQuery table and a query were specified.'
   658            ' Please specify only one of these.')
   659      elif table is None and query is None:
   660        raise ValueError('A BigQuery table or a query must be specified')
   661      elif table is not None:
   662        self.table_reference = bigquery_tools.parse_table_reference(
   663            table, dataset, project)
   664        self.query = None
   665        self.use_legacy_sql = True
   666      else:
   667        if isinstance(query, str):
   668          query = StaticValueProvider(str, query)
   669        self.query = query
   670        # TODO(BEAM-1082): Change the internal flag to be standard_sql
   671        self.use_legacy_sql = not use_standard_sql
   672        self.table_reference = None
   673  
   674      self.method = method
   675      self.gcs_location = gcs_location
   676      self.project = project
   677      self.validate = validate
   678      self.flatten_results = flatten_results
   679      self.coder = coder or _JsonToDictCoder
   680      self.kms_key = kms_key
   681      self.export_result = None
   682      self.options = pipeline_options
   683      self.bq_io_metadata = None  # Populate in setup, as it may make an RPC
   684      self.bigquery_job_labels = bigquery_job_labels or {}
   685      self.use_json_exports = use_json_exports
   686      self.temp_dataset = temp_dataset
   687      self.query_priority = query_priority
   688      self._job_name = job_name or 'BQ_EXPORT_JOB'
   689      self._step_name = step_name
   690      self._source_uuid = unique_id
   691  
   692    def _get_bq_metadata(self):
   693      if not self.bq_io_metadata:
   694        self.bq_io_metadata = create_bigquery_io_metadata(self._step_name)
   695      return self.bq_io_metadata
   696  
   697    def display_data(self):
   698      export_format = 'JSON' if self.use_json_exports else 'AVRO'
   699      return {
   700          'method': str(self.method),
   701          'table': str(self.table_reference),
   702          'query': str(self.query),
   703          'project': str(self.project),
   704          'use_legacy_sql': self.use_legacy_sql,
   705          'bigquery_job_labels': json.dumps(self.bigquery_job_labels),
   706          'export_file_format': export_format,
   707          'launchesBigQueryJobs': DisplayDataItem(
   708              True, label="This Dataflow job launches bigquery jobs."),
   709      }
   710  
   711    def estimate_size(self):
   712      bq = bigquery_tools.BigQueryWrapper()
   713      if self.table_reference is not None:
   714        table_ref = self.table_reference
   715        if (isinstance(self.table_reference, vp.ValueProvider) and
   716            self.table_reference.is_accessible()):
   717          table_ref = bigquery_tools.parse_table_reference(
   718              self.table_reference.get(), project=self._get_project())
   719        elif isinstance(self.table_reference, vp.ValueProvider):
   720          # Size estimation is best effort. We return None as we have
   721          # no access to the table that we're querying.
   722          return None
   723        if not table_ref.projectId:
   724          table_ref.projectId = self._get_project()
   725        table = bq.get_table(
   726            table_ref.projectId, table_ref.datasetId, table_ref.tableId)
   727        return int(table.numBytes)
   728      elif self.query is not None and self.query.is_accessible():
   729        project = self._get_project()
   730        query_job_name = bigquery_tools.generate_bq_job_name(
   731            self._job_name,
   732            self._source_uuid,
   733            bigquery_tools.BigQueryJobTypes.QUERY,
   734            '%s_%s' % (int(time.time()), random.randint(0, 1000)))
   735        job = bq._start_query_job(
   736            project,
   737            self.query.get(),
   738            self.use_legacy_sql,
   739            self.flatten_results,
   740            job_id=query_job_name,
   741            priority=self.query_priority,
   742            dry_run=True,
   743            kms_key=self.kms_key,
   744            job_labels=self._get_bq_metadata().add_additional_bq_job_labels(
   745                self.bigquery_job_labels))
   746        size = int(job.statistics.totalBytesProcessed)
   747        return size
   748      else:
   749        # Size estimation is best effort. We return None as we have
   750        # no access to the query that we're running.
   751        return None
   752  
   753    def _get_project(self):
   754      """Returns the project that queries and exports will be billed to."""
   755  
   756      project = self.options.view_as(GoogleCloudOptions).project
   757      if isinstance(project, vp.ValueProvider):
   758        project = project.get()
   759      if self.temp_dataset:
   760        return self.temp_dataset.projectId
   761      if not project:
   762        project = self.project
   763      return project
   764  
   765    def _create_source(self, path, coder):
   766      if not self.use_json_exports:
   767        return create_avro_source(path)
   768      else:
   769        return TextSource(
   770            path,
   771            min_bundle_size=0,
   772            compression_type=CompressionTypes.UNCOMPRESSED,
   773            strip_trailing_newlines=True,
   774            coder=coder)
   775  
   776    def split(self, desired_bundle_size, start_position=None, stop_position=None):
   777      if self.export_result is None:
   778        bq = bigquery_tools.BigQueryWrapper(
   779            temp_dataset_id=(
   780                self.temp_dataset.datasetId if self.temp_dataset else None))
   781  
   782        if self.query is not None:
   783          self._setup_temporary_dataset(bq)
   784          self.table_reference = self._execute_query(bq)
   785  
   786        if isinstance(self.table_reference, vp.ValueProvider):
   787          self.table_reference = bigquery_tools.parse_table_reference(
   788              self.table_reference.get(), project=self._get_project())
   789        elif not self.table_reference.projectId:
   790          self.table_reference.projectId = self._get_project()
   791  
   792        schema, metadata_list = self._export_files(bq)
   793        self.export_result = _BigQueryExportResult(
   794            coder=self.coder(schema),
   795            paths=[metadata.path for metadata in metadata_list])
   796  
   797        if self.query is not None:
   798          bq.clean_up_temporary_dataset(self._get_project())
   799  
   800      for path in self.export_result.paths:
   801        source = self._create_source(path, self.export_result.coder)
   802        yield SourceBundle(
   803            weight=1.0, source=source, start_position=None, stop_position=None)
   804  
   805    def get_range_tracker(self, start_position, stop_position):
   806      class CustomBigQuerySourceRangeTracker(RangeTracker):
   807        """A RangeTracker that always returns positions as None."""
   808        def start_position(self):
   809          return None
   810  
   811        def stop_position(self):
   812          return None
   813  
   814      return CustomBigQuerySourceRangeTracker()
   815  
   816    def read(self, range_tracker):
   817      raise NotImplementedError('BigQuery source must be split before being read')
   818  
   819    @check_accessible(['query'])
   820    def _setup_temporary_dataset(self, bq):
   821      if self.temp_dataset:
   822        # Temp dataset was provided by the user so we can just return.
   823        return
   824      location = bq.get_query_location(
   825          self._get_project(), self.query.get(), self.use_legacy_sql)
   826      bq.create_temporary_dataset(self._get_project(), location)
   827  
   828    @check_accessible(['query'])
   829    def _execute_query(self, bq):
   830      query_job_name = bigquery_tools.generate_bq_job_name(
   831          self._job_name,
   832          self._source_uuid,
   833          bigquery_tools.BigQueryJobTypes.QUERY,
   834          '%s_%s' % (int(time.time()), random.randint(0, 1000)))
   835      job = bq._start_query_job(
   836          self._get_project(),
   837          self.query.get(),
   838          self.use_legacy_sql,
   839          self.flatten_results,
   840          job_id=query_job_name,
   841          priority=self.query_priority,
   842          kms_key=self.kms_key,
   843          job_labels=self._get_bq_metadata().add_additional_bq_job_labels(
   844              self.bigquery_job_labels))
   845      job_ref = job.jobReference
   846      bq.wait_for_bq_job(job_ref, max_retries=0)
   847      return bq._get_temp_table(self._get_project())
   848  
   849    def _export_files(self, bq):
   850      """Runs a BigQuery export job.
   851  
   852      Returns:
   853        bigquery.TableSchema instance, a list of FileMetadata instances
   854      """
   855      job_labels = self._get_bq_metadata().add_additional_bq_job_labels(
   856          self.bigquery_job_labels)
   857      export_job_name = bigquery_tools.generate_bq_job_name(
   858          self._job_name,
   859          self._source_uuid,
   860          bigquery_tools.BigQueryJobTypes.EXPORT,
   861          '%s_%s' % (int(time.time()), random.randint(0, 1000)))
   862      temp_location = self.options.view_as(GoogleCloudOptions).temp_location
   863      gcs_location = bigquery_export_destination_uri(
   864          self.gcs_location, temp_location, self._source_uuid)
   865      try:
   866        if self.use_json_exports:
   867          job_ref = bq.perform_extract_job([gcs_location],
   868                                           export_job_name,
   869                                           self.table_reference,
   870                                           bigquery_tools.FileFormat.JSON,
   871                                           project=self._get_project(),
   872                                           job_labels=job_labels,
   873                                           include_header=False)
   874        else:
   875          job_ref = bq.perform_extract_job([gcs_location],
   876                                           export_job_name,
   877                                           self.table_reference,
   878                                           bigquery_tools.FileFormat.AVRO,
   879                                           project=self._get_project(),
   880                                           include_header=False,
   881                                           job_labels=job_labels,
   882                                           use_avro_logical_types=True)
   883        bq.wait_for_bq_job(job_ref)
   884      except Exception as exn:  # pylint: disable=broad-except
   885        # The error messages thrown in this case are generic and misleading,
   886        # so leave this breadcrumb in case it's the root cause.
   887        logging.warning(
   888            "Error exporting table: %s. "
   889            "Note that external tables cannot be exported: "
   890            "https://cloud.google.com/bigquery/docs/external-tables"
   891            "#external_table_limitations",
   892            exn)
   893        raise
   894      metadata_list = FileSystems.match([gcs_location])[0].metadata_list
   895  
   896      if isinstance(self.table_reference, vp.ValueProvider):
   897        table_ref = bigquery_tools.parse_table_reference(
   898            self.table_reference.get(), project=self.project)
   899      else:
   900        table_ref = self.table_reference
   901      table = bq.get_table(
   902          table_ref.projectId, table_ref.datasetId, table_ref.tableId)
   903  
   904      return table.schema, metadata_list
   905  
   906  
   907  class _CustomBigQueryStorageSource(BoundedSource):
   908    """A base class for BoundedSource implementations which read from BigQuery
   909    using the BigQuery Storage API.
   910    Args:
   911      table (str, TableReference): The ID of the table. If **dataset** argument is
   912        :data:`None` then the table argument must contain the entire table
   913        reference specified as: ``'PROJECT:DATASET.TABLE'`` or must specify a
   914        TableReference.
   915      dataset (str): Optional ID of the dataset containing this table or
   916        :data:`None` if the table argument specifies a TableReference.
   917      project (str): Optional ID of the project containing this table or
   918        :data:`None` if the table argument specifies a TableReference.
   919      selected_fields (List[str]): Optional List of names of the fields in the
   920        table that should be read. If empty, all fields will be read. If the
   921        specified field is a nested field, all the sub-fields in the field will be
   922        selected. The output field order is unrelated to the order of fields in
   923        selected_fields.
   924      row_restriction (str): Optional SQL text filtering statement, similar to a
   925        WHERE clause in a query. Aggregates are not supported. Restricted to a
   926        maximum length for 1 MB.
   927      use_native_datetime (bool): If :data:`True`, BigQuery DATETIME fields will
   928        be returned as native Python datetime objects. If :data:`False`,
   929        DATETIME fields will be returned as formatted strings (for example:
   930        2021-01-01T12:59:59). The default is :data:`False`.
   931    """
   932  
   933    # The maximum number of streams which will be requested when creating a read
   934    # session, regardless of the desired bundle size.
   935    MAX_SPLIT_COUNT = 10000
   936    # The minimum number of streams which will be requested when creating a read
   937    # session, regardless of the desired bundle size. Note that the server may
   938    # still choose to return fewer than ten streams based on the layout of the
   939    # table.
   940    MIN_SPLIT_COUNT = 10
   941  
   942    def __init__(
   943        self,
   944        method: str,
   945        query_priority: [BigQueryQueryPriority] = BigQueryQueryPriority.BATCH,
   946        table: Optional[Union[str, TableReference]] = None,
   947        dataset: Optional[str] = None,
   948        project: Optional[str] = None,
   949        query: Optional[str] = None,
   950        selected_fields: Optional[List[str]] = None,
   951        row_restriction: Optional[str] = None,
   952        pipeline_options: Optional[GoogleCloudOptions] = None,
   953        unique_id: Optional[uuid.UUID] = None,
   954        bigquery_job_labels: Optional[Dict] = None,
   955        bigquery_dataset_labels: Optional[Dict] = None,
   956        job_name: Optional[str] = None,
   957        step_name: Optional[str] = None,
   958        use_standard_sql: Optional[bool] = False,
   959        flatten_results: Optional[bool] = True,
   960        kms_key: Optional[str] = None,
   961        temp_dataset: Optional[DatasetReference] = None,
   962        temp_table: Optional[TableReference] = None,
   963        use_native_datetime: Optional[bool] = False):
   964  
   965      if table is not None and query is not None:
   966        raise ValueError(
   967            'Both a BigQuery table and a query were specified.'
   968            ' Please specify only one of these.')
   969      elif table is None and query is None:
   970        raise ValueError('A BigQuery table or a query must be specified')
   971      elif table is not None:
   972        self.table_reference = bigquery_tools.parse_table_reference(
   973            table, dataset, project)
   974        self.query = None
   975        self.use_legacy_sql = True
   976      else:
   977        if isinstance(query, str):
   978          query = StaticValueProvider(str, query)
   979        self.query = query
   980        # TODO(BEAM-1082): Change the internal flag to be standard_sql
   981        self.use_legacy_sql = not use_standard_sql
   982        self.table_reference = None
   983  
   984      self.method = method
   985      self.project = project
   986      self.selected_fields = selected_fields
   987      self.row_restriction = row_restriction
   988      self.pipeline_options = pipeline_options
   989      self.split_result = None
   990      self.bigquery_job_labels = bigquery_job_labels or {}
   991      self.bigquery_dataset_labels = bigquery_dataset_labels or {}
   992      self.bq_io_metadata = None  # Populate in setup, as it may make an RPC
   993      self.flatten_results = flatten_results
   994      self.kms_key = kms_key
   995      self.temp_table = temp_table
   996      self.query_priority = query_priority
   997      self.use_native_datetime = use_native_datetime
   998      self._job_name = job_name or 'BQ_DIRECT_READ_JOB'
   999      self._step_name = step_name
  1000      self._source_uuid = unique_id
  1001  
  1002    def _get_parent_project(self):
  1003      """Returns the project that will be billed."""
  1004      if self.temp_table:
  1005        return self.temp_table.projectId
  1006  
  1007      project = self.pipeline_options.view_as(GoogleCloudOptions).project
  1008      if isinstance(project, vp.ValueProvider):
  1009        project = project.get()
  1010      if not project:
  1011        project = self.project
  1012      return project
  1013  
  1014    def _get_table_size(self, bq, table_reference):
  1015      project = (
  1016          table_reference.projectId
  1017          if table_reference.projectId else self._get_parent_project())
  1018      table = bq.get_table(
  1019          project, table_reference.datasetId, table_reference.tableId)
  1020      return table.numBytes
  1021  
  1022    def _get_bq_metadata(self):
  1023      if not self.bq_io_metadata:
  1024        self.bq_io_metadata = create_bigquery_io_metadata(self._step_name)
  1025      return self.bq_io_metadata
  1026  
  1027    @check_accessible(['query'])
  1028    def _setup_temporary_dataset(self, bq):
  1029      if self.temp_table:
  1030        # Temp dataset was provided by the user so we can just return.
  1031        return
  1032      location = bq.get_query_location(
  1033          self._get_parent_project(), self.query.get(), self.use_legacy_sql)
  1034      _LOGGER.warning("### Labels: %s", str(self.bigquery_dataset_labels))
  1035      bq.create_temporary_dataset(
  1036          self._get_parent_project(), location, self.bigquery_dataset_labels)
  1037  
  1038    @check_accessible(['query'])
  1039    def _execute_query(self, bq):
  1040      query_job_name = bigquery_tools.generate_bq_job_name(
  1041          self._job_name,
  1042          self._source_uuid,
  1043          bigquery_tools.BigQueryJobTypes.QUERY,
  1044          '%s_%s' % (int(time.time()), random.randint(0, 1000)))
  1045      job = bq._start_query_job(
  1046          self._get_parent_project(),
  1047          self.query.get(),
  1048          self.use_legacy_sql,
  1049          self.flatten_results,
  1050          job_id=query_job_name,
  1051          priority=self.query_priority,
  1052          kms_key=self.kms_key,
  1053          job_labels=self._get_bq_metadata().add_additional_bq_job_labels(
  1054              self.bigquery_job_labels))
  1055      job_ref = job.jobReference
  1056      bq.wait_for_bq_job(job_ref, max_retries=0)
  1057      table_reference = bq._get_temp_table(self._get_parent_project())
  1058      return table_reference
  1059  
  1060    def display_data(self):
  1061      return {
  1062          'method': self.method,
  1063          'output_format': 'ARROW' if self.use_native_datetime else 'AVRO',
  1064          'project': str(self.project),
  1065          'table_reference': str(self.table_reference),
  1066          'query': str(self.query),
  1067          'use_legacy_sql': self.use_legacy_sql,
  1068          'use_native_datetime': self.use_native_datetime,
  1069          'selected_fields': str(self.selected_fields),
  1070          'row_restriction': str(self.row_restriction),
  1071          'launchesBigQueryJobs': DisplayDataItem(
  1072              True, label="This Dataflow job launches bigquery jobs."),
  1073      }
  1074  
  1075    def estimate_size(self):
  1076      # Returns the pre-filtering size of the (temporary) table being read.
  1077      bq = bigquery_tools.BigQueryWrapper()
  1078      if self.table_reference is not None:
  1079        return self._get_table_size(bq, self.table_reference)
  1080      elif self.query is not None and self.query.is_accessible():
  1081        query_job_name = bigquery_tools.generate_bq_job_name(
  1082            self._job_name,
  1083            self._source_uuid,
  1084            bigquery_tools.BigQueryJobTypes.QUERY,
  1085            '%s_%s' % (int(time.time()), random.randint(0, 1000)))
  1086        job = bq._start_query_job(
  1087            self._get_parent_project(),
  1088            self.query.get(),
  1089            self.use_legacy_sql,
  1090            self.flatten_results,
  1091            job_id=query_job_name,
  1092            priority=self.query_priority,
  1093            dry_run=True,
  1094            kms_key=self.kms_key,
  1095            job_labels=self._get_bq_metadata().add_additional_bq_job_labels(
  1096                self.bigquery_job_labels))
  1097        size = int(job.statistics.totalBytesProcessed)
  1098        return size
  1099      else:
  1100        # Size estimation is best effort. We return None as we have
  1101        # no access to the query that we're running.
  1102        return None
  1103  
  1104    def split(self, desired_bundle_size, start_position=None, stop_position=None):
  1105      if self.split_result is None:
  1106        bq = bigquery_tools.BigQueryWrapper(
  1107            temp_table_ref=(self.temp_table if self.temp_table else None))
  1108  
  1109        if self.query is not None:
  1110          self._setup_temporary_dataset(bq)
  1111          self.table_reference = self._execute_query(bq)
  1112  
  1113        requested_session = bq_storage.types.ReadSession()
  1114        requested_session.table = 'projects/{}/datasets/{}/tables/{}'.format(
  1115            self.table_reference.projectId,
  1116            self.table_reference.datasetId,
  1117            self.table_reference.tableId)
  1118  
  1119        if self.use_native_datetime:
  1120          requested_session.data_format = bq_storage.types.DataFormat.ARROW
  1121          requested_session.read_options\
  1122            .arrow_serialization_options.buffer_compression = \
  1123            bq_storage.types.ArrowSerializationOptions.CompressionCodec.LZ4_FRAME
  1124        else:
  1125          requested_session.data_format = bq_storage.types.DataFormat.AVRO
  1126  
  1127        if self.selected_fields is not None:
  1128          requested_session.read_options.selected_fields = self.selected_fields
  1129        if self.row_restriction is not None:
  1130          requested_session.read_options.row_restriction = self.row_restriction
  1131  
  1132        storage_client = bq_storage.BigQueryReadClient()
  1133        stream_count = 0
  1134        if desired_bundle_size > 0:
  1135          table_size = self._get_table_size(bq, self.table_reference)
  1136          stream_count = min(
  1137              int(table_size / desired_bundle_size),
  1138              _CustomBigQueryStorageSource.MAX_SPLIT_COUNT)
  1139        stream_count = max(
  1140            stream_count, _CustomBigQueryStorageSource.MIN_SPLIT_COUNT)
  1141  
  1142        parent = 'projects/{}'.format(self.table_reference.projectId)
  1143        read_session = storage_client.create_read_session(
  1144            parent=parent,
  1145            read_session=requested_session,
  1146            max_stream_count=stream_count)
  1147        _LOGGER.info(
  1148            'Sent BigQuery Storage API CreateReadSession request: \n %s \n'
  1149            'Received response \n %s.',
  1150            requested_session,
  1151            read_session)
  1152  
  1153        self.split_result = [
  1154            _CustomBigQueryStorageStreamSource(
  1155                stream.name, self.use_native_datetime)
  1156            for stream in read_session.streams
  1157        ]
  1158  
  1159      for source in self.split_result:
  1160        yield SourceBundle(
  1161            weight=1.0, source=source, start_position=None, stop_position=None)
  1162  
  1163    def get_range_tracker(self, start_position, stop_position):
  1164      class NonePositionRangeTracker(RangeTracker):
  1165        """A RangeTracker that always returns positions as None. Prevents the
  1166        BigQuery Storage source from being read() before being split()."""
  1167        def start_position(self):
  1168          return None
  1169  
  1170        def stop_position(self):
  1171          return None
  1172  
  1173      return NonePositionRangeTracker()
  1174  
  1175    def read(self, range_tracker):
  1176      raise NotImplementedError(
  1177          'BigQuery storage source must be split before being read')
  1178  
  1179  
  1180  class _CustomBigQueryStorageStreamSource(BoundedSource):
  1181    """A source representing a single stream in a read session."""
  1182    def __init__(
  1183        self, read_stream_name: str, use_native_datetime: Optional[bool] = True):
  1184      self.read_stream_name = read_stream_name
  1185      self.use_native_datetime = use_native_datetime
  1186  
  1187    def display_data(self):
  1188      return {
  1189          'output_format': 'ARROW' if self.use_native_datetime else 'AVRO',
  1190          'read_stream': str(self.read_stream_name),
  1191          'use_native_datetime': str(self.use_native_datetime)
  1192      }
  1193  
  1194    def estimate_size(self):
  1195      # The size of stream source cannot be estimate due to server-side liquid
  1196      # sharding.
  1197      # TODO(https://github.com/apache/beam/issues/21126): Implement progress
  1198      # reporting.
  1199      return None
  1200  
  1201    def split(self, desired_bundle_size, start_position=None, stop_position=None):
  1202      # A stream source can't be split without reading from it due to
  1203      # server-side liquid sharding. A split will simply return the current source
  1204      # for now.
  1205      return SourceBundle(
  1206          weight=1.0,
  1207          source=_CustomBigQueryStorageStreamSource(
  1208              self.read_stream_name, self.use_native_datetime),
  1209          start_position=None,
  1210          stop_position=None)
  1211  
  1212    def get_range_tracker(self, start_position, stop_position):
  1213      # TODO(https://github.com/apache/beam/issues/21127): Implement dynamic work
  1214      # rebalancing.
  1215      assert start_position is None
  1216      # Defaulting to the start of the stream.
  1217      start_position = 0
  1218      # Since the streams are unsplittable we choose OFFSET_INFINITY as the
  1219      # default end offset so that all data of the source gets read.
  1220      stop_position = range_trackers.OffsetRangeTracker.OFFSET_INFINITY
  1221      range_tracker = range_trackers.OffsetRangeTracker(
  1222          start_position, stop_position)
  1223      # Ensuring that all try_split() calls will be ignored by the Rangetracker.
  1224      range_tracker = range_trackers.UnsplittableRangeTracker(range_tracker)
  1225  
  1226      return range_tracker
  1227  
  1228    def read(self, range_tracker):
  1229      _LOGGER.info(
  1230          "Started BigQuery Storage API read from stream %s.",
  1231          self.read_stream_name)
  1232      if self.use_native_datetime:
  1233        return self.read_arrow()
  1234      else:
  1235        return self.read_avro()
  1236  
  1237    def read_arrow(self):
  1238      storage_client = bq_storage.BigQueryReadClient()
  1239      row_iter = iter(storage_client.read_rows(self.read_stream_name).rows())
  1240      row = next(row_iter, None)
  1241      # Handling the case where the user might provide very selective filters
  1242      # which can result in read_rows_response being empty.
  1243      if row is None:
  1244        return iter([])
  1245  
  1246      while row is not None:
  1247        py_row = dict(map(lambda item: (item[0], item[1].as_py()), row.items()))
  1248        row = next(row_iter, None)
  1249        yield py_row
  1250  
  1251    def read_avro(self):
  1252      storage_client = bq_storage.BigQueryReadClient()
  1253      read_rows_iterator = iter(storage_client.read_rows(self.read_stream_name))
  1254      # Handling the case where the user might provide very selective filters
  1255      # which can result in read_rows_response being empty.
  1256      first_read_rows_response = next(read_rows_iterator, None)
  1257      if first_read_rows_response is None:
  1258        return iter([])
  1259  
  1260      row_reader = _ReadReadRowsResponsesWithFastAvro(
  1261          read_rows_iterator, first_read_rows_response)
  1262      return iter(row_reader)
  1263  
  1264  
  1265  class _ReadReadRowsResponsesWithFastAvro():
  1266    """An iterator that deserializes ReadRowsResponses using the fastavro
  1267    library."""
  1268    def __init__(self, read_rows_iterator, read_rows_response):
  1269      self.read_rows_iterator = read_rows_iterator
  1270      self.read_rows_response = read_rows_response
  1271      self.avro_schema = fastavro.parse_schema(
  1272          json.loads(self.read_rows_response.avro_schema.schema))
  1273      self.bytes_reader = io.BytesIO(
  1274          self.read_rows_response.avro_rows.serialized_binary_rows)
  1275  
  1276    def __iter__(self):
  1277      return self
  1278  
  1279    def __next__(self):
  1280      try:
  1281        return fastavro.schemaless_reader(self.bytes_reader, self.avro_schema)
  1282      except StopIteration:
  1283        self.read_rows_response = next(self.read_rows_iterator, None)
  1284        if self.read_rows_response is not None:
  1285          self.bytes_reader = io.BytesIO(
  1286              self.read_rows_response.avro_rows.serialized_binary_rows)
  1287          return fastavro.schemaless_reader(self.bytes_reader, self.avro_schema)
  1288        else:
  1289          raise StopIteration
  1290  
  1291  
  1292  @deprecated(since='2.11.0', current="WriteToBigQuery")
  1293  def BigQuerySink(*args, validate=False, **kwargs):
  1294    """A deprecated alias for WriteToBigQuery."""
  1295    warnings.warn(
  1296        "Native sinks no longer implemented; "
  1297        "falling back to standard Beam sink.")
  1298    return WriteToBigQuery(*args, validate=validate, **kwargs)
  1299  
  1300  
  1301  _KNOWN_TABLES = set()
  1302  
  1303  
  1304  class BigQueryWriteFn(DoFn):
  1305    """A ``DoFn`` that streams writes to BigQuery once the table is created."""
  1306  
  1307    DEFAULT_MAX_BUFFERED_ROWS = 2000
  1308    DEFAULT_MAX_BATCH_SIZE = 500
  1309  
  1310    FAILED_ROWS = 'FailedRows'
  1311    FAILED_ROWS_WITH_ERRORS = 'FailedRowsWithErrors'
  1312    STREAMING_API_LOGGING_FREQUENCY_SEC = 300
  1313  
  1314    def __init__(
  1315        self,
  1316        batch_size,
  1317        schema=None,
  1318        create_disposition=None,
  1319        write_disposition=None,
  1320        kms_key=None,
  1321        test_client=None,
  1322        max_buffered_rows=None,
  1323        retry_strategy=None,
  1324        additional_bq_parameters=None,
  1325        ignore_insert_ids=False,
  1326        with_batched_input=False,
  1327        ignore_unknown_columns=False,
  1328        max_retries=MAX_INSERT_RETRIES):
  1329      """Initialize a WriteToBigQuery transform.
  1330  
  1331      Args:
  1332        batch_size: Number of rows to be written to BQ per streaming API insert.
  1333        schema: The schema to be used if the BigQuery table to write has to be
  1334          created. This can be either specified as a 'bigquery.TableSchema' object
  1335          or a single string  of the form 'field1:type1,field2:type2,field3:type3'
  1336          that defines a comma separated list of fields. Here 'type' should
  1337          specify the BigQuery type of the field. Single string based schemas do
  1338          not support nested fields, repeated fields, or specifying a BigQuery
  1339          mode for fields (mode will always be set to 'NULLABLE').
  1340        create_disposition: A string describing what happens if the table does not
  1341          exist. Possible values are:
  1342          - BigQueryDisposition.CREATE_IF_NEEDED: create if does not exist.
  1343          - BigQueryDisposition.CREATE_NEVER: fail the write if does not exist.
  1344        write_disposition: A string describing what happens if the table has
  1345          already some data. Possible values are:
  1346          -  BigQueryDisposition.WRITE_TRUNCATE: delete existing rows.
  1347          -  BigQueryDisposition.WRITE_APPEND: add to existing rows.
  1348          -  BigQueryDisposition.WRITE_EMPTY: fail the write if table not empty.
  1349          For streaming pipelines WriteTruncate can not be used.
  1350        kms_key: Optional Cloud KMS key name for use when creating new tables.
  1351        test_client: Override the default bigquery client used for testing.
  1352  
  1353        max_buffered_rows: The maximum number of rows that are allowed to stay
  1354          buffered when running dynamic destinations. When destinations are
  1355          dynamic, it is important to keep caches small even when a single
  1356          batch has not been completely filled up.
  1357        retry_strategy: The strategy to use when retrying streaming inserts
  1358          into BigQuery. Options are shown in bigquery_tools.RetryStrategy attrs.
  1359        additional_bq_parameters (dict, callable): A set of additional parameters
  1360          to be passed when creating a BigQuery table. These are passed when
  1361          triggering a load job for FILE_LOADS, and when creating a new table for
  1362          STREAMING_INSERTS.
  1363        ignore_insert_ids: When using the STREAMING_INSERTS method to write data
  1364          to BigQuery, `insert_ids` are a feature of BigQuery that support
  1365          deduplication of events. If your use case is not sensitive to
  1366          duplication of data inserted to BigQuery, set `ignore_insert_ids`
  1367          to True to increase the throughput for BQ writing. See:
  1368          https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication
  1369        with_batched_input: Whether the input has already been batched per
  1370          destination. If not, perform best-effort batching per destination within
  1371          a bundle.
  1372        ignore_unknown_columns: Accept rows that contain values that do not match
  1373          the schema. The unknown values are ignored. Default is False,
  1374          which treats unknown values as errors. See reference:
  1375          https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
  1376        max_retries: The number of times that we will retry inserting a group of
  1377          rows into BigQuery. By default, we retry 10000 times with exponential
  1378          backoffs (effectively retry forever).
  1379  
  1380      """
  1381      self.schema = schema
  1382      self.test_client = test_client
  1383      self.create_disposition = create_disposition
  1384      self.write_disposition = write_disposition
  1385      if write_disposition in (BigQueryDisposition.WRITE_EMPTY,
  1386                               BigQueryDisposition.WRITE_TRUNCATE):
  1387        raise ValueError(
  1388            'Write disposition %s is not supported for'
  1389            ' streaming inserts to BigQuery' % write_disposition)
  1390      self._rows_buffer = []
  1391      self._reset_rows_buffer()
  1392  
  1393      self._total_buffered_rows = 0
  1394      self.kms_key = kms_key
  1395      self._max_batch_size = batch_size or BigQueryWriteFn.DEFAULT_MAX_BATCH_SIZE
  1396      self._max_buffered_rows = (
  1397          max_buffered_rows or BigQueryWriteFn.DEFAULT_MAX_BUFFERED_ROWS)
  1398      self._retry_strategy = retry_strategy or RetryStrategy.RETRY_ALWAYS
  1399      self.ignore_insert_ids = ignore_insert_ids
  1400      self.with_batched_input = with_batched_input
  1401  
  1402      self.additional_bq_parameters = additional_bq_parameters or {}
  1403  
  1404      # accumulate the total time spent in exponential backoff
  1405      self._throttled_secs = Metrics.counter(
  1406          BigQueryWriteFn, "cumulativeThrottlingSeconds")
  1407      self.batch_size_metric = Metrics.distribution(self.__class__, "batch_size")
  1408      self.batch_latency_metric = Metrics.distribution(
  1409          self.__class__, "batch_latency_ms")
  1410      self.failed_rows_metric = Metrics.distribution(
  1411          self.__class__, "rows_failed_per_batch")
  1412      self.bigquery_wrapper = None
  1413      self.streaming_api_logging_frequency_sec = (
  1414          BigQueryWriteFn.STREAMING_API_LOGGING_FREQUENCY_SEC)
  1415      self.ignore_unknown_columns = ignore_unknown_columns
  1416      self._max_retries = max_retries
  1417  
  1418    def display_data(self):
  1419      return {
  1420          'max_batch_size': self._max_batch_size,
  1421          'max_buffered_rows': self._max_buffered_rows,
  1422          'retry_strategy': self._retry_strategy,
  1423          'create_disposition': str(self.create_disposition),
  1424          'write_disposition': str(self.write_disposition),
  1425          'additional_bq_parameters': str(self.additional_bq_parameters),
  1426          'ignore_insert_ids': str(self.ignore_insert_ids),
  1427          'ignore_unknown_columns': str(self.ignore_unknown_columns)
  1428      }
  1429  
  1430    def _reset_rows_buffer(self):
  1431      self._rows_buffer = collections.defaultdict(lambda: [])
  1432  
  1433    @staticmethod
  1434    def get_table_schema(schema):
  1435      """Transform the table schema into a bigquery.TableSchema instance.
  1436  
  1437      Args:
  1438        schema: The schema to be used if the BigQuery table to write has to be
  1439          created. This is a dictionary object created in the WriteToBigQuery
  1440          transform.
  1441      Returns:
  1442        table_schema: The schema to be used if the BigQuery table to write has
  1443           to be created but in the bigquery.TableSchema format.
  1444      """
  1445      if schema is None:
  1446        return schema
  1447      elif isinstance(schema, str):
  1448        return bigquery_tools.parse_table_schema_from_json(schema)
  1449      elif isinstance(schema, dict):
  1450        return bigquery_tools.parse_table_schema_from_json(json.dumps(schema))
  1451      else:
  1452        raise TypeError('Unexpected schema argument: %s.' % schema)
  1453  
  1454    def start_bundle(self):
  1455      self._reset_rows_buffer()
  1456  
  1457      if not self.bigquery_wrapper:
  1458        self.bigquery_wrapper = bigquery_tools.BigQueryWrapper(
  1459            client=self.test_client)
  1460  
  1461      (
  1462          bigquery_tools.BigQueryWrapper.HISTOGRAM_METRIC_LOGGER.
  1463          minimum_logging_frequency_msec
  1464      ) = self.streaming_api_logging_frequency_sec * 1000
  1465  
  1466      self._backoff_calculator = iter(
  1467          retry.FuzzedExponentialIntervals(
  1468              initial_delay_secs=0.2,
  1469              num_retries=self._max_retries,
  1470              max_delay_secs=1500))
  1471  
  1472    def _create_table_if_needed(self, table_reference, schema=None):
  1473      str_table_reference = '%s:%s.%s' % (
  1474          table_reference.projectId,
  1475          table_reference.datasetId,
  1476          table_reference.tableId)
  1477      if str_table_reference in _KNOWN_TABLES:
  1478        return
  1479  
  1480      if self.create_disposition == BigQueryDisposition.CREATE_NEVER:
  1481        # If we never want to create the table, we assume it already exists,
  1482        # and avoid the get-or-create step.
  1483        return
  1484  
  1485      _LOGGER.debug(
  1486          'Creating or getting table %s with schema %s.', table_reference, schema)
  1487  
  1488      table_schema = self.get_table_schema(schema)
  1489  
  1490      if table_reference.projectId is None:
  1491        table_reference.projectId = vp.RuntimeValueProvider.get_value(
  1492            'project', str, '')
  1493      self.bigquery_wrapper.get_or_create_table(
  1494          table_reference.projectId,
  1495          table_reference.datasetId,
  1496          table_reference.tableId,
  1497          table_schema,
  1498          self.create_disposition,
  1499          self.write_disposition,
  1500          additional_create_parameters=self.additional_bq_parameters)
  1501      _KNOWN_TABLES.add(str_table_reference)
  1502  
  1503    def process(self, element, *schema_side_inputs):
  1504      destination = bigquery_tools.get_hashable_destination(element[0])
  1505  
  1506      if callable(self.schema):
  1507        schema = self.schema(destination, *schema_side_inputs)
  1508      elif isinstance(self.schema, vp.ValueProvider):
  1509        schema = self.schema.get()
  1510      else:
  1511        schema = self.schema
  1512  
  1513      self._create_table_if_needed(
  1514          bigquery_tools.parse_table_reference(destination), schema)
  1515  
  1516      if not self.with_batched_input:
  1517        row_and_insert_id = element[1]
  1518        self._rows_buffer[destination].append(row_and_insert_id)
  1519        self._total_buffered_rows += 1
  1520        if len(self._rows_buffer[destination]) >= self._max_batch_size:
  1521          return self._flush_batch(destination)
  1522        elif self._total_buffered_rows >= self._max_buffered_rows:
  1523          return self._flush_all_batches()
  1524      else:
  1525        # The input is already batched per destination, flush the rows now.
  1526        batched_rows = element[1]
  1527        self._rows_buffer[destination].extend(batched_rows)
  1528        return self._flush_batch(destination)
  1529  
  1530    def finish_bundle(self):
  1531      bigquery_tools.BigQueryWrapper.HISTOGRAM_METRIC_LOGGER.log_metrics(
  1532          reset_after_logging=True)
  1533      return self._flush_all_batches()
  1534  
  1535    def _flush_all_batches(self):
  1536      _LOGGER.debug(
  1537          'Attempting to flush to all destinations. Total buffered: %s',
  1538          self._total_buffered_rows)
  1539  
  1540      return itertools.chain(
  1541          *[
  1542              self._flush_batch(destination)
  1543              for destination in list(self._rows_buffer.keys())
  1544              if self._rows_buffer[destination]
  1545          ])
  1546  
  1547    def _flush_batch(self, destination):
  1548  
  1549      # Flush the current batch of rows to BigQuery.
  1550      rows_and_insert_ids = self._rows_buffer[destination]
  1551      table_reference = bigquery_tools.parse_table_reference(destination)
  1552  
  1553      if table_reference.projectId is None:
  1554        table_reference.projectId = vp.RuntimeValueProvider.get_value(
  1555            'project', str, '')
  1556  
  1557      _LOGGER.debug(
  1558          'Flushing data to %s. Total %s rows.',
  1559          destination,
  1560          len(rows_and_insert_ids))
  1561      self.batch_size_metric.update(len(rows_and_insert_ids))
  1562  
  1563      rows = [r[0] for r in rows_and_insert_ids]
  1564      if self.ignore_insert_ids:
  1565        insert_ids = [None for r in rows_and_insert_ids]
  1566      else:
  1567        insert_ids = [r[1] for r in rows_and_insert_ids]
  1568  
  1569      while True:
  1570        start = time.time()
  1571        passed, errors = self.bigquery_wrapper.insert_rows(
  1572            project_id=table_reference.projectId,
  1573            dataset_id=table_reference.datasetId,
  1574            table_id=table_reference.tableId,
  1575            rows=rows,
  1576            insert_ids=insert_ids,
  1577            skip_invalid_rows=True,
  1578            ignore_unknown_values=self.ignore_unknown_columns)
  1579        self.batch_latency_metric.update((time.time() - start) * 1000)
  1580  
  1581        failed_rows = [(rows[entry['index']], entry["errors"])
  1582                       for entry in errors]
  1583        retry_backoff = next(self._backoff_calculator, None)
  1584  
  1585        # If retry_backoff is None, then we will not retry and must log.
  1586        should_retry = any(
  1587            RetryStrategy.should_retry(
  1588                self._retry_strategy, entry['errors'][0]['reason'])
  1589            for entry in errors) and retry_backoff is not None
  1590  
  1591        if not passed:
  1592          self.failed_rows_metric.update(len(failed_rows))
  1593          message = (
  1594              'There were errors inserting to BigQuery. Will{} retry. '
  1595              'Errors were {}'.format(("" if should_retry else " not"), errors))
  1596  
  1597          # The log level is:
  1598          # - WARNING when we are continuing to retry, and have a deadline.
  1599          # - ERROR when we will no longer retry, or MAY retry forever.
  1600          log_level = (
  1601              logging.WARN if should_retry or
  1602              self._retry_strategy != RetryStrategy.RETRY_ALWAYS else
  1603              logging.ERROR)
  1604  
  1605          _LOGGER.log(log_level, message)
  1606  
  1607        if not should_retry:
  1608          break
  1609        else:
  1610          _LOGGER.info(
  1611              'Sleeping %s seconds before retrying insertion.', retry_backoff)
  1612          time.sleep(retry_backoff)
  1613          rows = [fr[0] for fr in failed_rows]
  1614          self._throttled_secs.inc(retry_backoff)
  1615  
  1616      self._total_buffered_rows -= len(self._rows_buffer[destination])
  1617      del self._rows_buffer[destination]
  1618  
  1619      return itertools.chain([
  1620          pvalue.TaggedOutput(
  1621              BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS,
  1622              GlobalWindows.windowed_value((destination, row, err))) for row,
  1623          err in failed_rows
  1624      ],
  1625                             [
  1626                                 pvalue.TaggedOutput(
  1627                                     BigQueryWriteFn.FAILED_ROWS,
  1628                                     GlobalWindows.windowed_value(
  1629                                         (destination, row))) for row,
  1630                                 unused_err in failed_rows
  1631                             ])
  1632  
  1633  
  1634  # The number of shards per destination when writing via streaming inserts.
  1635  DEFAULT_SHARDS_PER_DESTINATION = 500
  1636  # The max duration a batch of elements is allowed to be buffered before being
  1637  # flushed to BigQuery.
  1638  DEFAULT_BATCH_BUFFERING_DURATION_LIMIT_SEC = 0.2
  1639  
  1640  
  1641  class _StreamToBigQuery(PTransform):
  1642    def __init__(
  1643        self,
  1644        table_reference,
  1645        table_side_inputs,
  1646        schema_side_inputs,
  1647        schema,
  1648        batch_size,
  1649        triggering_frequency,
  1650        create_disposition,
  1651        write_disposition,
  1652        kms_key,
  1653        retry_strategy,
  1654        additional_bq_parameters,
  1655        ignore_insert_ids,
  1656        ignore_unknown_columns,
  1657        with_auto_sharding,
  1658        num_streaming_keys=DEFAULT_SHARDS_PER_DESTINATION,
  1659        test_client=None,
  1660        max_retries=None):
  1661      self.table_reference = table_reference
  1662      self.table_side_inputs = table_side_inputs
  1663      self.schema_side_inputs = schema_side_inputs
  1664      self.schema = schema
  1665      self.batch_size = batch_size
  1666      self.triggering_frequency = triggering_frequency
  1667      self.create_disposition = create_disposition
  1668      self.write_disposition = write_disposition
  1669      self.kms_key = kms_key
  1670      self.retry_strategy = retry_strategy
  1671      self.test_client = test_client
  1672      self.additional_bq_parameters = additional_bq_parameters
  1673      self.ignore_insert_ids = ignore_insert_ids
  1674      self.ignore_unknown_columns = ignore_unknown_columns
  1675      self.with_auto_sharding = with_auto_sharding
  1676      self._num_streaming_keys = num_streaming_keys
  1677      self.max_retries = max_retries or MAX_INSERT_RETRIES
  1678  
  1679    class InsertIdPrefixFn(DoFn):
  1680      def start_bundle(self):
  1681        self.prefix = str(uuid.uuid4())
  1682        self._row_count = 0
  1683  
  1684      def process(self, element):
  1685        key = element[0]
  1686        value = element[1]
  1687        insert_id = '%s-%s' % (self.prefix, self._row_count)
  1688        self._row_count += 1
  1689        yield (key, (value, insert_id))
  1690  
  1691    def expand(self, input):
  1692      bigquery_write_fn = BigQueryWriteFn(
  1693          schema=self.schema,
  1694          batch_size=self.batch_size,
  1695          create_disposition=self.create_disposition,
  1696          write_disposition=self.write_disposition,
  1697          kms_key=self.kms_key,
  1698          retry_strategy=self.retry_strategy,
  1699          test_client=self.test_client,
  1700          additional_bq_parameters=self.additional_bq_parameters,
  1701          ignore_insert_ids=self.ignore_insert_ids,
  1702          ignore_unknown_columns=self.ignore_unknown_columns,
  1703          with_batched_input=self.with_auto_sharding,
  1704          max_retries=self.max_retries)
  1705  
  1706      def _add_random_shard(element):
  1707        key = element[0]
  1708        value = element[1]
  1709        return ((key, random.randint(0, self._num_streaming_keys)), value)
  1710  
  1711      def _restore_table_ref(sharded_table_ref_elems_kv):
  1712        sharded_table_ref = sharded_table_ref_elems_kv[0]
  1713        table_ref = bigquery_tools.parse_table_reference(sharded_table_ref)
  1714        return (table_ref, sharded_table_ref_elems_kv[1])
  1715  
  1716      tagged_data = (
  1717          input
  1718          | 'AppendDestination' >> beam.ParDo(
  1719              bigquery_tools.AppendDestinationsFn(self.table_reference),
  1720              *self.table_side_inputs)
  1721          | 'AddInsertIds' >> beam.ParDo(_StreamToBigQuery.InsertIdPrefixFn())
  1722          |
  1723          'ToHashableTableRef' >> beam.Map(bigquery_tools.to_hashable_table_ref))
  1724  
  1725      if not self.with_auto_sharding:
  1726        tagged_data = (
  1727            tagged_data
  1728            | 'WithFixedSharding' >> beam.Map(_add_random_shard)
  1729            | 'CommitInsertIds' >> ReshufflePerKey()
  1730            | 'DropShard' >> beam.Map(lambda kv: (kv[0][0], kv[1])))
  1731      else:
  1732        # Auto-sharding is achieved via GroupIntoBatches.WithShardedKey
  1733        # transform which shards, groups and at the same time batches the table
  1734        # rows to be inserted to BigQuery.
  1735  
  1736        # Firstly the keys of tagged_data (table references) are converted to a
  1737        # hashable format. This is needed to work with the keyed states used by
  1738        # GroupIntoBatches. After grouping and batching is done, original table
  1739        # references are restored.
  1740        tagged_data = (
  1741            tagged_data
  1742            | 'WithAutoSharding' >> beam.GroupIntoBatches.WithShardedKey(
  1743                (self.batch_size or BigQueryWriteFn.DEFAULT_MAX_BUFFERED_ROWS),
  1744                self.triggering_frequency or
  1745                DEFAULT_BATCH_BUFFERING_DURATION_LIMIT_SEC)
  1746            | 'DropShard' >> beam.Map(lambda kv: (kv[0].key, kv[1])))
  1747  
  1748      return (
  1749          tagged_data
  1750          | 'FromHashableTableRef' >> beam.Map(_restore_table_ref)
  1751          | 'StreamInsertRows' >> ParDo(
  1752              bigquery_write_fn, *self.schema_side_inputs).with_outputs(
  1753                  BigQueryWriteFn.FAILED_ROWS,
  1754                  BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS,
  1755                  main='main'))
  1756  
  1757  
  1758  # Flag to be passed to WriteToBigQuery to force schema autodetection
  1759  SCHEMA_AUTODETECT = 'SCHEMA_AUTODETECT'
  1760  
  1761  
  1762  class WriteToBigQuery(PTransform):
  1763    """Write data to BigQuery.
  1764  
  1765    This transform receives a PCollection of elements to be inserted into BigQuery
  1766    tables. The elements would come in as Python dictionaries, or as `TableRow`
  1767    instances.
  1768    """
  1769    class Method(object):
  1770      DEFAULT = 'DEFAULT'
  1771      STREAMING_INSERTS = 'STREAMING_INSERTS'
  1772      FILE_LOADS = 'FILE_LOADS'
  1773      STORAGE_WRITE_API = 'STORAGE_WRITE_API'
  1774  
  1775    def __init__(
  1776        self,
  1777        table,
  1778        dataset=None,
  1779        project=None,
  1780        schema=None,
  1781        create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
  1782        write_disposition=BigQueryDisposition.WRITE_APPEND,
  1783        kms_key=None,
  1784        batch_size=None,
  1785        max_file_size=None,
  1786        max_files_per_bundle=None,
  1787        test_client=None,
  1788        custom_gcs_temp_location=None,
  1789        method=None,
  1790        insert_retry_strategy=None,
  1791        additional_bq_parameters=None,
  1792        table_side_inputs=None,
  1793        schema_side_inputs=None,
  1794        triggering_frequency=None,
  1795        use_at_least_once=False,
  1796        validate=True,
  1797        temp_file_format=None,
  1798        ignore_insert_ids=False,
  1799        # TODO(https://github.com/apache/beam/issues/20712): Switch the default
  1800        # when the feature is mature.
  1801        with_auto_sharding=False,
  1802        ignore_unknown_columns=False,
  1803        load_job_project_id=None,
  1804        num_streaming_keys=DEFAULT_SHARDS_PER_DESTINATION,
  1805        expansion_service=None):
  1806      """Initialize a WriteToBigQuery transform.
  1807  
  1808      Args:
  1809        table (str, callable, ValueProvider): The ID of the table, or a callable
  1810           that returns it. The ID must contain only letters ``a-z``, ``A-Z``,
  1811           numbers ``0-9``, or connectors ``-_``. If dataset argument is
  1812           :data:`None` then the table argument must contain the entire table
  1813           reference specified as: ``'DATASET.TABLE'``
  1814           or ``'PROJECT:DATASET.TABLE'``. If it's a callable, it must receive one
  1815           argument representing an element to be written to BigQuery, and return
  1816           a TableReference, or a string table name as specified above.
  1817        dataset (str): The ID of the dataset containing this table or
  1818          :data:`None` if the table reference is specified entirely by the table
  1819          argument.
  1820        project (str): The ID of the project containing this table or
  1821          :data:`None` if the table reference is specified entirely by the table
  1822          argument.
  1823        schema (str,dict,ValueProvider,callable): The schema to be used if the
  1824          BigQuery table to write has to be created. This can be either specified
  1825          as a :class:`~apache_beam.io.gcp.internal.clients.bigquery.\
  1826  bigquery_v2_messages.TableSchema`. or a `ValueProvider` that has a JSON string,
  1827          or a python dictionary, or the string or dictionary itself,
  1828          object or a single string  of the form
  1829          ``'field1:type1,field2:type2,field3:type3'`` that defines a comma
  1830          separated list of fields. Here ``'type'`` should specify the BigQuery
  1831          type of the field. Single string based schemas do not support nested
  1832          fields, repeated fields, or specifying a BigQuery mode for fields
  1833          (mode will always be set to ``'NULLABLE'``).
  1834          If a callable, then it should receive a destination (in the form of
  1835          a str, and return a str, dict or TableSchema).
  1836          One may also pass ``SCHEMA_AUTODETECT`` here when using JSON-based
  1837          file loads, and BigQuery will try to infer the schema for the files
  1838          that are being loaded.
  1839        create_disposition (BigQueryDisposition): A string describing what
  1840          happens if the table does not exist. Possible values are:
  1841  
  1842          * :attr:`BigQueryDisposition.CREATE_IF_NEEDED`: create if does not
  1843            exist.
  1844          * :attr:`BigQueryDisposition.CREATE_NEVER`: fail the write if does not
  1845            exist.
  1846  
  1847        write_disposition (BigQueryDisposition): A string describing what happens
  1848          if the table has already some data. Possible values are:
  1849  
  1850          * :attr:`BigQueryDisposition.WRITE_TRUNCATE`: delete existing rows.
  1851          * :attr:`BigQueryDisposition.WRITE_APPEND`: add to existing rows.
  1852          * :attr:`BigQueryDisposition.WRITE_EMPTY`: fail the write if table not
  1853            empty.
  1854  
  1855          For streaming pipelines WriteTruncate can not be used.
  1856        kms_key (str): Optional Cloud KMS key name for use when creating new
  1857          tables.
  1858        batch_size (int): Number of rows to be written to BQ per streaming API
  1859          insert. The default is 500.
  1860        test_client: Override the default bigquery client used for testing.
  1861        max_file_size (int): The maximum size for a file to be written and then
  1862          loaded into BigQuery. The default value is 4TB, which is 80% of the
  1863          limit of 5TB for BigQuery to load any file.
  1864        max_files_per_bundle(int): The maximum number of files to be concurrently
  1865          written by a worker. The default here is 20. Larger values will allow
  1866          writing to multiple destinations without having to reshard - but they
  1867          increase the memory burden on the workers.
  1868        custom_gcs_temp_location (str): A GCS location to store files to be used
  1869          for file loads into BigQuery. By default, this will use the pipeline's
  1870          temp_location, but for pipelines whose temp_location is not appropriate
  1871          for BQ File Loads, users should pass a specific one.
  1872        method: The method to use to write to BigQuery. It may be
  1873          STREAMING_INSERTS, FILE_LOADS, STORAGE_WRITE_API or DEFAULT. An
  1874          introduction on loading data to BigQuery:
  1875          https://cloud.google.com/bigquery/docs/loading-data.
  1876          DEFAULT will use STREAMING_INSERTS on Streaming pipelines and
  1877          FILE_LOADS on Batch pipelines.
  1878          Note: FILE_LOADS currently does not support BigQuery's JSON data type:
  1879          https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#json_type">
  1880        insert_retry_strategy: The strategy to use when retrying streaming inserts
  1881          into BigQuery. Options are shown in bigquery_tools.RetryStrategy attrs.
  1882          Default is to retry always. This means that whenever there are rows
  1883          that fail to be inserted to BigQuery, they will be retried indefinitely.
  1884          Other retry strategy settings will produce a deadletter PCollection
  1885          as output. Appropriate values are:
  1886  
  1887          * `RetryStrategy.RETRY_ALWAYS`: retry all rows if
  1888            there are any kind of errors. Note that this will hold your pipeline
  1889            back if there are errors until you cancel or update it.
  1890          * `RetryStrategy.RETRY_NEVER`: rows with errors
  1891            will not be retried. Instead they will be output to a dead letter
  1892            queue under the `'FailedRows'` tag.
  1893          * `RetryStrategy.RETRY_ON_TRANSIENT_ERROR`: retry
  1894            rows with transient errors (e.g. timeouts). Rows with permanent errors
  1895            will be output to dead letter queue under `'FailedRows'` tag.
  1896  
  1897        additional_bq_parameters (dict, callable): Additional parameters to pass
  1898          to BQ when creating / loading data into a table. If a callable, it
  1899          should be a function that receives a table reference indicating
  1900          the destination and returns a dictionary.
  1901          These can be 'timePartitioning', 'clustering', etc. They are passed
  1902          directly to the job load configuration. See
  1903          https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload
  1904        table_side_inputs (tuple): A tuple with ``AsSideInput`` PCollections to be
  1905          passed to the table callable (if one is provided).
  1906        schema_side_inputs: A tuple with ``AsSideInput`` PCollections to be
  1907          passed to the schema callable (if one is provided).
  1908        triggering_frequency (float):
  1909          When method is FILE_LOADS:
  1910          Value will be converted to int. Every triggering_frequency seconds, a
  1911          BigQuery load job will be triggered for all the data written since the
  1912          last load job. BigQuery has limits on how many load jobs can be
  1913          triggered per day, so be careful not to set this duration too low, or
  1914          you may exceed daily quota. Often this is set to 5 or 10 minutes to
  1915          ensure that the project stays well under the BigQuery quota. See
  1916          https://cloud.google.com/bigquery/quota-policy for more information
  1917          about BigQuery quotas.
  1918  
  1919          When method is STREAMING_INSERTS and with_auto_sharding=True:
  1920          A streaming inserts batch will be submitted at least every
  1921          triggering_frequency seconds when data is waiting. The batch can be
  1922          sent earlier if it reaches the maximum batch size set by batch_size.
  1923          Default value is 0.2 seconds.
  1924  
  1925          When method is STORAGE_WRITE_API:
  1926          A stream of rows will be committed every triggering_frequency seconds.
  1927          By default, this will be 5 seconds to ensure exactly-once semantics.
  1928        use_at_least_once: Intended only for STORAGE_WRITE_API. When True, will
  1929          use at-least-once semantics. This is cheaper and provides lower
  1930          latency, but will potentially duplicate records.
  1931        validate: Indicates whether to perform validation checks on
  1932          inputs. This parameter is primarily used for testing.
  1933        temp_file_format: The format to use for file loads into BigQuery. The
  1934          options are NEWLINE_DELIMITED_JSON or AVRO, with NEWLINE_DELIMITED_JSON
  1935          being used by default. For advantages and limitations of the two
  1936          formats, see
  1937          https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro
  1938          and
  1939          https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json.
  1940        ignore_insert_ids: When using the STREAMING_INSERTS method to write data
  1941          to BigQuery, `insert_ids` are a feature of BigQuery that support
  1942          deduplication of events. If your use case is not sensitive to
  1943          duplication of data inserted to BigQuery, set `ignore_insert_ids`
  1944          to True to increase the throughput for BQ writing. See:
  1945          https://cloud.google.com/bigquery/streaming-data-into-bigquery#disabling_best_effort_de-duplication
  1946        with_auto_sharding: Experimental. If true, enables using a dynamically
  1947          determined number of shards to write to BigQuery. This can be used for
  1948          all of FILE_LOADS, STREAMING_INSERTS, and STORAGE_WRITE_API. Only
  1949          applicable to unbounded input.
  1950        ignore_unknown_columns: Accept rows that contain values that do not match
  1951          the schema. The unknown values are ignored. Default is False,
  1952          which treats unknown values as errors. This option is only valid for
  1953          method=STREAMING_INSERTS. See reference:
  1954          https://cloud.google.com/bigquery/docs/reference/rest/v2/tabledata/insertAll
  1955        load_job_project_id: Specifies an alternate GCP project id to use for
  1956          billingBatch File Loads. By default, the project id of the table is
  1957          used.
  1958        num_streaming_keys: The number of shards per destination when writing via
  1959          streaming inserts.
  1960        expansion_service: The address (host:port) of the expansion service.
  1961          If no expansion service is provided, will attempt to run the default
  1962          GCP expansion service. Used for STORAGE_WRITE_API method.
  1963      """
  1964      self._table = table
  1965      self._dataset = dataset
  1966      self._project = project
  1967      self.table_reference = bigquery_tools.parse_table_reference(
  1968          table, dataset, project)
  1969      self.create_disposition = BigQueryDisposition.validate_create(
  1970          create_disposition)
  1971      self.write_disposition = BigQueryDisposition.validate_write(
  1972          write_disposition)
  1973      if schema == SCHEMA_AUTODETECT:
  1974        self.schema = schema
  1975      else:
  1976        self.schema = bigquery_tools.get_dict_table_schema(schema)
  1977      self.batch_size = batch_size
  1978      self.kms_key = kms_key
  1979      self.test_client = test_client
  1980  
  1981      # TODO(pabloem): Consider handling ValueProvider for this location.
  1982      self.custom_gcs_temp_location = custom_gcs_temp_location
  1983      self.max_file_size = max_file_size
  1984      self.max_files_per_bundle = max_files_per_bundle
  1985      self.method = method or WriteToBigQuery.Method.DEFAULT
  1986      self.triggering_frequency = triggering_frequency
  1987      self.use_at_least_once = use_at_least_once
  1988      self.expansion_service = expansion_service
  1989      self.with_auto_sharding = with_auto_sharding
  1990      self.insert_retry_strategy = insert_retry_strategy
  1991      self._validate = validate
  1992      self._temp_file_format = temp_file_format or bigquery_tools.FileFormat.JSON
  1993  
  1994      self.additional_bq_parameters = additional_bq_parameters or {}
  1995      self.table_side_inputs = table_side_inputs or ()
  1996      self.schema_side_inputs = schema_side_inputs or ()
  1997      self._ignore_insert_ids = ignore_insert_ids
  1998      self._ignore_unknown_columns = ignore_unknown_columns
  1999      self.load_job_project_id = load_job_project_id
  2000      self._num_streaming_keys = num_streaming_keys
  2001  
  2002    # Dict/schema methods were moved to bigquery_tools, but keep references
  2003    # here for backward compatibility.
  2004    get_table_schema_from_string = \
  2005        staticmethod(bigquery_tools.get_table_schema_from_string)
  2006    table_schema_to_dict = staticmethod(bigquery_tools.table_schema_to_dict)
  2007    get_dict_table_schema = staticmethod(bigquery_tools.get_dict_table_schema)
  2008  
  2009    def _compute_method(self, experiments, is_streaming_pipeline):
  2010      # If the new BQ sink is not activated for experiment flags, then we use
  2011      # streaming inserts by default (it gets overridden in dataflow_runner.py).
  2012      if self.method == self.Method.DEFAULT and is_streaming_pipeline:
  2013        return self.Method.STREAMING_INSERTS
  2014      elif self.method == self.Method.DEFAULT and not is_streaming_pipeline:
  2015        return self.Method.FILE_LOADS
  2016      else:
  2017        return self.method
  2018  
  2019    def expand(self, pcoll):
  2020      p = pcoll.pipeline
  2021  
  2022      if (isinstance(self.table_reference, TableReference) and
  2023          self.table_reference.projectId is None):
  2024        self.table_reference.projectId = pcoll.pipeline.options.view_as(
  2025            GoogleCloudOptions).project
  2026  
  2027      # TODO(pabloem): Use a different method to determine if streaming or batch.
  2028      is_streaming_pipeline = p.options.view_as(StandardOptions).streaming
  2029  
  2030      if not is_streaming_pipeline and self.with_auto_sharding:
  2031        raise ValueError(
  2032            'with_auto_sharding is not applicable to batch pipelines.')
  2033  
  2034      experiments = p.options.view_as(DebugOptions).experiments or []
  2035      method_to_use = self._compute_method(experiments, is_streaming_pipeline)
  2036  
  2037      if method_to_use == WriteToBigQuery.Method.STREAMING_INSERTS:
  2038        if self.schema == SCHEMA_AUTODETECT:
  2039          raise ValueError(
  2040              'Schema auto-detection is not supported for streaming '
  2041              'inserts into BigQuery. Only for File Loads.')
  2042  
  2043        if self.triggering_frequency is not None and not self.with_auto_sharding:
  2044          raise ValueError(
  2045              'triggering_frequency with STREAMING_INSERTS can only be used with '
  2046              'with_auto_sharding=True.')
  2047  
  2048        outputs = pcoll | _StreamToBigQuery(
  2049            table_reference=self.table_reference,
  2050            table_side_inputs=self.table_side_inputs,
  2051            schema_side_inputs=self.schema_side_inputs,
  2052            schema=self.schema,
  2053            batch_size=self.batch_size,
  2054            triggering_frequency=self.triggering_frequency,
  2055            create_disposition=self.create_disposition,
  2056            write_disposition=self.write_disposition,
  2057            kms_key=self.kms_key,
  2058            retry_strategy=self.insert_retry_strategy,
  2059            additional_bq_parameters=self.additional_bq_parameters,
  2060            ignore_insert_ids=self._ignore_insert_ids,
  2061            ignore_unknown_columns=self._ignore_unknown_columns,
  2062            with_auto_sharding=self.with_auto_sharding,
  2063            test_client=self.test_client,
  2064            num_streaming_keys=self._num_streaming_keys)
  2065  
  2066        return WriteResult(
  2067            method=WriteToBigQuery.Method.STREAMING_INSERTS,
  2068            failed_rows=outputs[BigQueryWriteFn.FAILED_ROWS],
  2069            failed_rows_with_errors=outputs[
  2070                BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS])
  2071      elif method_to_use == WriteToBigQuery.Method.FILE_LOADS:
  2072        if self._temp_file_format == bigquery_tools.FileFormat.AVRO:
  2073          if self.schema == SCHEMA_AUTODETECT:
  2074            raise ValueError(
  2075                'Schema auto-detection is not supported when using Avro based '
  2076                'file loads into BigQuery. Please specify a schema or set '
  2077                'temp_file_format="NEWLINE_DELIMITED_JSON"')
  2078          if self.schema is None:
  2079            raise ValueError(
  2080                'A schema must be provided when writing to BigQuery using '
  2081                'Avro based file loads')
  2082  
  2083        if self.schema and type(self.schema) is dict:
  2084  
  2085          def find_in_nested_dict(schema):
  2086            for field in schema['fields']:
  2087              if field['type'] == 'JSON':
  2088                raise ValueError(
  2089                    'Found JSON type in table schema. JSON data '
  2090                    'insertion is currently not supported with '
  2091                    'FILE_LOADS write method. This is supported with '
  2092                    'STREAMING_INSERTS. For more information: '
  2093                    'https://cloud.google.com/bigquery/docs/reference/'
  2094                    'standard-sql/json-data#ingest_json_data')
  2095              elif field['type'] == 'STRUCT':
  2096                find_in_nested_dict(field)
  2097  
  2098          find_in_nested_dict(self.schema)
  2099  
  2100        from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads
  2101        # Only cast to int when a value is given.
  2102        # We only use an int for BigQueryBatchFileLoads
  2103        if self.triggering_frequency is not None:
  2104          triggering_frequency = int(self.triggering_frequency)
  2105        else:
  2106          triggering_frequency = self.triggering_frequency
  2107        output = pcoll | BigQueryBatchFileLoads(
  2108            destination=self.table_reference,
  2109            schema=self.schema,
  2110            project=self._project,
  2111            create_disposition=self.create_disposition,
  2112            write_disposition=self.write_disposition,
  2113            triggering_frequency=triggering_frequency,
  2114            with_auto_sharding=self.with_auto_sharding,
  2115            temp_file_format=self._temp_file_format,
  2116            max_file_size=self.max_file_size,
  2117            max_files_per_bundle=self.max_files_per_bundle,
  2118            custom_gcs_temp_location=self.custom_gcs_temp_location,
  2119            test_client=self.test_client,
  2120            table_side_inputs=self.table_side_inputs,
  2121            schema_side_inputs=self.schema_side_inputs,
  2122            additional_bq_parameters=self.additional_bq_parameters,
  2123            validate=self._validate,
  2124            is_streaming_pipeline=is_streaming_pipeline,
  2125            load_job_project_id=self.load_job_project_id)
  2126  
  2127        return WriteResult(
  2128            method=WriteToBigQuery.Method.FILE_LOADS,
  2129            destination_load_jobid_pairs=output[
  2130                BigQueryBatchFileLoads.DESTINATION_JOBID_PAIRS],
  2131            destination_file_pairs=output[
  2132                BigQueryBatchFileLoads.DESTINATION_FILE_PAIRS],
  2133            destination_copy_jobid_pairs=output[
  2134                BigQueryBatchFileLoads.DESTINATION_COPY_JOBID_PAIRS])
  2135      else:
  2136        # Storage Write API
  2137        if self.schema is None:
  2138          raise AttributeError(
  2139              "A schema is required in order to prepare rows"
  2140              "for writing with STORAGE_WRITE_API.")
  2141        if callable(self.schema):
  2142          raise NotImplementedError(
  2143              "Writing to dynamic destinations is not"
  2144              "supported for this write method.")
  2145        elif isinstance(self.schema, vp.ValueProvider):
  2146          schema = self.schema.get()
  2147        else:
  2148          schema = self.schema
  2149  
  2150        table = bigquery_tools.get_hashable_destination(self.table_reference)
  2151        # None type is not supported
  2152        triggering_frequency = self.triggering_frequency or 0
  2153        # SchemaTransform expects Beam Rows, so map to Rows first
  2154        output_beam_rows = (
  2155            pcoll
  2156            |
  2157            beam.Map(lambda row: bigquery_tools.beam_row_from_dict(row, schema)).
  2158            with_output_types(
  2159                RowTypeConstraint.from_fields(
  2160                    bigquery_tools.get_beam_typehints_from_tableschema(schema)))
  2161            | StorageWriteToBigQuery(
  2162                table=table,
  2163                create_disposition=self.create_disposition,
  2164                write_disposition=self.write_disposition,
  2165                triggering_frequency=triggering_frequency,
  2166                use_at_least_once=self.use_at_least_once,
  2167                with_auto_sharding=self.with_auto_sharding,
  2168                expansion_service=self.expansion_service))
  2169  
  2170        # return back from Beam Rows to Python dict elements
  2171        failed_rows = (
  2172            output_beam_rows[StorageWriteToBigQuery.FAILED_ROWS]
  2173            | beam.Map(lambda row: row.as_dict()))
  2174        failed_rows_with_errors = (
  2175            output_beam_rows[StorageWriteToBigQuery.FAILED_ROWS_WITH_ERRORS]
  2176            | beam.Map(
  2177                lambda row: {
  2178                    "error_message": row.error_message,
  2179                    "failed_row": row.failed_row.as_dict()
  2180                }))
  2181  
  2182        return WriteResult(
  2183            method=WriteToBigQuery.Method.STORAGE_WRITE_API,
  2184            failed_rows=failed_rows,
  2185            failed_rows_with_errors=failed_rows_with_errors)
  2186  
  2187    def display_data(self):
  2188      res = {}
  2189      if self.table_reference is not None and isinstance(self.table_reference,
  2190                                                         TableReference):
  2191        tableSpec = '{}.{}'.format(
  2192            self.table_reference.datasetId, self.table_reference.tableId)
  2193        if self.table_reference.projectId is not None:
  2194          tableSpec = '{}:{}'.format(self.table_reference.projectId, tableSpec)
  2195        res['table'] = DisplayDataItem(tableSpec, label='Table')
  2196  
  2197      res['validation'] = DisplayDataItem(
  2198          self._validate, label="Validation Enabled")
  2199      return res
  2200  
  2201    def to_runner_api_parameter(self, context):
  2202      from apache_beam.internal import pickler
  2203  
  2204      # It'd be nice to name these according to their actual
  2205      # names/positions in the orignal argument list, but such a
  2206      # transformation is currently irreversible given how
  2207      # remove_objects_from_args and insert_values_in_args
  2208      # are currently implemented.
  2209      def serialize(side_inputs):
  2210        return {(SIDE_INPUT_PREFIX + '%s') % ix:
  2211                si.to_runner_api(context).SerializeToString()
  2212                for ix,
  2213                si in enumerate(side_inputs)}
  2214  
  2215      table_side_inputs = serialize(self.table_side_inputs)
  2216      schema_side_inputs = serialize(self.schema_side_inputs)
  2217  
  2218      config = {
  2219          'table': self._table,
  2220          'dataset': self._dataset,
  2221          'project': self._project,
  2222          'schema': self.schema,
  2223          'create_disposition': self.create_disposition,
  2224          'write_disposition': self.write_disposition,
  2225          'kms_key': self.kms_key,
  2226          'batch_size': self.batch_size,
  2227          'max_file_size': self.max_file_size,
  2228          'max_files_per_bundle': self.max_files_per_bundle,
  2229          'custom_gcs_temp_location': self.custom_gcs_temp_location,
  2230          'method': self.method,
  2231          'insert_retry_strategy': self.insert_retry_strategy,
  2232          'additional_bq_parameters': self.additional_bq_parameters,
  2233          'table_side_inputs': table_side_inputs,
  2234          'schema_side_inputs': schema_side_inputs,
  2235          'triggering_frequency': self.triggering_frequency,
  2236          'validate': self._validate,
  2237          'temp_file_format': self._temp_file_format,
  2238          'ignore_insert_ids': self._ignore_insert_ids,
  2239          'with_auto_sharding': self.with_auto_sharding,
  2240      }
  2241      return 'beam:transform:write_to_big_query:v0', pickler.dumps(config)
  2242  
  2243    @PTransform.register_urn('beam:transform:write_to_big_query:v0', bytes)
  2244    def from_runner_api(unused_ptransform, payload, context):
  2245      from apache_beam.internal import pickler
  2246      from apache_beam.portability.api import beam_runner_api_pb2
  2247  
  2248      config = pickler.loads(payload)
  2249  
  2250      def deserialize(side_inputs):
  2251        deserialized_side_inputs = {}
  2252        for k, v in side_inputs.items():
  2253          side_input = beam_runner_api_pb2.SideInput()
  2254          side_input.ParseFromString(v)
  2255          deserialized_side_inputs[k] = side_input
  2256  
  2257        # This is an ordered list stored as a dict (see the comments in
  2258        # to_runner_api_parameter above).
  2259        indexed_side_inputs = [(
  2260            get_sideinput_index(tag),
  2261            pvalue.AsSideInput.from_runner_api(si, context)) for tag,
  2262                               si in deserialized_side_inputs.items()]
  2263        return [si for _, si in sorted(indexed_side_inputs)]
  2264  
  2265      config['table_side_inputs'] = deserialize(config['table_side_inputs'])
  2266      config['schema_side_inputs'] = deserialize(config['schema_side_inputs'])
  2267  
  2268      return WriteToBigQuery(**config)
  2269  
  2270  
  2271  class WriteResult:
  2272    """The result of a WriteToBigQuery transform.
  2273    """
  2274    def __init__(
  2275        self,
  2276        method: WriteToBigQuery.Method = None,
  2277        destination_load_jobid_pairs: PCollection[Tuple[str,
  2278                                                        JobReference]] = None,
  2279        destination_file_pairs: PCollection[Tuple[str, Tuple[str, int]]] = None,
  2280        destination_copy_jobid_pairs: PCollection[Tuple[str,
  2281                                                        JobReference]] = None,
  2282        failed_rows: PCollection[Tuple[str, dict]] = None,
  2283        failed_rows_with_errors: PCollection[Tuple[str, dict, list]] = None):
  2284  
  2285      self._method = method
  2286      self._destination_load_jobid_pairs = destination_load_jobid_pairs
  2287      self._destination_file_pairs = destination_file_pairs
  2288      self._destination_copy_jobid_pairs = destination_copy_jobid_pairs
  2289      self._failed_rows = failed_rows
  2290      self._failed_rows_with_errors = failed_rows_with_errors
  2291  
  2292      from apache_beam.io.gcp.bigquery_file_loads import BigQueryBatchFileLoads
  2293      self.attributes = {
  2294          BigQueryWriteFn.FAILED_ROWS: WriteResult.failed_rows,
  2295          BigQueryWriteFn.FAILED_ROWS_WITH_ERRORS: WriteResult.
  2296          failed_rows_with_errors,
  2297          BigQueryBatchFileLoads.DESTINATION_JOBID_PAIRS: WriteResult.
  2298          destination_load_jobid_pairs,
  2299          BigQueryBatchFileLoads.DESTINATION_FILE_PAIRS: WriteResult.
  2300          destination_file_pairs,
  2301          BigQueryBatchFileLoads.DESTINATION_COPY_JOBID_PAIRS: WriteResult.
  2302          destination_copy_jobid_pairs,
  2303      }
  2304  
  2305    def validate(self, valid_methods, attribute):
  2306      if self._method not in valid_methods:
  2307        raise AttributeError(
  2308            f'Cannot get {attribute} because it is not produced '
  2309            f'by the {self._method} write method. Note: only '
  2310            f'{valid_methods} produces this attribute.')
  2311  
  2312    @property
  2313    def destination_load_jobid_pairs(
  2314        self) -> PCollection[Tuple[str, JobReference]]:
  2315      """A ``FILE_LOADS`` method attribute
  2316  
  2317      Returns: A PCollection of the table destinations that were successfully
  2318        loaded to using the batch load API, along with the load job IDs.
  2319  
  2320      Raises: AttributeError: if accessed with a write method
  2321      besides ``FILE_LOADS``."""
  2322      self.validate([WriteToBigQuery.Method.FILE_LOADS],
  2323                    'DESTINATION_JOBID_PAIRS')
  2324  
  2325      return self._destination_load_jobid_pairs
  2326  
  2327    @property
  2328    def destination_file_pairs(self) -> PCollection[Tuple[str, Tuple[str, int]]]:
  2329      """A ``FILE_LOADS`` method attribute
  2330  
  2331      Returns: A PCollection of the table destinations along with the
  2332        temp files used as sources to load from.
  2333  
  2334      Raises: AttributeError: if accessed with a write method
  2335      besides ``FILE_LOADS``."""
  2336      self.validate([WriteToBigQuery.Method.FILE_LOADS], 'DESTINATION_FILE_PAIRS')
  2337  
  2338      return self._destination_file_pairs
  2339  
  2340    @property
  2341    def destination_copy_jobid_pairs(
  2342        self) -> PCollection[Tuple[str, JobReference]]:
  2343      """A ``FILE_LOADS`` method attribute
  2344  
  2345      Returns: A PCollection of the table destinations that were successfully
  2346        copied to, along with the copy job ID.
  2347  
  2348      Raises: AttributeError: if accessed with a write method
  2349      besides ``FILE_LOADS``."""
  2350      self.validate([WriteToBigQuery.Method.FILE_LOADS],
  2351                    'DESTINATION_COPY_JOBID_PAIRS')
  2352  
  2353      return self._destination_copy_jobid_pairs
  2354  
  2355    @property
  2356    def failed_rows(self) -> PCollection[Tuple[str, dict]]:
  2357      """A ``[STREAMING_INSERTS, STORAGE_WRITE_API]`` method attribute
  2358  
  2359      Returns: A PCollection of rows that failed when inserting to BigQuery.
  2360  
  2361      Raises: AttributeError: if accessed with a write method
  2362      besides ``[STREAMING_INSERTS, STORAGE_WRITE_API]``."""
  2363      self.validate([
  2364          WriteToBigQuery.Method.STREAMING_INSERTS,
  2365          WriteToBigQuery.Method.STORAGE_WRITE_API
  2366      ],
  2367                    'FAILED_ROWS')
  2368  
  2369      return self._failed_rows
  2370  
  2371    @property
  2372    def failed_rows_with_errors(self) -> PCollection[Tuple[str, dict, list]]:
  2373      """A ``[STREAMING_INSERTS, STORAGE_WRITE_API]`` method attribute
  2374  
  2375      Returns:
  2376        A PCollection of rows that failed when inserting to BigQuery,
  2377        along with their errors.
  2378  
  2379      Raises:
  2380        AttributeError: if accessed with a write method
  2381        besides ``[STREAMING_INSERTS, STORAGE_WRITE_API]``."""
  2382      self.validate([
  2383          WriteToBigQuery.Method.STREAMING_INSERTS,
  2384          WriteToBigQuery.Method.STORAGE_WRITE_API
  2385      ],
  2386                    'FAILED_ROWS_WITH_ERRORS')
  2387  
  2388      return self._failed_rows_with_errors
  2389  
  2390    def __getitem__(self, key):
  2391      if key not in self.attributes:
  2392        raise AttributeError(
  2393            f'Error trying to access nonexistent attribute `{key}` in write '
  2394            'result. Please see __documentation__ for available attributes.')
  2395  
  2396      return self.attributes[key].__get__(self, WriteResult)
  2397  
  2398  
  2399  def _default_io_expansion_service(append_args=None):
  2400    return BeamJarExpansionService(
  2401        'sdks:java:io:google-cloud-platform:expansion-service:build',
  2402        append_args=append_args)
  2403  
  2404  
  2405  class StorageWriteToBigQuery(PTransform):
  2406    """Writes data to BigQuery using Storage API.
  2407  
  2408    Experimental; no backwards compatibility guarantees.
  2409    """
  2410    URN = "beam:schematransform:org.apache.beam:bigquery_storage_write:v1"
  2411    FAILED_ROWS = "FailedRows"
  2412    FAILED_ROWS_WITH_ERRORS = "FailedRowsWithErrors"
  2413  
  2414    def __init__(
  2415        self,
  2416        table,
  2417        create_disposition=BigQueryDisposition.CREATE_IF_NEEDED,
  2418        write_disposition=BigQueryDisposition.WRITE_APPEND,
  2419        triggering_frequency=0,
  2420        use_at_least_once=False,
  2421        with_auto_sharding=False,
  2422        expansion_service=None):
  2423      """Initialize a StorageWriteToBigQuery transform.
  2424  
  2425      :param table:
  2426        Fully-qualified table ID specified as ``'PROJECT:DATASET.TABLE'``.
  2427      :param create_disposition:
  2428        String specifying the strategy to take when the table doesn't
  2429        exist. Possible values are:
  2430        * ``'CREATE_IF_NEEDED'``: create if does not exist.
  2431        * ``'CREATE_NEVER'``: fail the write if does not exist.
  2432      :param write_disposition:
  2433        String specifying the strategy to take when the table already
  2434        contains data. Possible values are:
  2435        * ``'WRITE_TRUNCATE'``: delete existing rows.
  2436        * ``'WRITE_APPEND'``: add to existing rows.
  2437        * ``'WRITE_EMPTY'``: fail the write if table not empty.
  2438      :param triggering_frequency:
  2439        The time in seconds between write commits. Should only be specified
  2440        for streaming pipelines. Defaults to 5 seconds.
  2441      :param use_at_least_once:
  2442        Use at-least-once semantics. Is cheaper and provides lower latency,
  2443        but will potentially duplicate records.
  2444      :param with_auto_sharding:
  2445        Experimental. If true, enables using a dynamically determined number of
  2446        shards to write to BigQuery. Only applicable to unbounded input.
  2447      :param expansion_service:
  2448        The address (host:port) of the expansion service. If no expansion
  2449        service is provided, will attempt to run the default GCP expansion
  2450        service.
  2451      """
  2452      super().__init__()
  2453      self._table = table
  2454      self._create_disposition = create_disposition
  2455      self._write_disposition = write_disposition
  2456      self._triggering_frequency = triggering_frequency
  2457      self._use_at_least_once = use_at_least_once
  2458      self._with_auto_sharding = with_auto_sharding
  2459      self._expansion_service = (
  2460          expansion_service or _default_io_expansion_service())
  2461      self.schematransform_config = SchemaAwareExternalTransform.discover_config(
  2462          self._expansion_service, self.URN)
  2463  
  2464    def expand(self, input):
  2465      external_storage_write = SchemaAwareExternalTransform(
  2466          identifier=self.schematransform_config.identifier,
  2467          expansion_service=self._expansion_service,
  2468          rearrange_based_on_discovery=True,
  2469          autoSharding=self._with_auto_sharding,
  2470          createDisposition=self._create_disposition,
  2471          table=self._table,
  2472          triggeringFrequencySeconds=self._triggering_frequency,
  2473          useAtLeastOnceSemantics=self._use_at_least_once,
  2474          writeDisposition=self._write_disposition,
  2475      )
  2476  
  2477      input_tag = self.schematransform_config.inputs[0]
  2478  
  2479      return {input_tag: input} | external_storage_write
  2480  
  2481  
  2482  class ReadFromBigQuery(PTransform):
  2483    """Read data from BigQuery.
  2484  
  2485      This PTransform uses a BigQuery export job to take a snapshot of the table
  2486      on GCS, and then reads from each produced file. File format is Avro by
  2487      default.
  2488  
  2489    Args:
  2490      method: The method to use to read from BigQuery. It may be EXPORT or
  2491        DIRECT_READ. EXPORT invokes a BigQuery export request
  2492        (https://cloud.google.com/bigquery/docs/exporting-data). DIRECT_READ reads
  2493        directly from BigQuery storage using the BigQuery Read API
  2494        (https://cloud.google.com/bigquery/docs/reference/storage). If
  2495        unspecified, the default is currently EXPORT.
  2496      use_native_datetime (bool): By default this transform exports BigQuery
  2497        DATETIME fields as formatted strings (for example:
  2498        2021-01-01T12:59:59). If :data:`True`, BigQuery DATETIME fields will
  2499        be returned as native Python datetime objects. This can only be used when
  2500        'method' is 'DIRECT_READ'.
  2501      table (str, callable, ValueProvider): The ID of the table, or a callable
  2502        that returns it. If dataset argument is :data:`None` then the table
  2503        argument must contain the entire table reference specified as:
  2504        ``'DATASET.TABLE'`` or ``'PROJECT:DATASET.TABLE'``. If it's a callable,
  2505        it must receive one argument representing an element to be written to
  2506        BigQuery, and return a TableReference, or a string table name as specified
  2507        above.
  2508      dataset (str): The ID of the dataset containing this table or
  2509        :data:`None` if the table reference is specified entirely by the table
  2510        argument.
  2511      project (str): The ID of the project containing this table.
  2512      query (str, ValueProvider): A query to be used instead of arguments
  2513        table, dataset, and project.
  2514      validate (bool): If :data:`True`, various checks will be done when source
  2515        gets initialized (e.g., is table present?). This should be
  2516        :data:`True` for most scenarios in order to catch errors as early as
  2517        possible (pipeline construction instead of pipeline execution). It
  2518        should be :data:`False` if the table is created during pipeline
  2519        execution by a previous step.
  2520      coder (~apache_beam.coders.coders.Coder): The coder for the table
  2521        rows. If :data:`None`, then the default coder is
  2522        _JsonToDictCoder, which will interpret every row as a JSON
  2523        serialized dictionary.
  2524      use_standard_sql (bool): Specifies whether to use BigQuery's standard SQL
  2525        dialect for this query. The default value is :data:`False`.
  2526        If set to :data:`True`, the query will use BigQuery's updated SQL
  2527        dialect with improved standards compliance.
  2528        This parameter is ignored for table inputs.
  2529      flatten_results (bool): Flattens all nested and repeated fields in the
  2530        query results. The default value is :data:`True`.
  2531      kms_key (str): Optional Cloud KMS key name for use when creating new
  2532        temporary tables.
  2533      gcs_location (str, ValueProvider): The name of the Google Cloud Storage
  2534        bucket where the extracted table should be written as a string or
  2535        a :class:`~apache_beam.options.value_provider.ValueProvider`. If
  2536        :data:`None`, then the temp_location parameter is used.
  2537      bigquery_job_labels (dict): A dictionary with string labels to be passed
  2538        to BigQuery export and query jobs created by this transform. See:
  2539        https://cloud.google.com/bigquery/docs/reference/rest/v2/\
  2540                Job#JobConfiguration
  2541      use_json_exports (bool): By default, this transform works by exporting
  2542        BigQuery data into Avro files, and reading those files. With this
  2543        parameter, the transform will instead export to JSON files. JSON files
  2544        are slower to read due to their larger size.
  2545        When using JSON exports, the BigQuery types for DATE, DATETIME, TIME, and
  2546        TIMESTAMP will be exported as strings. This behavior is consistent with
  2547        BigQuerySource.
  2548        When using Avro exports, these fields will be exported as native Python
  2549        types (datetime.date, datetime.datetime, datetime.datetime,
  2550        and datetime.datetime respectively). Avro exports are recommended.
  2551        To learn more about BigQuery types, and Time-related type
  2552        representations, see: https://cloud.google.com/bigquery/docs/reference/\
  2553                standard-sql/data-types
  2554        To learn more about type conversions between BigQuery and Avro, see:
  2555        https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-avro\
  2556                #avro_conversions
  2557      temp_dataset (``apache_beam.io.gcp.internal.clients.bigquery.\
  2558          DatasetReference``):
  2559          Temporary dataset reference to use when reading from BigQuery using a
  2560          query. When reading using a query, BigQuery source will create a
  2561          temporary dataset and a temporary table to store the results of the
  2562          query. With this option, you can set an existing dataset to create the
  2563          temporary table in. BigQuery source will create a temporary table in
  2564          that dataset, and will remove it once it is not needed. Job needs access
  2565          to create and delete tables within the given dataset. Dataset name
  2566          should *not* start with the reserved prefix `beam_temp_dataset_`.
  2567      query_priority (BigQueryQueryPriority): By default, this transform runs
  2568        queries with BATCH priority. Use :attr:`BigQueryQueryPriority.INTERACTIVE`
  2569        to run queries with INTERACTIVE priority. This option is ignored when
  2570        reading from a table rather than a query. To learn more about query
  2571        priority, see: https://cloud.google.com/bigquery/docs/running-queries
  2572      output_type (str): By default, this source yields Python dictionaries
  2573        (`PYTHON_DICT`). There is experimental support for producing a
  2574        PCollection with a schema and yielding Beam Rows via the option
  2575        `BEAM_ROW`. For more information on schemas, see
  2576        https://beam.apache.org/documentation/programming-guide/\
  2577        #what-is-a-schema)
  2578        """
  2579    class Method(object):
  2580      EXPORT = 'EXPORT'  #  This is currently the default.
  2581      DIRECT_READ = 'DIRECT_READ'
  2582  
  2583    COUNTER = 0
  2584  
  2585    def __init__(
  2586        self,
  2587        gcs_location=None,
  2588        method=None,
  2589        use_native_datetime=False,
  2590        output_type=None,
  2591        *args,
  2592        **kwargs):
  2593      self.method = method or ReadFromBigQuery.Method.EXPORT
  2594      self.use_native_datetime = use_native_datetime
  2595      self.output_type = output_type
  2596      self._args = args
  2597      self._kwargs = kwargs
  2598  
  2599      if self.method is ReadFromBigQuery.Method.EXPORT \
  2600          and self.use_native_datetime is True:
  2601        raise TypeError(
  2602            'The "use_native_datetime" parameter cannot be True for EXPORT.'
  2603            ' Please set the "use_native_datetime" parameter to False *OR*'
  2604            ' set the "method" parameter to ReadFromBigQuery.Method.DIRECT_READ.')
  2605  
  2606      if gcs_location and self.method is ReadFromBigQuery.Method.EXPORT:
  2607        if not isinstance(gcs_location, (str, ValueProvider)):
  2608          raise TypeError(
  2609              '%s: gcs_location must be of type string'
  2610              ' or ValueProvider; got %r instead' %
  2611              (self.__class__.__name__, type(gcs_location)))
  2612        if isinstance(gcs_location, str):
  2613          gcs_location = StaticValueProvider(str, gcs_location)
  2614  
  2615      if self.output_type == 'BEAM_ROW' and self._kwargs.get('query',
  2616                                                             None) is not None:
  2617        raise ValueError(
  2618            "Both a query and an output type of 'BEAM_ROW' were specified. "
  2619            "'BEAM_ROW' is not currently supported with queries.")
  2620  
  2621      self.gcs_location = gcs_location
  2622      self.bigquery_dataset_labels = {
  2623          'type': 'bq_direct_read_' + str(uuid.uuid4())[0:10]
  2624      }
  2625  
  2626    def expand(self, pcoll):
  2627      if self.method is ReadFromBigQuery.Method.EXPORT:
  2628        output_pcollection = self._expand_export(pcoll)
  2629      elif self.method is ReadFromBigQuery.Method.DIRECT_READ:
  2630        output_pcollection = self._expand_direct_read(pcoll)
  2631  
  2632      else:
  2633        raise ValueError(
  2634            'The method to read from BigQuery must be either EXPORT'
  2635            'or DIRECT_READ.')
  2636      return self._expand_output_type(output_pcollection)
  2637  
  2638    def _expand_output_type(self, output_pcollection):
  2639      if self.output_type == 'PYTHON_DICT' or self.output_type is None:
  2640        return output_pcollection
  2641      elif self.output_type == 'BEAM_ROW':
  2642        table_details = bigquery_tools.parse_table_reference(
  2643            table=self._kwargs.get("table", None),
  2644            dataset=self._kwargs.get("dataset", None),
  2645            project=self._kwargs.get("project", None))
  2646        if isinstance(self._kwargs['table'], ValueProvider):
  2647          raise TypeError(
  2648              '%s: table must be of type string'
  2649              '; got ValueProvider instead' % self.__class__.__name__)
  2650        elif callable(self._kwargs['table']):
  2651          raise TypeError(
  2652              '%s: table must be of type string'
  2653              '; got a callable instead' % self.__class__.__name__)
  2654        return output_pcollection | bigquery_schema_tools.\
  2655              convert_to_usertype(
  2656              bigquery_tools.BigQueryWrapper().get_table(
  2657                  project_id=table_details.projectId,
  2658                  dataset_id=table_details.datasetId,
  2659                  table_id=table_details.tableId).schema)
  2660  
  2661    def _expand_export(self, pcoll):
  2662      # TODO(https://github.com/apache/beam/issues/20683): Make ReadFromBQ rely
  2663      # on ReadAllFromBQ implementation.
  2664      temp_location = pcoll.pipeline.options.view_as(
  2665          GoogleCloudOptions).temp_location
  2666      job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
  2667      gcs_location_vp = self.gcs_location
  2668      unique_id = str(uuid.uuid4())[0:10]
  2669  
  2670      def file_path_to_remove(unused_elm):
  2671        gcs_location = bigquery_export_destination_uri(
  2672            gcs_location_vp, temp_location, unique_id, True)
  2673        return gcs_location + '/'
  2674  
  2675      files_to_remove_pcoll = beam.pvalue.AsList(
  2676          pcoll.pipeline
  2677          | 'FilesToRemoveImpulse' >> beam.Create([None])
  2678          | 'MapFilesToRemove' >> beam.Map(file_path_to_remove))
  2679  
  2680      try:
  2681        step_name = self.label
  2682      except AttributeError:
  2683        step_name = 'ReadFromBigQuery_%d' % ReadFromBigQuery.COUNTER
  2684        ReadFromBigQuery.COUNTER += 1
  2685      return (
  2686          pcoll
  2687          | beam.io.Read(
  2688              _CustomBigQuerySource(
  2689                  gcs_location=self.gcs_location,
  2690                  pipeline_options=pcoll.pipeline.options,
  2691                  method=self.method,
  2692                  job_name=job_name,
  2693                  step_name=step_name,
  2694                  unique_id=unique_id,
  2695                  *self._args,
  2696                  **self._kwargs))
  2697          | _PassThroughThenCleanup(files_to_remove_pcoll))
  2698  
  2699    def _expand_direct_read(self, pcoll):
  2700      project_id = None
  2701      temp_table_ref = None
  2702      if 'temp_dataset' in self._kwargs:
  2703        temp_table_ref = bigquery.TableReference(
  2704            projectId=self._kwargs['temp_dataset'].projectId,
  2705            datasetId=self._kwargs['temp_dataset'].datasetId,
  2706            tableId='beam_temp_table_' + uuid.uuid4().hex)
  2707      else:
  2708        project_id = pcoll.pipeline.options.view_as(GoogleCloudOptions).project
  2709  
  2710      def _get_pipeline_details(unused_elm):
  2711        pipeline_details = {}
  2712        if temp_table_ref is not None:
  2713          pipeline_details['temp_table_ref'] = temp_table_ref
  2714        elif project_id is not None:
  2715          pipeline_details['project_id'] = project_id
  2716          pipeline_details[
  2717              'bigquery_dataset_labels'] = self.bigquery_dataset_labels
  2718        return pipeline_details
  2719  
  2720      project_to_cleanup_pcoll = beam.pvalue.AsList(
  2721          pcoll.pipeline
  2722          | 'ProjectToCleanupImpulse' >> beam.Create([None])
  2723          | 'MapProjectToCleanup' >> beam.Map(_get_pipeline_details))
  2724  
  2725      return (
  2726          pcoll
  2727          | beam.io.Read(
  2728              _CustomBigQueryStorageSource(
  2729                  pipeline_options=pcoll.pipeline.options,
  2730                  method=self.method,
  2731                  use_native_datetime=self.use_native_datetime,
  2732                  temp_table=temp_table_ref,
  2733                  bigquery_dataset_labels=self.bigquery_dataset_labels,
  2734                  *self._args,
  2735                  **self._kwargs))
  2736          | _PassThroughThenCleanupTempDatasets(project_to_cleanup_pcoll))
  2737  
  2738  
  2739  class ReadFromBigQueryRequest:
  2740    """
  2741    Class that defines data to read from BQ.
  2742    """
  2743    def __init__(
  2744        self,
  2745        query: str = None,
  2746        use_standard_sql: bool = True,
  2747        table: Union[str, TableReference] = None,
  2748        flatten_results: bool = False):
  2749      """
  2750      Only one of query or table should be specified.
  2751  
  2752      :param query: SQL query to fetch data.
  2753      :param use_standard_sql:
  2754        Specifies whether to use BigQuery's standard SQL dialect for this query.
  2755        The default value is :data:`True`. If set to :data:`False`,
  2756        the query will use BigQuery's legacy SQL dialect.
  2757        This parameter is ignored for table inputs.
  2758      :param table:
  2759        The ID of the table to read. Table should define project and dataset
  2760        (ex.: ``'PROJECT:DATASET.TABLE'``).
  2761      :param flatten_results:
  2762        Flattens all nested and repeated fields in the query results.
  2763        The default value is :data:`False`.
  2764      """
  2765      self.flatten_results = flatten_results
  2766      self.query = query
  2767      self.use_standard_sql = use_standard_sql
  2768      self.table = table
  2769      self.validate()
  2770  
  2771      # We use this internal object ID to generate BigQuery export directories.
  2772      self.obj_id = random.randint(0, 100000)
  2773  
  2774    def validate(self):
  2775      if self.table is not None and self.query is not None:
  2776        raise ValueError(
  2777            'Both a BigQuery table and a query were specified.'
  2778            ' Please specify only one of these.')
  2779      elif self.table is None and self.query is None:
  2780        raise ValueError('A BigQuery table or a query must be specified')
  2781      if self.table is not None:
  2782        if isinstance(self.table, str):
  2783          assert self.table.find('.'), (
  2784              'Expected a table reference '
  2785              '(PROJECT:DATASET.TABLE or DATASET.TABLE) instead of %s'
  2786              % self.table)
  2787  
  2788  
  2789  class ReadAllFromBigQuery(PTransform):
  2790    """Read data from BigQuery.
  2791  
  2792      PTransform:ReadFromBigQueryRequest->Rows
  2793  
  2794      This PTransform uses a BigQuery export job to take a snapshot of the table
  2795      on GCS, and then reads from each produced file. Data is exported into
  2796      a new subdirectory for each export using UUIDs generated in
  2797      `ReadFromBigQueryRequest` objects.
  2798  
  2799      It is recommended not to use this PTransform for streaming jobs on
  2800      GlobalWindow, since it will not be able to cleanup snapshots.
  2801  
  2802    Args:
  2803      gcs_location (str): The name of the Google Cloud Storage
  2804        bucket where the extracted table should be written as a string. If
  2805        :data:`None`, then the temp_location parameter is used.
  2806      validate (bool): If :data:`True`, various checks will be done when source
  2807        gets initialized (e.g., is table present?).
  2808      kms_key (str): Experimental. Optional Cloud KMS key name for use when
  2809        creating new temporary tables.
  2810     """
  2811    COUNTER = 0
  2812  
  2813    def __init__(
  2814        self,
  2815        gcs_location: Union[str, ValueProvider] = None,
  2816        validate: bool = False,
  2817        kms_key: str = None,
  2818        temp_dataset: Union[str, DatasetReference] = None,
  2819        bigquery_job_labels: Dict[str, str] = None,
  2820        query_priority: str = BigQueryQueryPriority.BATCH):
  2821      if gcs_location:
  2822        if not isinstance(gcs_location, (str, ValueProvider)):
  2823          raise TypeError(
  2824              '%s: gcs_location must be of type string'
  2825              ' or ValueProvider; got %r instead' %
  2826              (self.__class__.__name__, type(gcs_location)))
  2827  
  2828      self.gcs_location = gcs_location
  2829      self.validate = validate
  2830      self.kms_key = kms_key
  2831      self.bigquery_job_labels = bigquery_job_labels
  2832      self.temp_dataset = temp_dataset
  2833      self.query_priority = query_priority
  2834  
  2835    def expand(self, pcoll):
  2836      job_name = pcoll.pipeline.options.view_as(GoogleCloudOptions).job_name
  2837      project = pcoll.pipeline.options.view_as(GoogleCloudOptions).project
  2838      unique_id = str(uuid.uuid4())[0:10]
  2839  
  2840      try:
  2841        step_name = self.label
  2842      except AttributeError:
  2843        step_name = 'ReadAllFromBigQuery_%d' % ReadAllFromBigQuery.COUNTER
  2844        ReadAllFromBigQuery.COUNTER += 1
  2845  
  2846      sources_to_read, cleanup_locations = (
  2847          pcoll
  2848          | beam.ParDo(
  2849          _BigQueryReadSplit(
  2850              options=pcoll.pipeline.options,
  2851              gcs_location=self.gcs_location,
  2852              bigquery_job_labels=self.bigquery_job_labels,
  2853              job_name=job_name,
  2854              step_name=step_name,
  2855              unique_id=unique_id,
  2856              kms_key=self.kms_key,
  2857              project=project,
  2858              temp_dataset=self.temp_dataset,
  2859              query_priority=self.query_priority)).with_outputs(
  2860          "location_to_cleanup", main="files_to_read")
  2861      )
  2862  
  2863      return (
  2864          sources_to_read
  2865          | SDFBoundedSourceReader(data_to_display=self.display_data())
  2866          | _PassThroughThenCleanup(beam.pvalue.AsIter(cleanup_locations)))