github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/dataframe/README.md (about)

     1  <!--
     2      Licensed to the Apache Software Foundation (ASF) under one
     3      or more contributor license agreements.  See the NOTICE file
     4      distributed with this work for additional information
     5      regarding copyright ownership.  The ASF licenses this file
     6      to you under the Apache License, Version 2.0 (the
     7      "License"); you may not use this file except in compliance
     8      with the License.  You may obtain a copy of the License at
     9  
    10        http://www.apache.org/licenses/LICENSE-2.0
    11  
    12      Unless required by applicable law or agreed to in writing,
    13      software distributed under the License is distributed on an
    14      "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    15      KIND, either express or implied.  See the License for the
    16      specific language governing permissions and limitations
    17      under the License.
    18  -->
    19  
    20  # Example DataFrame API Pipelines
    21  
    22  This module contains example pipelines that use the [Beam DataFrame
    23  API](https://beam.apache.org/documentation/dsls/dataframes/overview/).
    24  
    25  ## Pre-requisites
    26  
    27  You must have `apache-beam>=2.30.0` installed in order to run these pipelines,
    28  because the `apache_beam.examples.dataframe` module was added in that release.
    29  Using the DataFrame API also requires a compatible pandas version to be
    30  installed, see the
    31  [documentation](https://beam.apache.org/documentation/dsls/dataframes/overview/#pre-requisites)
    32  for details.
    33  
    34  ## Wordcount Pipeline
    35  
    36  Wordcount is the "Hello World" of data analytic systems, so of course we
    37  had to implement it for the Beam DataFrame API! See [`wordcount.py`](./wordcount.py) for the
    38  implementation. Note it demonstrates how to integrate the DataFrame API with
    39  a larger Beam pipeline by using [Beam
    40  Schemas](https://beam.apache.org/documentation/programming-guide/#what-is-a-schema)
    41  in conjunction with
    42  [to_dataframe](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe)
    43  and
    44  [to_pcollection](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection).
    45  
    46  ### Running the pipeline
    47  
    48  To run the pipeline locally:
    49  
    50  ```sh
    51  python -m apache_beam.examples.dataframe.wordcount \
    52    --input gs://dataflow-samples/shakespeare/kinglear.txt \
    53    --output counts
    54  ```
    55  
    56  This will produce files like `counts-XXXXX-of-YYYYY` with contents like:
    57  ```
    58  KING: 243
    59  LEAR: 236
    60  DRAMATIS: 1
    61  PERSONAE: 1
    62  king: 65
    63  of: 447
    64  Britain: 2
    65  OF: 15
    66  FRANCE: 10
    67  DUKE: 3
    68  ...
    69  ```
    70  
    71  ## Taxi Ride Example Pipelines
    72  
    73  [`taxiride.py`](./taxiride.py) contains implementations for two DataFrame pipelines that
    74  process the well-known [NYC Taxi
    75  dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). These
    76  pipelines don't use any Beam primitives. Instead they build end-to-end pipelines
    77  using the DataFrame API, by leveraging [DataFrame
    78  IOs](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html).
    79  
    80  The module defines two pipelines. The `location_id_agg` pipeline does a grouped
    81  aggregation on the drop-off location ID. The `borough_enrich` pipeline extends
    82  this example by joining the zone lookup table to find the borough where each
    83  drop off occurred, and aggregate per borough.
    84  
    85  ### Data
    86  Some snapshots of NYC taxi data have been staged in
    87  `gs://apache-beam-samples` for use with these example pipelines:
    88  
    89  - `gs://apache-beam-samples/nyc_taxi/2017/yellow_tripdata_2017-*.csv`: CSV files
    90    containing taxi ride data for each month of 2017 (similar directories exist
    91    for 2018 and 2019).
    92  - `gs://apache-beam-samples/nyc_taxi/misc/sample.csv`: A sample of 1 million
    93    records from the beginning of 2019. At ~85 MiB this is a manageable size for
    94    processing locally.
    95  - `gs://apache-beam-samples/nyc_taxi/misc/taxi+_zone_lookup.csv`: Lookup table
    96    with information about Zone IDs. Used by the `borough_enrich` pipeline.
    97  
    98  ### Running `location_id_agg`
    99  To run the aggregation pipeline locally, use the following command:
   100  ```sh
   101  python -m apache_beam.examples.dataframe.taxiride \
   102    --pipeline location_id_agg \
   103    --input gs://apache-beam-samples/nyc_taxi/misc/sample.csv \
   104    --output aggregation.csv
   105  ```
   106  
   107  This will write the output to files like `aggregation.csv-XXXXX-of-YYYYY` with
   108  contents like:
   109  ```
   110  DOLocationID,passenger_count
   111  1,3852
   112  3,130
   113  4,7725
   114  5,24
   115  6,37
   116  7,7429
   117  8,24
   118  9,180
   119  10,938
   120  ...
   121  ```
   122  
   123  ### Running `borough_enrich`
   124  To run the enrich pipeline locally, use the command:
   125  ```sh
   126  python -m apache_beam.examples.dataframe.taxiride \
   127    --pipeline borough_enrich \
   128    --input gs://apache-beam-samples/nyc_taxi/misc/sample.csv \
   129    --output enrich.csv
   130  ```
   131  
   132  This will write the output to files like `enrich.csv-XXXXX-of-YYYYY` with
   133  contents like:
   134  ```
   135  Borough,passenger_count
   136  Bronx,13645
   137  Brooklyn,70654
   138  EWR,3852
   139  Manhattan,1417124
   140  Queens,81138
   141  Staten Island,531
   142  Unknown,28527
   143  ```
   144  
   145  ## Flight Delay pipeline (added in 2.31.0)
   146  [`flight_delays.py`](./flight_delays.py) contains an implementation of
   147  a pipeline that processes the flight ontime data from
   148  `bigquery-samples.airline_ontime_data.flights`. It uses a conventional Beam
   149  pipeline to read from BigQuery, apply a 24-hour rolling window, and define a
   150  Beam schema for the data. Then it converts to DataFrames in order to perform
   151  a complex aggregation using `GroupBy.apply`, and write the result out with
   152  `to_csv`. Note that the DataFrame computation respects the 24-hour window
   153  applied above, and results are partitioned into separate files per day.
   154  
   155  ### Running the pipeline
   156  To run the pipeline locally:
   157  
   158  ```sh
   159  python -m apache_beam.examples.dataframe.flight_delays \
   160    --start_date 2012-12-24 \
   161    --end_date 2012-12-25 \
   162    --output gs://<bucket>/<dir>/delays.csv \
   163    --project <gcp-project> \
   164    --temp_location gs://<bucket>/<dir>
   165  ```
   166  
   167  Note a GCP `project` and `temp_location` are required for reading from BigQuery.
   168  
   169  This will produce files like
   170  `gs://<bucket>/<dir>/delays.csv-2012-12-24T00:00:00-2012-12-25T00:00:00-XXXXX-of-YYYYY`
   171  with contents tracking average delays per airline on that day, for example:
   172  ```
   173  airline,departure_delay,arrival_delay
   174  EV,10.01901901901902,4.431431431431432
   175  HA,-1.0829015544041452,0.010362694300518135
   176  UA,19.142555438225976,11.07180570221753
   177  VX,62.755102040816325,62.61224489795919
   178  WN,12.074298711144806,6.717968157695224
   179  ...
   180  ```