github.com/apache/beam/sdks/v2@v2.48.2/python/apache_beam/examples/dataframe/README.md (about) 1 <!-- 2 Licensed to the Apache Software Foundation (ASF) under one 3 or more contributor license agreements. See the NOTICE file 4 distributed with this work for additional information 5 regarding copyright ownership. The ASF licenses this file 6 to you under the Apache License, Version 2.0 (the 7 "License"); you may not use this file except in compliance 8 with the License. You may obtain a copy of the License at 9 10 http://www.apache.org/licenses/LICENSE-2.0 11 12 Unless required by applicable law or agreed to in writing, 13 software distributed under the License is distributed on an 14 "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY 15 KIND, either express or implied. See the License for the 16 specific language governing permissions and limitations 17 under the License. 18 --> 19 20 # Example DataFrame API Pipelines 21 22 This module contains example pipelines that use the [Beam DataFrame 23 API](https://beam.apache.org/documentation/dsls/dataframes/overview/). 24 25 ## Pre-requisites 26 27 You must have `apache-beam>=2.30.0` installed in order to run these pipelines, 28 because the `apache_beam.examples.dataframe` module was added in that release. 29 Using the DataFrame API also requires a compatible pandas version to be 30 installed, see the 31 [documentation](https://beam.apache.org/documentation/dsls/dataframes/overview/#pre-requisites) 32 for details. 33 34 ## Wordcount Pipeline 35 36 Wordcount is the "Hello World" of data analytic systems, so of course we 37 had to implement it for the Beam DataFrame API! See [`wordcount.py`](./wordcount.py) for the 38 implementation. Note it demonstrates how to integrate the DataFrame API with 39 a larger Beam pipeline by using [Beam 40 Schemas](https://beam.apache.org/documentation/programming-guide/#what-is-a-schema) 41 in conjunction with 42 [to_dataframe](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_dataframe) 43 and 44 [to_pcollection](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.convert.html#apache_beam.dataframe.convert.to_pcollection). 45 46 ### Running the pipeline 47 48 To run the pipeline locally: 49 50 ```sh 51 python -m apache_beam.examples.dataframe.wordcount \ 52 --input gs://dataflow-samples/shakespeare/kinglear.txt \ 53 --output counts 54 ``` 55 56 This will produce files like `counts-XXXXX-of-YYYYY` with contents like: 57 ``` 58 KING: 243 59 LEAR: 236 60 DRAMATIS: 1 61 PERSONAE: 1 62 king: 65 63 of: 447 64 Britain: 2 65 OF: 15 66 FRANCE: 10 67 DUKE: 3 68 ... 69 ``` 70 71 ## Taxi Ride Example Pipelines 72 73 [`taxiride.py`](./taxiride.py) contains implementations for two DataFrame pipelines that 74 process the well-known [NYC Taxi 75 dataset](https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page). These 76 pipelines don't use any Beam primitives. Instead they build end-to-end pipelines 77 using the DataFrame API, by leveraging [DataFrame 78 IOs](https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html). 79 80 The module defines two pipelines. The `location_id_agg` pipeline does a grouped 81 aggregation on the drop-off location ID. The `borough_enrich` pipeline extends 82 this example by joining the zone lookup table to find the borough where each 83 drop off occurred, and aggregate per borough. 84 85 ### Data 86 Some snapshots of NYC taxi data have been staged in 87 `gs://apache-beam-samples` for use with these example pipelines: 88 89 - `gs://apache-beam-samples/nyc_taxi/2017/yellow_tripdata_2017-*.csv`: CSV files 90 containing taxi ride data for each month of 2017 (similar directories exist 91 for 2018 and 2019). 92 - `gs://apache-beam-samples/nyc_taxi/misc/sample.csv`: A sample of 1 million 93 records from the beginning of 2019. At ~85 MiB this is a manageable size for 94 processing locally. 95 - `gs://apache-beam-samples/nyc_taxi/misc/taxi+_zone_lookup.csv`: Lookup table 96 with information about Zone IDs. Used by the `borough_enrich` pipeline. 97 98 ### Running `location_id_agg` 99 To run the aggregation pipeline locally, use the following command: 100 ```sh 101 python -m apache_beam.examples.dataframe.taxiride \ 102 --pipeline location_id_agg \ 103 --input gs://apache-beam-samples/nyc_taxi/misc/sample.csv \ 104 --output aggregation.csv 105 ``` 106 107 This will write the output to files like `aggregation.csv-XXXXX-of-YYYYY` with 108 contents like: 109 ``` 110 DOLocationID,passenger_count 111 1,3852 112 3,130 113 4,7725 114 5,24 115 6,37 116 7,7429 117 8,24 118 9,180 119 10,938 120 ... 121 ``` 122 123 ### Running `borough_enrich` 124 To run the enrich pipeline locally, use the command: 125 ```sh 126 python -m apache_beam.examples.dataframe.taxiride \ 127 --pipeline borough_enrich \ 128 --input gs://apache-beam-samples/nyc_taxi/misc/sample.csv \ 129 --output enrich.csv 130 ``` 131 132 This will write the output to files like `enrich.csv-XXXXX-of-YYYYY` with 133 contents like: 134 ``` 135 Borough,passenger_count 136 Bronx,13645 137 Brooklyn,70654 138 EWR,3852 139 Manhattan,1417124 140 Queens,81138 141 Staten Island,531 142 Unknown,28527 143 ``` 144 145 ## Flight Delay pipeline (added in 2.31.0) 146 [`flight_delays.py`](./flight_delays.py) contains an implementation of 147 a pipeline that processes the flight ontime data from 148 `bigquery-samples.airline_ontime_data.flights`. It uses a conventional Beam 149 pipeline to read from BigQuery, apply a 24-hour rolling window, and define a 150 Beam schema for the data. Then it converts to DataFrames in order to perform 151 a complex aggregation using `GroupBy.apply`, and write the result out with 152 `to_csv`. Note that the DataFrame computation respects the 24-hour window 153 applied above, and results are partitioned into separate files per day. 154 155 ### Running the pipeline 156 To run the pipeline locally: 157 158 ```sh 159 python -m apache_beam.examples.dataframe.flight_delays \ 160 --start_date 2012-12-24 \ 161 --end_date 2012-12-25 \ 162 --output gs://<bucket>/<dir>/delays.csv \ 163 --project <gcp-project> \ 164 --temp_location gs://<bucket>/<dir> 165 ``` 166 167 Note a GCP `project` and `temp_location` are required for reading from BigQuery. 168 169 This will produce files like 170 `gs://<bucket>/<dir>/delays.csv-2012-12-24T00:00:00-2012-12-25T00:00:00-XXXXX-of-YYYYY` 171 with contents tracking average delays per airline on that day, for example: 172 ``` 173 airline,departure_delay,arrival_delay 174 EV,10.01901901901902,4.431431431431432 175 HA,-1.0829015544041452,0.010362694300518135 176 UA,19.142555438225976,11.07180570221753 177 VX,62.755102040816325,62.61224489795919 178 WN,12.074298711144806,6.717968157695224 179 ... 180 ```