github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/splitting-data/splitting.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.10.x/how-tos/splitting-data/splitting.md (about)

     1  # Splitting Data for Distributed Processing
     2  
     3  Before you read this section, make sure that you understand
     4  the concepts described in
     5  [Distributed Computing](../../concepts/advanced-concepts/distributed_computing.md).
     6  
     7  Pachyderm enables you to parallelize computations over data as long as
     8  that data can be split up into multiple *datums*.  However, in many
     9  cases, you might have a dataset that you want or need to commit
    10  into Pachyderm as a single file rather than a bunch of smaller
    11  files that are easily mapped to datums, such as one file per record.
    12  For such cases, Pachyderm provides an easy way to prepare your dataset
    13  for subsequent distributed computing by splitting it upon uploading
    14  to a Pachyderm repository.
    15  
    16  In this example, you have a dataset that consists of information about your
    17  users and a repository called `user`.
    18  This data is in `CSV` format in a single file called `user_data.csv`
    19  with one record per line:
    20  
    21  ```
    22  head user_data.csv
    23  1,cyukhtin0@stumbleupon.com,144.155.176.12
    24  2,csisneros1@over-blog.com,26.119.26.5
    25  3,jeye2@instagram.com,13.165.230.106
    26  4,rnollet3@hexun.com,58.52.147.83
    27  5,bposkitt4@irs.gov,51.247.120.167
    28  6,vvenmore5@hubpages.com,161.189.245.212
    29  7,lcoyte6@ask.com,56.13.147.134
    30  8,atuke7@psu.edu,78.178.247.163
    31  9,nmorrell8@howstuffworks.com,28.172.10.170
    32  10,afynn9@google.com.au,166.14.112.65
    33  ```
    34  
    35  If you put this data into Pachyderm as a single
    36  file, Pachyderm processes them a single datum.
    37  It cannot process each of
    38  these user records in parallel as separate `datums`.
    39  Potentially, you can manually separate
    40  these user records into standalone files before you
    41  commit them into the `users` repository or through
    42  a pipeline stage dedicated to this splitting task.
    43  However, Pachyderm provides an optimized way of completing
    44  this task.
    45  
    46  The `put file` API includes an option for splitting
    47  the file into separate datums automatically. You can use
    48  the `--split` flag with the `put file` command.
    49  
    50  To complete this example, follow the steps below:
    51  
    52  1. Create a `users` repository by running:
    53  
    54     ```shell
    55     pachctl create repo users
    56     ```
    57  
    58  1. Create a file called `user_data.csv` with the
    59  contents listed above.
    60  
    61  1. Put your `user_data.csv` file into Pachyderm and
    62  automatically split it into separate datums for each line:
    63  
    64     ```shell
    65     pachctl put file users@master -f user_data.csv --split line --target-file-datums 1
    66     ```
    67  
    68     The `--split line` argument specifies that Pachyderm
    69     splits this file into lines, and the `--target-file-datums 1`
    70     argument specifies that each resulting file must include
    71     at most one datum or one line.
    72  
    73  1. View the list of files in the master branch of the `users`
    74  repository:
    75  
    76     ```shell
    77     pachctl list file users@master
    78     ```
    79  
    80     **System Response:**
    81  
    82     ```shell
    83     NAME                 TYPE                SIZE
    84     user_data.csv   dir                 5.346 KiB
    85     ```
    86  
    87     If you run `pachctl list file` command for the master branch
    88     in the `users` repository, Pachyderm
    89     still shows the `user_data.csv` entity to you as one
    90     entity in the repo
    91     However, this entity is now a directory that contains all
    92     of the split records.
    93  
    94  1. To view the detailed information about
    95  the `user_data.csv` file, run the command with the file name
    96  specified after a colon:
    97  
    98     ```shell
    99     pachctl list file users@master:user_data.csv
   100     ```
   101  
   102     **System Response:**
   103  
   104     ```shell
   105     NAME                             TYPE                SIZE
   106     user_data.csv/0000000000000000   file                43 B
   107     user_data.csv/0000000000000001   file                39 B
   108     user_data.csv/0000000000000002   file                37 B
   109     user_data.csv/0000000000000003   file                34 B
   110     user_data.csv/0000000000000004   file                35 B
   111     user_data.csv/0000000000000005   file                41 B
   112     user_data.csv/0000000000000006   file                32 B
   113     etc...
   114     ```
   115  
   116     Then, a pipeline that takes the repo `users` as input
   117     with a glob pattern of `/user_data.csv/*` processes each
   118     user record, such as each line in the CSV file in parallel.
   119  
   120  ### JSON and Text File Splitting Examples
   121  
   122  Pachyderm supports this type of splitting for lines or
   123  JSON blobs as well. See the examples below.
   124  
   125  * Split a `json` file on `json` blobs by putting each `json`
   126  blob into a separate file.
   127  
   128    ```shell
   129    pachctl put file users@master -f user_data.json --split json --target-file-datums 1
   130    ```
   131  
   132  * Split a `json` file on `json` blobs by putting three `json`
   133  blobs into each split file.
   134  
   135    ```shell
   136    pachctl put file users@master -f user_data.json --split json --target-file-datums 3
   137    ```
   138  
   139  * Split a file on lines by putting each 100-bytes chunk into
   140  the split files.
   141  
   142    ```shell
   143    pachctl put file users@master -f user_data.txt --split line --target-file-bytes 100
   144    ```
   145  
   146  ## Specifying a Header
   147  
   148  If your data has a common header, you can specify it
   149  manually by using `pachctl put file` with the `--header-records` flag.
   150  You can use this functionality with JSON and CSV data.
   151  
   152  To specify a header, complete the following steps:
   153  
   154  1. Create a new or use an existing data file. For example, the `user_data.csv`
   155  from the section above with the following header:
   156  
   157     ```shell
   158     NUMBER,EMAIL,IP_ADDRESS
   159     ```
   160  
   161  1. Create a new repository or use an existing one:
   162  
   163     ```shell
   164     pachctl create repo users
   165     ```
   166  
   167  1. Put your file into the repository by separating the header from
   168  other lines:
   169  
   170     ```shell
   171     pachctl put file users@master -f user_data.csv --split=csv --header-records=1 --target-file-datums=1
   172     ```
   173  
   174  1. Verify that the file was added and split:
   175  
   176     ```shell
   177     pachctl list file users@master:/user_data.csv
   178     ```
   179  
   180     **Example:**
   181  
   182     ```shell
   183     NAME                            TYPE SIZE
   184     /user_data.csv/0000000000000000 file 70B
   185     /user_data.csv/0000000000000001 file 66B
   186     /user_data.csv/0000000000000002 file 64B
   187     /user_data.csv/0000000000000003 file 61B
   188     /user_data.csv/0000000000000004 file 62B
   189     /user_data.csv/0000000000000005 file 68B
   190     /user_data.csv/0000000000000006 file 59B
   191     /user_data.csv/0000000000000007 file 59B
   192     /user_data.csv/0000000000000008 file 71B
   193     /user_data.csv/0000000000000009 file 65B
   194     ```
   195  
   196  1. Get the first file from the repository:
   197  
   198     ```shell
   199     pachctl get file users@master:/user_data.csv/0000000000000000
   200     ```
   201  
   202     **System Response:**
   203  
   204     ```shell
   205     NUMBER,EMAIL,IP_ADDRESS
   206     1,cyukhtin0@stumbleupon.com,144.155.176.12
   207     ```
   208  1. Get all files:
   209  
   210     ```shell
   211     pachctl get file users@master:/user_data.csv/*
   212     ```
   213  
   214     **System Response:**
   215  
   216     ```csv
   217     NUMBER,EMAIL,IP_ADDRESS
   218     1,cyukhtin0@stumbleupon.com,144.155.176.12
   219     2,csisneros1@over-blog.com,26.119.26.5
   220     3,jeye2@instagram.com,13.165.230.106
   221     4,rnollet3@hexun.com,58.52.147.83
   222     5,bposkitt4@irs.gov,51.247.120.167
   223     6,vvenmore5@hubpages.com,161.189.245.212
   224     7,lcoyte6@ask.com,56.13.147.134
   225     8,atuke7@psu.edu,78.178.247.163
   226     9,nmorrell8@howstuffworks.com,28.172.10.170
   227     10,afynn9@google.com.au,166.14.112.65
   228     ```
   229  
   230  For more information, type `pachctl put file --help`.
   231  
   232  ## Ingesting PostgresSQL data
   233  
   234  Pachyderm supports direct data ingestion from PostgreSQL.
   235  You need first extract your database into a script file
   236  by using `pg_dump` and then add the data from the file
   237  into Pachyderm by running the `pachctl put file` with the
   238  `--split` flag.
   239  
   240  When you use `pachctl put file --split sql ...`, Pachyderm
   241  splits your `pgdump` file into three parts - the header, rows,
   242  and the footer. The header contains all the SQL statements
   243  in the `pgdump` file that set up the schema and tables.
   244  The rows are split into individual files, or if you specify
   245  the `--target-file-datums` or `--target-file-bytes`, multiple
   246  rows per file. The footer contains the remaining
   247  SQL statements for setting up the tables.
   248  
   249  The header and footer are stored in the directory that contains
   250  the rows. If you request a `get file` on that directory, you
   251  get just the header and footer. If you request an individual
   252  file, you see the header, the row or rows, and the footer.
   253  If you request all the files with a glob pattern, for example,
   254  `/directoryname/*`, you receive the header, all the rows, and
   255  the footer recreating the full `pgdump`. Therefore, you can
   256  construct full or partial `pgdump` files so that you can
   257  load full or partial datasets.
   258  
   259  To put your PostgreSQL data into Pachyderm, complete the following
   260  steps:
   261  
   262  1. Generate a `pgdump` file:
   263  
   264     **Example:**
   265  
   266     ```shell
   267     pg_dump -t users -f users.pgdump
   268     ```
   269  
   270  1. View the `pgdump` file:
   271  
   272      ```shell
   273      cat users.pgdump
   274      ```
   275  
   276      **System Response:**
   277  
   278      ```shell
   279      --
   280      -- PostgreSQL database dump
   281      --
   282  
   283      -- Dumped from database version 9.5.12
   284      -- Dumped by pg_dump version 9.5.12
   285  
   286      SET statement_timeout = 0;
   287      SET lock_timeout = 0;
   288      SET client_encoding = 'UTF8';
   289      SET standard_conforming_strings = on;
   290      SELECT pg_catalog.set_config('search_path', '', false);
   291      SET check_function_bodies = false;
   292      SET client_min_messages = warning;
   293      SET row_security = off;
   294  
   295      SET default_tablespace = '';
   296  
   297      SET default_with_oids = false;
   298  
   299      --
   300      -- Name: users; Type: TABLE; Schema: public; Owner: postgres
   301      --
   302  
   303      CREATE TABLE public.users (
   304          id integer NOT NULL,
   305          name text NOT NULL,
   306          saying text NOT NULL
   307      );
   308  
   309  
   310      ALTER TABLE public.users OWNER TO postgres;
   311  
   312      --
   313      -- Data for Name: users; Type: TABLE DATA; Schema: public; Owner: postgres
   314      --
   315  
   316      COPY public.users (id, name, saying) FROM stdin;
   317      0	wile E Coyote	...
   318      1	road runner	\\.
   319      \.
   320  
   321  
   322      --
   323      -- PostgreSQL database dump complete
   324      --
   325      ```
   326  
   327  3.  Ingest the SQL data by using the `pachctl put file` command
   328      with the `--split` file:
   329  
   330      ```shell
   331      pachctl put file data@master -f users.pgdump --split sql
   332      ```
   333  
   334      ```shell
   335      pachctl put file data@master:users --split sql -f users.pgdump
   336      ```
   337  
   338  4. View the information about your repository:
   339  
   340     ```shell
   341     pachctl list file data@master
   342     ```
   343  
   344     **System Response:**
   345  
   346     ```shell
   347     NAME         TYPE SIZE
   348     users        dir  914B
   349     ```
   350  
   351     The `users.pgdump` file is added to the master branch in the `data`
   352     repository.
   353  
   354  5. View the information about the `users.pgdump` file:
   355  
   356     ```shell
   357  
   358     pachctl list file data@master:users
   359     ```
   360  
   361     **System Response:**
   362  
   363     ```shell
   364     NAME                           TYPE SIZE
   365     /users/0000000000000000        file 20B
   366     /users/0000000000000001        file 18B
   367     ```
   368  
   369  6. In your pipeline, where you have started and forked PostgreSQL,
   370  you can load the data by running the following or a similar script:
   371  
   372     ```
   373     cat /pfs/data/users/* | sudo -u postgres psql
   374     ```
   375  
   376     By using the glob pattern `/*`, this code loads each raw PostgreSQL chunk
   377     into your PostgreSQL instance for processing by your pipeline.
   378  
   379  
   380  !!! tip
   381      For this use case, you might want to use `--target-file-datums` or
   382      `--target-file-bytes` because these commands enable your queries to run
   383      against many rows at a time.