github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/splitting-data/splitting.md (about)

     1  # Splitting Data for Distributed Processing
     2  
     3  Before you read this section, make sure that you understand
     4  the concepts described in
     5  [Distributed Computing](../distributed_computing.md).
     6  
     7  Pachyderm enables you to parallelize computations over data as long as
     8  that data can be split up into multiple *datums*.  However, in many
     9  cases, you might have a dataset that you want or need to commit
    10  into Pachyderm as a single file rather than a bunch of smaller
    11  files that are easily mapped to datums, such as one file per record.
    12  For such cases, Pachyderm provides an easy way to prepare your dataset
    13  for subsequent distributed computing by splitting it upon uploading
    14  to a Pachyderm repository.
    15  
    16  In this example, you have a dataset that consists of information about your
    17  users and a repository called `user`.
    18  This data is in `CSV` format in a single file called `user_data.csv`
    19  with one record per line:
    20  
    21  ```
    22  $ head user_data.csv
    23  1,cyukhtin0@stumbleupon.com,144.155.176.12
    24  2,csisneros1@over-blog.com,26.119.26.5
    25  3,jeye2@instagram.com,13.165.230.106
    26  4,rnollet3@hexun.com,58.52.147.83
    27  5,bposkitt4@irs.gov,51.247.120.167
    28  6,vvenmore5@hubpages.com,161.189.245.212
    29  7,lcoyte6@ask.com,56.13.147.134
    30  8,atuke7@psu.edu,78.178.247.163
    31  9,nmorrell8@howstuffworks.com,28.172.10.170
    32  10,afynn9@google.com.au,166.14.112.65
    33  ```
    34  
    35  If you put this data into Pachyderm as a single
    36  file, Pachyderm processes them a single datum.
    37  It cannot process each of
    38  these user records in parallel as separate `datums`.
    39  Potentially, you can manually separate
    40  these user records into standalone files before you
    41  commit them into the `users` repository or through
    42  a pipeline stage dedicated to this splitting task.
    43  However, Pachyderm provides an optimized way of completing
    44  this task.
    45  
    46  The `put file` API includes an option for splitting
    47  the file into separate datums automatically. You can use
    48  the `--split` flag with the `put file` command.
    49  
    50  To complete this example, follow the steps below:
    51  
    52  1. Create a `users` repository by running:
    53  
    54     ```shell
    55     $ pachctl create repo users
    56     ```
    57  
    58  1. Create a file called `user_data.csv` with the
    59  contents listed above.
    60  
    61  1. Put your `user_data.csv` file into Pachyderm and
    62  automatically split it into separate datums for each line:
    63  
    64     ```shell
    65     $ pachctl put file users@master -f user_data.csv --split line --target-file-datums 1
    66     ```
    67  
    68     The `--split line` argument specifies that Pachyderm
    69     splits this file into lines, and the `--target-file-datums 1`
    70     argument specifies that each resulting file must include
    71     at most one datum or one line.
    72  
    73  1. View the list of files in the master branch of the `users`
    74  repository:
    75  
    76     ```shell
    77     $ pachctl list file users@master
    78     NAME                 TYPE                SIZE
    79     user_data.csv   dir                 5.346 KiB
    80     ```
    81  
    82     If you run `pachctl list file` command for the master branch
    83     in the `users` repository, Pachyderm
    84     still shows the `user_data.csv` entity to you as one
    85     entity in the repo
    86     However, this entity is now a directory that contains all
    87     of the split records.
    88  
    89  1. To view the detailed information about
    90  the `user_data.csv` file, run the command with the file name
    91  specified after a colon:
    92  
    93     ```shell
    94     $ pachctl list file users@master:user_data.csv
    95     NAME                             TYPE                SIZE
    96     user_data.csv/0000000000000000   file                43 B
    97     user_data.csv/0000000000000001   file                39 B
    98     user_data.csv/0000000000000002   file                37 B
    99     user_data.csv/0000000000000003   file                34 B
   100     user_data.csv/0000000000000004   file                35 B
   101     user_data.csv/0000000000000005   file                41 B
   102     user_data.csv/0000000000000006   file                32 B
   103     etc...
   104     ```
   105  
   106     Then, a pipeline that takes the repo `users` as input
   107     with a glob pattern of `/user_data.csv/*` processes each
   108     user record, such as each line in the CSV file in parallel.
   109  
   110  ### JSON and Text File Splitting Examples
   111  
   112  Pachyderm supports this type of splitting for lines or
   113  JSON blobs as well. See the examples below.
   114  
   115  * Split a `json` file on `json` blobs by putting each `json`
   116  blob into a separate file.
   117  
   118    ```shell
   119    $ pachctl put file users@master -f user_data.json --split json --target-file-datums 1
   120    ```
   121  
   122  * Split a `json` file on `json` blobs by putting three `json`
   123  blobs into each split file.
   124  
   125    ```shell
   126    $ pachctl put file users@master -f user_data.json --split json --target-file-datums 3
   127    ```
   128  
   129  * Split a file on lines by putting each 100-bytes chunk into
   130  the split files.
   131  
   132    ```shell
   133    $ pachctl put file users@master -f user_data.txt --split line --target-file-bytes 100
   134    ```
   135  
   136  ## Specifying a Header
   137  
   138  If your data has a common header, you can specify it
   139  manually by using `pachctl put file` with the `--header-records` flag.
   140  You can use this functionality with JSON and CSV data.
   141  
   142  To specify a header, complete the following steps:
   143  
   144  1. Create a new or use an existing data file. For example, the `user_data.csv`
   145  from the section above with the following header:
   146  
   147     ```shell
   148     NUMBER,EMAIL,IP_ADDRESS
   149     ```
   150  
   151  1. Create a new repository or use an existing one:
   152  
   153     ```shell
   154     $ pachctl create repo users
   155     ```
   156  
   157  1. Put your file into the repository by separating the header from
   158  other lines:
   159  
   160     ```shell
   161     $ pachctl put file users@master -f user_data.csv --split=csv --header-records=1 --target-file-datums=1
   162     ```
   163  
   164  1. Verify that the file was added and split:
   165  
   166     ```shell
   167     $ pachctl list file users@master:/user_data.csv
   168     ```
   169  
   170     **Example:**
   171  
   172     ```shell
   173     NAME                            TYPE SIZE
   174     /user_data.csv/0000000000000000 file 70B
   175     /user_data.csv/0000000000000001 file 66B
   176     /user_data.csv/0000000000000002 file 64B
   177     /user_data.csv/0000000000000003 file 61B
   178     /user_data.csv/0000000000000004 file 62B
   179     /user_data.csv/0000000000000005 file 68B
   180     /user_data.csv/0000000000000006 file 59B
   181     /user_data.csv/0000000000000007 file 59B
   182     /user_data.csv/0000000000000008 file 71B
   183     /user_data.csv/0000000000000009 file 65B
   184     ```
   185  
   186  1. Get the first file from the repository:
   187  
   188     ```shell
   189     $ pachctl get file users@master:/user_data.csv/0000000000000000
   190     NUMBER,EMAIL,IP_ADDRESS
   191     1,cyukhtin0@stumbleupon.com,144.155.176.12
   192     ```
   193  1. Get all files:
   194  
   195     ```csv
   196     $ pachctl get file users@master:/user_data.csv/*
   197     NUMBER,EMAIL,IP_ADDRESS
   198     1,cyukhtin0@stumbleupon.com,144.155.176.12
   199     2,csisneros1@over-blog.com,26.119.26.5
   200     3,jeye2@instagram.com,13.165.230.106
   201     4,rnollet3@hexun.com,58.52.147.83
   202     5,bposkitt4@irs.gov,51.247.120.167
   203     6,vvenmore5@hubpages.com,161.189.245.212
   204     7,lcoyte6@ask.com,56.13.147.134
   205     8,atuke7@psu.edu,78.178.247.163
   206     9,nmorrell8@howstuffworks.com,28.172.10.170
   207     10,afynn9@google.com.au,166.14.112.65
   208     ```
   209  
   210  For more information, type `pachctl put file --help`.
   211  
   212  ## Ingesting PostgresSQL data
   213  
   214  Pachyderm supports direct data ingestion from PostgreSQL.
   215  You need first extract your database into a script file
   216  by using `pg_dump` and then add the data from the file
   217  into Pachyderm by running the `pachctl put file` with the
   218  `--split` flag.
   219  
   220  When you use `pachctl put file --split sql ...`, Pachyderm
   221  splits your `pgdump` file into three parts - the header, rows,
   222  and the footer. The header contains all the SQL statements
   223  in the `pgdump` file that set up the schema and tables.
   224  The rows are split into individual files, or if you specify
   225  the `--target-file-datums` or `--target-file-bytes`, multiple
   226  rows per file. The footer contains the remaining
   227  SQL statements for setting up the tables.
   228  
   229  The header and footer are stored in the directory that contains
   230  the rows. If you request a `get file` on that directory, you
   231  get just the header and footer. If you request an individual
   232  file, you see the header, the row or rows, and the footer.
   233  If you request all the files with a glob pattern, for example,
   234  `/directoryname/*`, you receive the header, all the rows, and
   235  the footer recreating the full `pgdump`. Therefore, you can
   236  construct full or partial `pgdump` files so that you can
   237  load full or partial datasets.
   238  
   239  To put your PostgreSQL data into Pachyderm, complete the following
   240  steps:
   241  
   242  1. Generate a `pgdump` file:
   243  
   244     **Example:**
   245  
   246     ```shell
   247     $ pg_dump -t users -f users.pgdump
   248     ```
   249  
   250  1. View the `pgdump` file:
   251  
   252  ???+ note "Example"
   253  
   254      ```shell
   255      $ cat users.pgdump
   256      --
   257      -- PostgreSQL database dump
   258      --
   259  
   260      -- Dumped from database version 9.5.12
   261      -- Dumped by pg_dump version 9.5.12
   262  
   263      SET statement_timeout = 0;
   264      SET lock_timeout = 0;
   265      SET client_encoding = 'UTF8';
   266      SET standard_conforming_strings = on;
   267      SELECT pg_catalog.set_config('search_path', '', false);
   268      SET check_function_bodies = false;
   269      SET client_min_messages = warning;
   270      SET row_security = off;
   271  
   272      SET default_tablespace = '';
   273  
   274      SET default_with_oids = false;
   275  
   276      --
   277      -- Name: users; Type: TABLE; Schema: public; Owner: postgres
   278      --
   279  
   280      CREATE TABLE public.users (
   281          id integer NOT NULL,
   282          name text NOT NULL,
   283          saying text NOT NULL
   284      );
   285  
   286  
   287      ALTER TABLE public.users OWNER TO postgres;
   288  
   289      --
   290      -- Data for Name: users; Type: TABLE DATA; Schema: public; Owner: postgres
   291      --
   292  
   293      COPY public.users (id, name, saying) FROM stdin;
   294      0	wile E Coyote	...
   295      1	road runner	\\.
   296      \.
   297  
   298  
   299      --
   300      -- PostgreSQL database dump complete
   301      --
   302      ```
   303  
   304  3.  Ingest the SQL data by using the `pachctl put file` command
   305      with the `--split` file:
   306  
   307      ```shell
   308      $ pachctl put file data@master -f users.pgdump --split sql
   309      $ pachctl put file data@master:users --split sql -f users.pgdump
   310      ```
   311  
   312  4. View the information about your repository:
   313  
   314     ```shell
   315  
   316     $ pachctl list file data@master
   317     NAME         TYPE SIZE
   318     users        dir  914B
   319     ```
   320  
   321     The `users.pgdump` file is added to the master branch in the `data`
   322     repository.
   323  
   324  5. View the information about the `users.pgdump` file:
   325  
   326     ```shell
   327  
   328     $ pachctl list file data@master:users
   329     NAME                           TYPE SIZE
   330     /users/0000000000000000        file 20B
   331     /users/0000000000000001        file 18B
   332     ```
   333  
   334  6. In your pipeline, where you have started and forked PostgreSQL,
   335  you can load the data by running the following or a similar script:
   336  
   337     ```
   338     $ cat /pfs/data/users/* | sudo -u postgres psql
   339     ```
   340  
   341     By using the glob pattern `/*`, this code loads each raw PostgreSQL chunk
   342     into your PostgreSQL instance for processing by your pipeline.
   343  
   344  
   345  !!! tip
   346      For this use case, you might want to use `--target-file-datums` or
   347      `--target-file-bytes` because these commands enable your queries to run
   348      against many rows at a time.