github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/how-tos/splitting-data/adjusting_data_processing_w_split.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/master/how-tos/splitting-data/adjusting_data_processing_w_split.md (about)

     1  # Adjusting Data Processing by Splitting Data
     2  
     3  !!! info
     4      Before you read this section, make sure that you understand
     5      the concepts described in [File](../../concepts/data-concepts/file.md),
     6      [Glob Pattern](../../concepts/pipeline-concepts/datum/glob-pattern.md),
     7      [Pipeline Specification](../../reference/pipeline_spec.md), and
     8      [Developer Workflow](../developer-workflow/).
     9  
    10  Unlike source code version-control systems, such as Git, that mostly
    11  store and version text files, Pachyderm does not perform intra-file
    12  diffing. This means that if you have a 10,000 lines CSV file and
    13  change a single line in that file, a pipeline that is subscribed
    14  to the repository where that file is located processes the whole file.
    15  You can adjust this behavior by splitting your file upon loading
    16  into chunks.
    17  
    18  Pachyderm applies diffing at per file level.
    19  Therefore, if one bit of a file changes,
    20  Pachyderm identifies that change as a new file.
    21  Similarly, Pachyderm can only distribute computation
    22  at the level of a single file. If your data is stored in
    23  one large file, it can only be processed by a single worker, which
    24  might affect performance.
    25  
    26  To optimize performance and processing time, you might want to
    27  break up large files into smaller chunks while Pachyderm uploads
    28  data to an input repository. For simple data types, you can
    29  run the `pachctl put file` command with the `--split` flag. For
    30  more complex splitting pattern, such as when you work with `.avro`
    31  or other binary formats, you need to manually split your data
    32  either at ingest or by configuring splitting in a Pachyderm
    33  pipeline.
    34  
    35  ## Using Split and Target-File Flags
    36  
    37  For common file types that are often used in data science, such as CSV,
    38  line-delimited text files, JavaScript Object Notation (JSON) files,
    39  Pachyderm includes the `--split`, `--target-file-bytes`, and
    40  `--target-file-datums` flags.
    41  
    42  !!! note
    43      In this section, a chunk of data is called a *split-file*.
    44  
    45  | Flag              | Description                                           |
    46  | ----------------- | ----------------------------------------------------- |
    47  | `--split`         | Divides a file into chunks based on a *record*, such as newlines <br> in a line-delimited files or by JSON object for JSON files. <br> The `--split` flag takes one of the following arguments— `line`, <br> `json`, or `sql`. For example, `--split line` ensures that Pachyderm <br> only breaks up a file on a newline boundaries and not in the <br> middle of a line. |
    48  | `--target-file-bytes` |  This flag must be used with the `--split` <br> flag. The `--target-file-bytes` flag fills each of the split-files with data up to <br> the specified number of bytes, splitting on the nearest <br>record boundary. For example, you have a line-delimited file <br>of 50 lines, with each line having about 20 bytes. If you run the <br> `--split lines --target-file-bytes 100` command, you see the input <br>file split into about 10 files and each file has about 5 lines. Each <br>split-file’s size hovers above the target value of 100 bytes, <br>not going below 100 bytes until the last split-file, which might <br>be less than 100 bytes. |
    49  | `--target-file-datums` | This flag must be used with the `--split` <br> flag. The `--target-file-datums` attempts to fill each split-file <br>with the specified number of datums. If you run `--split lines --target-file-datums 2` on the line-delimited 100-line file mentioned above, you see the file split into 50 split-files and each file has 2 lines. |
    50  
    51  
    52  If you specify both `--target-file-datums` and `--target-file-bytes` flags,
    53  Pachyderm creates split-files until it hits one of the
    54  constraints.
    55  
    56  !!! note "See also:"
    57      [Splitting Data for Distributed Processing](../splitting/#ingesting-postgressql-data)
    58  
    59  ### Example: Splitting a File
    60  
    61  In this example, you have a 50-line file called `my-data.txt`.
    62  You create a repository named `line-data` and load
    63  `my-data.txt` into that repository. Then, you can analyze
    64  how the data is split in the repository.
    65  
    66  To complete this example, perform the following steps:
    67  
    68  1. Create a file with fifty lines named `my-data.txt`. You can
    69     add random lines, such as numbers from one to fifty, or US states,
    70     or anything else.
    71  
    72     **Examples:**
    73  
    74     ```shell
    75     Zero
    76     One
    77     Two
    78     Three
    79     ...
    80     Fifty
    81     ```
    82  
    83  1. Create a repository called `line-data`:
    84  
    85     ```shell
    86     pachctl create repo line-data
    87     pachctl put file line-data@master -f my-data.txt --split line
    88     ```
    89  
    90  1. List the filesystem objects in the repository:
    91  
    92     ```shell
    93     pachctl list file line-data@master
    94     ```
    95  
    96     **System Response:**
    97  
    98     ```shell
    99     NAME         TYPE  SIZE
   100     /my-data.txt dir   1.071KiB
   101     ```
   102  
   103     The `pachctl list file` command shows
   104     that the line-oriented file `my-data.txt`
   105     that was uploaded has been transformed into a
   106     directory that includes the chunks of the original
   107     `my-data.txt` file. Each chunk is put into a split-file
   108     and given a 16-character filename, left-padded with 0.
   109     Pachyderm numbers each filename sequentially in hexadecimal. We
   110     modify the command to list the contents of “my-data.txt”, and the output
   111     reveals the naming structure.
   112  
   113  1. List the files in the `my-data.txt` directory:
   114  
   115     ```shell
   116     pachctl list file line-data@master my-data.txt
   117     ```
   118  
   119     **System Response:**
   120  
   121     ```shell
   122     NAME                          TYPE  SIZE
   123     /my-data.txt/0000000000000000 file  21B
   124     /my-data.txt/0000000000000001 file  22B
   125     /my-data.txt/0000000000000002 file  24B
   126     /my-data.txt/0000000000000003 file  21B
   127     ...
   128     NAME                          TYPE  SIZE
   129     /my-data.txt/0000000000000031 file  22B
   130     ```
   131  
   132  ## Example: Appending to files with `-–split`
   133  
   134  Combining `--split` with the default *append* behavior of
   135  `pachctl put file` enables flexible and scalable processing of
   136  record-oriented file data from external, legacy systems.
   137  
   138  Pachyderm ensures that only the newly added data gets processed when
   139  you append to an existing file by using `--split`. Pachyderm
   140  optimizes storage utilization by automatically deduplicating each
   141  split-file. If you split a large file
   142  with many duplicate lines or objects with identical hashes
   143  might use less space in PFS than it does as
   144  a single file outside of PFS.
   145  
   146  To complete this example, follow these steps:
   147  
   148  1. Create a file `count.txt` with the following lines:
   149  
   150     ```shell
   151     Zero
   152     One
   153     Two
   154     Three
   155     Four
   156     ```
   157  
   158  1. Put the `count.txt` file into a Pachyderm repository called `raw_data`:
   159  
   160     ```shell
   161     pachctl put file -f count.txt raw_data@master --split line
   162     ```
   163  
   164     This command splits the `count.txt` file by line and creates
   165     a separate file with one line in each file. Also, this operation
   166     creates five datums that are processed by the
   167     pipelines that use this repository as input.
   168  
   169  1. View the repository contents:
   170  
   171     ```shell
   172     pachctl list file raw_data@master
   173     ```
   174  
   175     **System Response:**
   176  
   177     ```shell
   178     NAME       TYPE SIZE
   179     /count.txt dir  24B
   180     ```
   181  
   182     Pachyderm created a directory called `count.txt`.
   183  
   184  1. View the contents of the `count.txt` directory:
   185  
   186     ```shell
   187     pachctl list file raw_data@master:count.txt
   188     ```
   189  
   190     **System Response:**
   191  
   192     ```shell
   193     NAME                        TYPE SIZE
   194     /count.txt/0000000000000000 file 4B
   195     /count.txt/0000000000000001 file 4B
   196     /count.txt/0000000000000002 file 6B
   197     /count.txt/0000000000000003 file 5B
   198     /count.txt/0000000000000004 file 5B
   199     ```
   200  
   201     In the output above, you can see that Pachyderm created five split-files
   202     from the original `count.txt` file. Each file has one line from the
   203     original `count.txt`. You can check the contents of each file by
   204     running the `pachctl get file` command. For example, to get
   205     the contents of `count.txt/0000000000000000`, run the following
   206     command:
   207  
   208     ```shell
   209     pachctl get file raw_data@master:count.txt/0000000000000000
   210     ```
   211  
   212     **System Response:**
   213  
   214     ```shell
   215     Zero
   216     ```
   217  
   218     This operation creates five datums that are processed by the
   219     pipelines that use this repository as input.
   220  
   221  1. Create a one-line file called `more-count.txt` with the
   222     following content:
   223  
   224     ```shell
   225     Five
   226     ```
   227  
   228  1. Load this file into Pachyderm by appending it to the `count.txt` file:
   229  
   230     ```shell
   231     pachctl put file raw_data@master:count.txt -f more-count.txt --split line
   232     ```
   233  
   234     * If you do not specify `--split` flag while appending to
   235       a file that was previously split, Pachyderm displays the following
   236       error message:
   237  
   238       ```shell
   239       could not put file at "/count.txt"; a file of type directory is already there
   240       ```
   241  
   242  1. Verify that another file was added:
   243  
   244     ```shell
   245     pachctl list file raw_data@master:count.txt
   246     ```
   247  
   248     **System Response:**
   249  
   250     ```shell
   251     NAME                        TYPE SIZE
   252     /count.txt/0000000000000000 file 4B
   253     /count.txt/0000000000000001 file 4B
   254     /count.txt/0000000000000002 file 6B
   255     /count.txt/0000000000000003 file 5B
   256     /count.txt/0000000000000004 file 5B
   257     /count.txt/0000000000000005 file 4B
   258     ```
   259  
   260     The `/count.txt/0000000000000005` file was added to the input
   261     repository. Pachyderm considers
   262     this new file as a separate datum. Therefore, pipelines process
   263     only that datum instead of all the chunks of `count.txt`.
   264  
   265  1. Get the contents of the `/count.txt/0000000000000005` file:
   266  
   267     ```
   268     pachctl get file raw_data@master:count.txt/0000000000000005
   269     ```
   270  
   271     **System Response:**
   272  
   273     ```shell
   274     Five
   275     ```
   276  
   277  ## Example: Overwriting Files with `–-split`
   278  
   279  The behavior of Pachyderm when a file loaded with ``--split`` is
   280  overwritten is simple to explain but subtle in its implications.
   281  Most importantly, it can have major implications when new rows
   282  are inserted within the file as opposed to just being appended to the end.
   283  The loaded file is split into those sequentially-named files,
   284  as shown above. If any of those resulting
   285  split-files hashes differently than the one it is replacing, it
   286  causes the Pachyderm Pipeline System to process that data.
   287  This can have significant consequences for downstream processing.
   288  
   289  To complete this example, follow these steps:
   290  
   291  1. Create a file `count.txt` with the following lines:
   292  
   293     ```shell
   294     One
   295     Two
   296     Three
   297     Four
   298     Five
   299     ```
   300  
   301  1. Put the file into a Pachyderm repository called `raw_data`:
   302  
   303     ```shell
   304     pachctl put file -f count.txt raw_data@master --split line
   305     ```
   306  
   307     This command splits the `count.txt` file by line and creates
   308     a separate file with one line in each file. Also, this operation
   309     creates five datums that are processed by the
   310     pipelines that use this repository as input.
   311  
   312  1. View the repository contents:
   313  
   314     ```shell
   315     pachctl list file raw_data@master
   316     ```
   317  
   318     **System Response:**
   319  
   320     ```shell
   321     NAME       TYPE SIZE
   322     /count.txt dir  24B
   323     ```
   324  
   325     Pachyderm created a directory called `count.txt`.
   326  
   327  1. View the contents of the `count.txt` directory:
   328  
   329     ```shell
   330     pachctl list file raw_data@master:count.txt
   331     ```
   332  
   333     **System Response:**
   334  
   335     ```shell
   336     NAME                        TYPE SIZE
   337     /count.txt/0000000000000000 file 4B
   338     /count.txt/0000000000000001 file 4B
   339     /count.txt/0000000000000002 file 6B
   340     /count.txt/0000000000000003 file 5B
   341     /count.txt/0000000000000004 file 5B
   342     ```
   343  
   344     In the output above, you can see that Pachyderm created five split-files
   345     from the original `count.txt` file. Each file has one line from the
   346     original `count.txt` file. You can check the contents of each file by
   347     running the `pachctl get file` command. For example, to get
   348     the contents of `count.txt/0000000000000000`, run the following
   349     command:
   350  
   351     ```shell
   352     pachctl get file raw_data@master:count.txt/0000000000000000
   353     ```
   354  
   355     **System Response:**
   356  
   357     ```shell
   358     One
   359     ```
   360  
   361  1. In your local directory, modify the original `count.txt` file by
   362     inserting the word *Zero* on the first line:
   363  
   364     ```shell
   365     Zero
   366     One
   367     Two
   368     Three
   369     Four
   370     Five
   371     ```
   372  
   373  1. Upload the updated `count.txt` file into the raw_data repository
   374     by using the `--split` and `--overwrite` flags:
   375  
   376     ```shell
   377     pachctl put file -f count.txt raw_data@master:count.txt --split line --overwrite
   378     ```
   379  
   380     Because Pachyderm takes the file name into account when hashing
   381     data for a pipeline, it considers every single split-file as new,
   382     and the pipelines that use this repository as input process all
   383     six datums.
   384  
   385  1. List the files in the directory:
   386  
   387     ```shell
   388     pachctl list file raw_data@master:count.txt
   389     ```
   390  
   391     **System Response:**
   392  
   393     ```shell
   394     NAME                        TYPE SIZE
   395     /count.txt/0000000000000000 file 5B
   396     /count.txt/0000000000000001 file 4B
   397     /count.txt/0000000000000002 file 4B
   398     /count.txt/0000000000000003 file 6B
   399     /count.txt/0000000000000004 file 5B
   400     /count.txt/0000000000000005 file 5B
   401     ```
   402  
   403     The `/count.txt/0000000000000000` file now has the newly added `Zero` line.
   404     To verify the contents of the file, run:
   405  
   406     ```shell
   407     pachctl get file raw_data@master:count.txt/0000000000000000
   408     ```
   409  
   410     **System Response:**
   411  
   412     ```shell
   413     Zero
   414     ```
   415  
   416  !!! note "See Also:"
   417      [Splitting Data](splitting.md)