github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/splitting-data/adjusting_data_processing_w_split.md

github.com/pachyderm/pachyderm@v1.13.4/doc/docs/1.9.x/how-tos/splitting-data/adjusting_data_processing_w_split.md (about)

     1  # Adjusting Data Processing by Splitting Data
     2  
     3  !!! info
     4      Before you read this section, make sure that you understand
     5      the concepts described in [File](../../concepts/data-concepts/file.md),
     6      [Glob Pattern](../../concepts/pipeline-concepts/datum/glob-pattern.md),
     7      [Pipeline Specification](../../reference/pipeline_spec.md), and
     8      [Individual Developer Workflow](../individual-developer-workflow.md).
     9  
    10  Unlike source code version-control systems, such as Git, that mostly
    11  store and version text files, Pachyderm does not perform intra-file
    12  diffing. This means that if you have a 10,000 lines CSV file and
    13  change a single line in that file, a pipeline that is subscribed
    14  to the repository where that file is located processes the whole file.
    15  You can adjust this behavior by splitting your file upon loading
    16  into chunks.
    17  
    18  Pachyderm applies diffing at per file level.
    19  Therefore, if one bit of a file changes,
    20  Pachyderm identifies that change as a new file.
    21  Similarly, Pachyderm can only distribute computation
    22  at the level of a single file. If your data is stored in
    23  one large file, it can only be processed by a single worker, which
    24  might affect performance.
    25  
    26  To optimize performance and processing time, you might want to
    27  break up large files into smaller chunks while Pachyderm uploads
    28  data to an input repository. For simple data types, you can
    29  run the `pachctl put file` command with the `--split` flag. For
    30  more complex splitting pattern, such as when you work with `.avro`
    31  or other binary formats, you need to manually split your data
    32  either at ingest or by configuring splitting in a Pachyderm
    33  pipeline.
    34  
    35  ## Using Split and Target-File Flags
    36  
    37  For common file types that are often used in data science, such as CSV,
    38  line-delimited text files, JavaScript Object Notation (JSON) files,
    39  Pachyderm includes the `--split`, `--target-file-bytes`, and
    40  `--target-file-datums` flags.
    41  
    42  !!! note
    43      In this section, a chunk of data is called a *split-file*.
    44  
    45  | Flag              | Description                                           |
    46  | ----------------- | ----------------------------------------------------- |
    47  | `--split`         | Divides a file into chunks based on a *record*, such as newlines <br> in a line-delimited files or by JSON object for JSON files. <br> The `--split` flag takes one of the following arguments— `line`, <br> `json`, or `sql`. For example, `--split line` ensures that Pachyderm <br> only breaks up a file on a newline boundaries and not in the <br> middle of a line. |
    48  | `--target-file-bytes` |  This flag must be used with the `--split` <br> flag. The `--target-file-bytes` flag fills each of the split-files with data up to <br> the specified number of bytes, splitting on the nearest <br>record boundary. For example, you have a line-delimited file <br>of 50 lines, with each line having about 20 bytes. If you run the <br> `--split lines --target-file-bytes 100` command, you see the input <br>file split into about 10 files and each file has about 5 lines. Each <br>split-file’s size hovers above the target value of 100 bytes, <br>not going below 100 bytes until the last split-file, which might <br>be less than 100 bytes. |
    49  | `--target-file-datums` | This flag must be used with the `--split` <br> flag. The `--target-file-datums` attempts to fill each split-file <br>with the specified number of datums. If you run `--split lines --target-file-datums 2` on the line-delimited 100-line file mentioned above, you see the file split into 50 split-files and each file has 2 lines. |
    50  
    51  
    52  If you specify both `--target-file-datums` and `--target-file-bytes` flags,
    53  Pachyderm creates split-files until it hits one of the
    54  constraints.
    55  
    56  !!! note "See also:"
    57      [Splitting Data for Distributed Processing](../splitting/#ingesting-postgressql-data)
    58  
    59  ### Example: Splitting a File
    60  
    61  In this example, you have a 50-line file called `my-data.txt`.
    62  You create a repository named `line-data` and load
    63  `my-data.txt` into that repository. Then, you can analyze
    64  how the data is split in the repository.
    65  
    66  To complete this example, perform the following steps:
    67  
    68  1. Create a file with fifty lines named `my-data.txt`. You can
    69     add random lines, such as numbers from one to fifty, or US states,
    70     or anything else.
    71  
    72     **Examples:**
    73  
    74     ```shell
    75     Zero
    76     One
    77     Two
    78     Three
    79     ...
    80     Fifty
    81     ```
    82  
    83  1. Create a repository called `line-data`:
    84  
    85     ```shell
    86     $ pachctl create repo line-data
    87     $ pachctl put file line-data@master -f my-data.txt --split line
    88     ```
    89  
    90  1. List the filesystem objects in the repository:
    91  
    92     ```shell
    93     $ pachctl list file line-data@master
    94     NAME         TYPE  SIZE
    95     /my-data.txt dir   1.071KiB
    96     ```
    97  
    98     The `pachctl list file` command shows
    99     that the line-oriented file `my-data.txt`
   100     that was uploaded has been transformed into a
   101     directory that includes the chunks of the original
   102     `my-data.txt` file. Each chunk is put into a split-file
   103     and given a 16-character filename, left-padded with 0.
   104     Pachyderm numbers each filename sequentially in hexadecimal. We
   105     modify the command to list the contents of “my-data.txt”, and the output
   106     reveals the naming structure.
   107  
   108  1. List the files in the `my-data.txt` directory:
   109  
   110     ```shell
   111     $ pachctl list file line-data@master my-data.txt
   112     NAME                          TYPE  SIZE
   113     /my-data.txt/0000000000000000 file  21B
   114     /my-data.txt/0000000000000001 file  22B
   115     /my-data.txt/0000000000000002 file  24B
   116     /my-data.txt/0000000000000003 file  21B
   117     ...
   118     NAME                          TYPE  SIZE
   119     /my-data.txt/0000000000000031 file  22B
   120     ```
   121  
   122  ## Example: Appending to files with `-–split`
   123  
   124  Combining `--split` with the default *append* behavior of
   125  `pachctl put file` enables flexible and scalable processing of
   126  record-oriented file data from external, legacy systems.
   127  
   128  Pachyderm ensures that only the newly added data gets processed when
   129  you append to an existing file by using `--split`. Pachyderm
   130  optimizes storage utilization by automatically deduplicating each
   131  split-file. If you split a large file
   132  with many duplicate lines or objects with identical hashes
   133  might use less space in PFS than it does as
   134  a single file outside of PFS.
   135  
   136  To complete this example, follow these steps:
   137  
   138  1. Create a file `count.txt` with the following lines:
   139  
   140     ```shell
   141     Zero
   142     One
   143     Two
   144     Three
   145     Four
   146     ```
   147  
   148  1. Put the `count.txt` file into a Pachyderm repository called `raw_data`:
   149  
   150     ```shell
   151     $ pachctl put file -f count.txt raw_data@master --split line
   152     ```
   153  
   154     This command splits the `count.txt` file by line and creates
   155     a separate file with one line in each file. Also, this operation
   156     creates five datums that are processed by the
   157     pipelines that use this repository as input.
   158  
   159  1. View the repository contents:
   160  
   161     ```shell
   162     $ pachctl list file raw_data@master
   163     NAME       TYPE SIZE
   164     /count.txt dir  24B
   165     ```
   166  
   167     Pachyderm created a directory called `count.txt`.
   168  
   169  1. View the contents of the `count.txt` directory:
   170  
   171     ```shell
   172     $ pachctl list file raw_data@master:count.txt
   173     NAME                        TYPE SIZE
   174     /count.txt/0000000000000000 file 4B
   175     /count.txt/0000000000000001 file 4B
   176     /count.txt/0000000000000002 file 6B
   177     /count.txt/0000000000000003 file 5B
   178     /count.txt/0000000000000004 file 5B
   179     ```
   180  
   181     In the output above, you can see that Pachyderm created five split-files
   182     from the original `count.txt` file. Each file has one line from the
   183     original `count.txt`. You can check the contents of each file by
   184     running the `pachctl get file` command. For example, to get
   185     the contents of `count.txt/0000000000000000`, run the following
   186     command:
   187  
   188     ```shell
   189     $ pachctl get file raw_data@master:count.txt/0000000000000000
   190     Zero
   191     ```
   192  
   193     This operation creates five datums that are processed by the
   194     pipelines that use this repository as input.
   195  
   196  1. Create a one-line file called `more-count.txt` with the
   197     following content:
   198  
   199     ```shell
   200     Five
   201     ```
   202  
   203  1. Load this file into Pachyderm by appending it to the `count.txt` file:
   204  
   205     ```shell
   206     $ pachctl put file raw_data@master:count.txt -f more-count.txt --split line
   207     ```
   208  
   209     * If you do not specify `--split` flag while appending to
   210       a file that was previously split, Pachyderm displays the following
   211       error message:
   212  
   213       ```shell
   214       could not put file at "/count.txt"; a file of type directory is already there
   215       ```
   216  
   217  1. Verify that another file was added:
   218  
   219     ```shell
   220     $ pachctl list file raw_data@master:count.txt
   221     NAME                        TYPE SIZE
   222     /count.txt/0000000000000000 file 4B
   223     /count.txt/0000000000000001 file 4B
   224     /count.txt/0000000000000002 file 6B
   225     /count.txt/0000000000000003 file 5B
   226     /count.txt/0000000000000004 file 5B
   227     /count.txt/0000000000000005 file 4B
   228     ```
   229  
   230     The `/count.txt/0000000000000005` file was added to the input
   231     repository. Pachyderm considers
   232     this new file as a separate datum. Therefore, pipelines process
   233     only that datum instead of all the chunks of `count.txt`.
   234  
   235  1. Get the contents of the `/count.txt/0000000000000005` file:
   236  
   237     ```
   238     $ pachctl get file raw_data@master:count.txt/0000000000000005
   239     Five
   240     ```
   241  
   242  ## Example: Overwriting Files with `–-split`
   243  
   244  The behavior of Pachyderm when a file loaded with ``--split`` is
   245  overwritten is simple to explain but subtle in its implications.
   246  Most importantly, it can have major implications when new rows
   247  are inserted within the file as opposed to just being appended to the end.
   248  The loaded file is split into those sequentially-named files,
   249  as shown above. If any of those resulting
   250  split-files hashes differently than the one it is replacing, it
   251  causes the Pachyderm Pipeline System to process that data.
   252  This can have significant consequences for downstream processing.
   253  
   254  To complete this example, follow these steps:
   255  
   256  1. Create a file `count.txt` with the following lines:
   257  
   258     ```shell
   259     One
   260     Two
   261     Three
   262     Four
   263     Five
   264     ```
   265  
   266  1. Put the file into a Pachyderm repository called `raw_data`:
   267  
   268     ```shell
   269     $ pachctl put file -f count.txt raw_data@master --split line
   270     ```
   271  
   272     This command splits the `count.txt` file by line and creates
   273     a separate file with one line in each file. Also, this operation
   274     creates five datums that are processed by the
   275     pipelines that use this repository as input.
   276  
   277  1. View the repository contents:
   278  
   279     ```shell
   280     $ pachctl list file raw_data@master
   281     NAME       TYPE SIZE
   282     /count.txt dir  24B
   283     ```
   284  
   285     Pachyderm created a directory called `count.txt`.
   286  
   287  1. View the contents of the `count.txt` directory:
   288  
   289     ```shell
   290     $ pachctl list file raw_data@master:count.txt
   291     NAME                        TYPE SIZE
   292     /count.txt/0000000000000000 file 4B
   293     /count.txt/0000000000000001 file 4B
   294     /count.txt/0000000000000002 file 6B
   295     /count.txt/0000000000000003 file 5B
   296     /count.txt/0000000000000004 file 5B
   297     ```
   298  
   299     In the output above, you can see that Pachyderm created five split-files
   300     from the original `count.txt` file. Each file has one line from the
   301     original `count.txt` file. You can check the contents of each file by
   302     running the `pachctl get file` command. For example, to get
   303     the contents of `count.txt/0000000000000000`, run the following
   304     command:
   305  
   306     ```shell
   307     $ pachctl get file raw_data@master:count.txt/0000000000000000
   308     One
   309     ```
   310  
   311  1. In your local directory, modify the original `count.txt` file by
   312     inserting the word *Zero* on the first line:
   313  
   314     ```shell
   315     Zero
   316     One
   317     Two
   318     Three
   319     Four
   320     Five
   321     ```
   322  
   323  1. Upload the updated `count.txt` file into the raw_data repository
   324     by using the `--split` and `--overwrite` flags:
   325  
   326     ```shell
   327     $ pachctl put file -f count.txt raw_data@master:count.txt --split line --overwrite
   328     ```
   329  
   330     Because Pachyderm takes the file name into account when hashing
   331     data for a pipeline, it considers every single split-file as new,
   332     and the pipelines that use this repository as input process all
   333     six datums.
   334  
   335  1. List the files in the directory:
   336  
   337     ```shell
   338     $ pachctl list file raw_data@master:count.txt
   339     NAME                        TYPE SIZE
   340     /count.txt/0000000000000000 file 5B
   341     /count.txt/0000000000000001 file 4B
   342     /count.txt/0000000000000002 file 4B
   343     /count.txt/0000000000000003 file 6B
   344     /count.txt/0000000000000004 file 5B
   345     /count.txt/0000000000000005 file 5B
   346     ```
   347  
   348     The `/count.txt/0000000000000000` file now has the newly added `Zero` line.
   349     To verify the contents of the file, run:
   350  
   351     ```shell
   352     $ pachctl get file raw_data@master:count.txt/0000000000000000
   353     Zero
   354     ```
   355  
   356  !!! note "See Also:"
   357      [Splitting Data](splitting.md)