github.com/square/finch@v0.0.0-20240412205204-6530c03e2b96/docs/content/data/scope.md

github.com/square/finch@v0.0.0-20240412205204-6530c03e2b96/docs/content/data/scope.md (about)

     1  ---
     2  weight: 2
     3  ---
     4  
     5  Data keys have a scope corresponding to [run levels]({{< relref "intro/concepts#run-levels" >}}) plus two special low-level scopes:
     6  
     7  ```
     8  global
     9  └──stage
    10     └──workload
    11        └──exec-group
    12           └──client-group
    13              └──client
    14                 └──iter
    15                    └──trx
    16                       └──statement (default)
    17                          └──row
    18                             └──value
    19  ```
    20  
    21  <mark><b>A data key is unique within its scope, and its data generator is called once per scope iteration.</b></mark>
    22  
    23  
    24  |Data Scope|Scope Iteration (Data Generator Called When)|Class|
    25  |----------|--------------------------------------------|-----|
    26  |[global](#global)|Finch runs|[One time](#one-time)|
    27  |[stage](#stage)|Stage runs|[One time](#one-time)|
    28  |[workload](#workload)|Each client starts a new iter|[Multi client](#multi-client)|
    29  |[exec-group](#exec-group)|Each client starts a new iter|[Multi client](#multi-client)|
    30  |[client-group](#client-group)|Each client starts a new iter|[Multi client](#multi-client)|
    31  |[client](#client)|Client connects to MySQL or recoverable query error|[Single client](#single-client)|
    32  |[iter](#iter)|Iter count increases|[Single client](#single-client)|
    33  |[trx](#trx)|Trx count increases|[Single client](#single-client)|
    34  |[statement](#statement)|Statement executes|[Single client](#single-client)|
    35  |[row](#row)|Each @d per row when statement executes|[Special](#special)|
    36  |[value](#value)|Each @d when statement executes|[Special](#special)|
    37  
    38  For example, @d with statement scope (the default) is called once per statement execution.
    39  Or, @d with iter scope is called once per client iter (start of executing all trx assigned to the client).
    40  
    41  As this rest of this page will show, data scope makes it possible to craft both simple and elaborate workloads.
    42  
    43  ## Configure
    44  
    45  Specify [`stage.trx[].data.d.scope`]({{< relref "syntax/stage-file#dscope" >}}) for a data key:
    46  
    47  ```yaml
    48  stage:
    49    trx:
    50      - file:
    51        data:
    52          d:
    53            generator: int  # Required data generator name
    54            scope: trx      # Optional data scope
    55  ```
    56  
    57  If not specified, [statement](#statement) is the default scope.
    58  
    59  ## Explicit Call
    60  
    61  |Implicit Call|Explicit Call|
    62  |-------------|-------------|
    63  |`@d`|`@d()`|
    64  
    65  By default, each data key is called once per scope iteration because `@d` is an _implicit call_: Finch calls the data generator when the scope iteration changes.
    66  Implicit calls are the default and canonical case because it just works for the vast majority of benchmarks.
    67  
    68  An _explicit call_, like `@d()`, calls the data generator regardless of the scope iteration but always within the data scope.
    69  For example, presume `@d` has statement scope and uses the [`auto-inc` generator]({{< relref "data/generators#auto-inc" >}}):
    70  
    71  {{< columns >}}
    72  ```sql
    73  SELECT @d @d
    74  -- SELECT 1 1
    75  ```
    76  <---> <!-- magic separator, between columns -->
    77  ```sql
    78  SELECT @d @d()
    79  -- SELECT 1 2
    80  ```
    81  
    82  {{< /columns >}}
    83  
    84  The left query returns `1 1` because the second `@d` is an implicit call and the scope (statement) has not changed.
    85  The right query returns `1 2` because the second `@d()` is an explicit call, so Finch calls the data generator again even though the scope iteration hasn't changed.
    86  
    87  [Row scope](#row) is a good example of why explicit calls are sometimes required.
    88  Until then, the rest of this page will continue to default to the canonical case: implicit calls, `@d`.
    89  
    90  ## Single Client
    91  
    92  To reason about and explain single client data scopes, let's use a <mark>canonical example</mark>:
    93  
    94  {{< columns >}}
    95  _Client 1_
    96  ```
    97  -- Iter 1
    98  
    99  -- Trx A n=1
   100  INSERT @a, @R
   101  UPDATE @a, @R
   102  
   103  -- Trx B n=2
   104  DELETE @a, @R
   105  ```
   106  
   107  ```
   108  -- Iter 2
   109  
   110  -- Trx A n=3
   111  INSERT @a, @R
   112  UPDATE @a, @R
   113  
   114  -- Trx B n=4
   115  DELETE @a, @R
   116  ```
   117  <--->
   118  &nbsp;
   119  {{< /columns >}}
   120  
   121  The client executes two trx: trx A is an `INSERT` and an `UPDATE`; trx B is a `DELETE`.
   122  There's a trx counter: `n=1`, `n=2`, and so forth.
   123  Remember that the statements shown in iter 1 are the same as in iter 2: it's one `INSERT` executed twice; one `UPDATE` executed twice; one `DELETE` executed twice.
   124  
   125  There are two data keys:
   126  * @a uses the [`auto-inc` generator]({{< relref "data/generators#auto-inc" >}}) so we know what values it returns when called: 1, 2, 3, and so on.
   127  * @R uses the [random `int` generator]({{< relref "data/generators#int" >}}) so we don't know what value it returns when called. 
   128  
   129  These different data keys will help highlight how data scope works and why it's important.
   130  
   131  ### Statement
   132  
   133  Using the canonical example described above, the default data scope (statement) returns:
   134  
   135  {{< columns >}}
   136  _Client 1_
   137  ```
   138  -- Iter 1
   139  
   140  -- Trx A n=1
   141  INSERT @a=1, @R=101
   142  UPDATE @a=1, @R=492
   143  
   144  -- Trx B n=2
   145  DELETE @a=1, @R=239
   146  ```
   147  
   148  ```
   149  -- Iter 2
   150  
   151  -- Trx A n=3
   152  INSERT @a=2, @R=934
   153  UPDATE @a=2, @R=111
   154  
   155  -- Trx B n=4
   156  DELETE @a=2, @R=202
   157  ```
   158  <--->
   159  &nbsp;
   160  {{< /columns >}}
   161  
   162  First notice that `@a=1` for all three statements in iter 1.
   163  This is because @a is statement scoped: each @a in each statement is unique, so there are three data generators and they each return the first value: 1.
   164  Another way to look at it: statement scoped @a in `INSERT @a` is only visible and accessible within that statement; none of the other statements can see or access it.
   165  The values (in iter 1) are all 1 only because each statement scoped @a happens to return that as the first value.
   166  
   167  @R demonstrates why statement scope is the default.
   168  Typically, benchmarks uses random values and expect different (random) values for each query.
   169  Since @R is statement scope, the typical case is generated by default.
   170  A better example for @R is something like:
   171  
   172  ```sql
   173  SELECT c FROM t WHERE id = @R
   174  ```
   175  
   176  That might benchmark random lookups on column `id`.
   177  
   178  ### Trx
   179  
   180  Using the canonical example described above, trx scope returns:
   181  
   182  {{< columns >}}
   183  _Client 1_
   184  ```
   185  -- Iter 1
   186  
   187  -- Trx A n=1
   188  INSERT @a=1, @R=505
   189  UPDATE @a=1, @R=505
   190  
   191  -- Trx B n=2
   192  DELETE @a=1, @R=293
   193  ```
   194  
   195  ```
   196  -- Iter 2
   197  
   198  -- Trx A n=3
   199  INSERT @a=2, @R=821
   200  UPDATE @a=2, @R=821
   201  
   202  -- Trx B n=4
   203  DELETE @a=2, @R=410
   204  ```
   205  <--->
   206  &nbsp;
   207  {{< /columns >}}
   208  
   209  {{< hint type=note >}}
   210  Remember that "trx" in these docs refers to a [Finch trx]({{< relref "benchmark/trx" >}}), not a MySQL transaction, although the two are closely related.
   211  {{< /hint >}}
   212  
   213  First again, notice that `@a=1` in the first trx (A) and second trx (B).
   214  This is because @a is trx scoped, so it's unique to each trx (A and B) and called once per trx (when the trx count `n` increases).
   215  But since @a returns the same initial values, it looks the same, so @R demonstrates trx scope better.
   216  
   217  Although @R generates random integers, when it's trx scoped it's called only once per trx, so @R in each trx has the same value, like `@R=505` in client 1 iter 1 trx n=1.
   218  This could be used, for example, to benchmark inserting, updating, and deleting random rows, like:
   219  
   220  ```sql
   221  INSERT INTO t (id, ...) VALUES (@R, ...)
   222  UPDATE t SET ... WHERE id=@R
   223  DELETE FROM t WHERE id=@R
   224  ```
   225  
   226  Fun fact: the classic [sysbench write-only benchmark]({{< relref "benchmark/examples#sysbench" >}}) does this: it deletes a random row then re-inserts it.
   227  See `@del_id` in its config.
   228  
   229  ### Iter
   230  
   231  Using the canonical example described above, iter scope returns:
   232  
   233  {{< columns >}}
   234  _Client 1_
   235  ```
   236  -- Iter 1
   237  
   238  -- Trx A n=1
   239  INSERT @a=1, @R=505
   240  UPDATE @a=1, @R=505
   241  
   242  -- Trx B n=2
   243  DELETE @a=1, @R=505
   244  ```
   245  
   246  ```
   247  -- Iter 2
   248  
   249  -- Trx A n=3
   250  INSERT @a=2, @R=821
   251  UPDATE @a=2, @R=821
   252  
   253  -- Trx B n=4
   254  DELETE @a=2, @R=821
   255  ```
   256  <--->
   257  &nbsp;
   258  {{< /columns >}}
   259  
   260  An _iteration (iter)_ is one execution of all trx assigned to a client.
   261  In this example, @a and @R are iter scoped, so they're called once per iter (per client).
   262  Values are the same across trx and only change with each new iter.
   263  
   264  Iter scope is useful to "share" values across multiple trx.
   265  This is equivalent to combining multiple trx into one and using trx scope.
   266  
   267  ### Client
   268  
   269  Client scope has the same scope as [iter](#iter) (one client) but its scope iteration is unique: when the client connects to MySQL or recovers from a query error.
   270  Client scope increments at least once: when the client first connects to MySQL.
   271  Further increments occur when the client reconnects to MySQL _or_ starts a new iter to recover from certain errors (see [Benchmark / Error Handling]({{< relref "benchmark/error-handling" >}})).
   272  
   273  Is this scope useful?
   274  Maybe.
   275  For example, perhaps there's a use case for client scoped @d to handle duplicate key errors or deadlocks&mdash;recoverable errors that start a new iter.
   276  A client scoped @d would know it's a recoverable error and not just the next iteration, whereas an iter scoped data key couldn't know this.
   277  
   278  {{< hint type=tip title=iter-on-error >}}
   279  It might be helpful to think of client scope as "iter-on-error".
   280  {{< /hint >}}
   281  
   282  ## Multi Client
   283  
   284  To reason about and explain multi client data scopes, let's add a second client to the [canonical example](#single-client):
   285  
   286  {{< columns >}}
   287  _Client 1_
   288  ```
   289  -- Iter 1
   290  
   291  -- Trx A n=1
   292  INSERT @a, @R
   293  UPDATE @a, @R
   294  
   295  -- Trx B n=2
   296  DELETE @a, @R
   297  ```
   298  
   299  ```
   300  -- Iter 2
   301  
   302  -- Trx A n=3
   303  INSERT @a, @R
   304  UPDATE @a, @R
   305  
   306  -- Trx B n=4
   307  DELETE @a, @R
   308  ```
   309  <--->
   310  _Client 2_
   311  ```
   312  -- Iter 1
   313  
   314  -- Trx A n=1
   315  INSERT @a, @R
   316  UPDATE @a, @R
   317  
   318  -- Trx B n=2
   319  DELETE @a, @R
   320  ```
   321  {{< /columns >}}
   322  
   323  ### Client Group
   324  
   325  Using the canonical example described immediately above, client-group scope returns:
   326  
   327  {{< columns >}}
   328  _Client 1_
   329  ```
   330  -- Iter 1
   331  
   332  -- Trx A n=1
   333  INSERT @a=1, @R=505
   334  UPDATE @a=1, @R=505
   335  
   336  -- Trx B n=2
   337  DELETE @a=1, @R=505
   338  ```
   339  
   340  ```
   341  -- Iter 2
   342  
   343  -- Trx A n=3
   344  INSERT @a=3, @R=821
   345  UPDATE @a=3, @R=821
   346  
   347  -- Trx B n=4
   348  DELETE @a=3, @R=821
   349  ```
   350  <--->
   351  _Client 2_
   352  ```
   353  -- Iter 1
   354  
   355  -- Trx A n=1
   356  INSERT @a=2, @R=743
   357  UPDATE @a=2, @R=743
   358  
   359  -- Trx B n=2
   360  DELETE @a=2, @R=743
   361  ```
   362  {{< /columns >}}
   363  
   364  With client group scope, @a and @R are unique to the client group, which means their data generators are shared by all clients in the group.
   365  And the client group scope iteration is "Each client starts a new iter", which means their data generators are called when each client starts a new iter.
   366  
   367  Presume this call order:
   368  
   369  1. Client 1 iter 1
   370  2. Client 2 iter 1
   371  3. Client 1 iter 2
   372  
   373  Since @a is shared by all clients in the group and called when each client starts a new iter, that call order explains the values.
   374  
   375  Client group scoped data keys are useful with (pseudo) stateful data generators like @a where the call order matters.
   376  This scope allows coordination across multiple clients.
   377  For example, it's necessary to insert values 1..N without duplicates using multiple clients.
   378  
   379  With random value generators like @R, client group scope is equivalent to iter scoped presuming no [explicit calls](#explicit-call).
   380  
   381  ### Exec Group
   382  
   383  Exec group scope works the same as [client group scope](#client-group) but is unique to all client groups in the exec group.
   384  But be careful: since different client groups can execute different trx, make sure any data keys shared across client groups make sense.
   385  
   386  ### Workload
   387  
   388  Workload scope works the same as [client group scope](#client-group) but is unique to all exec groups, which means all clients in the stage.
   389  But be careful: since different exec groups can execute different trx, make sure any data keys shared across exec and client groups make sense.
   390  
   391  ## One Time
   392  
   393  ### Stage
   394  
   395  Stage data scope applies to the entire stage but stage scoped data keys are only called once when the stage starts.
   396  This might be useful for static or one-time values reused across different exec or client groups.
   397  Or, stage scoped data keys can be [called explicitly](#explicit-call).
   398  
   399  ### Global
   400  
   401  Global data scope applies to all stages in a single Finch run.
   402  For example, if Finch is run like `finch stage1.yaml stage2.yaml`, global scope applies to both stages.
   403  Global scoped data keys are only called once (when the first query of the first client of the first stage executes).
   404  Or, global scoped data keys can be [called explicitly](#explicit-call).
   405  
   406  {{< hint type=warning >}}
   407  Global data scope does _not_ span [compute instances]({{< relref "operate/client-server" >}}).
   408  It's an interesting idea that could work, but is there a use case for sharing a data value across compute instances?
   409  {{< /hint >}}
   410  
   411  <br>
   412  
   413  ## Special
   414  
   415  ### Row
   416  
   417  Row scope is intended for use with [CSV substitution]({{< relref "syntax/trx-file#csv" >}}) to produce multi-row `INSERT` statements:
   418  
   419  {{< columns >}}
   420  _Trx File_ &rarr;
   421  ```sql
   422  INSERT INTO t VALUES
   423  /*!csv 2 (@a, ...)*/
   424  ```
   425  <---> <!-- magic separator, between columns -->
   426  _Automatic Transformation_ &rarr;
   427  ```sql
   428  INSERT INTO t VALUES
   429     (@a(), ...)
   430    ,(@a(), ...)
   431  ```
   432  <--->
   433  _Resulting Values_
   434  ```sql
   435  INSERT INTO t VALUES
   436     (1, ...) -- row 1
   437    ,(2, ...) -- row 2
   438  ```
   439  {{< /columns >}}
   440  
   441  With row scope, the [`auto-inc`]({{< relref "data/generators#auto-inc" >}}) data key, @a, is unique to the _statement_ but called for each _row_.
   442  
   443  As shown above, when used with [CSV substitution]({{< relref "syntax/trx-file#csv" >}}), Finch automatically transforms @a to an [explicit call](#explicit-call) in each row.
   444  Moreover, it transforms only the first occurrence of every unique data key in the row.
   445  In this example, `/*!csv N (@a, @a, @a)*/` produces `(1, 1, 1)` for the first row: the first @a is an explicit call, and the latter two @a are copies of the first.
   446  
   447  To achieve the same results without [CSV substitution]({{< relref "syntax/trx-file#csv" >}}), use statement scope and explicit calls&mdash;manually write the query like the automatic transformation shown above&mdash;or use [value scope](#value).
   448  
   449  ### Value
   450  
   451  Value scope means every @d has its own unique data generator and is called every time the statement is executed.
   452  This sounds like statement scope with [explicit calls](#explicit-call), but there's a difference:
   453  
   454  {{< columns >}}
   455  _Value Scope_
   456  ```sql
   457         @d    @d
   458          │     │	
   459  SELECT @d,   @d
   460  ```
   461  <---> <!-- magic separator, between columns -->
   462  _Statement Scope with Explicit Calls_
   463  ```sql
   464          ┌─ @d ─┐	
   465          │      │	
   466  SELECT @d(),  @d()
   467  ```
   468  {{< /columns >}}
   469  
   470  With value scope, each @d has its own data generator.
   471  With statement scope and explicit calls, all @d in the statement share/call the same data generator.
   472  
   473  Whether or not this makes a difference depends on the data generator.
   474  For (pseudo) stateful generators like [`auto-inc`]({{< relref "data/generators#auto-inc" >}}), it makes a difference: value scope yields 1 and 1; statement scopes with explicit calls yields 1 and 2.
   475  For random value generators, it might not make a difference, especially since @d can have only one configuration.
   476  If, for example, you want two random numbers with the same generator but configured differently, then you must use two different data keys, one for each configuration.