github.com/square/finch@v0.0.0-20240412205204-6530c03e2b96/docs/content/intro/concepts.md

github.com/square/finch@v0.0.0-20240412205204-6530c03e2b96/docs/content/intro/concepts.md (about)

     1  ---
     2  weight: 2
     3  ---
     4  
     5  The following concepts are important for understanding how Finch works, and many of them are unique to Finch.
     6  
     7  {{< hint type=tip >}}
     8  If you're new to database benchmarking read <a target="_new" href="https://hackmysql.com/benchmarking/">Fundamentals of Database Benchmarking</a>.
     9  {{< /hint >}}
    10  
    11  {{< toc >}}
    12  
    13  ## Run Levels
    14  
    15  Finch is designed on a hierarchy of run levels:
    16  
    17  ```
    18  global
    19  └──stage
    20     └──workload
    21        └──exec group
    22           └──client group
    23              └──client
    24                 └──iter
    25                    └──trx
    26                       └──statement
    27  ```
    28  
    29  These are called _run levels_ because they determine if, when, and how various aspects of Finch run.
    30  By understanding the run levels, you'll understand all the core concepts of Finch and how they work.
    31  
    32  Let's start at the bottom because it's the most direct and concrete concept: SQL statements.
    33  Moving up the hierarchy, the concepts become increasingly abstract and specific to Finch.
    34  
    35  ### Statements
    36  
    37  A _statement_ is any SQL statement that Finch can execute on MySQL: `SELECT`, `INSERT`, `UPDATE`, `DELETE`, and so forth.
    38  
    39  This is usually synonymous with _query_, but statement is more general because, for example, would you call `BEGIN` a query?
    40  Maybe, but what about `CREATE` or `SET`?
    41  
    42  Statements are the lowest level&mdash;but most important part&mdash;of running Finch: executing SQL statements on MySQL.
    43  This shouldn't be surprising since Finch is a benchmarking tool, but what might be surprising is that you write Finch benchmarks using real SQL statements; they are not wrapped in a scripting language.
    44  You write one or more SQL statement in a Finch transaction file.
    45  
    46  ### Transactions
    47  
    48  Finch _transactions (trx)_ are plain text files that contain one or more SQL statement to execute.
    49  Suppose that you want to benchmark these statements:
    50  
    51  ```sql
    52  SELECT c FROM t WHERE id = @d
    53  
    54  BEGIN
    55  
    56  SELECT id FROM t WHERE user = @d
    57  
    58  UPDATE users SET n = n + 1 WHERE id = @d
    59  
    60  COMMIT
    61  ```
    62  
    63  You can put those statements in one file and it will work, but it's better to separate the transactions:
    64  
    65  {{< columns >}}
    66  _trx-1.sql_
    67  ```sql
    68  SELECT c FROM t WHERE id = @d
    69  ```
    70  <---> <!-- magic separator, between columns -->
    71  _trx-2.sql_
    72  ```sql
    73  BEGIN
    74  
    75  SELECT id FROM t WHERE user = @d
    76  
    77  UPDATE users SET n = n + 1 WHERE id = @d
    78  
    79  COMMIT
    80  ```
    81  {{< /columns >}}
    82  
    83  
    84  {{< hint type=note >}}
    85  Ignore the `@d` placeholders for now; they are Finch [data keys](#data-generators) described later on this page.
    86  {{< /hint >}}
    87  
    88  With separate Finch trx, you can craft more advanced and interesting benchmarks, like:
    89  
    90  |Trx|Clients|TPS|
    91  |---|-------|---|
    92  |trx-1.sql|100|(Unlimited)|
    93  |trx-2.sql|20|50|
    94  
    95  Finch will run the first trx (the single `SELECT`) on 100 clients as fast as possible, and it will run the second trx (the explicit, multi-statement SQL transaction) on only 20 clients limited to 50 transactions per second (TPS).
    96  
    97  {{< hint type=tip >}}
    98  Remember the example in the table above because it's reused in the following sections.
    99  {{< /hint >}}
   100  
   101  Finch also collects and reports statistics per-trx.
   102  (The built-in stats reports combine all trx stats, but a [stats plugin]({{< relref "api/stats" >}}) receives per-trx stats.)
   103  
   104  You can write a benchmark as one large pseudo-transaction (all SQL statements in one file), but for various reasons that will become clear later, it's better to write each SQL transaction in a separate Finch trx file.
   105  
   106  ### Client
   107  
   108  A _client_ is one MySQL client (connection) that executes all SQL statements in all  trx assigned to it.
   109  Each pass through all assigned trx is one _iter_ (iteration).
   110  
   111  You can configure Finch to run any number of clients.
   112  This is typical for benchmarking tools, but Finch has two unique features:
   113  
   114  * Clients can be grouped (described next)
   115  * Clients can be assigned different trx
   116  
   117  The second point makes Finch especially powerful and flexible.
   118  For example, as shown in the table above, you can run 100 clients executing certain transactions, and 20 clients executing other transactions.
   119  Having different clients executing different trx is why Finch needs client and execution groups.
   120  
   121  ### Client and Execution Groups
   122  
   123  A _client group_ is a logical group of clients with the same trx and limits.
   124  From the example in the table above, this snippet of a Finch stage file creates two client groups:
   125  
   126  ```yaml
   127  workload:
   128    - clients: 100      # Client group 1
   129      trx: [trx-1.sql]  # 
   130  
   131    - clients: 20       # Client group 2
   132      trx: [trx-2.sql]  #
   133      tps-clients: 50   #
   134  ```
   135  
   136  The first client group will run 100 independent clients, all executing the statements in trx `trx-1.sql`.
   137  The second client group will run 20 independent clients, all executing the statements in trx `trx-2.sql`, and limited to 50 TPS across all 20 clients.
   138  Together, this will create 120 MySQL clients (connections).
   139  
   140  An _execution group_ is one or more client groups with the same group name.
   141  
   142  Let's pretend you want to execute another client group _after_ client groups 1 and 2 (above) are done.
   143  That requires putting client groups 1 and 2 in a named execution group&mdash;let's call it `FOO`&mdash;and the third client group in another named execution group&mdash;let's call it `BAR`.
   144  A snippet of a Finch stage file is shown below that creates these two execution groups.
   145  
   146  ```yaml
   147  workload:
   148                        # FOO ----------
   149    - clients: 100      # Client group 1
   150      trx: [trx-1.sql]  # 
   151      group: FOO        #
   152    - clients: 20       # Client group 2
   153      trx: [trx-2.sql]  #
   154      tps-clients: 50   #
   155      group: FOO        #
   156                           # BAR ----------
   157    - client: 1            # Client group 3
   158      trx: [trx-3.sql]     #
   159      group: BAR           #
   160  ```
   161  
   162  Finch processes the `workload` list top to bottom, so order matters when creating execution groups.
   163  All client groups in an execution group run concurrently.
   164  And execution groups are run top to bottom in workload order.
   165  
   166  Visually, the workload processing (Y axis) and time when the execution groups are run (X axis) can be depicted like:
   167  
   168  ![Finch Execution Group](/finch/img/finch_exec_group.svg)
   169  
   170  Both client groups in execution group `FOO` (blue) run first.
   171  When they're done, the client group in execution group `BAR` runs next.
   172  
   173  This is advanced workload orchestration.
   174  Sometimes it's necessary (like with DDL statements), but it's optional for most simple benchmarks that just want run everything all at once.
   175  
   176  ### Workload
   177  
   178  A workload is a combination of queries, data, and access patterns.
   179  In Finch, the previous (lower) run levels combine to create a workload that is defined in the aptly named `workload` section of a stage file:
   180  
   181  ```yaml
   182  stage:
   183    workload:
   184      - clients: 100
   185        trx: [trx-1.sql]
   186      - clients: 20
   187        trx: [trx-2.sql]
   188        tps-clients: 50 # all 20 clients
   189    trx:
   190      - file: trx-1.sql
   191      - file: trx-2.sql
   192  ```
   193  
   194  That defines the workload for the running example (the table in [Transactions](#transactions)): 100 clients to execute the single `SELECT` in `trx-1.sql` and, concurrently, 20 clients to execute the transaction in `trx-2.sql` limited to 20 TPS (across all 20 clients).
   195  
   196  In short, a Finch workload defines which clients run which transitions, plus a few features like QPS and TPS limiters.
   197  
   198  ### Stages
   199  
   200  A stage executes a workload.
   201  As shown immediately above, a Finch stage is configured in a YAML file.
   202  Here's a real and more complete example:
   203  
   204  ```yaml
   205  stage:
   206    name: read-only
   207    runtime: 60s
   208    mysql:
   209      db: finch
   210    workload:
   211      - clients: 1
   212    trx:
   213      - file: trx/read-only.sql
   214        data:
   215          id:
   216            generator: "rand-int"
   217            scope: client
   218            params:
   219              max: $params.rows
   220              dist: normal
   221          id_100:
   222            generator: "int-range"
   223            scope: client
   224            params:
   225              range: 100
   226              max: $params.rows
   227  ```
   228  
   229  Every stage has a name: either set by `stage.name` in the file, or defaulting to the base file name.
   230  Every stage needs at least one trx, defined by the `stage.trx` list.
   231  Every stage should have an explicitly defined workload (`stage.workload`), but in some cases Finch can auto-assign the workload.
   232  
   233  The rest is various configuration settings for the stage, workload, data keys, and so forth.
   234  
   235  Finch runs one or more stages _sequentially_ as specified on the command line:
   236  
   237  ```bash
   238  finch STAGE_1 [... STAGE_N]
   239  ```
   240  
   241  This allows you to do a full setup, benchmark, and cleanup in a single run by writing and specifying those three stages.
   242  But you don't have to; it's common to set up only once but run a benchmark repeatedly, like:
   243  
   244  ```bash
   245  finch setup.yaml
   246  finch benchmark.yaml
   247  finch benchmark.yaml
   248  finch benchmark.yaml
   249  ```
   250  
   251  ### Compute
   252  
   253  The global run level represents one running instances of Finch (all stages) on a compute instance.
   254  What's unique about Finch is that one instance (a server) can coordinate running the stages on other instances (clients):
   255  
   256  ![Finch Distributed Compute](/finch/img/finch_compute.svg)
   257  
   258  This is optional, but it's easy to enable and useful for creating a massive load on a MySQL instance with several smaller and cheaper compute instances rather than one huge benchmark server.
   259  
   260  To enable, configure a stage with:
   261  
   262  ```yaml
   263  compute:
   264    instances: 2  # 1 server + 1 client
   265  ```
   266  
   267  Then start a Finch server and one client like:
   268  
   269  {{< columns >}}
   270  _Server_
   271  ```sh
   272  finch benchmark.yaml
   273  ```
   274  <---> <!-- magic separator, between columns -->
   275  _Client_
   276  ```sh
   277  finch --server 10.0.0.1
   278  ```
   279  {{< /columns >}}
   280  
   281  The server sends a copy of the stage to the client, so there's nothing to configure on the client.
   282  The client sends its statistics to the server, so you don't have to aggregate them&mdash;Finch does it automatically.
   283  
   284  ## Data Keys
   285  
   286  A _data generator_ is a plugin that generates data for [statements](#statements).
   287  Data generators are referenced by user-defined _data keys_ like @d.
   288  
   289  "Data key" and "data generator" are synonymous because they're are two sides of the same coin.
   290  For brevity, "data key" (or just "@d") is used the most.
   291  
   292  ### Name and Configuration
   293  
   294  Data keys are used in statements and configured with a specific data generator in the stage file:
   295  
   296  {{< columns >}}
   297  _Statement_
   298  ```sql
   299  SELECT c FROM t WHERE id = @d
   300  ```
   301  <---> <!-- magic separator, between columns -->
   302  _Stage File_
   303  ```yaml
   304  stage:
   305    trx:
   306      - file: read.sql
   307        data:
   308          d: # @d
   309            generator: "int"
   310            params:
   311              max: 25,000
   312              dist: uniform
   313  ```
   314  {{< /columns >}}
   315  
   316  {{< hint type=note >}}
   317  `@d` in a statement, but `d` (no `@` prefix) in a stage file because `@` is a reserved character in YAML.
   318  {{< /hint >}}
   319  
   320  In the stage file, `stage.trx[].data.d.generator` configures the type of data generator: data key `@d` uses the `int` (random integer) data generator.
   321  And `stage.trx[].data.d.params` provides generator-specific configuration; in this case, configuring the `int` generator to return a uniform random distribution of numbers between 1 and 25,000.
   322  When Finch clients execute the `SELECT` statement above, data key `@d` will be replaced with random values from this `int` data generator.
   323  
   324  Finch ships with [built-in data generators]({{< relref "data/generators" >}}), and it's easy [build your own]({{< relref "api/data" >}}) data generator.
   325  
   326  ### Scope
   327  
   328  Data keys are scoped to a [run level](#run-levels).
   329  To see why data scope is important, consider this transaction:
   330  
   331  ```sql
   332  BEGIN
   333  
   334  SELECT c FROM t WHERE id = @d
   335  
   336  UPDATE t SET c=c+1 WHERE id = @d
   337  
   338  COMMIT
   339  ```
   340  
   341  Is `@d` the same or different in those two statements?
   342  The answer depends on the configured scope of `@d`:
   343  
   344  |`@d` Scope|Result|
   345  |----------|------|
   346  |statement|Different `@d`: one generator and value for `SELECT @d`, and another generator and value for `UPDATE @d`|
   347  |trx|Same `@d`: one generator and one value for both|
   348  
   349  Since the statements are in the same MySQL transactions, it intent is probably that `@d` is trx scoped so that the value in both statements is the same.
   350  But Finch supports very complex workloads, so [Data / Scope]({{< relref "data/scope" >}}) ranges from simple to complex.
   351  
   352  ### Input/Output
   353  
   354  From the statement point of view, most data keys provide input _to a statement_.
   355  But data keys can also be outputs: a statement provides a value _to a data key_.
   356  And a data key can be both.
   357  For example, a data key is used to save the insert ID from a statement (output) and provide to another statement (input):
   358  
   359  ```sql
   360  -- save-insert-id: @d
   361  INSERT INTO money_transfer VALUES (...)
   362  
   363  INSERT INTO audit_log VALUES (@d, ...)
   364  ```
   365  
   366  Finch saves the insert ID returned from MySQL into `@d`.
   367  Then, in the second query, `@d` returns that insert ID value.
   368  
   369  
   370  ## Benchmark Directory
   371  
   372  By convention, stages related to the same benchmark are put in the same directory, and [trx files](#transactions) are put in a `trx/` subdirectory.
   373  Here's an example from the Finch repo:
   374  
   375  ```bash
   376  finch/benchmarks/sysbench % tree .
   377  .
   378  ├── _all.yaml
   379  ├── read-only.yaml
   380  ├── setup.yaml
   381  ├── trx
   382  │   ├── insert-rows.sql
   383  │   ├── read-only.sql
   384  │   ├── schema.sql
   385  │   ├── secondary-index.sql
   386  │   └── write-only.sql
   387  └── write-only.yaml
   388  ```
   389  
   390  That is a recreation of the venerable [sysbench](https://github.com/akopytov/sysbench) benchmarks:
   391  
   392  * Stage `setup.yaml` creates the sysbench schema and inserts rows 
   393  * Stage `read-only.yaml` is the sysbench read-only benchmark
   394  * Stage `write-only.yam` is the sysbench write-only benchmark
   395  
   396  All trx files are kept under `trx/`.
   397  These are referenced in the stage files with relative paths, like `trx/insert-rows.sql`.
   398  
   399  `_all.yaml` is a special meta file with parameters for all stages.
   400  
   401  These conventions and relatives paths work because Finch changes working directory to the directory of the stage file.
   402  
   403  ## Parameters
   404  
   405  Parameters let you to define key-value pairs once and reference them in separate places&mdash;the DRY principle: don't repeat yourself.
   406  The simplest example is a single stage and trx file:
   407  
   408  {{< columns >}}
   409  _Stage_
   410  ```yaml
   411  stage:
   412    params:
   413      rows: "10,000"
   414    trx:
   415      - file: trx.sql
   416  ```
   417  <---> <!-- magic separator, between columns -->
   418  _Trx_
   419  ```sql
   420  -- rows: $params.rows
   421  INSERT INTO t (...)
   422  ```
   423  {{< /columns >}}
   424  
   425  `$params.rows` in the trx is replaced by the `stage.params.rows` value (10,000). 
   426  
   427  You can also override the value on the command line:
   428  
   429  ```bash
   430  finch benchmark.yaml --param rows=500000
   431  ```
   432  
   433  Parameters are most useful for multi-stage benchmarks because the shared parameters can be defined in `_all.yaml` in the same directory as the stages.
   434  When `_all.yaml` exists, Finch automatically applies its parameters to all stages.
   435  
   436  For example, in the [Benchmark Directory](#benchmark-directory) above, parameters in `_all.yaml` are automatically applied to all stages (`setup.yaml`, `read-only.yaml`, and `write-only.yaml`).
   437  When the setup stage is run, `params.rows` number of rows are inserted into the table.
   438  When the other stages are run, random rows between 1 and `params.rows` are accessed.