github.com/square/finch@v0.0.0-20240412205204-6530c03e2b96/docs/content/intro/concepts.md (about) 1 --- 2 weight: 2 3 --- 4 5 The following concepts are important for understanding how Finch works, and many of them are unique to Finch. 6 7 {{< hint type=tip >}} 8 If you're new to database benchmarking read <a target="_new" href="https://hackmysql.com/benchmarking/">Fundamentals of Database Benchmarking</a>. 9 {{< /hint >}} 10 11 {{< toc >}} 12 13 ## Run Levels 14 15 Finch is designed on a hierarchy of run levels: 16 17 ``` 18 global 19 └──stage 20 └──workload 21 └──exec group 22 └──client group 23 └──client 24 └──iter 25 └──trx 26 └──statement 27 ``` 28 29 These are called _run levels_ because they determine if, when, and how various aspects of Finch run. 30 By understanding the run levels, you'll understand all the core concepts of Finch and how they work. 31 32 Let's start at the bottom because it's the most direct and concrete concept: SQL statements. 33 Moving up the hierarchy, the concepts become increasingly abstract and specific to Finch. 34 35 ### Statements 36 37 A _statement_ is any SQL statement that Finch can execute on MySQL: `SELECT`, `INSERT`, `UPDATE`, `DELETE`, and so forth. 38 39 This is usually synonymous with _query_, but statement is more general because, for example, would you call `BEGIN` a query? 40 Maybe, but what about `CREATE` or `SET`? 41 42 Statements are the lowest level—but most important part—of running Finch: executing SQL statements on MySQL. 43 This shouldn't be surprising since Finch is a benchmarking tool, but what might be surprising is that you write Finch benchmarks using real SQL statements; they are not wrapped in a scripting language. 44 You write one or more SQL statement in a Finch transaction file. 45 46 ### Transactions 47 48 Finch _transactions (trx)_ are plain text files that contain one or more SQL statement to execute. 49 Suppose that you want to benchmark these statements: 50 51 ```sql 52 SELECT c FROM t WHERE id = @d 53 54 BEGIN 55 56 SELECT id FROM t WHERE user = @d 57 58 UPDATE users SET n = n + 1 WHERE id = @d 59 60 COMMIT 61 ``` 62 63 You can put those statements in one file and it will work, but it's better to separate the transactions: 64 65 {{< columns >}} 66 _trx-1.sql_ 67 ```sql 68 SELECT c FROM t WHERE id = @d 69 ``` 70 <---> <!-- magic separator, between columns --> 71 _trx-2.sql_ 72 ```sql 73 BEGIN 74 75 SELECT id FROM t WHERE user = @d 76 77 UPDATE users SET n = n + 1 WHERE id = @d 78 79 COMMIT 80 ``` 81 {{< /columns >}} 82 83 84 {{< hint type=note >}} 85 Ignore the `@d` placeholders for now; they are Finch [data keys](#data-generators) described later on this page. 86 {{< /hint >}} 87 88 With separate Finch trx, you can craft more advanced and interesting benchmarks, like: 89 90 |Trx|Clients|TPS| 91 |---|-------|---| 92 |trx-1.sql|100|(Unlimited)| 93 |trx-2.sql|20|50| 94 95 Finch will run the first trx (the single `SELECT`) on 100 clients as fast as possible, and it will run the second trx (the explicit, multi-statement SQL transaction) on only 20 clients limited to 50 transactions per second (TPS). 96 97 {{< hint type=tip >}} 98 Remember the example in the table above because it's reused in the following sections. 99 {{< /hint >}} 100 101 Finch also collects and reports statistics per-trx. 102 (The built-in stats reports combine all trx stats, but a [stats plugin]({{< relref "api/stats" >}}) receives per-trx stats.) 103 104 You can write a benchmark as one large pseudo-transaction (all SQL statements in one file), but for various reasons that will become clear later, it's better to write each SQL transaction in a separate Finch trx file. 105 106 ### Client 107 108 A _client_ is one MySQL client (connection) that executes all SQL statements in all trx assigned to it. 109 Each pass through all assigned trx is one _iter_ (iteration). 110 111 You can configure Finch to run any number of clients. 112 This is typical for benchmarking tools, but Finch has two unique features: 113 114 * Clients can be grouped (described next) 115 * Clients can be assigned different trx 116 117 The second point makes Finch especially powerful and flexible. 118 For example, as shown in the table above, you can run 100 clients executing certain transactions, and 20 clients executing other transactions. 119 Having different clients executing different trx is why Finch needs client and execution groups. 120 121 ### Client and Execution Groups 122 123 A _client group_ is a logical group of clients with the same trx and limits. 124 From the example in the table above, this snippet of a Finch stage file creates two client groups: 125 126 ```yaml 127 workload: 128 - clients: 100 # Client group 1 129 trx: [trx-1.sql] # 130 131 - clients: 20 # Client group 2 132 trx: [trx-2.sql] # 133 tps-clients: 50 # 134 ``` 135 136 The first client group will run 100 independent clients, all executing the statements in trx `trx-1.sql`. 137 The second client group will run 20 independent clients, all executing the statements in trx `trx-2.sql`, and limited to 50 TPS across all 20 clients. 138 Together, this will create 120 MySQL clients (connections). 139 140 An _execution group_ is one or more client groups with the same group name. 141 142 Let's pretend you want to execute another client group _after_ client groups 1 and 2 (above) are done. 143 That requires putting client groups 1 and 2 in a named execution group—let's call it `FOO`—and the third client group in another named execution group—let's call it `BAR`. 144 A snippet of a Finch stage file is shown below that creates these two execution groups. 145 146 ```yaml 147 workload: 148 # FOO ---------- 149 - clients: 100 # Client group 1 150 trx: [trx-1.sql] # 151 group: FOO # 152 - clients: 20 # Client group 2 153 trx: [trx-2.sql] # 154 tps-clients: 50 # 155 group: FOO # 156 # BAR ---------- 157 - client: 1 # Client group 3 158 trx: [trx-3.sql] # 159 group: BAR # 160 ``` 161 162 Finch processes the `workload` list top to bottom, so order matters when creating execution groups. 163 All client groups in an execution group run concurrently. 164 And execution groups are run top to bottom in workload order. 165 166 Visually, the workload processing (Y axis) and time when the execution groups are run (X axis) can be depicted like: 167 168  169 170 Both client groups in execution group `FOO` (blue) run first. 171 When they're done, the client group in execution group `BAR` runs next. 172 173 This is advanced workload orchestration. 174 Sometimes it's necessary (like with DDL statements), but it's optional for most simple benchmarks that just want run everything all at once. 175 176 ### Workload 177 178 A workload is a combination of queries, data, and access patterns. 179 In Finch, the previous (lower) run levels combine to create a workload that is defined in the aptly named `workload` section of a stage file: 180 181 ```yaml 182 stage: 183 workload: 184 - clients: 100 185 trx: [trx-1.sql] 186 - clients: 20 187 trx: [trx-2.sql] 188 tps-clients: 50 # all 20 clients 189 trx: 190 - file: trx-1.sql 191 - file: trx-2.sql 192 ``` 193 194 That defines the workload for the running example (the table in [Transactions](#transactions)): 100 clients to execute the single `SELECT` in `trx-1.sql` and, concurrently, 20 clients to execute the transaction in `trx-2.sql` limited to 20 TPS (across all 20 clients). 195 196 In short, a Finch workload defines which clients run which transitions, plus a few features like QPS and TPS limiters. 197 198 ### Stages 199 200 A stage executes a workload. 201 As shown immediately above, a Finch stage is configured in a YAML file. 202 Here's a real and more complete example: 203 204 ```yaml 205 stage: 206 name: read-only 207 runtime: 60s 208 mysql: 209 db: finch 210 workload: 211 - clients: 1 212 trx: 213 - file: trx/read-only.sql 214 data: 215 id: 216 generator: "rand-int" 217 scope: client 218 params: 219 max: $params.rows 220 dist: normal 221 id_100: 222 generator: "int-range" 223 scope: client 224 params: 225 range: 100 226 max: $params.rows 227 ``` 228 229 Every stage has a name: either set by `stage.name` in the file, or defaulting to the base file name. 230 Every stage needs at least one trx, defined by the `stage.trx` list. 231 Every stage should have an explicitly defined workload (`stage.workload`), but in some cases Finch can auto-assign the workload. 232 233 The rest is various configuration settings for the stage, workload, data keys, and so forth. 234 235 Finch runs one or more stages _sequentially_ as specified on the command line: 236 237 ```bash 238 finch STAGE_1 [... STAGE_N] 239 ``` 240 241 This allows you to do a full setup, benchmark, and cleanup in a single run by writing and specifying those three stages. 242 But you don't have to; it's common to set up only once but run a benchmark repeatedly, like: 243 244 ```bash 245 finch setup.yaml 246 finch benchmark.yaml 247 finch benchmark.yaml 248 finch benchmark.yaml 249 ``` 250 251 ### Compute 252 253 The global run level represents one running instances of Finch (all stages) on a compute instance. 254 What's unique about Finch is that one instance (a server) can coordinate running the stages on other instances (clients): 255 256  257 258 This is optional, but it's easy to enable and useful for creating a massive load on a MySQL instance with several smaller and cheaper compute instances rather than one huge benchmark server. 259 260 To enable, configure a stage with: 261 262 ```yaml 263 compute: 264 instances: 2 # 1 server + 1 client 265 ``` 266 267 Then start a Finch server and one client like: 268 269 {{< columns >}} 270 _Server_ 271 ```sh 272 finch benchmark.yaml 273 ``` 274 <---> <!-- magic separator, between columns --> 275 _Client_ 276 ```sh 277 finch --server 10.0.0.1 278 ``` 279 {{< /columns >}} 280 281 The server sends a copy of the stage to the client, so there's nothing to configure on the client. 282 The client sends its statistics to the server, so you don't have to aggregate them—Finch does it automatically. 283 284 ## Data Keys 285 286 A _data generator_ is a plugin that generates data for [statements](#statements). 287 Data generators are referenced by user-defined _data keys_ like @d. 288 289 "Data key" and "data generator" are synonymous because they're are two sides of the same coin. 290 For brevity, "data key" (or just "@d") is used the most. 291 292 ### Name and Configuration 293 294 Data keys are used in statements and configured with a specific data generator in the stage file: 295 296 {{< columns >}} 297 _Statement_ 298 ```sql 299 SELECT c FROM t WHERE id = @d 300 ``` 301 <---> <!-- magic separator, between columns --> 302 _Stage File_ 303 ```yaml 304 stage: 305 trx: 306 - file: read.sql 307 data: 308 d: # @d 309 generator: "int" 310 params: 311 max: 25,000 312 dist: uniform 313 ``` 314 {{< /columns >}} 315 316 {{< hint type=note >}} 317 `@d` in a statement, but `d` (no `@` prefix) in a stage file because `@` is a reserved character in YAML. 318 {{< /hint >}} 319 320 In the stage file, `stage.trx[].data.d.generator` configures the type of data generator: data key `@d` uses the `int` (random integer) data generator. 321 And `stage.trx[].data.d.params` provides generator-specific configuration; in this case, configuring the `int` generator to return a uniform random distribution of numbers between 1 and 25,000. 322 When Finch clients execute the `SELECT` statement above, data key `@d` will be replaced with random values from this `int` data generator. 323 324 Finch ships with [built-in data generators]({{< relref "data/generators" >}}), and it's easy [build your own]({{< relref "api/data" >}}) data generator. 325 326 ### Scope 327 328 Data keys are scoped to a [run level](#run-levels). 329 To see why data scope is important, consider this transaction: 330 331 ```sql 332 BEGIN 333 334 SELECT c FROM t WHERE id = @d 335 336 UPDATE t SET c=c+1 WHERE id = @d 337 338 COMMIT 339 ``` 340 341 Is `@d` the same or different in those two statements? 342 The answer depends on the configured scope of `@d`: 343 344 |`@d` Scope|Result| 345 |----------|------| 346 |statement|Different `@d`: one generator and value for `SELECT @d`, and another generator and value for `UPDATE @d`| 347 |trx|Same `@d`: one generator and one value for both| 348 349 Since the statements are in the same MySQL transactions, it intent is probably that `@d` is trx scoped so that the value in both statements is the same. 350 But Finch supports very complex workloads, so [Data / Scope]({{< relref "data/scope" >}}) ranges from simple to complex. 351 352 ### Input/Output 353 354 From the statement point of view, most data keys provide input _to a statement_. 355 But data keys can also be outputs: a statement provides a value _to a data key_. 356 And a data key can be both. 357 For example, a data key is used to save the insert ID from a statement (output) and provide to another statement (input): 358 359 ```sql 360 -- save-insert-id: @d 361 INSERT INTO money_transfer VALUES (...) 362 363 INSERT INTO audit_log VALUES (@d, ...) 364 ``` 365 366 Finch saves the insert ID returned from MySQL into `@d`. 367 Then, in the second query, `@d` returns that insert ID value. 368 369 370 ## Benchmark Directory 371 372 By convention, stages related to the same benchmark are put in the same directory, and [trx files](#transactions) are put in a `trx/` subdirectory. 373 Here's an example from the Finch repo: 374 375 ```bash 376 finch/benchmarks/sysbench % tree . 377 . 378 ├── _all.yaml 379 ├── read-only.yaml 380 ├── setup.yaml 381 ├── trx 382 │ ├── insert-rows.sql 383 │ ├── read-only.sql 384 │ ├── schema.sql 385 │ ├── secondary-index.sql 386 │ └── write-only.sql 387 └── write-only.yaml 388 ``` 389 390 That is a recreation of the venerable [sysbench](https://github.com/akopytov/sysbench) benchmarks: 391 392 * Stage `setup.yaml` creates the sysbench schema and inserts rows 393 * Stage `read-only.yaml` is the sysbench read-only benchmark 394 * Stage `write-only.yam` is the sysbench write-only benchmark 395 396 All trx files are kept under `trx/`. 397 These are referenced in the stage files with relative paths, like `trx/insert-rows.sql`. 398 399 `_all.yaml` is a special meta file with parameters for all stages. 400 401 These conventions and relatives paths work because Finch changes working directory to the directory of the stage file. 402 403 ## Parameters 404 405 Parameters let you to define key-value pairs once and reference them in separate places—the DRY principle: don't repeat yourself. 406 The simplest example is a single stage and trx file: 407 408 {{< columns >}} 409 _Stage_ 410 ```yaml 411 stage: 412 params: 413 rows: "10,000" 414 trx: 415 - file: trx.sql 416 ``` 417 <---> <!-- magic separator, between columns --> 418 _Trx_ 419 ```sql 420 -- rows: $params.rows 421 INSERT INTO t (...) 422 ``` 423 {{< /columns >}} 424 425 `$params.rows` in the trx is replaced by the `stage.params.rows` value (10,000). 426 427 You can also override the value on the command line: 428 429 ```bash 430 finch benchmark.yaml --param rows=500000 431 ``` 432 433 Parameters are most useful for multi-stage benchmarks because the shared parameters can be defined in `_all.yaml` in the same directory as the stages. 434 When `_all.yaml` exists, Finch automatically applies its parameters to all stages. 435 436 For example, in the [Benchmark Directory](#benchmark-directory) above, parameters in `_all.yaml` are automatically applied to all stages (`setup.yaml`, `read-only.yaml`, and `write-only.yaml`). 437 When the setup stage is run, `params.rows` number of rows are inserted into the table. 438 When the other stages are run, random rows between 1 and `params.rows` are accessed.