github.com/dolthub/dolt/go@v0.40.5-0.20240520175717-68db7794bea6/libraries/doltcore/branch_control/expr_parser_node.md (about) 1 This document is to show the implementation theory behind the `MatchNode` structure in `expr_parser_node.go`. 2 The intent is to aid future development by giving a high-level overview of how the `dolt_branch_control` table works in relation to the `MatchNode`. 3 4 I'll first explain how it works by using an example. 5 The branch control tables operate over 4 columns: `database`, `branch`, `user`, and `host`. 6 For this example, I'll work with only the first two columns to make it a bit easier to work with. 7 Also, it's worth mentioning that the actual implementation works on sort orders[^1], however I'll use the original characters in this example, as well as the character stand-ins for the `singleMatch`[^2] and `anyMatch`[^3] characters. 8 9 ## Insertion 10 11 The initial root `MatchNode` looks like the following: 12 ``` 13 Node("|", nil) 14 ``` 15 Our root node contains a single `columnMarker` (represented using `|`), which marks the beginning of a column. 16 In addition, it does not represent a destination node[^4], therefore it does not contain any data (represented with `nil` for no data, and `dataX` for data). 17 Let's say we want to add the database `"databomb"` and branch `"branchoo"`. 18 We concatenate the strings together, and insert `columnMarker`s before each column, which now gives us `"|databomb|branchoo"`. 19 Adding this new string to the `MatchNode` results in the following: 20 ``` 21 Node("|", nil) 22 └─ Node("databomb|branchoo", data1) 23 ``` 24 You'll notice that the initial `columnMarker` was "consumed"[^5] by the root node, leaving the child with the same string sans the `columnMarker`. 25 Now, let's add the branch `"branchah"` with the same database: 26 ``` 27 Node("|", nil) 28 └─ Node("databomb|branch", nil) 29 ├─ Node("oo", data1) 30 └─ Node("ah", data2) 31 ``` 32 The difference between `"|databomb|branchoo"` and `"|databomb|branchah"` begins after `"|branch"`, so we split the destination node into two children. 33 The first child contains the remainder of the original destination node (which is just `"oo"`) while the second child contains the remainder of the new branch (`"ah"`). 34 We've also moved our data (`data1`) to the new child, making the original node a parental[^6] node. 35 For our next addition, we'll add `"|databoost|branchius"`: 36 ``` 37 Node("|", nil) 38 └─ Node("databo", nil) 39 ├─ Node("mb|branch", nil) 40 │ ├─ Node("oo", data1) 41 │ └─ Node("ah", data2) 42 └─ Node("ost|branchius", data3) 43 ``` 44 This splits the `"databomb|branch"` node, similar to when we added `"branchah"`. 45 `"databo"` is the common portion of `"databomb"` and `"databoost"`, so it becomes a parental node. 46 We still move the "data", but as it wasn't a destination node, we just moved `nil`. 47 Lastly, let's add `"|databomb|branchahee"`, which appends `"ee"` to our pre-existing branch `"branchah"`: 48 ``` 49 Node("|", nil) 50 └─ Node("databo", nil) 51 ├─ Node("mb|branch", nil) 52 │ ├─ Node("oo", data1) 53 │ └─ Node("ah", data2) 54 │ └─ Node("ee", data4) 55 └─ Node("ost|branchius", data3) 56 ``` 57 As `"branchah"` was a destination node, we simply add the `"ee"` portion as a child, while `"branchah"` remains a destination node. 58 As both are destination nodes, this means we can match both `"branchah"` and `"branchahee"`. 59 60 ## Deletion 61 62 Let's use the tree from the last section: 63 ``` 64 Node("|", nil) 65 └─ Node("databo", nil) 66 ├─ Node("mb|branch", nil) 67 │ ├─ Node("oo", data1) 68 │ └─ Node("ah", data2) 69 │ └─ Node("ee", data4) 70 └─ Node("ost|branchius", data3) 71 ``` 72 First, we'll delete the first string we inserted (`"|databomb|branchoo"`): 73 ``` 74 Node("|", nil) 75 └─ Node("databo", nil) 76 ├─ Node("mb|branchah", data2) 77 │ └─ Node("ee", data4) 78 └─ Node("ost|branchius", data3) 79 ``` 80 Multiple things occurred here. 81 First, removing `"|databomb|branchoo"` only removed the `("oo", data1)` node, as the remainder of the string is still used by other strings. 82 This gave us the intermediate subtree: 83 ``` 84 └─ Node("databo", nil) 85 └─ Node("mb|branch", nil) 86 └─ Node("ah", data2) 87 └─ Node("ee", data4) 88 ``` 89 Since `Node("mb|branch", nil)` is _not_ a destination node, and it now contains only a single child, we can merge that child with the node. 90 This just appends the child's string to the node, moves its data, and moves its children. 91 It is worth noting that attempting to delete a node that is not a destination node results in a no-op[^7], as it must have multiple children, and its parent either has multiple children or is a destination node. 92 The only exception will be the root node which will actually merge its child onto itself. 93 94 ## Matching 95 96 Using the tree from the last section, we'll illustrate how matching works. 97 For reference, this is our tree: 98 ``` 99 Node("|", nil) 100 └─ Node("databo", nil) 101 ├─ Node("mb|branchah", data2) 102 │ └─ Node("ee", data4) 103 └─ Node("ost|branchius", data3) 104 ``` 105 Let's attempt to match `"|databomb|branch"`. 106 First we'll match `|` on the root, leaving us with `"databomb|branch"`. 107 As children are implemented using a map, we do three checks: one for `%` (`anyMatch`), one for `_` (`singleMatch`), and another for the character itself. 108 These three checks are what allows us to use the `LIKE` expression syntax. 109 In our case, `%` and `_` do not return nodes, but `d` returns `Node("databo", nil)`, so we continue. 110 If we had multiple nodes then we would iterate over each node in turn, but as we only have one we fully match that node. 111 That leaves us with `"mb|branch"`. 112 Again we do three checks, and match `m` for `Node("mb|branchah", data2)`. 113 As we do not match the full node, we end up returning `false` for the match. 114 115 In order to match, you must fully match a destination node. 116 There is an exception, and that deals with nodes ending with an `anyMatch`. 117 To explain how these match, I'll skip the nodes and just use a single string, and I'll also use a table to track the match progress. 118 Let's say we have the expression `"abc%fg"` and we want to match the string `"abcdeffg"`. 119 120 | expressions | string | 121 |-------------|----------| 122 | abc%fg | abcdeffg | 123 124 `"abc"` will exactly match with the expression, so we'll remove it from both sides. 125 126 | expressions | string | 127 |-------------|--------| 128 | %fg | deffg | 129 130 `d` matches `%`, which consumes it on our string, but does _not_ consume it for our expression, leaving us with `"effg"` for our string. 131 132 | expressions | string | 133 |-------------|--------| 134 | %fg | effg | 135 136 Similarly, `e` is consumed on the string, leaving us with `"ffg"`. 137 138 | expressions | string | 139 |-------------|--------| 140 | %fg | ffg | 141 142 Now for the fun part, `f` still matches against `%`, but it also matches against the `f` after the `anyMatch`, so we create a new expression. 143 This means that we now match against two expressions (`string` is duplicated in the table, but we only ever compare against a single string). 144 145 | expressions | string | 146 |-------------|--------| 147 | %fg | fg | 148 | g | fg | 149 150 Whenever the character after an `anyMatch` would match the current character, we create a new expression. 151 Since we had two `f`s in our string, we end up matching against `%` again, along with the following `f` again, creating a third match. 152 However, the expression `g` does not match our second `f`, so it is dropped. 153 In this case though, the third that expression we just created is another `"g"`, so I'll cross out the previous `"g"` expression to make it clear that we added a new expression. 154 155 | expressions | string | 156 |-------------|--------| 157 | %fg | g | 158 | ~~g~~ | ~~g~~ | 159 | g | g | 160 161 Finally, our first and third expressions end up matching the final character in our string. 162 `"%fg"` matches because `anyMatch` still matches everything, and `"g"` is an exact match. 163 Now we've completely matched our string, and we've got two expressions (nodes in the actual implementation). 164 We throw out all expressions that still have unmatched characters, except if there is only one character in the expression **and** it is the `anyMatch` character. 165 `"%fg"` has more than one character and does not end with `anyMatch`, so it is thrown out, leaving us with only the third expression. 166 167 There are two additional points worth mentioning. 168 The first is that we fold all expressions using `FoldExpression()`, which reduces an expression down to its smallest form, and also guarantees the uniqueness of the expression. 169 This lets us catch all duplicate expressions. 170 The second is that `singleMatch` and `anyMatch` do not match `columnMarker`, and `singleMatch` does not match `anyMatch`. 171 This ensures that column boundaries are respected, and it also gives us the ability to match expressions against other expressions to find subsets. 172 For example, `"abc%"` will match all strings that `"abc__"` would match, meaning `"abc__"` will only match a subset of `"abc%"`, which allows us to also block the insertion of subsets. 173 We cannot efficiently find supersets to remove existing nodes, so subsets may still appear in the tree depending on the order of insertion. 174 175 [^1]: A sort order is the comparing value of a character, digit, or symbol. This allows for things such as case-insensitive comparisons, as an uppercase and lowercase character will have the same sort order. They're ignored in this document as they're reliant on collations, and they'd be hard to visually parse. 176 177 [^2]: The `singleMatch` character matches any single character, digit, or symbol, except for the `anyMatch` and `columnMarker` characters. 178 179 [^3]: The `anyMatch` character will match zero or more characters, digits, or symbols. It does not match `columnMarker`s. 180 181 [^4]: Destination nodes are nodes that contain data, and represent a full expression when all parents are concatenated together along with this node's expression. Matches may only be done on destination nodes, and destination nodes may still have children. We do not have a concept of a "leaf node", hence the term "destination node". 182 183 [^5]: The code assumes that all terminal nodes are either destination or parental nodes. This is true for all nodes under the root, but will not be true for the initial root node. To make the code simpler, we don't special case the root node since it's not that important or impactful to performance. The root node can still end up as a destination node by adding two children and then deleting one of them. 184 185 [^6]: Parental nodes are nodes that contain only children (with no data), and must contain at least two children (except the root node). 186 187 [^7]: This includes the root node, as it will be the only node without a parent, and removing a node from the tree requires removing it from its parent.