github.com/dolthub/dolt/go@v0.40.5-0.20240520175717-68db7794bea6/libraries/doltcore/branch_control/expr_parser_node.md

github.com/dolthub/dolt/go@v0.40.5-0.20240520175717-68db7794bea6/libraries/doltcore/branch_control/expr_parser_node.md (about)

     1  This document is to show the implementation theory behind the `MatchNode` structure in `expr_parser_node.go`.
     2  The intent is to aid future development by giving a high-level overview of how the `dolt_branch_control` table works in relation to the `MatchNode`.
     3  
     4  I'll first explain how it works by using an example.
     5  The branch control tables operate over 4 columns: `database`, `branch`, `user`, and `host`.
     6  For this example, I'll work with only the first two columns to make it a bit easier to work with.
     7  Also, it's worth mentioning that the actual implementation works on sort orders[^1], however I'll use the original characters in this example, as well as the character stand-ins for the `singleMatch`[^2] and `anyMatch`[^3] characters.
     8  
     9  ## Insertion
    10  
    11  The initial root `MatchNode` looks like the following:
    12  ```
    13  Node("|", nil)
    14  ```
    15  Our root node contains a single `columnMarker` (represented using `|`), which marks the beginning of a column.
    16  In addition, it does not represent a destination node[^4], therefore it does not contain any data (represented with `nil` for no data, and `dataX` for data).
    17  Let's say we want to add the database `"databomb"` and branch `"branchoo"`.
    18  We concatenate the strings together, and insert `columnMarker`s before each column, which now gives us `"|databomb|branchoo"`.
    19  Adding this new string to the `MatchNode` results in the following:
    20  ```
    21  Node("|", nil)
    22   └─ Node("databomb|branchoo", data1)
    23  ```
    24  You'll notice that the initial `columnMarker` was "consumed"[^5] by the root node, leaving the child with the same string sans the `columnMarker`.
    25  Now, let's add the branch `"branchah"` with the same database:
    26  ```
    27  Node("|", nil)
    28   └─ Node("databomb|branch", nil)
    29       ├─ Node("oo", data1)
    30       └─ Node("ah", data2)
    31  ```
    32  The difference between `"|databomb|branchoo"` and `"|databomb|branchah"` begins after `"|branch"`, so we split the destination node into two children.
    33  The first child contains the remainder of the original destination node (which is just `"oo"`) while the second child contains the remainder of the new branch (`"ah"`).
    34  We've also moved our data (`data1`) to the new child, making the original node a parental[^6] node.
    35  For our next addition, we'll add `"|databoost|branchius"`:
    36  ```
    37  Node("|", nil)
    38   └─ Node("databo", nil)
    39       ├─ Node("mb|branch", nil)
    40       │   ├─ Node("oo", data1)
    41       │   └─ Node("ah", data2)
    42       └─ Node("ost|branchius", data3)
    43  ```
    44  This splits the `"databomb|branch"` node, similar to when we added `"branchah"`.
    45  `"databo"` is the common portion of `"databomb"` and `"databoost"`, so it becomes a parental node.
    46  We still move the "data", but as it wasn't a destination node, we just moved `nil`.
    47  Lastly, let's add `"|databomb|branchahee"`, which appends `"ee"` to our pre-existing branch `"branchah"`:
    48  ```
    49  Node("|", nil)
    50   └─ Node("databo", nil)
    51       ├─ Node("mb|branch", nil)
    52       │   ├─ Node("oo", data1)
    53       │   └─ Node("ah", data2)
    54       │       └─ Node("ee", data4)
    55       └─ Node("ost|branchius", data3)
    56  ```
    57  As `"branchah"` was a destination node, we simply add the `"ee"` portion as a child, while `"branchah"` remains a destination node.
    58  As both are destination nodes, this means we can match both `"branchah"` and `"branchahee"`.
    59  
    60  ## Deletion
    61  
    62  Let's use the tree from the last section:
    63  ```
    64  Node("|", nil)
    65   └─ Node("databo", nil)
    66       ├─ Node("mb|branch", nil)
    67       │   ├─ Node("oo", data1)
    68       │   └─ Node("ah", data2)
    69       │       └─ Node("ee", data4)
    70       └─ Node("ost|branchius", data3)
    71  ```
    72  First, we'll delete the first string we inserted (`"|databomb|branchoo"`):
    73  ```
    74  Node("|", nil)
    75   └─ Node("databo", nil)
    76       ├─ Node("mb|branchah", data2)
    77       │   └─ Node("ee", data4)
    78       └─ Node("ost|branchius", data3)
    79  ```
    80  Multiple things occurred here.
    81  First, removing `"|databomb|branchoo"` only removed the `("oo", data1)` node, as the remainder of the string is still used by other strings.
    82  This gave us the intermediate subtree:
    83  ```
    84   └─ Node("databo", nil)
    85       └─ Node("mb|branch", nil)
    86           └─ Node("ah", data2)
    87               └─ Node("ee", data4)
    88  ```
    89  Since `Node("mb|branch", nil)` is _not_ a destination node, and it now contains only a single child, we can merge that child with the node.
    90  This just appends the child's string to the node, moves its data, and moves its children.
    91  It is worth noting that attempting to delete a node that is not a destination node results in a no-op[^7], as it must have multiple children, and its parent either has multiple children or is a destination node.
    92  The only exception will be the root node which will actually merge its child onto itself.
    93  
    94  ## Matching
    95  
    96  Using the tree from the last section, we'll illustrate how matching works.
    97  For reference, this is our tree:
    98  ```
    99  Node("|", nil)
   100   └─ Node("databo", nil)
   101       ├─ Node("mb|branchah", data2)
   102       │   └─ Node("ee", data4)
   103       └─ Node("ost|branchius", data3)
   104  ```
   105  Let's attempt to match `"|databomb|branch"`.
   106  First we'll match `|` on the root, leaving us with `"databomb|branch"`.
   107  As children are implemented using a map, we do three checks: one for `%` (`anyMatch`), one for `_` (`singleMatch`), and another for the character itself.
   108  These three checks are what allows us to use the `LIKE` expression syntax.
   109  In our case, `%` and `_` do not return nodes, but `d` returns `Node("databo", nil)`, so we continue.
   110  If we had multiple nodes then we would iterate over each node in turn, but as we only have one we fully match that node.
   111  That leaves us with `"mb|branch"`.
   112  Again we do three checks, and match `m` for `Node("mb|branchah", data2)`.
   113  As we do not match the full node, we end up returning `false` for the match.
   114  
   115  In order to match, you must fully match a destination node.
   116  There is an exception, and that deals with nodes ending with an `anyMatch`.
   117  To explain how these match, I'll skip the nodes and just use a single string, and I'll also use a table to track the match progress.
   118  Let's say we have the expression `"abc%fg"` and we want to match the string `"abcdeffg"`.
   119  
   120  | expressions | string   |
   121  |-------------|----------|
   122  | abc%fg      | abcdeffg |
   123  
   124  `"abc"` will exactly match with the expression, so we'll remove it from both sides.
   125  
   126  | expressions | string |
   127  |-------------|--------|
   128  | %fg         | deffg  |
   129  
   130  `d` matches `%`, which consumes it on our string, but does _not_ consume it for our expression, leaving us with `"effg"` for our string.
   131  
   132  | expressions | string |
   133  |-------------|--------|
   134  | %fg         | effg   |
   135  
   136  Similarly, `e` is consumed on the string, leaving us with `"ffg"`.
   137  
   138  | expressions | string |
   139  |-------------|--------|
   140  | %fg         | ffg    |
   141  
   142  Now for the fun part, `f` still matches against `%`, but it also matches against the `f` after the `anyMatch`, so we create a new expression.
   143  This means that we now match against two expressions (`string` is duplicated in the table, but we only ever compare against a single string).
   144  
   145  | expressions | string |
   146  |-------------|--------|
   147  | %fg         | fg     |
   148  | g           | fg     |
   149  
   150  Whenever the character after an `anyMatch` would match the current character, we create a new expression.
   151  Since we had two `f`s in our string, we end up matching against `%` again, along with the following `f` again, creating a third match.
   152  However, the expression `g` does not match our second `f`, so it is dropped.
   153  In this case though, the third that expression we just created is another `"g"`, so I'll cross out the previous `"g"` expression to make it clear that we added a new expression.
   154  
   155  | expressions | string |
   156  |-------------|--------|
   157  | %fg         | g      |
   158  | ~~g~~       | ~~g~~  |
   159  | g           | g      |
   160  
   161  Finally, our first and third expressions end up matching the final character in our string.
   162  `"%fg"` matches because `anyMatch` still matches everything, and `"g"` is an exact match.
   163  Now we've completely matched our string, and we've got two expressions (nodes in the actual implementation).
   164  We throw out all expressions that still have unmatched characters, except if there is only one character in the expression **and** it is the `anyMatch` character.
   165  `"%fg"` has more than one character and does not end with `anyMatch`, so it is thrown out, leaving us with only the third expression.
   166  
   167  There are two additional points worth mentioning.
   168  The first is that we fold all expressions using `FoldExpression()`, which reduces an expression down to its smallest form, and also guarantees the uniqueness of the expression.
   169  This lets us catch all duplicate expressions.
   170  The second is that `singleMatch` and `anyMatch` do not match `columnMarker`, and `singleMatch` does not match `anyMatch`.
   171  This ensures that column boundaries are respected, and it also gives us the ability to match expressions against other expressions to find subsets.
   172  For example, `"abc%"` will match all strings that `"abc__"` would match, meaning `"abc__"` will only match a subset of `"abc%"`, which allows us to also block the insertion of subsets.
   173  We cannot efficiently find supersets to remove existing nodes, so subsets may still appear in the tree depending on the order of insertion.
   174  
   175  [^1]: A sort order is the comparing value of a character, digit, or symbol. This allows for things such as case-insensitive comparisons, as an uppercase and lowercase character will have the same sort order. They're ignored in this document as they're reliant on collations, and they'd be hard to visually parse.
   176  
   177  [^2]: The `singleMatch` character matches any single character, digit, or symbol, except for the `anyMatch` and `columnMarker` characters. 
   178  
   179  [^3]: The `anyMatch` character will match zero or more characters, digits, or symbols. It does not match `columnMarker`s.
   180  
   181  [^4]: Destination nodes are nodes that contain data, and represent a full expression when all parents are concatenated together along with this node's expression. Matches may only be done on destination nodes, and destination nodes may still have children. We do not have a concept of a "leaf node", hence the term "destination node".
   182  
   183  [^5]: The code assumes that all terminal nodes are either destination or parental nodes. This is true for all nodes under the root, but will not be true for the initial root node. To make the code simpler, we don't special case the root node since it's not that important or impactful to performance. The root node can still end up as a destination node by adding two children and then deleting one of them.
   184  
   185  [^6]: Parental nodes are nodes that contain only children (with no data), and must contain at least two children (except the root node).
   186  
   187  [^7]: This includes the root node, as it will be the only node without a parent, and removing a node from the tree requires removing it from its parent.