github.com/cockroachdb/cockroachdb-parser@v0.23.3-0.20240213214944-911057d40c9a/pkg/sql/inverted/expression.go (about) 1 // Copyright 2020 The Cockroach Authors. 2 // 3 // Use of this software is governed by the Business Source License 4 // included in the file licenses/BSL.txt. 5 // 6 // As of the Change Date specified in that file, in accordance with 7 // the Business Source License, use of this software will be governed 8 // by the Apache License, Version 2.0, included in the file 9 // licenses/APL.txt. 10 11 package inverted 12 13 import ( 14 "bytes" 15 "fmt" 16 "strconv" 17 18 "github.com/cockroachdb/cockroachdb-parser/pkg/keysbase" 19 "github.com/cockroachdb/cockroachdb-parser/pkg/util/treeprinter" 20 "github.com/cockroachdb/errors" 21 "github.com/cockroachdb/redact" 22 ) 23 24 // EncVal is the encoded form of a value in the inverted column. This library 25 // does not care about how the value is encoded. The following encoding comment 26 // is only relevant for integration purposes, and to justify the use of an 27 // encoded form. 28 // 29 // If the inverted column stores an encoded datum, the encoding is 30 // DatumEncoding_ASCENDING_KEY, and is performed using 31 // keyside.Encode(nil /* prefix */, val tree.Datum, encoding.Ascending). 32 // It is used to represent spans of the inverted column. 33 // 34 // It would be ideal if the inverted column only contained Datums, since we 35 // could then work with a Datum here. However, JSON breaks that approach: 36 // - JSON inverted columns use a custom encoding that uses a special byte 37 // jsonInvertedIndex, followed by the bytes produced by the various 38 // implementations of the encodeInvertedIndexKey() method in the JSON 39 // interface. This could be worked around by using a JSON datum that 40 // represents a single path as the start key of the span, and representing 41 // [start, start] spans. We would special case the encoding logic to 42 // recognize that it is dealing with JSON (we have similar special path code 43 // for JSON elsewhere). But this is insufficient (next bullet). 44 // - Expressions like x ? 'b' don't have operands that are JSON, but can be 45 // represented using a span on the inverted column. 46 // 47 // So we make it the job of the caller of this library to encode the inverted 48 // column. Note that the second bullet above has some similarities with the 49 // behavior in makeStringPrefixSpan(), except there we can represent the start 50 // and end keys using the string type. 51 type EncVal []byte 52 53 // High-level context: 54 // 55 // 1. Semantics of inverted index spans and effect on union and intersection 56 // 57 // Unlike spans of a normal index (e.g. the spans in the constraints package), 58 // the spans of the inverted index cannot be immediately "evaluated" since 59 // they represent sets of primary keys that we won't know about until we do 60 // the scan. Using a simple example: [a, d) \intersection [c, f) is not [c, d) 61 // since the same primary key K could be found under a and f and be part of 62 // the result. More precisely, the above expression can be simplified to: [c, 63 // d) \union ([a, c) \intersection [d, f)) 64 // 65 // For regular indexes, since each primary key is indexed in one row of the 66 // index, we can be sure that the same primary key will not appear in both of 67 // the non-overlapping spans [a, c) and [d, f), so we can immediately throw 68 // that part away knowing that it is the empty set. This discarding is not 69 // possible with inverted indexes, though the factoring can be useful for 70 // speed of execution (it does not limit what we need to scan) and for 71 // selectivity estimation when making optimizer choices. 72 // 73 // One could try to construct a general library that handles both the 74 // cases handled in the constraints package and here, but the complexity seems 75 // high. Instead, this package is more general than constraints in a few ways 76 // but simplifies most other things (so overall much simpler): 77 // - All the inverted spans are [start, end). 78 // - It handles spans only on the inverted column, with a way to plug-in spans 79 // generated for the PK columns. For more discussion on multi-column 80 // constraints for inverted indexes, see the long comment at the end of the 81 // file. 82 // 83 // 2. Representing a canonical "inverted expression" 84 // 85 // This package represents a canonical form for all inverted expressions -- it 86 // is more than the description of a scan. The evaluation machinery will 87 // evaluate this expression over an inverted index. The support to build that 88 // canonical form expression is independent of how the original expression is 89 // represented: instead of taking an opt.Expr parameter and traversing it 90 // itself, this library assumes the caller is doing a traversal. This is 91 // partly because the representation of the original expression for the single 92 // table scan case and the invertedJoiner case are not the same: the latter 93 // starts with an expression with two unspecified rows, and after the left 94 // side row is bound (partial application), this library needs to be used to 95 // construct the Expression. 96 97 // Span is a span of the inverted index. Represents [start, end). 98 type Span struct { 99 Start, End EncVal 100 } 101 102 // MakeSingleValSpan constructs a span equivalent to [val, val]. 103 func MakeSingleValSpan(val EncVal) Span { 104 end := EncVal(keysbase.PrefixEnd(val)) 105 return Span{Start: val, End: end} 106 } 107 108 // IsSingleVal returns true iff the span is equivalent to [val, val]. 109 func (s Span) IsSingleVal() bool { 110 return bytes.Equal(keysbase.PrefixEnd(s.Start), s.End) 111 } 112 113 // Equals returns true if this span has the same start and end as the given 114 // span. 115 func (s Span) Equals(other Span) bool { 116 if !bytes.Equal(s.Start, other.Start) { 117 return false 118 } 119 return bytes.Equal(s.End, other.End) 120 } 121 122 // Spans is a slice of Span objects. 123 type Spans []Span 124 125 // ContainsKey returns whether the span contains the given key. 126 func (s Span) ContainsKey(key EncVal) bool { 127 return bytes.Compare(key, s.Start) >= 0 && bytes.Compare(key, s.End) < 0 128 } 129 130 // Equals returns true if this Spans has the same spans as the given 131 // Spans, in the same order. 132 func (is Spans) Equals(other Spans) bool { 133 if len(is) != len(other) { 134 return false 135 } 136 for i := range is { 137 if !is[i].Equals(other[i]) { 138 return false 139 } 140 } 141 return true 142 } 143 144 // Format pretty-prints the spans. 145 func (is Spans) Format(tp treeprinter.Node, label string, redactable bool) { 146 if len(is) == 0 { 147 tp.Childf("%s: empty", label) 148 return 149 } 150 if len(is) == 1 { 151 tp.Childf("%s: %s", label, formatSpan(is[0], redactable)) 152 return 153 } 154 n := tp.Child(label) 155 for i := 0; i < len(is); i++ { 156 n.Child(formatSpan(is[i], redactable)) 157 } 158 } 159 160 func formatSpan(span Span, redactable bool) string { 161 end := span.End 162 spanEndOpenOrClosed := ')' 163 if span.IsSingleVal() { 164 end = span.Start 165 spanEndOpenOrClosed = ']' 166 } 167 output := fmt.Sprintf("[%s, %s%c", strconv.Quote(string(span.Start)), 168 strconv.Quote(string(end)), spanEndOpenOrClosed) 169 if redactable { 170 output = string(redact.Sprintf("%s", redact.Unsafe(output))) 171 } 172 return output 173 } 174 175 // Len implements sort.Interface. 176 func (is Spans) Len() int { return len(is) } 177 178 // Less implements sort.Interface, when Spans is known to contain 179 // non-overlapping spans. 180 func (is Spans) Less(i, j int) bool { 181 return bytes.Compare(is[i].Start, is[j].Start) < 0 182 } 183 184 // Swap implements the sort.Interface. 185 func (is Spans) Swap(i, j int) { 186 is[i], is[j] = is[j], is[i] 187 } 188 189 // Start implements the span.KeyableInvertedSpans interface. 190 func (is Spans) Start(i int) []byte { 191 return is[i].Start 192 } 193 194 // End implements the span.KeyableInvertedSpans interface. 195 func (is Spans) End(i int) []byte { 196 return is[i].End 197 } 198 199 // Expression is the interface representing an expression or sub-expression 200 // to be evaluated on the inverted index. Any implementation can be used in the 201 // builder functions And() and Or(), but in practice there are two useful 202 // implementations provided here: 203 // - SpanExpression: this is the normal expression representing unions and 204 // intersections over spans of the inverted index. A SpanExpression is the 205 // root of an expression tree containing other SpanExpressions (there is one 206 // exception when a SpanExpression tree can contain non-SpanExpressions, 207 // discussed below for Joins). 208 // - NonInvertedColExpression: this is a marker expression representing the universal 209 // span, due to it being an expression on the non inverted column. This only appears in 210 // expression trees with a single node, since Anding with such an expression simply 211 // changes the tightness to false and Oring with this expression replaces the 212 // other expression with a NonInvertedColExpression. 213 // 214 // # Optimizer cost estimation 215 // 216 // There are two cases: 217 // 218 // - Single table expression: after generating the Expression, the 219 // optimizer will check that it is a *SpanExpression -- if not, it is a 220 // NonInvertedColExpression, which implies a full inverted index scan, and 221 // it is definitely not worth using the inverted index. There are two costs for 222 // using the inverted index: 223 // 224 // - The scan cost: this should be estimated by using SpanExpression.SpansToRead. 225 // 226 // - The cardinality of the output set after evaluating the expression: this 227 // requires a traversal of the expression to assign cardinality to the 228 // spans in each FactoredUnionSpans (this could be done using a mean, 229 // or using histograms). The cardinality of a SpanExpression is the 230 // cardinality of the union of its FactoredUnionSpans and the intersection 231 // of its left and right expressions. If the cardinality of the original 232 // table is C (i.e., the number of primary keys), and we have two subsets 233 // of cardinality C1 and C2, we can assume that each set itself is a 234 // drawing without replacement from the original table. This can be 235 // used to derive the expected cardinality of the union of the two sets 236 // and the intersection of the two sets. 237 // 238 // - Join expression: Assigning a cost is hard since there are two 239 // parameters, corresponding to the left and right columns. In some cases, 240 // like Geospatial, the expression that could be generated is a black-box to 241 // the optimizer since the quad-tree traversal is unknown until partial 242 // application (when one of the parameters is known). Minimally, we do need to 243 // know whether the user expression is going to cause a full inverted index 244 // scan due to parts of the expression referring to non-inverted columns. 245 // The optimizer will provide its own placeholder implementation of 246 // Expression into which it can embed whatever information it wants. 247 // Let's call this the UnknownExpression -- it will only exist at the 248 // leaves of the expression tree. It will use this UnknownExpression 249 // whenever there is an expression involving both the inverted columns. If 250 // the final expression is a NonInvertedColExpression, it is definitely not 251 // worth using the inverted index. If the final expression is an 252 // UnknownExpression (the tree must be a single node) or a *SpanExpression, 253 // the optimizer could either conjure up some magic cost number or try to 254 // compose one using costs assigned to each span (as described in the 255 // previous bullet) and to each leaf-level UnknownExpression. 256 // 257 // # Query evaluation 258 // 259 // There are two cases: 260 // - Single table expression: The optimizer will convert the *SpanExpression 261 // into a form that is passed to the evaluation machinery, which can recreate 262 // the *SpanExpression and evaluate it. The optimizer will have constructed 263 // the spans for the evaluation using SpanExpression.SpansToRead, so the 264 // expression evaluating code does not need to concern itself with the spans 265 // to be read. 266 // e.g. the query was of the form ... WHERE x <@ '{"a":1, "b":2}'::json 267 // The optimizer constructs a *SpanExpression, and 268 // - uses the serialization of the *SpanExpression as the spec for a processor 269 // that will evaluate the expression. 270 // - uses the SpanExpression.SpansToRead to specify the inverted index 271 // spans that must be read and fed to the processor. 272 // - Join expression: The optimizer had an expression tree with the root as 273 // a *SpanExpression or an UnknownExpression. Therefore it knows that after 274 // partial application the expression will be a *SpanExpression. It passes the 275 // inverted expression with two unknowns, as a string, to the join execution 276 // machinery. The optimizer provides a way to do partial application for each 277 // input row, and returns a *SpanExpression, which is evaluated on the 278 // inverted index. 279 // e.g. the join query was of the form 280 // ... ON t1.x <@ t2.y OR (t1.x @> t2.y AND t2.y @> '{"a":1, "b":2}'::json) 281 // and the optimizer decides to use the inverted index on t2.y. The optimizer 282 // passes an expression string with two unknowns in the InvertedJoinerSpec, 283 // where @1 represents t1.x and @2 represents t2.y. For each input row of 284 // t1 the inverted join processor asks the optimizer to apply the value of @1 285 // and return a *SpanExpression, which the join processor will evaluate on 286 // the inverted index. 287 type Expression interface { 288 // IsTight returns whether the inverted expression is tight, i.e., will the 289 // original expression not need to be reevaluated on each row output by the 290 // query evaluation over the inverted index. 291 IsTight() bool 292 // SetNotTight sets tight to false. 293 SetNotTight() 294 // Copy makes a copy of the inverted expression. 295 Copy() Expression 296 } 297 298 // SpanExpression is an implementation of Expression. 299 // 300 // TODO(sumeer): after integration and experimentation with optimizer costing, 301 // decide if we can eliminate the generality of the Expression 302 // interface. If we don't need that generality, we can merge SpanExpression 303 // and SpanExpressionProto. 304 type SpanExpression struct { 305 // Tight mirrors the definition of IsTight(). 306 Tight bool 307 308 // Unique is true if the spans in FactoredUnionSpans are guaranteed not to 309 // produce duplicate primary keys. Otherwise, Unique is false. Unique may 310 // be true for certain JSON or Array SpanExpressions, and it holds when 311 // unique SpanExpressions are combined with And. It does not hold when 312 // these SpanExpressions are combined with Or. 313 // 314 // Once a SpanExpression is built, this field is relevant if the root 315 // SpanExpression has no children (i.e., Operator is None). In this case, 316 // Unique is used to determine whether an invertedFilter is needed on top 317 // of the inverted index scan to deduplicate keys (an invertedFilter is 318 // always necessary if Operator is not None). 319 Unique bool 320 321 // SpansToRead are the spans to read from the inverted index 322 // to evaluate this SpanExpression. These are non-overlapping 323 // and sorted. If left or right contains a non-SpanExpression, 324 // it is not included in the spanning union. 325 // To illustrate, consider a made up example: 326 // [2, 10) \intersection [6, 14) 327 // is factored into: 328 // [6, 10) \union ([2, 6) \intersection [10, 14)) 329 // The root expression has a spanning union of [2, 14). 330 SpansToRead Spans 331 332 // FactoredUnionSpans are the spans to be unioned. These are 333 // non-overlapping and sorted. As mentioned earlier, factoring 334 // can result in faster evaluation and can be useful for 335 // optimizer cost estimation. 336 // 337 // Using the same example, the FactoredUnionSpans will be 338 // [6, 10). Now let's extend the above example and say that 339 // it was just a sub-expression in a bigger expression, and 340 // the full expression involved an intersection of that 341 // sub-expression and [5, 8). After factoring, we would get 342 // [6, 8) \union ([5, 6) \intersection ([8, 10) \union ([2, 6) \intersection [10, 14)))) 343 // The top-level expression has FactoredUnionSpans [6, 8), and the left and 344 // right children have factoredUnionSpans [5, 6) and [8, 10) respectively. 345 // The SpansToRead of this top-level expression is still [2, 14) since the 346 // intersection with [5, 8) did not add anything to the spans to read. Also 347 // note that, despite factoring, there are overlapping spans in this 348 // expression, specifically [2, 6) and [5, 6). 349 FactoredUnionSpans Spans 350 351 // Operator is the set operation to apply to Left and Right. 352 // When this is union or intersection, both Left and Right are non-nil, 353 // else both are nil. 354 Operator SetOperator 355 Left Expression 356 Right Expression 357 } 358 359 var _ Expression = (*SpanExpression)(nil) 360 361 // IsTight implements the Expression interface. 362 func (s *SpanExpression) IsTight() bool { 363 return s.Tight 364 } 365 366 // SetNotTight implements the Expression interface. 367 func (s *SpanExpression) SetNotTight() { 368 s.Tight = false 369 } 370 371 // Copy implements the Expression interface. 372 // 373 // Copy makes a copy of the SpanExpression and returns it. Copy recurses into 374 // the children and makes copies of them as well, so the new struct is 375 // independent from the old. It does *not* perform a deep copy of the 376 // SpansToRead or FactoredUnionSpans slices, however, because those slices are 377 // never modified in place and therefore are safe to reuse. 378 func (s *SpanExpression) Copy() Expression { 379 res := &SpanExpression{ 380 Tight: s.Tight, 381 Unique: s.Unique, 382 SpansToRead: s.SpansToRead, 383 FactoredUnionSpans: s.FactoredUnionSpans, 384 Operator: s.Operator, 385 } 386 if s.Left != nil { 387 res.Left = s.Left.Copy() 388 } 389 if s.Right != nil { 390 res.Right = s.Right.Copy() 391 } 392 return res 393 } 394 395 func (s *SpanExpression) String() string { 396 tp := treeprinter.New() 397 n := tp.Child("span expression") 398 s.Format(n, true /* includeSpansToRead */, false /* redactable */) 399 return tp.String() 400 } 401 402 // Format pretty-prints the SpanExpression. 403 func (s *SpanExpression) Format(tp treeprinter.Node, includeSpansToRead, redactable bool) { 404 tp.Childf("tight: %t, unique: %t", s.Tight, s.Unique) 405 if includeSpansToRead { 406 s.SpansToRead.Format(tp, "to read", redactable) 407 } 408 s.FactoredUnionSpans.Format(tp, "union spans", redactable) 409 if s.Operator == None { 410 return 411 } 412 switch s.Operator { 413 case SetUnion: 414 tp = tp.Child("UNION") 415 case SetIntersection: 416 tp = tp.Child("INTERSECTION") 417 } 418 formatExpression(tp, s.Left, includeSpansToRead, redactable) 419 formatExpression(tp, s.Right, includeSpansToRead, redactable) 420 } 421 422 func formatExpression(tp treeprinter.Node, expr Expression, includeSpansToRead, redactable bool) { 423 switch e := expr.(type) { 424 case *SpanExpression: 425 n := tp.Child("span expression") 426 e.Format(n, includeSpansToRead, redactable) 427 default: 428 tp.Child(fmt.Sprintf("%v", e)) 429 } 430 } 431 432 // ToProto constructs a SpanExpressionProto for execution. It should 433 // be called on an expression tree that contains only *SpanExpressions. 434 func (s *SpanExpression) ToProto() *SpanExpressionProto { 435 if s == nil { 436 return nil 437 } 438 proto := &SpanExpressionProto{ 439 SpansToRead: getProtoSpans(s.SpansToRead), 440 Node: *s.getProtoNode(), 441 } 442 return proto 443 } 444 445 func getProtoSpans(spans []Span) []SpanExpressionProto_Span { 446 out := make([]SpanExpressionProto_Span, len(spans)) 447 for i := range spans { 448 out[i] = SpanExpressionProto_Span{ 449 Start: spans[i].Start, 450 End: spans[i].End, 451 } 452 } 453 return out 454 } 455 456 func (s *SpanExpression) getProtoNode() *SpanExpressionProto_Node { 457 node := &SpanExpressionProto_Node{ 458 FactoredUnionSpans: getProtoSpans(s.FactoredUnionSpans), 459 Operator: s.Operator, 460 } 461 if node.Operator != None { 462 node.Left = s.Left.(*SpanExpression).getProtoNode() 463 node.Right = s.Right.(*SpanExpression).getProtoNode() 464 } 465 return node 466 } 467 468 // NonInvertedColExpression is an expression to use for parts of the 469 // user expression that do not involve the inverted index. 470 type NonInvertedColExpression struct{} 471 472 var _ Expression = NonInvertedColExpression{} 473 474 // IsTight implements the Expression interface. 475 func (n NonInvertedColExpression) IsTight() bool { 476 return false 477 } 478 479 // SetNotTight implements the Expression interface. 480 func (n NonInvertedColExpression) SetNotTight() {} 481 482 // Copy implements the Expression interface. 483 func (n NonInvertedColExpression) Copy() Expression { 484 return NonInvertedColExpression{} 485 } 486 487 // SpanExpressionProtoSpans is a slice of SpanExpressionProto_Span. 488 type SpanExpressionProtoSpans []SpanExpressionProto_Span 489 490 // Len implements the span.InvertedSpans interface. 491 func (s SpanExpressionProtoSpans) Len() int { 492 return len(s) 493 } 494 495 // Start implements the span.InvertedSpans interface. 496 func (s SpanExpressionProtoSpans) Start(i int) []byte { 497 return s[i].Start 498 } 499 500 // End implements the span.InvertedSpans interface. 501 func (s SpanExpressionProtoSpans) End(i int) []byte { 502 return s[i].End 503 } 504 505 // ExprForSpan constructs a leaf-level SpanExpression for an inverted 506 // expression. Note that these leaf-level expressions may also have 507 // tight = false. Geospatial functions are all non-tight. 508 // 509 // For JSON, expressions like x <@ '{"a":1, "b":2}'::json will have 510 // tight = false. Say SpanA, SpanB correspond to "a":1 and "b":2 511 // respectively). A tight expression would require the following set 512 // evaluation: 513 // Set(SpanA) \union Set(SpanB) - Set(ComplementSpan(SpanA \spanunion SpanB)) 514 // where ComplementSpan(X) is everything in the inverted index 515 // except for X. 516 // Since ComplementSpan(SpanA \spanunion SpanB) is likely to 517 // be very wide when SpanA and SpanB are narrow, or vice versa, 518 // this tight expression would be very costly to evaluate. 519 func ExprForSpan(span Span, tight bool) *SpanExpression { 520 return &SpanExpression{ 521 Tight: tight, 522 SpansToRead: []Span{span}, 523 FactoredUnionSpans: []Span{span}, 524 } 525 } 526 527 // ContainsKeys traverses the SpanExpression to determine whether the span 528 // expression contains the given keys. It is primarily used for testing. 529 func (s *SpanExpression) ContainsKeys(keys [][]byte) (bool, error) { 530 if s.Operator == None && len(s.FactoredUnionSpans) == 0 { 531 return false, nil 532 } 533 534 // FactoredUnionSpans represents a union over the spans, so any span in the slice 535 // can contain any of the keys. 536 if len(s.FactoredUnionSpans) > 0 { 537 for _, span := range s.FactoredUnionSpans { 538 for _, key := range keys { 539 if span.ContainsKey(key) { 540 return true, nil 541 } 542 } 543 } 544 } 545 546 if s.Operator == None { 547 return false, nil 548 } 549 550 // This is either a UNION or INTERSECTION. 551 leftRes, err := s.Left.(*SpanExpression).ContainsKeys(keys) 552 if err != nil { 553 return false, err 554 } 555 if leftRes && s.Operator == SetUnion { 556 return true, nil 557 } 558 559 rightRes, err := s.Right.(*SpanExpression).ContainsKeys(keys) 560 if err != nil { 561 return false, err 562 } 563 switch s.Operator { 564 case SetIntersection: 565 return leftRes && rightRes, nil 566 case SetUnion: 567 return leftRes || rightRes, nil 568 default: 569 return false, errors.AssertionFailedf("invalid operator %v", s.Operator) 570 } 571 } 572 573 // And of two boolean expressions. This function may modify both the left and 574 // right Expressions. 575 func And(left, right Expression) Expression { 576 switch l := left.(type) { 577 case *SpanExpression: 578 switch r := right.(type) { 579 case *SpanExpression: 580 return intersectSpanExpressions(l, r) 581 case NonInvertedColExpression: 582 left.SetNotTight() 583 return left 584 default: 585 return opSpanExpressionAndDefault(l, right, SetIntersection) 586 } 587 case NonInvertedColExpression: 588 right.SetNotTight() 589 return right 590 default: 591 switch r := right.(type) { 592 case *SpanExpression: 593 return opSpanExpressionAndDefault(r, left, SetIntersection) 594 case NonInvertedColExpression: 595 left.SetNotTight() 596 return left 597 default: 598 return &SpanExpression{ 599 Tight: left.IsTight() && right.IsTight(), 600 Operator: SetIntersection, 601 Left: left, 602 Right: right, 603 } 604 } 605 } 606 } 607 608 // Or of two boolean expressions. This function may modify both the left and 609 // right Expressions. 610 func Or(left, right Expression) Expression { 611 switch l := left.(type) { 612 case *SpanExpression: 613 switch r := right.(type) { 614 case *SpanExpression: 615 return unionSpanExpressions(l, r) 616 case NonInvertedColExpression: 617 return r 618 default: 619 return opSpanExpressionAndDefault(l, right, SetUnion) 620 } 621 case NonInvertedColExpression: 622 return left 623 default: 624 switch r := right.(type) { 625 case *SpanExpression: 626 return opSpanExpressionAndDefault(r, left, SetUnion) 627 case NonInvertedColExpression: 628 return right 629 default: 630 return &SpanExpression{ 631 Tight: left.IsTight() && right.IsTight(), 632 Operator: SetUnion, 633 Left: left, 634 Right: right, 635 } 636 } 637 } 638 } 639 640 // Helper that applies op to a left-side that is a *SpanExpression and 641 // a right-side that is an unknown implementation of Expression. 642 func opSpanExpressionAndDefault( 643 left *SpanExpression, right Expression, op SetOperator, 644 ) *SpanExpression { 645 expr := &SpanExpression{ 646 Tight: left.IsTight() && right.IsTight(), 647 // The SpansToRead is a lower-bound in this case. Note that 648 // such an expression is only used for Join costing. 649 SpansToRead: left.SpansToRead, 650 Operator: op, 651 Left: left, 652 Right: right, 653 } 654 if op == SetUnion { 655 // Promote the left-side union spans. We don't know anything 656 // about the right-side. 657 expr.FactoredUnionSpans = left.FactoredUnionSpans 658 left.FactoredUnionSpans = nil 659 } 660 // Else SetIntersection -- we can't factor anything if one side is 661 // unknown. 662 return expr 663 } 664 665 // Intersects two SpanExpressions. 666 func intersectSpanExpressions(left, right *SpanExpression) *SpanExpression { 667 expr := &SpanExpression{ 668 Tight: left.Tight && right.Tight, 669 Unique: left.Unique && right.Unique, 670 671 // We calculate SpansToRead as the union of the left and right sides as a 672 // first approximation, but this may result in too many spans if either of 673 // the children are pruned below. SpansToRead will be recomputed in 674 // tryPruneChildren if needed. (It is important that SpansToRead be exactly 675 // what would be computed if a caller traversed the tree and explicitly 676 // unioned all the FactoredUnionSpans, and no looser, since the execution 677 // code path relies on this property.) 678 SpansToRead: unionSpans(left.SpansToRead, right.SpansToRead), 679 FactoredUnionSpans: intersectSpans(left.FactoredUnionSpans, right.FactoredUnionSpans), 680 Operator: SetIntersection, 681 Left: left, 682 Right: right, 683 } 684 if expr.FactoredUnionSpans != nil { 685 left.FactoredUnionSpans = subtractSpans(left.FactoredUnionSpans, expr.FactoredUnionSpans) 686 right.FactoredUnionSpans = subtractSpans(right.FactoredUnionSpans, expr.FactoredUnionSpans) 687 } 688 tryPruneChildren(expr, SetIntersection) 689 return expr 690 } 691 692 // Unions two SpanExpressions. 693 func unionSpanExpressions(left, right *SpanExpression) *SpanExpression { 694 expr := &SpanExpression{ 695 Tight: left.Tight && right.Tight, 696 SpansToRead: unionSpans(left.SpansToRead, right.SpansToRead), 697 FactoredUnionSpans: unionSpans(left.FactoredUnionSpans, right.FactoredUnionSpans), 698 Operator: SetUnion, 699 Left: left, 700 Right: right, 701 } 702 left.FactoredUnionSpans = nil 703 right.FactoredUnionSpans = nil 704 tryPruneChildren(expr, SetUnion) 705 return expr 706 } 707 708 // tryPruneChildren takes an expr with two child *SpanExpression and removes 709 // children when safe to do so. 710 func tryPruneChildren(expr *SpanExpression, op SetOperator) { 711 isEmptyExpr := func(e *SpanExpression) bool { 712 return len(e.FactoredUnionSpans) == 0 && e.Left == nil && e.Right == nil 713 } 714 if isEmptyExpr(expr.Left.(*SpanExpression)) { 715 expr.Left = nil 716 } 717 if isEmptyExpr(expr.Right.(*SpanExpression)) { 718 expr.Right = nil 719 } 720 if expr.Operator == SetUnion { 721 // Promotes the left and right sub-expressions of child to the parent 722 // expr, when the other child is empty. 723 promoteChild := func(child *SpanExpression) { 724 // For SetUnion, the FactoredUnionSpans for the child is already nil 725 // since it has been unioned into expr. Therefore, we don't need to 726 // update expr.FactoredUnionSpans. 727 expr.Operator = child.Operator 728 expr.Left = child.Left 729 expr.Right = child.Right 730 731 // If child.FactoredUnionSpans is non-empty, we need to recalculate 732 // SpansToRead since it may have contained some spans that were 733 // removed by discarding child.FactoredUnionSpans. 734 if child.FactoredUnionSpans != nil { 735 expr.SpansToRead = expr.FactoredUnionSpans 736 if expr.Left != nil { 737 expr.SpansToRead = unionSpans(expr.SpansToRead, expr.Left.(*SpanExpression).SpansToRead) 738 } 739 if expr.Right != nil { 740 expr.SpansToRead = unionSpans(expr.SpansToRead, expr.Right.(*SpanExpression).SpansToRead) 741 } 742 } 743 } 744 promoteLeft := expr.Left != nil && expr.Right == nil 745 promoteRight := expr.Left == nil && expr.Right != nil 746 if promoteLeft { 747 promoteChild(expr.Left.(*SpanExpression)) 748 } 749 if promoteRight { 750 promoteChild(expr.Right.(*SpanExpression)) 751 } 752 } else if expr.Operator == SetIntersection { 753 // The result of intersecting with the empty set is the empty set. In 754 // this case, we can discard the non-empty child. 755 if expr.Left == nil { 756 expr.Right = nil 757 } else if expr.Right == nil { 758 expr.Left = nil 759 } 760 } 761 if expr.Left == nil && expr.Right == nil { 762 expr.Operator = None 763 expr.SpansToRead = expr.FactoredUnionSpans 764 } 765 } 766 767 func unionSpans(left []Span, right []Span) []Span { 768 if len(left) == 0 { 769 return right 770 } 771 if len(right) == 0 { 772 return left 773 } 774 // Both left and right are non-empty. 775 776 // The output spans. 777 var spans []Span 778 // Contains the current span being merged into. 779 var mergeSpan Span 780 // Indexes into left and right. 781 var i, j int 782 783 swapLeftRight := func() { 784 i, j = j, i 785 left, right = right, left 786 } 787 788 // makeMergeSpan is used to initialize mergeSpan. It uses the span from 789 // left or right that has an earlier start. Additionally, it swaps left 790 // and right if the mergeSpan was initialized using right, so the mergeSpan 791 // is coming from the left. 792 // REQUIRES: i < len(left) || j < len(right). 793 makeMergeSpan := func() { 794 if i >= len(left) || (j < len(right) && bytes.Compare(left[i].Start, right[j].Start) > 0) { 795 swapLeftRight() 796 } 797 mergeSpan = left[i] 798 i++ 799 } 800 makeMergeSpan() 801 // We only need to merge spans into mergeSpan while we have more 802 // spans from the right. Once the right is exhausted we know that 803 // the remaining spans from the left (including mergeSpan) can be 804 // appended to the output unchanged. 805 for j < len(right) { 806 cmpEndStart := cmpExcEndWithIncStart(mergeSpan, right[j]) 807 if cmpEndStart >= 0 { 808 if extendSpanEnd(&mergeSpan, right[j], cmpEndStart) { 809 // The right side extended the span, so now it plays the 810 // role of the left. 811 j++ 812 swapLeftRight() 813 } else { 814 j++ 815 } 816 continue 817 } 818 // Cannot extend mergeSpan. 819 spans = append(spans, mergeSpan) 820 makeMergeSpan() 821 } 822 spans = append(spans, mergeSpan) 823 spans = append(spans, left[i:]...) 824 return spans 825 } 826 827 func intersectSpans(left []Span, right []Span) []Span { 828 if len(left) == 0 || len(right) == 0 { 829 return nil 830 } 831 832 // Both left and right are non-empty 833 834 // The output spans. 835 var spans []Span 836 // Indexes into left and right. 837 var i, j int 838 // Contains the current span being intersected. 839 var mergeSpan Span 840 var mergeSpanInitialized bool 841 swapLeftRight := func() { 842 i, j = j, i 843 left, right = right, left 844 } 845 // Initializes mergeSpan. Additionally, arranges it such that the span has 846 // come from left. i continues to refer to the index used to initialize 847 // mergeSpan. 848 // REQUIRES: i < len(left) && j < len(right) 849 makeMergeSpan := func() { 850 if bytes.Compare(left[i].Start, right[j].Start) > 0 { 851 swapLeftRight() 852 } 853 mergeSpan = left[i] 854 mergeSpanInitialized = true 855 } 856 857 for i < len(left) && j < len(right) { 858 if !mergeSpanInitialized { 859 makeMergeSpan() 860 } 861 cmpEndStart := cmpExcEndWithIncStart(mergeSpan, right[j]) 862 if cmpEndStart > 0 { 863 // The intersection of these spans is non-empty. 864 mergeSpan.Start = right[j].Start 865 mergeSpanEnd := mergeSpan.End 866 cmpEnds := cmpEnds(mergeSpan, right[j]) 867 if cmpEnds > 0 { 868 // The right span constrains the end of the intersection. 869 mergeSpan.End = right[j].End 870 } 871 // Else the mergeSpan is not constrained by the right span, 872 // so it is already ready to be appended to the output. 873 874 // Append to the spans that will be output. 875 spans = append(spans, mergeSpan) 876 877 // Now decide whether we should continue intersecting with what 878 // is left of the original mergeSpan. 879 if cmpEnds < 0 { 880 // The mergeSpan constrained the end of the intersection. 881 // So nothing left of the original mergeSpan. The rightSpan 882 // should become the new mergeSpan since it is guaranteed to 883 // have a start <= the next span from the left and it has 884 // something leftover. 885 i++ 886 mergeSpan.Start = mergeSpan.End 887 mergeSpan.End = right[j].End 888 swapLeftRight() 889 } else if cmpEnds == 0 { 890 // Both spans end at the same key, so both are consumed. 891 i++ 892 j++ 893 mergeSpanInitialized = false 894 } else { 895 // The right span constrained the end of the intersection. 896 // So there is something left of the original mergeSpan. 897 j++ 898 mergeSpan.Start = mergeSpan.End 899 mergeSpan.End = mergeSpanEnd 900 } 901 } else { 902 // Intersection is empty 903 i++ 904 mergeSpanInitialized = false 905 } 906 } 907 return spans 908 } 909 910 // subtractSpans subtracts right from left, under the assumption that right is a 911 // subset of left. 912 func subtractSpans(left []Span, right []Span) []Span { 913 if len(right) == 0 { 914 return left 915 } 916 // Both left and right are non-empty 917 918 // The output spans. 919 var out []Span 920 921 // Contains the current span being subtracted. 922 var mergeSpan Span 923 var mergeSpanInitialized bool 924 // Indexes into left and right. 925 var i, j int 926 for j < len(right) { 927 if !mergeSpanInitialized { 928 mergeSpan = left[i] 929 mergeSpanInitialized = true 930 } 931 cmpEndStart := cmpExcEndWithIncStart(mergeSpan, right[j]) 932 if cmpEndStart > 0 { 933 // mergeSpan will have some part subtracted by the right span. 934 cmpStart := bytes.Compare(mergeSpan.Start, right[j].Start) 935 if cmpStart < 0 { 936 // There is some part of mergeSpan before the right span starts. Add it 937 // to the output. 938 out = append(out, Span{Start: mergeSpan.Start, End: right[j].Start}) 939 mergeSpan.Start = right[j].Start 940 } 941 // Else cmpStart == 0, since the right side is a subset of the left. 942 943 // Invariant: mergeSpan.start == right[j].start 944 cmpEnd := cmpEnds(mergeSpan, right[j]) 945 if cmpEnd == 0 { 946 // Both spans end at the same key, so both are consumed. 947 i++ 948 j++ 949 mergeSpanInitialized = false 950 continue 951 } 952 953 // Invariant: cmpEnd > 0, since the right side is a subset of the left. 954 mergeSpan.Start = right[j].End 955 j++ 956 } else { 957 // Right span starts after mergeSpan ends. 958 out = append(out, mergeSpan) 959 i++ 960 mergeSpanInitialized = false 961 } 962 } 963 if mergeSpanInitialized { 964 out = append(out, mergeSpan) 965 i++ 966 } 967 out = append(out, left[i:]...) 968 return out 969 } 970 971 // Compares the exclusive end key of left with the inclusive start key of 972 // right. 973 // Examples: 974 // [a, b), [b, c) == 0 975 // [a, a\x00), [a, c) == +1 976 // [a, c), [d, e) == -1 977 func cmpExcEndWithIncStart(left, right Span) int { 978 return bytes.Compare(left.End, right.Start) 979 } 980 981 // Extends the left span using the right span. Will return true iff 982 // left was extended, i.e., the left.end < right.end, and 983 // false otherwise. 984 func extendSpanEnd(left *Span, right Span, cmpExcEndIncStart int) bool { 985 if cmpExcEndIncStart == 0 { 986 // Definitely extends. 987 left.End = right.End 988 return true 989 } 990 // cmpExcEndIncStart > 0, so left covers at least right.start. But may not 991 // cover right.end. 992 if bytes.Compare(left.End, right.End) < 0 { 993 left.End = right.End 994 return true 995 } 996 return false 997 } 998 999 // Compares the end keys of left and right. 1000 func cmpEnds(left, right Span) int { 1001 return bytes.Compare(left.End, right.End) 1002 } 1003 1004 // Representing multi-column constraints 1005 // 1006 // Building multi-column constraints is complicated even for the regular 1007 // index case (see idxconstraint and constraints packages). Because the 1008 // constraints code is not generating a full expression and it can immediately 1009 // evaluate intersections, it takes an approach of traversing the expression 1010 // at monotonically increasing column offsets (e.g. makeSpansForAnd() and the 1011 // offset+delta logic). This allows it to build up Key constraints in increasing 1012 // order of the index column (say labeled @1, @2, ...), instead of needing to 1013 // handle an arbitrary order, and then combine them using Constraint.Combine(). 1014 // This repeated traversal at different offsets is a simplification and can 1015 // result in spans that are wider than optimal. 1016 // 1017 // Example 1: 1018 // index-constraints vars=(int, int, int) index=(@1 not null, @2 not null, @3 not null) 1019 // ((@1 = 1 AND @3 = 5) OR (@1 = 3 AND @3 = 10)) AND (@2 = 76) 1020 // ---- 1021 // [/1/76/5 - /1/76/5] 1022 // [/1/76/10 - /1/76/10] 1023 // [/3/76/5 - /3/76/5] 1024 // [/3/76/10 - /3/76/10] 1025 // Remaining filter: ((@1 = 1) AND (@3 = 5)) OR ((@1 = 3) AND (@3 = 10)) 1026 // 1027 // Note that in example 1 we produce the spans with the single key /1/76/10 1028 // and /3/76/5 which are not possible -- this is because the application of 1029 // the @3 constraint happened at the higher level after the @2 constraint had 1030 // been applied, and at that higher level the @3 constraint was now the set 1031 // {5, 10}, so it needed to be applied to both the /1/76 and /3/76 span. 1032 // 1033 // In contrast example 2 is able to apply the @2 constraint inside each of the 1034 // sub-expressions and results in a tight span. 1035 // 1036 // Example 2: 1037 // index-constraints vars=(int, int, int) index=(@1 not null, @2 not null, @3 not null) 1038 // ((@1 = 1 AND @2 = 5) OR (@1 = 3 AND @2 = 10)) AND (@3 = 76) 1039 // ---- 1040 // [/1/5/76 - /1/5/76] 1041 // [/3/10/76 - /3/10/76] 1042 // 1043 // We note that: 1044 // - Working with spans of only the inverted column is much easier for factoring. 1045 // - It is not yet clear how important multi-column constraints are for inverted 1046 // index performance. 1047 // - We cannot adopt the approach of traversing at monotonically increasing 1048 // column offsets since we are trying to build an expression. We want to 1049 // traverse once, to build up the expression tree. One possibility would be 1050 // to incrementally build the expression tree with the caller traversing once 1051 // but additionally keep track of the span constraints for each PK column at 1052 // each node in the already build expression tree. To illustrate, consider 1053 // an example 1' akin to example 1 where @1 is an inverted column: 1054 // ((f(@1, 1) AND @3 = 5) OR (f(@1, 3) AND @3 = 10)) AND (@2 = 76) 1055 // and the functions f(@1, 1) and f(@1, 3) each give a single value for the 1056 // inverted column (this could be something like f @> '{"a":1}'::json). 1057 // Say we already have the expression tree built for: 1058 // ((f(@1, 1) AND @3 = 5) OR (f(@1, 3) AND @3 = 10)) 1059 // When the constraint for (@2 = 76) is anded we traverse this built tree 1060 // and add this constraint to each node. Note that we are delaying building 1061 // something akin to a constraint.Key since we are encountering the constraints 1062 // in arbitrary column order. Then after the full expression tree is built, 1063 // one traverses and builds the inverted spans and primary key spans (latter 1064 // could reuse constraint.Span for each node). 1065 // - The previous bullet is doable but complicated, and especially increases the 1066 // complexity of factoring spans when unioning and intersecting while building 1067 // up sub-expressions. One needs to either factor taking into account the 1068 // current per-column PK constraints or delay it until the end (I gave up 1069 // half-way through writing the code, as it doesn't seem worth the complexity). 1070 // 1071 // In the following we adopt a much simpler approach. The caller generates the 1072 // the inverted index expression and the PK spans separately. 1073 // 1074 // - Generating the inverted index expression: The caller does a single 1075 // traversal and calls the methods in this package. For every 1076 // leaf-sub-expression on the non-inverted columns it uses a marker 1077 // NonInvertedColExpression. Anding a NonInvertedColExpression results in a 1078 // non-tight inverted expression and Oring a NonInvertedColExpression 1079 // results in discarding the inverted expression built so far. This package 1080 // does factoring for ands and ors involving inverted expressions 1081 // incrementally, and this factoring is straightforward since it involves a 1082 // single column. 1083 // - Generating the PK spans (optional): The caller can use something like 1084 // idxconstraint, pretending that the PK columns of the inverted index 1085 // are the index columns. Every leaf inverted sub-expression is replaced 1086 // with true. This is because when not representing the inverted column 1087 // constraint we need the weakest possible constraint on the PK columns. 1088 // Using example 1' again, 1089 // ((f(@1, 1) AND @3 = 5) OR (f(@1, 3) AND @3 = 10)) AND (@2 = 76) 1090 // when generating the PK constraints we would use 1091 // (@3 = 5 OR @3 = 10) AND (@2 = 76) 1092 // So the PK spans will be: 1093 // [/76/5, /76/5], [/76/10, /76/10] 1094 // - The spans in the inverted index expression can be composed with the 1095 // spans of the PK columns to narrow wherever possible. 1096 // Continuing with example 1', the inverted index expression will be 1097 // v11 \union v13, corresponding to f(@1, 1) and f(@1, 3), where each 1098 // of v11 and v13 are single value spans. And this expression is not tight 1099 // (because of the anding with NonInvertedColExpression). 1100 // The PK spans, [/76/5, /76/5], [/76/10, /76/10], are also single key spans. 1101 // This is a favorable example in that we can compose all these singleton 1102 // spans to get single inverted index rows: 1103 // /v11/76/5, /v11/76/10, /v13/76/5, /v13/76/10 1104 // (this is also a contrived example since with such narrow constraints 1105 // on the PK, we would possibly not use the inverted index). 1106 // 1107 // If one constructs example 2' (derived from example 2 in the same way 1108 // we derived example 1'), we would have 1109 // ((f(@1, 1) AND @2 = 5) OR (f(@1, 3) AND @2 = 10)) AND (@3 = 76) 1110 // and the inverted index expression would be: 1111 // v11 \union v13 1112 // and the PK spans: 1113 // [/5/76, /5/76], [/10/76, /10/76] 1114 // And so the inverted index rows would be: 1115 // /v11/5/76, /v11/10/76, /v13/5/76, /v13/10/76 1116 // This is worse than example 2 (and resembles example 1 and 1') since 1117 // we are taking the cross-product. 1118 // 1119 // TODO(sumeer): write this composition code.