github.com/powerman/golang-tools@v0.1.11-0.20220410185822-5ad214d8d803/go/pointer/doc.go (about) 1 // Copyright 2013 The Go Authors. All rights reserved. 2 // Use of this source code is governed by a BSD-style 3 // license that can be found in the LICENSE file. 4 5 /* 6 7 Package pointer implements Andersen's analysis, an inclusion-based 8 pointer analysis algorithm first described in (Andersen, 1994). 9 10 A pointer analysis relates every pointer expression in a whole program 11 to the set of memory locations to which it might point. This 12 information can be used to construct a call graph of the program that 13 precisely represents the destinations of dynamic function and method 14 calls. It can also be used to determine, for example, which pairs of 15 channel operations operate on the same channel. 16 17 The package allows the client to request a set of expressions of 18 interest for which the points-to information will be returned once the 19 analysis is complete. In addition, the client may request that a 20 callgraph is constructed. The example program in example_test.go 21 demonstrates both of these features. Clients should not request more 22 information than they need since it may increase the cost of the 23 analysis significantly. 24 25 26 CLASSIFICATION 27 28 Our algorithm is INCLUSION-BASED: the points-to sets for x and y will 29 be related by pts(y) ⊇ pts(x) if the program contains the statement 30 y = x. 31 32 It is FLOW-INSENSITIVE: it ignores all control flow constructs and the 33 order of statements in a program. It is therefore a "MAY ALIAS" 34 analysis: its facts are of the form "P may/may not point to L", 35 not "P must point to L". 36 37 It is FIELD-SENSITIVE: it builds separate points-to sets for distinct 38 fields, such as x and y in struct { x, y *int }. 39 40 It is mostly CONTEXT-INSENSITIVE: most functions are analyzed once, 41 so values can flow in at one call to the function and return out at 42 another. Only some smaller functions are analyzed with consideration 43 of their calling context. 44 45 It has a CONTEXT-SENSITIVE HEAP: objects are named by both allocation 46 site and context, so the objects returned by two distinct calls to f: 47 func f() *T { return new(T) } 48 are distinguished up to the limits of the calling context. 49 50 It is a WHOLE PROGRAM analysis: it requires SSA-form IR for the 51 complete Go program and summaries for native code. 52 53 See the (Hind, PASTE'01) survey paper for an explanation of these terms. 54 55 56 SOUNDNESS 57 58 The analysis is fully sound when invoked on pure Go programs that do not 59 use reflection or unsafe.Pointer conversions. In other words, if there 60 is any possible execution of the program in which pointer P may point to 61 object O, the analysis will report that fact. 62 63 64 REFLECTION 65 66 By default, the "reflect" library is ignored by the analysis, as if all 67 its functions were no-ops, but if the client enables the Reflection flag, 68 the analysis will make a reasonable attempt to model the effects of 69 calls into this library. However, this comes at a significant 70 performance cost, and not all features of that library are yet 71 implemented. In addition, some simplifying approximations must be made 72 to ensure that the analysis terminates; for example, reflection can be 73 used to construct an infinite set of types and values of those types, 74 but the analysis arbitrarily bounds the depth of such types. 75 76 Most but not all reflection operations are supported. 77 In particular, addressable reflect.Values are not yet implemented, so 78 operations such as (reflect.Value).Set have no analytic effect. 79 80 81 UNSAFE POINTER CONVERSIONS 82 83 The pointer analysis makes no attempt to understand aliasing between the 84 operand x and result y of an unsafe.Pointer conversion: 85 y = (*T)(unsafe.Pointer(x)) 86 It is as if the conversion allocated an entirely new object: 87 y = new(T) 88 89 90 NATIVE CODE 91 92 The analysis cannot model the aliasing effects of functions written in 93 languages other than Go, such as runtime intrinsics in C or assembly, or 94 code accessed via cgo. The result is as if such functions are no-ops. 95 However, various important intrinsics are understood by the analysis, 96 along with built-ins such as append. 97 98 The analysis currently provides no way for users to specify the aliasing 99 effects of native code. 100 101 ------------------------------------------------------------------------ 102 103 IMPLEMENTATION 104 105 The remaining documentation is intended for package maintainers and 106 pointer analysis specialists. Maintainers should have a solid 107 understanding of the referenced papers (especially those by H&L and PKH) 108 before making making significant changes. 109 110 The implementation is similar to that described in (Pearce et al, 111 PASTE'04). Unlike many algorithms which interleave constraint 112 generation and solving, constructing the callgraph as they go, this 113 implementation for the most part observes a phase ordering (generation 114 before solving), with only simple (copy) constraints being generated 115 during solving. (The exception is reflection, which creates various 116 constraints during solving as new types flow to reflect.Value 117 operations.) This improves the traction of presolver optimisations, 118 but imposes certain restrictions, e.g. potential context sensitivity 119 is limited since all variants must be created a priori. 120 121 122 TERMINOLOGY 123 124 A type is said to be "pointer-like" if it is a reference to an object. 125 Pointer-like types include pointers and also interfaces, maps, channels, 126 functions and slices. 127 128 We occasionally use C's x->f notation to distinguish the case where x 129 is a struct pointer from x.f where is a struct value. 130 131 Pointer analysis literature (and our comments) often uses the notation 132 dst=*src+offset to mean something different than what it means in Go. 133 It means: for each node index p in pts(src), the node index p+offset is 134 in pts(dst). Similarly *dst+offset=src is used for store constraints 135 and dst=src+offset for offset-address constraints. 136 137 138 NODES 139 140 Nodes are the key datastructure of the analysis, and have a dual role: 141 they represent both constraint variables (equivalence classes of 142 pointers) and members of points-to sets (things that can be pointed 143 at, i.e. "labels"). 144 145 Nodes are naturally numbered. The numbering enables compact 146 representations of sets of nodes such as bitvectors (or BDDs); and the 147 ordering enables a very cheap way to group related nodes together. For 148 example, passing n parameters consists of generating n parallel 149 constraints from caller+i to callee+i for 0<=i<n. 150 151 The zero nodeid means "not a pointer". For simplicity, we generate flow 152 constraints even for non-pointer types such as int. The pointer 153 equivalence (PE) presolver optimization detects which variables cannot 154 point to anything; this includes not only all variables of non-pointer 155 types (such as int) but also variables of pointer-like types if they are 156 always nil, or are parameters to a function that is never called. 157 158 Each node represents a scalar part of a value or object. 159 Aggregate types (structs, tuples, arrays) are recursively flattened 160 out into a sequential list of scalar component types, and all the 161 elements of an array are represented by a single node. (The 162 flattening of a basic type is a list containing a single node.) 163 164 Nodes are connected into a graph with various kinds of labelled edges: 165 simple edges (or copy constraints) represent value flow. Complex 166 edges (load, store, etc) trigger the creation of new simple edges 167 during the solving phase. 168 169 170 OBJECTS 171 172 Conceptually, an "object" is a contiguous sequence of nodes denoting 173 an addressable location: something that a pointer can point to. The 174 first node of an object has a non-nil obj field containing information 175 about the allocation: its size, context, and ssa.Value. 176 177 Objects include: 178 - functions and globals; 179 - variable allocations in the stack frame or heap; 180 - maps, channels and slices created by calls to make(); 181 - allocations to construct an interface; 182 - allocations caused by conversions, e.g. []byte(str). 183 - arrays allocated by calls to append(); 184 185 Many objects have no Go types. For example, the func, map and chan type 186 kinds in Go are all varieties of pointers, but their respective objects 187 are actual functions (executable code), maps (hash tables), and channels 188 (synchronized queues). Given the way we model interfaces, they too are 189 pointers to "tagged" objects with no Go type. And an *ssa.Global denotes 190 the address of a global variable, but the object for a Global is the 191 actual data. So, the types of an ssa.Value that creates an object is 192 "off by one indirection": a pointer to the object. 193 194 The individual nodes of an object are sometimes referred to as "labels". 195 196 For uniformity, all objects have a non-zero number of fields, even those 197 of the empty type struct{}. (All arrays are treated as if of length 1, 198 so there are no empty arrays. The empty tuple is never address-taken, 199 so is never an object.) 200 201 202 TAGGED OBJECTS 203 204 An tagged object has the following layout: 205 206 T -- obj.flags ⊇ {otTagged} 207 v 208 ... 209 210 The T node's typ field is the dynamic type of the "payload": the value 211 v which follows, flattened out. The T node's obj has the otTagged 212 flag. 213 214 Tagged objects are needed when generalizing across types: interfaces, 215 reflect.Values, reflect.Types. Each of these three types is modelled 216 as a pointer that exclusively points to tagged objects. 217 218 Tagged objects may be indirect (obj.flags ⊇ {otIndirect}) meaning that 219 the value v is not of type T but *T; this is used only for 220 reflect.Values that represent lvalues. (These are not implemented yet.) 221 222 223 ANALYSIS ABSTRACTION OF EACH TYPE 224 225 Variables of the following "scalar" types may be represented by a 226 single node: basic types, pointers, channels, maps, slices, 'func' 227 pointers, interfaces. 228 229 Pointers 230 Nothing to say here, oddly. 231 232 Basic types (bool, string, numbers, unsafe.Pointer) 233 Currently all fields in the flattening of a type, including 234 non-pointer basic types such as int, are represented in objects and 235 values. Though non-pointer nodes within values are uninteresting, 236 non-pointer nodes in objects may be useful (if address-taken) 237 because they permit the analysis to deduce, in this example, 238 239 var s struct{ ...; x int; ... } 240 p := &s.x 241 242 that p points to s.x. If we ignored such object fields, we could only 243 say that p points somewhere within s. 244 245 All other basic types are ignored. Expressions of these types have 246 zero nodeid, and fields of these types within aggregate other types 247 are omitted. 248 249 unsafe.Pointers are not modelled as pointers, so a conversion of an 250 unsafe.Pointer to *T is (unsoundly) treated equivalent to new(T). 251 252 Channels 253 An expression of type 'chan T' is a kind of pointer that points 254 exclusively to channel objects, i.e. objects created by MakeChan (or 255 reflection). 256 257 'chan T' is treated like *T. 258 *ssa.MakeChan is treated as equivalent to new(T). 259 *ssa.Send and receive (*ssa.UnOp(ARROW)) and are equivalent to store 260 and load. 261 262 Maps 263 An expression of type 'map[K]V' is a kind of pointer that points 264 exclusively to map objects, i.e. objects created by MakeMap (or 265 reflection). 266 267 map K[V] is treated like *M where M = struct{k K; v V}. 268 *ssa.MakeMap is equivalent to new(M). 269 *ssa.MapUpdate is equivalent to *y=x where *y and x have type M. 270 *ssa.Lookup is equivalent to y=x.v where x has type *M. 271 272 Slices 273 A slice []T, which dynamically resembles a struct{array *T, len, cap int}, 274 is treated as if it were just a *T pointer; the len and cap fields are 275 ignored. 276 277 *ssa.MakeSlice is treated like new([1]T): an allocation of a 278 singleton array. 279 *ssa.Index on a slice is equivalent to a load. 280 *ssa.IndexAddr on a slice returns the address of the sole element of the 281 slice, i.e. the same address. 282 *ssa.Slice is treated as a simple copy. 283 284 Functions 285 An expression of type 'func...' is a kind of pointer that points 286 exclusively to function objects. 287 288 A function object has the following layout: 289 290 identity -- typ:*types.Signature; obj.flags ⊇ {otFunction} 291 params_0 -- (the receiver, if a method) 292 ... 293 params_n-1 294 results_0 295 ... 296 results_m-1 297 298 There may be multiple function objects for the same *ssa.Function 299 due to context-sensitive treatment of some functions. 300 301 The first node is the function's identity node. 302 Associated with every callsite is a special "targets" variable, 303 whose pts() contains the identity node of each function to which 304 the call may dispatch. Identity words are not otherwise used during 305 the analysis, but we construct the call graph from the pts() 306 solution for such nodes. 307 308 The following block of contiguous nodes represents the flattened-out 309 types of the parameters ("P-block") and results ("R-block") of the 310 function object. 311 312 The treatment of free variables of closures (*ssa.FreeVar) is like 313 that of global variables; it is not context-sensitive. 314 *ssa.MakeClosure instructions create copy edges to Captures. 315 316 A Go value of type 'func' (i.e. a pointer to one or more functions) 317 is a pointer whose pts() contains function objects. The valueNode() 318 for an *ssa.Function returns a singleton for that function. 319 320 Interfaces 321 An expression of type 'interface{...}' is a kind of pointer that 322 points exclusively to tagged objects. All tagged objects pointed to 323 by an interface are direct (the otIndirect flag is clear) and 324 concrete (the tag type T is not itself an interface type). The 325 associated ssa.Value for an interface's tagged objects may be an 326 *ssa.MakeInterface instruction, or nil if the tagged object was 327 created by an instrinsic (e.g. reflection). 328 329 Constructing an interface value causes generation of constraints for 330 all of the concrete type's methods; we can't tell a priori which 331 ones may be called. 332 333 TypeAssert y = x.(T) is implemented by a dynamic constraint 334 triggered by each tagged object O added to pts(x): a typeFilter 335 constraint if T is an interface type, or an untag constraint if T is 336 a concrete type. A typeFilter tests whether O.typ implements T; if 337 so, O is added to pts(y). An untagFilter tests whether O.typ is 338 assignable to T,and if so, a copy edge O.v -> y is added. 339 340 ChangeInterface is a simple copy because the representation of 341 tagged objects is independent of the interface type (in contrast 342 to the "method tables" approach used by the gc runtime). 343 344 y := Invoke x.m(...) is implemented by allocating contiguous P/R 345 blocks for the callsite and adding a dynamic rule triggered by each 346 tagged object added to pts(x). The rule adds param/results copy 347 edges to/from each discovered concrete method. 348 349 (Q. Why do we model an interface as a pointer to a pair of type and 350 value, rather than as a pair of a pointer to type and a pointer to 351 value? 352 A. Control-flow joins would merge interfaces ({T1}, {V1}) and ({T2}, 353 {V2}) to make ({T1,T2}, {V1,V2}), leading to the infeasible and 354 type-unsafe combination (T1,V2). Treating the value and its concrete 355 type as inseparable makes the analysis type-safe.) 356 357 reflect.Value 358 A reflect.Value is modelled very similar to an interface{}, i.e. as 359 a pointer exclusively to tagged objects, but with two generalizations. 360 361 1) a reflect.Value that represents an lvalue points to an indirect 362 (obj.flags ⊇ {otIndirect}) tagged object, which has a similar 363 layout to an tagged object except that the value is a pointer to 364 the dynamic type. Indirect tagged objects preserve the correct 365 aliasing so that mutations made by (reflect.Value).Set can be 366 observed. 367 368 Indirect objects only arise when an lvalue is derived from an 369 rvalue by indirection, e.g. the following code: 370 371 type S struct { X T } 372 var s S 373 var i interface{} = &s // i points to a *S-tagged object (from MakeInterface) 374 v1 := reflect.ValueOf(i) // v1 points to same *S-tagged object as i 375 v2 := v1.Elem() // v2 points to an indirect S-tagged object, pointing to s 376 v3 := v2.FieldByName("X") // v3 points to an indirect int-tagged object, pointing to s.X 377 v3.Set(y) // pts(s.X) ⊇ pts(y) 378 379 Whether indirect or not, the concrete type of the tagged object 380 corresponds to the user-visible dynamic type, and the existence 381 of a pointer is an implementation detail. 382 383 (NB: indirect tagged objects are not yet implemented) 384 385 2) The dynamic type tag of a tagged object pointed to by a 386 reflect.Value may be an interface type; it need not be concrete. 387 388 This arises in code such as this: 389 tEface := reflect.TypeOf(new(interface{}).Elem() // interface{} 390 eface := reflect.Zero(tEface) 391 pts(eface) is a singleton containing an interface{}-tagged 392 object. That tagged object's payload is an interface{} value, 393 i.e. the pts of the payload contains only concrete-tagged 394 objects, although in this example it's the zero interface{} value, 395 so its pts is empty. 396 397 reflect.Type 398 Just as in the real "reflect" library, we represent a reflect.Type 399 as an interface whose sole implementation is the concrete type, 400 *reflect.rtype. (This choice is forced on us by go/types: clients 401 cannot fabricate types with arbitrary method sets.) 402 403 rtype instances are canonical: there is at most one per dynamic 404 type. (rtypes are in fact large structs but since identity is all 405 that matters, we represent them by a single node.) 406 407 The payload of each *rtype-tagged object is an *rtype pointer that 408 points to exactly one such canonical rtype object. We exploit this 409 by setting the node.typ of the payload to the dynamic type, not 410 '*rtype'. This saves us an indirection in each resolution rule. As 411 an optimisation, *rtype-tagged objects are canonicalized too. 412 413 414 Aggregate types: 415 416 Aggregate types are treated as if all directly contained 417 aggregates are recursively flattened out. 418 419 Structs 420 *ssa.Field y = x.f creates a simple edge to y from x's node at f's offset. 421 422 *ssa.FieldAddr y = &x->f requires a dynamic closure rule to create 423 simple edges for each struct discovered in pts(x). 424 425 The nodes of a struct consist of a special 'identity' node (whose 426 type is that of the struct itself), followed by the nodes for all 427 the struct's fields, recursively flattened out. A pointer to the 428 struct is a pointer to its identity node. That node allows us to 429 distinguish a pointer to a struct from a pointer to its first field. 430 431 Field offsets are logical field offsets (plus one for the identity 432 node), so the sizes of the fields can be ignored by the analysis. 433 434 (The identity node is non-traditional but enables the distinction 435 described above, which is valuable for code comprehension tools. 436 Typical pointer analyses for C, whose purpose is compiler 437 optimization, must soundly model unsafe.Pointer (void*) conversions, 438 and this requires fidelity to the actual memory layout using physical 439 field offsets.) 440 441 *ssa.Field y = x.f creates a simple edge to y from x's node at f's offset. 442 443 *ssa.FieldAddr y = &x->f requires a dynamic closure rule to create 444 simple edges for each struct discovered in pts(x). 445 446 Arrays 447 We model an array by an identity node (whose type is that of the 448 array itself) followed by a node representing all the elements of 449 the array; the analysis does not distinguish elements with different 450 indices. Effectively, an array is treated like struct{elem T}, a 451 load y=x[i] like y=x.elem, and a store x[i]=y like x.elem=y; the 452 index i is ignored. 453 454 A pointer to an array is pointer to its identity node. (A slice is 455 also a pointer to an array's identity node.) The identity node 456 allows us to distinguish a pointer to an array from a pointer to one 457 of its elements, but it is rather costly because it introduces more 458 offset constraints into the system. Furthermore, sound treatment of 459 unsafe.Pointer would require us to dispense with this node. 460 461 Arrays may be allocated by Alloc, by make([]T), by calls to append, 462 and via reflection. 463 464 Tuples (T, ...) 465 Tuples are treated like structs with naturally numbered fields. 466 *ssa.Extract is analogous to *ssa.Field. 467 468 However, tuples have no identity field since by construction, they 469 cannot be address-taken. 470 471 472 FUNCTION CALLS 473 474 There are three kinds of function call: 475 (1) static "call"-mode calls of functions. 476 (2) dynamic "call"-mode calls of functions. 477 (3) dynamic "invoke"-mode calls of interface methods. 478 Cases 1 and 2 apply equally to methods and standalone functions. 479 480 Static calls. 481 A static call consists three steps: 482 - finding the function object of the callee; 483 - creating copy edges from the actual parameter value nodes to the 484 P-block in the function object (this includes the receiver if 485 the callee is a method); 486 - creating copy edges from the R-block in the function object to 487 the value nodes for the result of the call. 488 489 A static function call is little more than two struct value copies 490 between the P/R blocks of caller and callee: 491 492 callee.P = caller.P 493 caller.R = callee.R 494 495 Context sensitivity 496 497 Static calls (alone) may be treated context sensitively, 498 i.e. each callsite may cause a distinct re-analysis of the 499 callee, improving precision. Our current context-sensitivity 500 policy treats all intrinsics and getter/setter methods in this 501 manner since such functions are small and seem like an obvious 502 source of spurious confluences, though this has not yet been 503 evaluated. 504 505 Dynamic function calls 506 507 Dynamic calls work in a similar manner except that the creation of 508 copy edges occurs dynamically, in a similar fashion to a pair of 509 struct copies in which the callee is indirect: 510 511 callee->P = caller.P 512 caller.R = callee->R 513 514 (Recall that the function object's P- and R-blocks are contiguous.) 515 516 Interface method invocation 517 518 For invoke-mode calls, we create a params/results block for the 519 callsite and attach a dynamic closure rule to the interface. For 520 each new tagged object that flows to the interface, we look up 521 the concrete method, find its function object, and connect its P/R 522 blocks to the callsite's P/R blocks, adding copy edges to the graph 523 during solving. 524 525 Recording call targets 526 527 The analysis notifies its clients of each callsite it encounters, 528 passing a CallSite interface. Among other things, the CallSite 529 contains a synthetic constraint variable ("targets") whose 530 points-to solution includes the set of all function objects to 531 which the call may dispatch. 532 533 It is via this mechanism that the callgraph is made available. 534 Clients may also elect to be notified of callgraph edges directly; 535 internally this just iterates all "targets" variables' pts(·)s. 536 537 538 PRESOLVER 539 540 We implement Hash-Value Numbering (HVN), a pre-solver constraint 541 optimization described in Hardekopf & Lin, SAS'07. This is documented 542 in more detail in hvn.go. We intend to add its cousins HR and HU in 543 future. 544 545 546 SOLVER 547 548 The solver is currently a naive Andersen-style implementation; it does 549 not perform online cycle detection, though we plan to add solver 550 optimisations such as Hybrid- and Lazy- Cycle Detection from (Hardekopf 551 & Lin, PLDI'07). 552 553 It uses difference propagation (Pearce et al, SQC'04) to avoid 554 redundant re-triggering of closure rules for values already seen. 555 556 Points-to sets are represented using sparse bit vectors (similar to 557 those used in LLVM and gcc), which are more space- and time-efficient 558 than sets based on Go's built-in map type or dense bit vectors. 559 560 Nodes are permuted prior to solving so that object nodes (which may 561 appear in points-to sets) are lower numbered than non-object (var) 562 nodes. This improves the density of the set over which the PTSs 563 range, and thus the efficiency of the representation. 564 565 Partly thanks to avoiding map iteration, the execution of the solver is 566 100% deterministic, a great help during debugging. 567 568 569 FURTHER READING 570 571 Andersen, L. O. 1994. Program analysis and specialization for the C 572 programming language. Ph.D. dissertation. DIKU, University of 573 Copenhagen. 574 575 David J. Pearce, Paul H. J. Kelly, and Chris Hankin. 2004. Efficient 576 field-sensitive pointer analysis for C. In Proceedings of the 5th ACM 577 SIGPLAN-SIGSOFT workshop on Program analysis for software tools and 578 engineering (PASTE '04). ACM, New York, NY, USA, 37-42. 579 http://doi.acm.org/10.1145/996821.996835 580 581 David J. Pearce, Paul H. J. Kelly, and Chris Hankin. 2004. Online 582 Cycle Detection and Difference Propagation: Applications to Pointer 583 Analysis. Software Quality Control 12, 4 (December 2004), 311-337. 584 http://dx.doi.org/10.1023/B:SQJO.0000039791.93071.a2 585 586 David Grove and Craig Chambers. 2001. A framework for call graph 587 construction algorithms. ACM Trans. Program. Lang. Syst. 23, 6 588 (November 2001), 685-746. 589 http://doi.acm.org/10.1145/506315.506316 590 591 Ben Hardekopf and Calvin Lin. 2007. The ant and the grasshopper: fast 592 and accurate pointer analysis for millions of lines of code. In 593 Proceedings of the 2007 ACM SIGPLAN conference on Programming language 594 design and implementation (PLDI '07). ACM, New York, NY, USA, 290-299. 595 http://doi.acm.org/10.1145/1250734.1250767 596 597 Ben Hardekopf and Calvin Lin. 2007. Exploiting pointer and location 598 equivalence to optimize pointer analysis. In Proceedings of the 14th 599 international conference on Static Analysis (SAS'07), Hanne Riis 600 Nielson and Gilberto Filé (Eds.). Springer-Verlag, Berlin, Heidelberg, 601 265-280. 602 603 Atanas Rountev and Satish Chandra. 2000. Off-line variable substitution 604 for scaling points-to analysis. In Proceedings of the ACM SIGPLAN 2000 605 conference on Programming language design and implementation (PLDI '00). 606 ACM, New York, NY, USA, 47-56. DOI=10.1145/349299.349310 607 http://doi.acm.org/10.1145/349299.349310 608 609 */ 610 package pointer // import "github.com/powerman/golang-tools/go/pointer"