How to Do the Alias Analysis This document and some examples can be found in the directory notes/alias-analysis in the source tree of titaniumc IF.50 and up. Last updated: Jan. 3, 1999, by Glen The basics: A value of an "object-valued" expression in Java is actually a pointer (reference) to a location in the heap. We can think of these locations as abstract values represented by positive integers, so that object-valued expressions may be thought of as having these integers as their values. We associate a unique abstract value with each expression in a method whose value may be a reference to an object (including arrays) which is not visible to that method at any point before the occurence of that expression. This abstract value represents the memory location which that expression may refer to. For example, an allocation ("new int[3]") or a string literal would have associated abstract values. Method calls have them as well because it may return such a reference. The mapping of abstract values to memory locations is one-to-one, with exceptions to be discussed. Our goal is to associate a set of abstract values to every textual occurrence of an object-valued expression in the method, representing all the locations which that reference may point to at run-time. Let's do an example: void foo () { int[] c; int[] b = new int[4]; if (true) c = b; else c = null; System.out.println(c + "hello"); } Here the expressions null, new int[4], and "hello" would have their own abstract values. Let's say they are 1, 2, and 3 respectively. Then the value-sets of these expressions would be singleton sets containing these values. The value-set of the first occurence of c will be empty. After the assignment into b, b will have the same value-set as new int[4], namely {2}. After the if-else expression, c will have the value-set {1, 2}, because it could have gotten either one. As shown here, just because a value is in a value set, doesn't mean that the expression will have that value in any run of the program. However, we do garantee that it must have SOME value that's in its value set. What about array accesses and fields? An object may have references to other objects, so we represent this by associating with each abstract value: 1. A set of abstract values which an array access on that object may yield. 2. A mapping of field names to value sets. We will call values occuring in these sets "sub-values" of the given value. It is very important to be clear why we're associating these with values. An array access a[i] should not be thought of as retrieving the element i of the array a, but rather the element i of the object pointed to by a. For example: a = d; a[i] = new Object(); c = a[i]; a = b; After the above 4 statements, it is not true that c and a[i] are "aliased", although it is still true that c is aliased to the i-th element of the array which was pointed to by a before the last assignment. From the above definitions, it should be obvious how to reach our goal of assignment a value set to every expression, so we'll leave out most of the details. It's basically a data-flow calculation in the traditional way, using fixed-point iterations on the CFG. For example, in a[i] = b[i], we get the value set of the expression a, and union the value set of b[i] with the array-access-value-set of each of the values in the value set of a. It's not obvious that this will terminate, but the reason is that we only do copys and unions. See below. Special values: What about values that are not allocated within the method? We give a unique abstract value to the expressions 'this', 'null', and every formal object-valued parameter. Actually, 'this' can be thought of as parameter 0. For anything else (e.g. class variables, unassigned-to object fields), we give a special value called the external value. We said earlier that distinct abstract values represent distinct locations in memory. Here we give the exceptions. Any special value may represent any location visible to the outside world. In particular, two special values may represent the same location. You might wonder why we do it this way. Read on! Some definitions: The dirty and leaked sets: Define a value to be dirty at textual point x in our method iff another method f() can reference that value if f() were called at point x. For example, all special values are always dirty. Define a value to be leaked at a point to be the set of all values which would be dirty at that point under the assumption that special values are not initially dirty. (It's a value that f "makes" dirty.) Thus the dirty set is just the leaked set plus special values. Notice that an assignment into a field of a dirty value makes the right-hand-side (RHS) and all values reachable from the RHS leaked (and thus dirty). This definition will help us during method calls. More definitions: Let us define 3 sets of values associated with each method f. It is very important to note that these definitions are made UNDER THE ASSUMPTION that all values in f are distinct memory locations and that the arguments are not visible to anyone but the caller and callee. 1. Leaks-set: The set of all values in f that may be leaked at any point of f. 2. Modifies-set: The set of all values which may be modified by f. 3. Returns-set: The set of all values which may be returned by f. Let's first see how to compute these sets for assignments into object fields and arrays, which are the only statements that affect the first 2 sets (the returns-set is obvious). In an assignment to a.x or a[i], what gets modified? Just the value set of a. But what if a contains the external value??? Doesn't it modify the dirty values too??? It might in actuality, but remember our assumption in computing these sets that all values are distinct. Now what becomes dirty? If and only if the LHS-object contains a dirty value, every value reachable from the RHS becomes dirty. Method calls: For our purposes, we may treat a method call as an arbitrarily long sequence of arbitrary assignments existing in our method at the point of the method call. But this would mean that any method call destroys our entire analysis, so we use the sets defined above to narrow down what assignments the called method may perform. Suppose we see see the statement a = obj.f(arg); where a, obj, and arg are object-valued. First we discuss how to modify our own method's sets accordingly. We know that, out of the values that WE can see in our method, the called method f may only see three kinds of values: (1) Any value reachable from arg. This includes the value-set of arg, subvalues of these values, their subvalues, etc. etc. (2) Any value reachable from obj. (3) Any dirty value at the point just before the call. Our first restriction is that f may only modify (assign to) these values and no more. But we can do better by checking f's modifies-set. Values of kind (1) appear to f as arguments. Values of kind (2) appear to f as 'this'. Values of kind (3) appear to f as external values. Thus, we can check the value set of f to see which of these f modifies. This information tells us which values our own method modifies, so that we can update our own method's modifies-set. Notice how our assumption of distinct values in our definition of the modifies-set helps us to determine which of these values f modifies. Also remember that in computing our OWN modifies-set, we can assume that arguments are not dirty, so for (3) we only have to use the leaks set. Similarly, we can use f's leaks-set to update our own. We only care about the part of f's leaks-set that concerns the arguments (for now). If it has been leaked by f, we assume it has been leaked by us as well. Of course, it may be the case that f simply leaked it to one of our modified-but-not-dirty-to-us arguments, but for simplicity we'll ignore that case. (It's very easy to put in actually.) Something that is leaked also becomes dirty. Now we know how to modify our method's 3 sets upon a method call. But what about our dataflow calculations? Since some of the values may be modified, we have to update their array access/field value sets. First a little fact: In a call of f, the set of values that f may assign to object fields and that are visible to the caller are a subset of f's leaks-set. To see this, just notice that in the above, (1)-(3) are all dirty to f, so assignment into them necessarily leaks the right-hand-side. Our resulting data-flow actions are summarized in one line: Clobber every value reachable from a modified value with every value reachable from a leaked value. A rather subtle detail is that although in modifying our expression value sets, we do not have the assumption that argument values are "clean", we nonetheless do not have to clobber them, because the leaked values are dirty, and it is already true that subvalues of arguments may be any dirty value. Finally, we can use the returns-values set to determine which of our values may be returned so we know what to assign to a. If the method returns the external value, we assign to a the external value plus the leaks set (we don't need the dirty set since we have the external value). It also follows that a method is side-effect-free iff its modifies-set does not include the external value or any argument values. However, a method call is SEF only if the called method and all its overriders are SEF. Finally, what about our original motivation of alias analysis? Two textual expressions are not "aliased" iff their value-sets are disjoint even when 1. All special values are considered as being equal (recall that special values may refer to the same location in memory) 2. All dirty values are considered as equal to the external value. Our special casing makes aliasing information more complicated but more precise. Notice that abstract values themselves, being heap memory locations, are unchanging throughout the method, so it always makes sense to compare value sets. Some odds and ends: * For object allocations expressions, we need to make the value associated with the allocation dirty if the constructor leaks the 'this' value. See how useful this value is? * We have been speaking of "arguments and their subvalues" as if they were one thing: "arguments". This is simulated in the implementation by assigning a value v to an argument, and then setting v as a possible value of an array access on the value v. With more work, we may be able to make this more precise. * The null value should be eliminated altogether since for all analysis purposes (?) we don't care about references that are null. * For recursive functions, we iterate to a fixed point over the call graph. This is done implicitly by using a call-back list for each method. It always terminates because things can only get worse. Sort the methods topologically so that in the abscence of recursion there will be no iteration. * This information can be used for other things, checking if arguments in particular are modified, checking if things can be allocated on the stack instead of using gargabe collection, etc.. * To make this information useful for a later stage, we make a final pass to attached "canonicalized" value sets to AST nodes which are ready for direct comparison. For example, a value set of {ext, 3} would be changed to {ext, 3} U dirty-set . --------- Types and overriding: Consider this: public void foo(PrintStream ps) { PrintStream myPS = new PrintStream(); // (1) myPS.println(); // (2) ps.println(); // (3) } The call at (2) calls the println() method in the PrintStream class, which has an empty body, and is side-effect free. What about the call at (3)? We don't know exactly which println() method it will call at run-time because of overriding, so we have to check to make sure that all methods which may override println() are side-effect free. In general, we have to merge the 3 sets we defined above for all methods that override the called method, if we don't know anything about its actual type. For locally-allocated variables we can tell directly the actual type of the object, as in (1). For parameters and method return values, we can look at all the places where a method is called and all a method's return statements to narrow down the possibilities. The former requires that we propogate information from caller to callee, which is in the direction opposite to what we have been doing. This should probably be left to a final pass (not implemented yet). There's more we can do in the other direction too. --------- More work: It would be very easy to propogate interprocedural data the other way: we don't have to assume that arguments are special values if we know that they never are. For example, if method m's formal x is always bound to a locally-allocated clean object of the caller's, then m can treat x as its own local, clean value, since for the active lifetime of m nobody else can see x. This won't cause much more trouble to implement, and we can actually do this in one pass without iteraction by first completing the above analysis, and then going through to canonicalize each method's sets using this information, since we have already assumed in our analysis that arguments are not special values, and it's only in the canonicalization that we assume that they are again. Meanwhile, after (optimizing) transformations to the code, all analyses need to be redone. We should try to limit this so we won't have to do everything over again. -------------------- Termination proof: First we have to see more precisely what our algorithm does. At any point in the method, the set of all mappings from variables to values and values to subvalues, plus the dirty set, give us complete dataflow information at that point. We package this information in an "AliasInfos" in the code. At a merge point (a node with multiple predecessors), we take the respective unions of these sets. We can define a partial ordering <= on AliasInfos. Say that ai1 <= ai2 if for every variable and every value in ai1, the associated set is a subset of the corresponding one in ai2. Each statement in the method can be thought of as a function taking AliasInfos to AliasInfos. Fix a method m. Let AI be the set of all possible AliasInfos for the method m. Let each statement in the method be represented as a function s: AI -> AI. Let X be the set of textual program points. Then a function f: X -> AI is one "state". Our analysis tries to reach a solution by going through a sequence of these states, starting from the empty state. Say that for such states, f1 <= f2 iff for every x in X, f1(x) <= f2(x). Lemma 1. For any AliasInfos a1, a2, and statement s, if a1 <= a2, then s(a1) <= s(a2). This can be seen by considering all the possible statements, but it should be obvious because we only do copys and unions. This is enough to garantee convergence since our state space is finite: Let prec(s) be the point just before statement s, and succ(s) the one after. Each step of our analysis sets f(succ(s)) to be s(f(prec(s))), for some statement s. For cases of multiple descendents, f(prec(s)) should be just the union over all q(f(prec(q))), for q any statement just before s. Lemma 2. If f0, f1, ..., fn is a sequence of our analysis, then f0 <= f1 <= ... <= fn. The proof is by induction: Case 0: nothing to check. Case n: Let f0, f1, ..., fn-1, fn be our sequence. Suppose our analysis takes fn-1 to fn by setting fn(succ(s)) to s(fn-1(prec(s))) (for multiple predecessors it's similar but messier). If fn-1(succ(s)) was never "set" at some step, then it is the empty AliasInfos, and fn-1(succ(s)) <= fn(succ(s)). If it was set, then fn-1(succ(s)) = s(fj(prec(s))) for some 0 <= j < n-1. By the induction hypothesis, fj(prec(s)) <= fn-1(prec(s)), so by Lemma 1, fn-1(succ(s)) = s(fj(prec(s))) <= s(fn-1(prec(s))) = fn(succ(s)). This proves Lemma 2. Since AI is a finite set, and the sequence is non-decreasing, it must be constant at some point. We detect this and terminate.