How to Do the Alias Analysis 

This document and some examples can be found in the directory
notes/alias-analysis in the source tree of titaniumc IF.50 and up.

Last updated: Jan. 3, 1999, by Glen


The basics: 

A value of an "object-valued" expression in Java is actually a pointer
(reference) to a location in the heap. We can think of these locations as
abstract values represented by positive integers, so that object-valued
expressions may be thought of as having these integers as their values. We
associate a unique abstract value with each expression in a method whose
value may be a reference to an object (including arrays) which is not
visible to that method at any point before the occurence of that expression.
This abstract value represents the memory location which that expression may
refer to. For example, an allocation ("new int[3]") or a string literal
would have associated abstract values. Method calls have them as well
because it may return such a reference. The mapping of abstract values to
memory locations is one-to-one, with exceptions to be discussed.

Our goal is to associate a set of abstract values to every textual
occurrence of an object-valued expression in the method, representing all
the locations which that reference may point to at run-time. Let's do an
example:

  void foo () {
    int[] c;
    int[] b = new int[4];
    if (true)
      c = b;
    else
      c = null;
    System.out.println(c + "hello");
  }


Here the expressions null, new int[4], and "hello" would have their own
abstract values. Let's say they are 1, 2, and 3 respectively. Then the
value-sets of these expressions would be singleton sets containing these
values. The value-set of the first occurence of c will be empty. After the
assignment into b, b will have the same value-set as new int[4], namely {2}.
After the if-else expression, c will have the value-set {1, 2}, because it
could have gotten either one. As shown here, just because a value is in a
value set, doesn't mean that the expression will have that value in any run
of the program. However, we do garantee that it must have SOME value that's
in its value set.

What about array accesses and fields? An object may have references to other
objects, so we represent this by associating with each abstract value: 

1. A set of abstract values which an array access on that object may yield.
2. A mapping of field names to value sets. 

We will call values occuring in these sets "sub-values" of the given value.
It is very important to be clear why we're associating these with values. An
array access a[i] should not be thought of as retrieving the element i of
the array a, but rather the element i of the object pointed to by a. For
example:

a = d;
a[i] = new Object();
c = a[i];
a = b;

After the above 4 statements, it is not true that c and a[i] are "aliased",
although it is still true that c is aliased to the i-th element of the array
which was pointed to by a before the last assignment.

From the above definitions, it should be obvious how to reach our goal of
assignment a value set to every expression, so we'll leave out most of the
details. It's basically a data-flow calculation in the traditional way,
using fixed-point iterations on the CFG. For example, in a[i] = b[i], we get
the value set of the expression a, and union the value set of b[i] with the
array-access-value-set of each of the values in the value set of a. It's not
obvious that this will terminate, but the reason is that we only do copys
and unions. See below.


Special values:

What about values that are not allocated within the method? We give a unique
abstract value to the expressions 'this', 'null', and every formal
object-valued parameter. Actually, 'this' can be thought of as parameter 0.
For anything else (e.g. class variables, unassigned-to object fields), we
give a special value called the external value. We said earlier that
distinct abstract values represent distinct locations in memory. Here we
give the exceptions. Any special value may represent any location visible to
the outside world. In particular, two special values may represent the same
location. You might wonder why we do it this way. Read on!


Some definitions: 

The dirty and leaked sets: 

Define a value to be dirty at textual point x in our method iff another
method f() can reference that value if f() were called at point x. For
example, all special values are always dirty. Define a value to be leaked at
a point to be the set of all values which would be dirty at that point under
the assumption that special values are not initially dirty. (It's a value
that f "makes" dirty.) Thus the dirty set is just the leaked set plus
special values. Notice that an assignment into a field of a dirty value
makes the right-hand-side (RHS) and all values reachable from the RHS leaked
(and thus dirty). This definition will help us during method calls.


More definitions: 

Let us define 3 sets of values associated with each method f. It is very
important to note that these definitions are made UNDER THE ASSUMPTION that
all values in f are distinct memory locations and that the arguments are not
visible to anyone but the caller and callee.

1. Leaks-set: The set of all values in f that may be leaked at any point of
   f. 

2. Modifies-set: The set of all values which may be modified by f.

3. Returns-set: The set of all values which may be returned by f.

Let's first see how to compute these sets for assignments into object fields
and arrays, which are the only statements that affect the first 2 sets (the
returns-set is obvious). In an assignment to a.x or a[i], what gets
modified? Just the value set of a. But what if a contains the external
value??? Doesn't it modify the dirty values too??? It might in actuality,
but remember our assumption in computing these sets that all values are
distinct. Now what becomes dirty? If and only if the LHS-object contains a
dirty value, every value reachable from the RHS becomes dirty.


Method calls:

For our purposes, we may treat a method call as an arbitrarily long sequence
of arbitrary assignments existing in our method at the point of the method
call. But this would mean that any method call destroys our entire analysis,
so we use the sets defined above to narrow down what assignments the called
method may perform. Suppose we see see the statement

a = obj.f(arg);

where a, obj, and arg are object-valued. First we discuss how to modify our
own method's sets accordingly. We know that, out of the values that WE can
see in our method, the called method f may only see three kinds of values:

(1) Any value reachable from arg. This includes the value-set of arg,
    subvalues of these values, their subvalues, etc. etc.

(2) Any value reachable from obj. 

(3) Any dirty value at the point just before the call. 

Our first restriction is that f may only modify (assign to) these values and
no more. But we can do better by checking f's modifies-set. Values of kind
(1) appear to f as arguments. Values of kind (2) appear to f as 'this'.
Values of kind (3) appear to f as external values. Thus, we can check the
value set of f to see which of these f modifies. This information tells us
which values our own method modifies, so that we can update our own method's
modifies-set. Notice how our assumption of distinct values in our definition
of the modifies-set helps us to determine which of these values f modifies.
Also remember that in computing our OWN modifies-set, we can assume that
arguments are not dirty, so for (3) we only have to use the leaks set.


Similarly, we can use f's leaks-set to update our own. We only care about
the part of f's leaks-set that concerns the arguments (for now). If it has
been leaked by f, we assume it has been leaked by us as well. Of course, it
may be the case that f simply leaked it to one of our
modified-but-not-dirty-to-us arguments, but for simplicity we'll ignore that
case. (It's very easy to put in actually.) Something that is leaked also
becomes dirty.


Now we know how to modify our method's 3 sets upon a method call. But what
about our dataflow calculations? Since some of the values may be modified,
we have to update their array access/field value sets. First a little fact:
In a call of f, the set of values that f may assign to object fields and
that are visible to the caller are a subset of f's leaks-set. To see this,
just notice that in the above, (1)-(3) are all dirty to f, so assignment
into them necessarily leaks the right-hand-side.

Our resulting data-flow actions are summarized in one line: Clobber every
value reachable from a modified value with every value reachable from a
leaked value. A rather subtle detail is that although in modifying our
expression value sets, we do not have the assumption that argument values
are "clean", we nonetheless do not have to clobber them, because the leaked
values are dirty, and it is already true that subvalues of arguments may be
any dirty value.

Finally, we can use the returns-values set to determine which of our values
may be returned so we know what to assign to a. If the method returns the
external value, we assign to a the external value plus the leaks set (we
don't need the dirty set since we have the external value).


It also follows that a method is side-effect-free iff its modifies-set does
not include the external value or any argument values. However, a method
call is SEF only if the called method and all its overriders are SEF.


Finally, what about our original motivation of alias analysis? Two textual
expressions are not "aliased" iff their value-sets are disjoint even when

1. All special values are considered as being equal (recall that special
   values may refer to the same location in memory)
2. All dirty values are considered as equal to the external value. 

Our special casing makes aliasing information more complicated but more
precise. Notice that abstract values themselves, being heap memory
locations, are unchanging throughout the method, so it always makes sense to
compare value sets.


Some odds and ends: 

* For object allocations expressions, we need to make the value associated
  with the allocation dirty if the constructor leaks the 'this' value. See
  how useful this value is?

* We have been speaking of "arguments and their subvalues" as if they were
  one thing: "arguments". This is simulated in the implementation by
  assigning a value v to an argument, and then setting v as a possible value
  of an array access on the value v. With more work, we may be able to make
  this more precise.

* The null value should be eliminated altogether since for all analysis
  purposes (?) we don't care about references that are null.

* For recursive functions, we iterate to a fixed point over the call graph.
  This is done implicitly by using a call-back list for each method. It
  always terminates because things can only get worse. Sort the
  methods topologically so that in the abscence of recursion there will be
  no iteration. 

* This information can be used for other things, checking if arguments in
  particular are modified, checking if things can be allocated on the stack
  instead of using gargabe collection, etc..

* To make this information useful for a later stage, we make a final pass to
  attached "canonicalized" value sets to AST nodes which are ready for
  direct comparison. For example, a value set of {ext, 3} would be changed
  to  {ext, 3} U dirty-set .

---------
Types and overriding: 

Consider this: 

public void foo(PrintStream ps) 
{
  PrintStream myPS = new PrintStream(); // (1)
  myPS.println();  // (2)
  ps.println();  // (3)
}

The call at (2) calls the println() method in the PrintStream class, which
has an empty body, and is side-effect free. What about the call at (3)? We
don't know exactly which println() method it will call at run-time because
of overriding, so we have to check to make sure that all methods which may
override println() are side-effect free. In general, we have to merge the 3
sets we defined above for all methods that override the called method, if we
don't know anything about its actual type.

For locally-allocated variables we can tell directly the actual type of the
object, as in (1). For parameters and method return values, we can look at
all the places where a method is called and all a method's return statements
to narrow down the possibilities. The former requires that we propogate
information from caller to callee, which is in the direction opposite to
what we have been doing. This should probably be left to a final pass (not
implemented yet). There's more we can do in the other direction too.

--------- 

More work: 

It would be very easy to propogate interprocedural data the other way: we
don't have to assume that arguments are special values if we know that they
never are. For example, if method m's formal x is always bound to a
locally-allocated clean object of the caller's, then m can treat x as its
own local, clean value, since for the active lifetime of m nobody else can
see x. This won't cause much more trouble to implement, and we can actually
do this in one pass without iteraction by first completing the above
analysis, and then going through to canonicalize each method's sets using
this information, since we have already assumed in our analysis that
arguments are not special values, and it's only in the canonicalization that
we assume that they are again.

Meanwhile, after (optimizing) transformations to the code, all analyses need
to be redone. We should try to limit this so we won't have to do everything
over again.

--------------------

Termination proof:

First we have to see more precisely what our algorithm does. At any point in
the method, the set of all mappings from variables to values and values to
subvalues, plus the dirty set, give us complete dataflow information at that
point. We package this information in an "AliasInfos" in the code. At a
merge point (a node with multiple predecessors), we take the respective
unions of these sets. We can define a partial ordering <= on AliasInfos. Say
that ai1 <= ai2 if for every variable and every value in ai1, the associated
set is a subset of the corresponding one in ai2. Each statement in the
method can be thought of as a function taking AliasInfos to AliasInfos.

Fix a method m. Let AI be the set of all possible AliasInfos for the method
m. Let each statement in the method be represented as a function s: AI ->
AI. Let X be the set of textual program points. Then a function f: X -> AI
is one "state". Our analysis tries to reach a solution by going through a
sequence of these states, starting from the empty state. Say that for such
states, f1 <= f2 iff for every x in X, f1(x) <= f2(x).

Lemma 1. For any AliasInfos a1, a2, and statement s, if a1 <= a2, then s(a1)
<= s(a2). This can be seen by considering all the possible statements, but
it should be obvious because we only do copys and unions.

This is enough to garantee convergence since our state space is finite:

Let prec(s) be the point just before statement s, and succ(s) the one after.
Each step of our analysis sets f(succ(s)) to be s(f(prec(s))), for some
statement s. For cases of multiple descendents, f(prec(s)) should be just
the union over all q(f(prec(q))), for q any statement just before s.

Lemma 2. If f0, f1, ..., fn is a sequence of our analysis, then f0 <= f1 <=
... <= fn. The proof is by induction:

Case 0: nothing to check. Case n: Let f0, f1, ..., fn-1, fn be our sequence.
Suppose our analysis takes fn-1 to fn by setting fn(succ(s)) to
s(fn-1(prec(s))) (for multiple predecessors it's similar but messier). If
fn-1(succ(s)) was never "set" at some step, then it is the empty AliasInfos,
and fn-1(succ(s)) <= fn(succ(s)). If it was set, then fn-1(succ(s)) =
s(fj(prec(s))) for some 0 <= j < n-1. By the induction hypothesis,
fj(prec(s)) <= fn-1(prec(s)), so by Lemma 1, fn-1(succ(s)) = s(fj(prec(s)))
<= s(fn-1(prec(s))) = fn(succ(s)). This proves Lemma 2.

Since AI is a finite set, and the sequence is non-decreasing, it must be
constant at some point. We detect this and terminate.