Common Subexpression Elimination in Titanium

Class Project for CS 265, Susan Graham
Johnathon Jamison (jjamison@cs.berkeley.edu)

Introduction

Titanium [1] is a language designed for parallel programming. It is almost a perfect superset to Java 1.0. In this work, the specific extensions to Java are irrelevant, for the optimizations implemented in Titanium compiler framework were not specific to Titanium. Titanium does not have an intermediate language independent of the original abstract syntax. Instead, it restricts which constructs are allowed in the syntax tree to a subset that is easily generated. The various optimizations maniplulate this lowered form. After all transformations on the tree are completed, C is then generated from the tree. That C code is then compiled by the local C compiler, e.g. gcc.

The Titanium compiler has a def/use analysis available for use by the internals of the compiler. This analysis collects information regarding the possible definitions of any given use (and vice versa) of a variable. It maintains, for every use of a variable, all possible locations where that variable could be defined and that definition reaches the use. Similarly for definitions, in which every definition maintains a list of all possible places where that definition could be used.

Common subexpression elimination needs related information, i.e. when the variables involved in an expression are possibly overwritten before the expression is reused. That is, the variables can not have any definitions between the two evaluations of the expression. This is information that can be derived from def/use information. Unfortunately, the information can not be used directly.  Thus, extra assignments are inserted to generate the information needed for common subexpression elimination. In this way, the def/use analysis already supplied can be leveraged to provide us with needed data, rather than needing to write an additional analysis.

Background

One may wonder why I even started on this path to begin with. There is a two part answer.

I originally started on the project for two reasons. One was that I wished to work on something related to Titanium, for that is the base of my current work. A second was that I had never actually worked on a real compiler, and this experience would teach me something new.

Why I used this different approach to common subexpression elimination is a different story. I originally was implementing the local elimination method described by Muchnick [2]. I implemented the algorithm up to the point of deciding what expressions were killed by a given statement. To do this, I was using def/use information to see what a particular assignment could define. When talking to my contact in the Titanium compiler group, Dan Bonachea (bonachea@cs.berkeley.edu), he mentioned some potential pitfalls, and suggested that I try a technique similar to what David Marin (dmarin@cs.berkeley.edu) was using for global copy propagation. Not only did this technique solve my problem, but it gave me global common subexpression elimination for free, without needing to write an additional analysis, which would have been difficult.

Def/use analysis

A definition is an assignment of a value to a variable. A use of a definition is where the value assigned by a definition is susequently used. A particular definition of a variable is said to reach a given point in a procedure if there is an execution path from the definition to that point such that the variable has the value assigned by the definition. Def/use analysis determines which particular definitions may, by some control flow path, reach any particular use.

In general, it is recursively undecidable whether a definition actually reaches a particular point. Thus, we assume that a definition reaches a use if we can not determine absolutely whether it can or not. Generally, optimizations based on reaching definitions depend on what may reach, we keep on the side of conservatism, putting the preservation of correctness ahead of aggressiveness of optimization.

If an instruction redefines a variable, then it is said to kill the definition; otherwise it preserves it. Some definitions of a variable are unambiguous, such as explicit assignments, while others, such as assignments through pointers and procedure calls, are ambiguous, for we may not be able to determine if they affect a particular variable.

The Titanium compiler has a def/use analysis already available for use. It maintains def/use and use/def chains for all variables that appear in the program.  A def/use-chain for a variable connects a definition of that variable to everything that may be a use, while a use/def chain connects a use to everything that may be a definition. The Titanium compiler handles method calls by killing every variable, unless it determines the method has no side effects, and handles pointers by killing all variable of the indirected type.

Common Subexpression Elimination

Suppose that in some code, the following portion exists:
a = f * i
...
b = f * i
If the value of the expression f *i has not changed between the first and second assignment, then we are doing an extra computation that is unnecessary. In this case, the expression f * i is called a common subexpression. We can eliminate common subexpressions to potentially reduce the running time of the program. That is, we transform the code into this:
a = f * i
temp = a
...
b = temp
where temp is a fresh variable.

Common subexpression elimination depends on what is called available expressions. At any particular point in the program, there is some set of expressions that has been computed, and whose values have not subesquently changed. For example, after the execution of a = f * i above, the expression f * i is available. As the value of f * i did not change before the execution of b = f * i, it was still available. Because it was available, we could perform the transformation. Common subexpression elimination needs to know what expressions are available for use. The available expressions data-flow analysis determines, for every program point, which expressions are available. There must be an evaluation of the expression on every path from the entry to the program point, and none of the variables occurring in the expression are defined between the last such evaluation on a path and the point.

An occurrence of an expression in a program is a common subexpression if the expression is available. Common subexpression elimination is a transformation that removes the recomputations of common subexpressions and replaces them with uses of the saved values of those common subexpressions.

Originally, only local common subexpression elimination was originally being coded for addition to the Titanium compiler. The method being used was the standard algorithm in Muchnick's book. When analyzing each node, one must determine what variables are killed, and consequently what subexpressions were no longer available. Rather than attempting to write my own analysis of which variables were killed, this was accomplished by taking every assignment and determining what variables were defined by that assignment by looking at who used this definition, utilizing the def/use info previously computed. This, however, was skating the thin edge of the analysis, and Dan suggested that the def/use analysis could be used in a more direct way, which would then also provide global common subexpression elimination for free.

Given the code:
a = f * i
...
b = f * i
One wishes to determine if the expression f * i, computed at the first statement, is available at the second statement. We can do this only if the value of f * i has not changed, which we can conclude from whether f or i has changed between the two assignment statements. That is, if f or i has a definition that kills the respective variable between the two assignments, then the expression is not available. However, as it stands, the def/use information available is not amenable for use to determine if a variable x is killed between point a and point b. One would need to ask the question: is the definition of f * i in the first statement the only definition at the use of f* i? But expressions do not have definitions. So, modify the code above by inserting assignments to produce:
f = f
i = i
a = f * i
...
b = f * i
At this point, we now can use the created definitions of f and i as the definition of the expression f * i. Now the question becomes: are the definitions of f and i in the second statement the same as the definitions of f and i in the first?

The above test is sufficient for straight line code, but not for code where more than one computation of an expression can take place. For example:

a = f * i
...

b = f * i
...




...
c = f * i

In this case, one wants to know if every definition of f and i are from the generated assignments associated with some computation of the expression. If there is a definition of f or i that is not from an inserted assignment, then there exists a path on which f or i is killed and f * i is not recomputed, i.e. the expression f * i is not available. The particular computations that are associated with instances of inserted assignments where the definitions reach the second computation is the set of assignments that are involved in the common subexpression.  This winnowing is necessary, for not all computations of f * i reach the statement under examination.

To find the subexpressions that are canidates for elimination, the control flow graph is searched. All assignments with expressions on the right hand side that might be eliminatable are stored in a hash table. Any expression stored in the table more than once is then instrumented for the analysis to determine if it can be eliminated. As expressions of no more than two operands are considered, the code increase is linear in the number of assignments instrumented.

Running the def/use analysis is quite expensive. In particular, one does not want to run it for every expression under examination. Thus, all useless assignments are inserted at one time, the def/use analysis is run, and then each expression is examined. However, this adds one additional wrinkle: all inserted assignments that are not for the particular subexpression under examination must be ignored.

Algorithm:
For every location where an elimination is to be attempted:
1) For each variable in the expression:
1)a) search its definitions
1)b) if this is a real definition, we can not eliminate
1)c) if this definition is assiciated with an evaluation of the expression, save the location for later
1)d) if this definition is a different inserted assignment recurively look at its definitions
2)If we never found any real definitions, then this expression can be elinimated. Choose a fresh variable temp. At every saved location, add an instruction temp = lhs after the location, where lhs is the left hand side of the assignment at the saved location.
3) Replace the right hand side of the instruction at the location of elimination with a use of the new temporary variable temp.

Using the example above, if there is a definition of f after the assignment to b, then the algorithm would terminate at step (1)(b), and not eliminate. If there were another expression that was an elimination candidate in that position, then a definition would be encountered, but it would be recognized as a fake definition and traeted as if it were not there, by just using the definitions at the inserted assignment. If there were an additional path in which the expression were not available, then there will be some other definition of the variables in the subexpression (after possibly having to pass through ther inserted assignments), for we are coming from a Java like language, and all variables must be definitely assigned before use.

After this transformation is complete, all assignments subsequently removed; do not want to cause problems for later phases. Of course, after completion, the def/use analysis needs to be rerun for the benefit of other optimizations.

The question does come up as to why the def/use analysis was (ab)used in this way rather than writing a whole new data-flow analysis. According to Dan, the def/use analysis was a complicated piece of code to write, taking many man-months, with many subtle points. Thus, it is probable that writing available expression would be similar in difficulty, and just as time consuming. In consequence, Dan recommended using the def/use analysis rather than writing a new analysis, for my time was limited. Further, while the code will be somewhat more inefficient using this method, as it is not built for CSE, the resulting code is not very slow.

Interaction with other optimizations

Common subexpression elimination inserts a fresh compiler temporary for every subexpression it eliminates. This can cause problems, like register pressure, especially if the number of inserted temporaries is high. However, global copy propagation is performed subsequent to common subexpression elimination. This removes the uses of many of the compiler temporaries generated by common subexpression elimination. For example:
a = f * i
...
b = f * i
...
c = f * i
generates:
a = f * i
temp_1 = a
...
b = temp_1
temp_2 = b
...
c = temp_2
At least one and possibly both temporaries in this case could be removed, depending on if a or b were killed after being assigned to. There is also the possibility of revealing additional common subexpressions that common subexpression elimination can not detect if run in isolation.

To remove subexpressions with more than two operands, this algorithm depends on copy propagation, followed by another run of this transformation. Thus if f * i + 3 is computed twice, the first pass will remove f * i, and the second will remove temp + 3, where temp is the generated variable. However, this will only work with limited instances of larger subexpressions.

There is the potential to use the same algorithm for arbitrary sized expressions directly. Here would be some code with the inserted assignments:
a = a
d = d
x_2=d+a
a = a
b = b
d = d
e=x_2+b
...

a = a
b = b
x_1=a+b
a = a
b = b
d = d
c=x_1+d
...




...
b = b
d = d
x_3=b+d
a = a
b = b
d = d
f=x_3+a

While this would detect arbitrarily large subexpressions, the program expansion now is quadratic. This deo not seem like a win.

Timings

Preliminary results of timings are okay. Timings were done with -O0 and -O3 optimization on gcc. The test program was matrix multiplication.

With -O0, common subexpression elimination alone seems to have little or no effect. However, when run in conjunction with global copy propagation, it seems to remove a couple percent from the running time. Thus the reliance on global copy propagation that was built into the design does seem to be required.

With -O3, the benefits gained by the optimization were not measurable. It could be that the gain simply fell below the noise level of the clock, and more and better benchmarks should be run to detect it. Another posibility is that other subexpression classes should be considered for possible elimination.

The tests ran on the order of seconds, and the clock granularity was at hundredths of a second, so differences were near the ability of the clock to measure.  Also, variations in the running times were also in the few percent range. Thus longer running programs need to be used for timing.

In future timings, a wider range of program choices will be used; matrix multiplication is not a good choice to see the effeciveness of common subexpression elimination. In particular, we need to try on tests that we believe will have favorable interactions between the various optimizations within the Titanium compiler.

Future work

The subexpression considered for elimination only include selected unary and binary operations. Not all operations are currently eliminated. Other simple subexprs, like casts, should be considered for elimination. Additional expressions identified would presumably speed up the code more. In particular, it seems that efforts should be made to enable Titanium specific expression elimination, for example points or domains, or remove expressions using special information that is available, like methods with no side effects. This class of optimizations seems more likely to be useful, as they take advantage of information that the Titanium compiler has that gcc does not. The reason none of these were considered in this work was due to the fact that I believed that it was more important to implement the basic algorithm than to make it work for every possible subexpression before finishing the work.

The subexpression identified for elimination only include single operations. Larger subexpressions could be analyzed to give a broader range of the subexpressions that can be eliminated. Note, however, that to identify more larger expressions than multiple passes of the current code finds, expressions need to be compared in a non-naive fashion, taking into account commutativity, associativity, and possibly other algebraic identities.

As the live ranges of common subexpressions are unanalyzed, all declarations of compiler generated temporaries are placed at the top level of the method in which they are used. This could potentially cause performance degradation if the C compiler is not sufficiently intelligent about the live ranges of these variables. In particular, eliminations in basic blocks should be able to be easily analyzed for the live ranges of temporaries, and place the declarations of those temporaries accordingly.

Acknowledgments

Many thanks to Dan Bonachea, who was my contact in the Titanium compiler group. Also, to David Marin, who relayed the main ideas, and whose code which accessed the def/use analysis provided a guideline for my own.

References

[1] Titanium: http://titanium.cs.berkeley.edu/

[2] Muchnick, Steven S. Advanced Compiler Design & Implementation, Morgan Kaufmann Publishers, San Francisco, CA. 1997