The Titanium compiler has a def/use analysis available for use by the internals
of the compiler. This analysis collects information regarding the possible
definitions of any given use (and vice versa) of a variable. It maintains,
for every use of a variable, all possible locations where that variable could
be defined and that definition reaches the use. Similarly for definitions,
in which every definition maintains a list of all possible places where that
definition could be used.
Common subexpression elimination needs related information, i.e. when the variables involved in an expression are possibly overwritten before the expression is reused. That is, the variables can not have any definitions between the two evaluations of the expression. This is information that can be derived from def/use information. Unfortunately, the information can not be used directly. Thus, extra assignments are inserted to generate the information needed for common subexpression elimination. In this way, the def/use analysis already supplied can be leveraged to provide us with needed data, rather than needing to write an additional analysis.
A definition is an assignment of a value to a variable. A use of a definition
is where the value assigned by a definition is susequently used. A particular
definition of a variable is said to reach a given point in a procedure if
there is an execution path from the definition to that point such that the
variable has the value assigned by the definition. Def/use analysis determines
which particular definitions may, by some control flow path, reach any particular
use.
In general, it is recursively undecidable whether a definition actually reaches a particular point. Thus, we assume that a definition reaches a use if we can not determine absolutely whether it can or not. Generally, optimizations based on reaching definitions depend on what may reach, we keep on the side of conservatism, putting the preservation of correctness ahead of aggressiveness of optimization.
If an instruction redefines a variable, then it is said to kill the definition; otherwise it preserves it. Some definitions of a variable are unambiguous, such as explicit assignments, while others, such as assignments through pointers and procedure calls, are ambiguous, for we may not be able to determine if they affect a particular variable.
The Titanium compiler has a def/use analysis already available for use.
It maintains def/use and use/def chains for all variables that appear in the
program. A def/use-chain for a variable connects a definition of that
variable to everything that may be a use, while a use/def chain connects
a use to everything that may be a definition. The Titanium compiler handles
method calls by killing every variable, unless it determines the method has
no side effects, and handles pointers by killing all variable of the indirected
type.
Suppose that in some code, the following portion exists:
a = f * i
...
b = f * i
If the value of the expression f *i has not changed between the first and
second assignment, then we are doing an extra computation that is unnecessary.
In this case, the expression f * i is called a common subexpression. We can
eliminate common subexpressions to potentially reduce the running time of
the program. That is, we transform the code into this:
a = f * i
temp = a
...
b = temp
where temp is a fresh variable.
Common subexpression elimination depends on what is called available expressions. At any particular point in the program, there is some set of expressions that has been computed, and whose values have not subesquently changed. For example, after the execution of a = f * i above, the expression f * i is available. As the value of f * i did not change before the execution of b = f * i, it was still available. Because it was available, we could perform the transformation. Common subexpression elimination needs to know what expressions are available for use. The available expressions data-flow analysis determines, for every program point, which expressions are available. There must be an evaluation of the expression on every path from the entry to the program point, and none of the variables occurring in the expression are defined between the last such evaluation on a path and the point.
An occurrence of an expression in a program is a common subexpression if the expression is available. Common subexpression elimination is a transformation that removes the recomputations of common subexpressions and replaces them with uses of the saved values of those common subexpressions.
Originally, only local common subexpression elimination was originally being coded for addition to the Titanium compiler. The method being used was the standard algorithm in Muchnick's book. When analyzing each node, one must determine what variables are killed, and consequently what subexpressions were no longer available. Rather than attempting to write my own analysis of which variables were killed, this was accomplished by taking every assignment and determining what variables were defined by that assignment by looking at who used this definition, utilizing the def/use info previously computed. This, however, was skating the thin edge of the analysis, and Dan suggested that the def/use analysis could be used in a more direct way, which would then also provide global common subexpression elimination for free.
Given the code:
a = f * i
...
b = f * i
One wishes to determine if the expression f * i, computed at the first statement,
is available at the second statement. We can do this only if the value of
f * i has not changed, which we can conclude from whether f or i has changed
between the two assignment statements. That is, if f or i has a definition
that kills the respective variable between the two assignments, then the
expression is not available. However, as it stands, the def/use information
available is not amenable for use to determine if a variable x is killed between
point a and point b. One would need to ask the question: is the definition
of f * i in the first statement the only definition at the use of f* i? But
expressions do not have definitions. So, modify the code above by inserting
assignments to produce:
f = f
i = i
a = f * i
...
b = f * i
At this point, we now can use the created definitions of f and i as the
definition of the expression f * i. Now the question becomes: are the definitions
of f and i in the second statement the same as the definitions of f and i
in the first?
The above test is sufficient for straight line code, but not for code
where more than one computation of an expression can take place. For example:
a = f * i ... |
b = f * i ... |
|
... c = f * i |
In this case, one wants to know if every definition of f and i are from the generated assignments associated with some computation of the expression. If there is a definition of f or i that is not from an inserted assignment, then there exists a path on which f or i is killed and f * i is not recomputed, i.e. the expression f * i is not available. The particular computations that are associated with instances of inserted assignments where the definitions reach the second computation is the set of assignments that are involved in the common subexpression. This winnowing is necessary, for not all computations of f * i reach the statement under examination.
To find the subexpressions that are canidates for elimination, the control
flow graph is searched. All assignments with expressions on the right hand
side that might be eliminatable are stored in a hash table. Any expression
stored in the table more than once is then instrumented for the analysis
to determine if it can be eliminated. As expressions of no more than two
operands are considered, the code increase is linear in the number of assignments
instrumented.
Running the def/use analysis is quite expensive. In particular, one does
not want to run it for every expression under examination. Thus, all useless
assignments are inserted at one time, the def/use analysis is run, and then
each expression is examined. However, this adds one additional wrinkle: all
inserted assignments that are not for the particular subexpression under examination
must be ignored.
Algorithm:
For every location where an elimination is to be attempted:
1) For each variable in the expression:
1)a) search its definitions
1)b) if this is a real definition, we can not eliminate
1)c) if this definition is assiciated with an evaluation of the expression,
save the location for later
1)d) if this definition is a different inserted assignment recurively look
at its definitions
2)If we never found any real definitions, then this expression can be elinimated.
Choose a fresh variable temp. At every saved location, add an instruction
temp = lhs after the location, where lhs is the left hand side of the assignment
at the saved location.
3) Replace the right hand side of the instruction at the location of elimination
with a use of the new temporary variable temp.
Using the example above, if there is a definition of f after the assignment
to b, then the algorithm would terminate at step (1)(b), and not eliminate.
If there were another expression that was an elimination candidate in that
position, then a definition would be encountered, but it would be recognized
as a fake definition and traeted as if it were not there, by just using the
definitions at the inserted assignment. If there were an additional path in
which the expression were not available, then there will be some other definition
of the variables in the subexpression (after possibly having to pass through
ther inserted assignments), for we are coming from a Java like language,
and all variables must be definitely assigned before use.
After this transformation is complete, all assignments subsequently removed;
do not want to cause problems for later phases. Of course, after completion,
the def/use analysis needs to be rerun for the benefit of other optimizations.
The question does come up as to why the def/use analysis was (ab)used in this way rather than writing a whole new data-flow analysis. According to Dan, the def/use analysis was a complicated piece of code to write, taking many man-months, with many subtle points. Thus, it is probable that writing available expression would be similar in difficulty, and just as time consuming. In consequence, Dan recommended using the def/use analysis rather than writing a new analysis, for my time was limited. Further, while the code will be somewhat more inefficient using this method, as it is not built for CSE, the resulting code is not very slow.
Common subexpression elimination inserts a fresh compiler temporary for
every subexpression it eliminates. This can cause problems, like register
pressure, especially if the number of inserted temporaries is high. However,
global copy propagation is performed subsequent to common subexpression elimination.
This removes the uses of many of the compiler temporaries generated by common
subexpression elimination. For example:
a = f * i
...
b = f * i
...
c = f * i
generates:
a = f * i
temp_1 = a
...
b = temp_1
temp_2 = b
...
c = temp_2
At least one and possibly both temporaries in this case could be removed,
depending on if a or b were killed after being assigned to. There is also
the possibility of revealing additional common subexpressions that common
subexpression elimination can not detect if run in isolation.
a = a d = d x_2=d+a a = a b = b d = d e=x_2+b ... |
a = a b = b x_1=a+b a = a b = b d = d c=x_1+d ... |
|
... b = b d = d x_3=b+d a = a b = b d = d f=x_3+a |
Preliminary results of timings are okay. Timings were done with -O0 and
-O3 optimization on gcc. The test program was matrix multiplication.
With -O0, common subexpression elimination alone seems to have little
or no effect. However, when run in conjunction with global copy propagation,
it seems to remove a couple percent from the running time. Thus the reliance
on global copy propagation that was built into the design does seem to be
required.
With -O3, the benefits gained by the optimization were not measurable.
It could be that the gain simply fell below the noise level of the clock,
and more and better benchmarks should be run to detect it. Another posibility
is that other subexpression classes should be considered for possible elimination.
The tests ran on the order of seconds, and the clock granularity was
at hundredths of a second, so differences were near the ability of the clock
to measure. Also, variations in the running times were also in the
few percent range. Thus longer running programs need to be used for timing.
In future timings, a wider range of program choices will be used; matrix
multiplication is not a good choice to see the effeciveness of common subexpression
elimination. In particular, we need to try on tests that we believe will
have favorable interactions between the various optimizations within the
Titanium compiler.
The subexpression considered for elimination only include selected unary and binary operations. Not all operations are currently eliminated. Other simple subexprs, like casts, should be considered for elimination. Additional expressions identified would presumably speed up the code more. In particular, it seems that efforts should be made to enable Titanium specific expression elimination, for example points or domains, or remove expressions using special information that is available, like methods with no side effects. This class of optimizations seems more likely to be useful, as they take advantage of information that the Titanium compiler has that gcc does not. The reason none of these were considered in this work was due to the fact that I believed that it was more important to implement the basic algorithm than to make it work for every possible subexpression before finishing the work.
The subexpression identified for elimination only include single operations. Larger subexpressions could be analyzed to give a broader range of the subexpressions that can be eliminated. Note, however, that to identify more larger expressions than multiple passes of the current code finds, expressions need to be compared in a non-naive fashion, taking into account commutativity, associativity, and possibly other algebraic identities.As the live ranges of common subexpressions are unanalyzed, all declarations of compiler generated temporaries are placed at the top level of the method in which they are used. This could potentially cause performance degradation if the C compiler is not sufficiently intelligent about the live ranges of these variables. In particular, eliminations in basic blocks should be able to be easily analyzed for the live ranges of temporaries, and place the declarations of those temporaries accordingly.