\documentclass[]{sigplanconf}
\usepackage{amsmath}
\usepackage{xspace}
\usepackage{url}
\usepackage{tikz}
\usepackage{listings}
\lstset{language=R,escapechar=@}
\usetikzlibrary{shapes,arrows,positioning}
\newcommand{\Jimple}{\emph{Jimple}\xspace}
\newcommand{\Baf}{\emph{Baf}\xspace}
\newcommand{\Dava}{\emph{Dava}\xspace}
\newcommand{\Grimp}{\emph{Grimp}\xspace}
\newcommand{\Soot}{\texttt{Soot}\xspace}
\newcommand{\ASM}{\texttt{ASM}\xspace}
\newcommand{\APRON}{\texttt{APRON}\xspace}
\newcommand{\PPL}{\texttt{PPL}\xspace}
\newcommand{\Jandom}{\texttt{Jandom}\xspace}
\newcommand{\Random}{\texttt{Random}\xspace}
\newcommand{\wrt}{w.r.t.\xspace}
\newcommand{\Z}{[-\infty,\infty]}
\begin{document}
\conferenceinfo{SOAP'13}{June 20, 2013, Seattle, Washington, USA}
\copyrightyear{2013}
\copyrightdata{ISBN 978-1-4503-2201-0/13/06}
%\titlebanner{banner above paper title} % These are ignored unless
%\preprintfooter{short description of paper} % 'preprint' option specified.
\title{Numerical static analysis with Soot}
%\subtitle{Subtitle Text, if any}
\authorinfo
{Gianluca Amato \and Simone Di Nardo Di Maio \and Francesca Scozzari}
{Universit\`a di Chieti-Pescara - Italy}
{\{gamato, simone.dinardo, fscozzari\}@unich.it}
\maketitle
\tikzstyle{flowchart} = [node distance=0.5cm, inner sep=3pt, draw]
\tikzstyle{joinnode} = [flowchart, circle, minimum size=1em]
\tikzstyle{decision} = [flowchart, diamond, minimum size=2em, inner sep=0pt]
\tikzstyle{assignment} = [flowchart, rectangle, align=center,
rounded corners, minimum size=2em]
\tikzstyle{depnode} = [draw, circle]
\tikzstyle{line} = [draw, inner sep=2pt, -latex']
\begin{abstract}
Numerical static analysis computes an
approximation of all the possible values
that a numeric variable may assume, in any execution of the program. Many
numerical static analyses have been proposed exploiting the theory of abstract
interpretation, which is a general framework for designing provably correct
program analysis. The two main problems in analyzing numerical properties are:
choosing the right level of abstraction (the abstract domain) and developing
an efficient iteration strategy which computes the analysis result
guaranteeing termination and soundness.
In this paper, we report on our prototype implementation of a Java bytecode
static analyzer for numerical properties. It has been developed exploiting
\Soot bytecode abstractions, existing libraries for numerical abstract
domains, and the iteration strategies commonly used in the abstract
interpretation community. We show pros
and cons of using \Soot, and discuss the main differences between our analyzer
and the \Soot static analysis framework.
\end{abstract}
\category{F.3.2}{Semantics of Programming Languages}{Program analysis}
\terms
Design, implementation, static analysis.
\keywords
Abstract interpretation, bytecode, numerical domains.
\section{Introduction}
Static analysis determines, at compile-time, properties about the run-time
behavior of programs, in order to verify, debug and optimize the code.
Abstract interpretation~\cite{CousotC77,CousotC79} is a general
theory for defining static analyses starting from the property of interest
(the so-called \emph{abstract domain}), and for formally proving their
correctness.
The basic idea of abstract interpretation is that a static analysis
can be derived from the (concrete) semantics of a program. We assume that the
semantics of a program $P$ can be computed as the least fixed
point of a semantic function $f: C \to C$, where $C$ is a concrete domain. In
general, the semantic function is given compositionally starting from a set
of basic operators which depends on the kind of programming language under
analysis. Typical operators for imperative programs include assignment, test
and merge, which are used in the semantics for treating the basic statements
of variable assignments, conditionals and loops.
In the abstract interpretation theory, a static analysis is viewed as an
abstract semantics, which can be directly obtained from the concrete one by
substituting the concrete domain $C$ with an abstract domain $A$,
representing the program properties we want to analyze, and all the concrete
operators with corresponding abstract operators.
%The abstract domain must
%generally be a partially ordered set, and concrete and abstract
%domains are related by a concretization function $\gamma: A \to C$ which maps
%abstract objects to concrete objects.
In practice, at the end of this formalization process, we get a system of
equations not dissimilar from any other data-flow analysis. Each variable in
the equations corresponds to a program point, and each equation describes the
effect of an instruction or a basic block on the program properties. The
least fixpoint of this equation system is a safe approximation of the concrete
semantics of the program.
%
% a program $P$ is transformed in a system of equations,
%where equation describes the behavior in a program point. The concrete
%semantics of a program is the least solution of this system of equations
%computed over the
%concrete domain $C$.
In this work we are mainly interested in numerical
properties. A numerical property on a program variable $x$ gives an (over)
approximation of the possible values that the variable $x$ may assume in a
specific program point, during any execution of the program. Numerical
properties are typically expressed by means of geometrical shapes.
For instance, in the interval abstract
domain \cite{CousotC76}, each abstract object maps a program variable to a
(possibly unbounded) interval, such as $x \in [3,7]$.
%, which represent the
%set of integers $\{3,4,5,6,7\}$.
%The interval abstract domain is defined as the collection
%of all the
%(possibly unbounded) intervals over integers.
%
\begin{figure}[t]
%\begin{minipage}[c]{3cm}
%\begin{lstlisting}
% i = 0
%@ \ding{172}@ while @ \ding{173}@ (i<10)
%@ \ding{174}@ i = i+1 @ \ding{175}@
%@ \ding{176}@
%\end{lstlisting}
%\end{minipage}
%\hfill
\begin{minipage}[c]{3.5cm}
\hfill
\begin{tikzpicture}[auto, scale=1, transform shape]
\node [assignment] (init) {$x=0$};
\node [joinnode, below=of init] (join1) {};
\node [decision, below=of join1] (if) {$x<10$};
\node [assignment, below=of if] (incr1) {$x=x+1$};
\path [line] (if) -- node[near start,swap]{\tiny true} node{3}
(incr1);
node[near start]{4} ++(0,-6);
\path [line] (init) -- node{1}(join1);
\path [line] (join1) -- node{2}(if);
% \path [line] (if2) -- node[near start,swap]{\tiny true} node{7}
% (incr1);
% \path [line] (if2) -- node[swap]{\tiny false} ++(-1,0) |-
% node[near
% start,swap](e8){8} (incr2);
\path [line] (incr1) -- ++(1,0) |- node[near start,swap](e4){4}
(join1);
\path [line] (if) -- node[near start,swap]{\tiny false} ++(-1,0) --
node[near start]{5} ++(0,-2);
% \path [line] (incr2) -- ++(2,0) |- node[pos=0.4]{10} (join1);
% \node [rectangle,draw,dashed,fit=(join2) (if2) (incr1) (e8)(e9) ]
% {};
\end{tikzpicture}
\end{minipage}
\hfill
\begin{minipage}[l]{4.5cm}
\begin{align*}
y_1 &= [0,0] \\
y_2 &= y_1 \vee y_{4}\\
y_3 &= y_2 \wedge [-\infty,9]\\
y_4 &= y_3 + [1,1]\\
y_5 &= y_2 \wedge [10,\infty]\\
\end{align*}
\end{minipage}
\caption{\label{fig:nested}A control flow graph and the corresponding system
of
equations in the interval abstract domain.}
\end{figure}
%
Consider, for instance, the simple program:
\begin{quote}
\begin{lstlisting}
x = 0
while (x<10)
x = x + 1
\end{lstlisting}
\end{quote}
whose corresponding control flow graph and system of equations are depicted in
Figure~\ref{fig:nested}. Each variable $y_i$ in the system corresponds to the
edge in the graph labeled by $i$, which in turn is a relevant program
point.
The abstract semantics of the example program is simply obtained as the least
solution of the corresponding system of equations solved over the abstract
domain of intervals, where the operators $\vee$ and $\wedge$ represent
convex hull and intersection of intervals, while $+$ is the (pointwise) sum.
Analyses over the abstract domain of intervals are quite simple, efficient,
but not very precise. Other numerical domains, encoding more expressive
properties, are the polyhedra~\cite{CousotH78}, octagon \cite{Mine06} and
parallelotope~\cite{AmatoS12-entcs} domains. Of course, more expressive
abstract domains lead to higher complexity from the
computational point of view.
In the polyhedra domain, each abstract object is a polyhedron described by a
system of
linear equations $l \leq Ax \leq u$, where $x$ is the vector of program
variables, $A$ is the coefficient matrix and $l, u$ are vectors. A
parallelotope is a polyhedron whose matrix $A$ is invertible.
The octagon domain is very similar to the polyhedra domain but the coefficient
matrix $A$ is fixed; the linear equations are of the form $\pm x_1 \pm x_2
\leq c$, where $x_1$ and $x_2$ are program variables and $c$ is a constant.
%
%In the parallelotope domain, instead, each abstract object is a parallelotope
%represented by a system
%of linear equations $l \leq Ax \leq u$ where $A$ is invertible.
All numerical abstract domains should implement a common set of operations
such
as assignment, convex hull (which corresponds to the merge operation in
\Soot),
intersection, projection, and so on.
Usually the equations are solved iteratively, through successive
approximations to determine a fixed point. Since most numerical abstract
domains contain infinite ascending
chains, we need to introduce an approximation to be able to compute a fixed
point.
The most common technique is to use a \emph{widening}~\cite{CousotC76}. It
is an operator that allows to predict the fixed point by analyzing the
sequence of approximations calculated in previous iterations of the analysis.
It guarantees termination of the analysis, but may introduce a loss of
precision.
In order to partially recover precision, it is almost mandatory to perform a
two-phases analysis: an ascending phase, using widening, which computes a
rough
over-approximation, and a descending phase, where widening is replaced by a
new operator called \emph{narrowing}, which refines the previous result.
%Thanks to the introduction of widening operator it follows that static
%analysis done through the infinite abstract domains are more precise and as
%efficient as any static analysis done through the finite abstract domains.
%The fundamental requirement that this is the operator must ensure termination
%when aied to a increasing chain.
By following the abstract interpretation theory and exploiting \Soot
\cite{Soot}, we are
implementing analyses of numerical properties of Java bytecode inside our
analyzer \Jandom.
\section{\Jandom}
\Jandom is an abstract interpretation based static analyzer written in Scala,
derived from our former project \Random \cite{AmatoS12lpar,AmatoPS10-rv}, which implemented template parallelotopes \cite{AmatoPS11JSC,AmatoPS10-sas,AmatoS11fi,AmatoLM09}.
At the moment, \Jandom supports intra-procedural static analysis of numerical
properties for a simple imperative language with a C-like syntax. It has
preliminary support for symbolic transition systems of the kind used in the
FASTer\footnote{\url{http://tapas.labri.fr/trac/wiki/FASTer/}} model checker
\cite{FAST2008} and for Java bytecode. \Jandom is freely available
online\footnote{\url{https://github.com/jandom-devel/Jandom/}} on
GitHub. In the long
run, \Jandom aims at becoming a general framework for the abstract
interpretation community to ease
the implementation of new analysis strategies and the test of new abstract
domains.
The support for the analysis of Java bytecode is at a preliminary stage. We
support a small set of instructions which allow to analyze only
very simple methods. Supporting the full bytecode would not be difficult, but
time consuming. Therefore, we prefer to explore different implementation
models and technologies before committing to a definitive solution.
For manipulating the Java bytecode, we have first tried
\ASM\footnote{\url{http://asm.ow2.org/}}, which is a fast
library with pretty good documentation. We also evaluated the use of
\texttt{BCEL}\footnote{\url{http://commons.apache.org/proper/commons-bcel/}},
but we preferred \ASM since it seems to be better maintained.
In our experience, \Soot is not very common in the abstract interpretation
community, so we came to it only later. However, \Soot has a much broader
scope than \ASM, and it has been extensively used for the static analysis of
Java. Therefore, we expected that using \Soot would give us important benefits
and accelerate development.
This paper is a report on our experience with the bytecode analyzer in
\Jandom, and the use of \Soot in its development. We will try to explain
what parts of \Soot we used,
the ones we did not use, the ones we plan to use in the future.
%, and also make
%some comparisons with \ASM on the functionalities which are common between
%the
%two libraries.
It is important to note that our aim is to use \Soot as a
library for implementing numerical static analyses. We
do not consider here the problem of integrating these analyses in the \Soot
framework itself. The interested reader can find the code described here in
the branch \texttt{soap2013} of the GitHub repository of \Jandom.
\subsection{Architecture of \Jandom}
We have designed \Jandom having in mind a strongly layered structure, as
depicted in Figure~\ref{fig:structure}. Some of these layers are not
so cleanly separated in the real code as they are in this description.
%This holds, in particular, for the \emph{Flow
%graph analyzer} and the \emph{Basic block analyzer}, which are described
%below.
Nonetheless, this is the model to which we are aiming.
\subsection{Numerical abstract domains}
In the lower layer we find the numerical abstract domains, which encode
properties of numerical variables. Although there are
a few numerical domains natively implemented in \Jandom, most of them
are part of the Parma Polyhedra Library (\PPL for short)
\cite{BagnaraHZ08SCP}. Another well known library for numerical domains is
\APRON \cite{JeannettM08}, which we plan to integrate in the future. To
accommodate the use of native, \PPL and \APRON based domains, we have
designed a suitable common interface, called \texttt{NumericalProperty}. All
the native domains directly implement this interface, while \PPL domains are
appropriately
wrapped. This interface allowed us to develop a truly parametric analyzer,
which can be easily plugged with new, numerical abstract domains.
We believe that the ability to exploit existing libraries is fundamental when
designing a new analyzer, so that we carefully studied the problem of best
wrapping \PPL abstract domains into our common interface, which was not an
easy task.
In fact, while all the \PPL domains have almost identical method
signatures, they do not implement any common Java interface, and directly
descend from the \texttt{Object} class. This is the heritage of the fact that
\PPL is developed in C++, where templates may be used to achieve generic
programming. The Java
bindings came later and, unfortunately, did not try to recover some of the
flexibility of
templates through inheritance. This makes it difficult to write a generic
wrapper for all the \PPL domains. At the
moment, we use three kind of wrappers:
\begin{itemize}
\item ad-hoc wrappers for the most common numerical domains;
\item a generic wrapper based on reflection;
\item a generic wrapper based on Scala macros.
\end{itemize}
Ad-hoc wrappers are the simplest one, but different wrappers are required for
different domains. The reflection based wrapper is quite convenient, but
suffers
from the performance penalty of using reflection. On the contrary, the wrapper
based on macros
has the same speed of the ad-hoc wrappers, but it is generic for all the
domains:
the Scala compiler generates a different binary for each numerical domain at
compile time, similarly to C++ templates.
Although the macro based wrapper seems to be the best choice, its use is
annoying in the developing process. Due to the limits in the Scala compiler
and build tools, classes
using macros should belong to different projects than classes defining
macros. For this reason, we use the reflection based wrappers during the daily
development.
%As a consequence, the macro based wrapper is outdated most of
%the time.
Another possibility we are exploring is to use \Soot or \ASM abilities of
manipulating bytecode to generate ad-hoc wrappers at runtime. This would
bring all the advantages of macro based wrappers, together with a reduction of
the size of compiled code
%(since wrappers would be generated at runtime when
%needed)
and a simplification in the management of the project. On the other side, the
implementation of this dynamic wrapper is more difficult,
especially because the correspondence between source code and bytecode is not
so easy in Scala as it
is in Java.
\begin{figure}
\centering
\tikzstyle{myclouds}=[cloud, fill=green!30, draw=black, text=black, cloud puffs=10, cloud puff arc=120, aspect=3.2, inner sep=2.3]
\begin{tikzpicture}
\filldraw[fill=yellow!30] (0,0) +(-2.75,-1.1) rectangle ++(2.75,0.3);
\draw (0,0) node{Numerical domains};
\node [myclouds] at (-1.75, -0.6) {native};
\node [myclouds] at (0, -0.6) {\PPL};
\node [myclouds] at (1.75, -0.6) {\APRON};
%\filldraw[fill=green!30] (-1.75,-0.6) ellipse (0.75 and 0.25);
%\filldraw (-1.75,-0.6) node{Native};
%\filldraw[fill=green!30] (0,-0.6) ellipse (0.75 and 0.25);
%\filldraw (0,-0.6) node{PPL};
%\filldraw[fill=green!30] (1.75,-0.6) ellipse (0.75 and 0.25);
%\filldraw (1.75,-0.6) node{APRON};
\draw[<->] (0,0.3) -- (0,0.8);
\filldraw[fill=yellow!30] (0,1.1) +(-2.75,-0.3) rectangle ++(2.75,0.3);
\draw (0,1.1) node{Abstract environments};
\draw[<->] (0,1.4) -- (0,1.9);
\filldraw[fill=yellow!30] (0,3) +(-2.75,-1.1) rectangle ++(2.75,0.3);
\draw (0,3) node{Basic block analyzer};
\node [myclouds] at (-1.25, 2.4) {\texttt{ASM}};
\node [myclouds] at (1.25, 2.4) {\texttt{Soot}};
%\filldraw[fill=green!30] (-1.25,1.3) ellipse (0.75 and 0.25);
%\filldraw (-1.25,1.3) node{\ASM};
%\filldraw[fill=green!30] (1.25,1.3) ellipse (0.75 and 0.25);
%\filldraw (1.25,1.3) node{\Soot};
\draw[<->] (0,3.3) -- (0,3.8);
\filldraw[fill=yellow!30] (0,4.1) +(-2.75,-0.3) rectangle ++(2.75,0.3);
\draw (0,4.1) node{Flow graph analyzer};
\end{tikzpicture}
\caption{\label{fig:structure}Layered architecture of \Jandom}
\end{figure}
Finally, this layer also contains some domain combinators, i.e., methods to
get more precise domains from the basic ones. Thanks to the powerful type
system of Scala, the numerical domain API is completely type safe.
It is worth noting that, at the moment, numerical domains do not take into
consideration overflow and underflow of machine integers and floats. There are
standard methods to handle them \cite{mine:esop04}, which we plan to integrate
in the near future.
\subsection{Abstract environments}
Numerical domains only handle numerical variables and their
relationships, but
the program state in the JVM is much more complex: for instance, there are
objects in the heap and references to objects. In addition, in the \Baf
representation we also have a stack to take into consideration.
The \emph{Abstract environments} layer allows to implement (different)
abstractions of the full program state. It is parametric \wrt a numerical
domain, which is
used for the abstraction of numerical variables in the stack, frame and heap.
We have implemented two abstract environments: one is used for the analysis
with \Jimple, the other one for the analyses with \Baf and \ASM. At the
moment, they
both ignore the heap and everything which is not numeric. In the future, we
plan to introduce new domain hierarchies for the analysis of heap and objects,
and make them additional parameters of the abstract program environment.
As an example, consider a generic
abstract state for \Baf, as implemented in the class \texttt{JVMEnvDynFrame}.
It is a triple $\langle f, s, p \rangle$ where $p$ is
a numerical property (such as $v_0 + v_1 = 1$), $f$ is an array which maps
frame positions to variables in $p$, and $s$ is a stack of integers which
has the same goal as $f$ but for the stack. For example, the state $\langle
[0,-1], \langle 1,-1,-1\rangle, v_0+v_1=1 \rangle$ means that:
\begin{itemize}
\item the frame has two positions. The first position corresponds to the
variable $v_0$ in $p$, while the value $-1$ means that the second position is
unused or contains a non-numerical value;
\item the stack has currently three elements. The top element corresponds to
the variable $v_1$ in $p$, the other two elements contain non numerical values;
\item the first frame position and the top position of the stack are subject to
the condition $v_0+v_1=1$.
\end{itemize}
This choice of tracking frame and stack variables separately from the numerical property $p$ has many advantages:
%An alternative choice would have been to only use $p$ (without the $f$ and $p$
%component) and a fixed mapping from numeric dimensions to positions in the
%stack and frame. For example, the same property above could be represented gas
%$v_0 + v_4 = 1$, where $v_0$ and $v_1$ are the variables associated to the
%frame and $v_2, \ldots, v_4$ those associated to the stack. However, this
%solutions has a performance drawback: it forces $p$ to
%have dimension $n+m$ where $n$ is the size of the frame and $m$ is the maximum
%size of the stack. In our chosen solution,
the dimension of $p$ (the number of variables in the associated space) varies dynamically and
is generally much lower than the total number of variables in the heap and
stack.
%$n+m$.
Since most numerical domains have cubic (or
worst, even super exponential) complexity on the dimension of $p$, we want
to keep this value as low as possible.
An alternative choice is to only use $p$ (without the $f$ and
$s$ components) and a fixed mapping from numeric dimensions to positions in
the stack and frame. For example, the same property above could be represented
as $v_0 + v_4 = 1$, where $v_0$ and $v_1$ are the variables associated to the
frame and $v_2, \ldots, v_4$ those associated to the stack. With this
solution, implemented in the \texttt{JVMEnvFixedFrame} class, $p$ has
dimension $n+m$ where $n$ is the number of locals and $m$ is the size of the
stack. This solution has the advantage to be simpler than the previous one,
but we expect it to be slower for methods with many non-numerical variables.
The abstract environment for \Jimple is much simpler since there is no
stack involved and the correspondence between the local variables and
the variables in the numerical properties is fixed for the entire method.
%It should be possible to combine these abstraction in
%different way, but more work should be done tu understand how it is possible.
\subsection{Basic block analyzer}
\label{sec:bb}
Our definition of basic block is somewhat different from the standard one: it
is a maximal sequence of instructions such that none but the first one may be
target of a jump. Therefore, both single instructions and standard basic
blocks are instances of our definition, but we do not force the creation of
new blocks after every jump instruction. Bigger blocks allow to reduce the
number of intermediate program states we need to record to accomplish the
analysis.
It turns out that a basic block may have many outgoing edges: one
fall-through edge, which is followed when the execution reach the end of the
block, and many edges corresponding to the targets of jumps. The result of the
analysis of a block is a sequence of pairs $\langle \textit{outgoing edge},
\textit{abstract env}\rangle$.
The basic block analyzer is strictly connected to the (abstract) language we
want to analyze. Therefore, we have different basic block analyzers for \ASM,
\Baf and \Jimple.
\subsection{Flow graph analyzer}
The flow graph analyzer builds a full intra-procedural analysis from
a directed graph of basic blocks. At the moment, it implements a worklist
based strategy similar to the one provided by
\texttt{ForwardBranchedFlowAnalysis},
but it directly supports ascending and descending phases and some advanced
widenings.
A crucial point in abstract interpretation based analysis is the ability to
determine an admissible set of
widening points, i.e., a set of program points where widening should be used
instead of merge to ensure the termination of the analysis. Fundamentally, a
set
of widening points is admissible if every cycle in the control flow graph
passes for at least a widening point.
%
Determining a good set of widening points is easy with \Soot. We can use
the \texttt{(Slow)PseudoTopologicalOrderer}, and take as widening points those
program points which are the target of some retreating edges.
Figure~\ref{fig:result} shows the result of the \Baf analyzer for the
flow graph in Figure~\ref{fig:nested}.
\begin{figure}
\small
\begin{verbatim}
static void loop()
{
word i0;
/* Frame: <-1> Stack: <> Property: [ ] */
push 0;
store.i i0;
label0:
/* Frame: <0> Stack: <> Property: [ 0 <= v0 <= 10 ] */
load.i i0;
push 10;
ifcmpge.i label1;
inc.i i0 1;
goto label0;
label1:
/* Frame: <0> Stack: <> Property: [ v0 = 10 ] */
return;
}
\end{verbatim}
\caption{\label{fig:result}Result of the \Baf analysis for the flow graph
in Figure~\ref{fig:nested}.}
\end{figure}
%\subsection{The frontend reporter}
%
%Once the \Baf or \Jimple code has been analyzed, we need to report the
%results
%at the source code level (if we have the source code available). This is the
%aim of the top layer.
\section{\Jandom and \Soot}
Now that we have outlined the architecture of \Jandom, we go into more depth
on
the relationship between \Jandom and \Soot, we discuss some of our
implementation choices and we outline plans for future work related to \Soot.
\subsection{The \Soot analysis framework}
Before implementing our \emph{Flow graph analyzer}, we evaluated whether to
use the \Soot
analysis
framework directly, but we decided against it. The main reason is that we
are using \Soot as a library for a generic abstract interpretation based
analyzer, which we would like to use to test different iteration strategies for several target languages, such as imperative, object-oriented and transition systems.
With this in mind, and given the amount of work already done in \Jandom to accomplish this goal, we concluded that using the \Soot framework was not going to give us any real
benefit. We discuss here in detail why this is the case.
First of all, the only viable base class for our analyzer is
\texttt{ForwardBranchedFlowAnalysis}, since we need to keep separate
numerical properties at different branches of a conditional instruction. This
class implements a highly optimized but straightforward worklist based
analysis. However, an abstract interpretation based static analysis often
requires more complex iteration strategies to achieve adequate precision.
At least, a couple of phases are needed: an ascending phase where an over
approximation of the required solution is built using widening, and a
descending
phase where the result of the first phase is improved. These phases could have
been implemented with two different \texttt{ForwardBranchedFlowAnalysis} in
cascade, but other more complex iteration strategies (with multiple interleaving phases) would have not been possible without
overriding the \texttt{doAnalysis} method, which is tantamount to rewrite the
analyzer from scratch. This is the case, for example, of the basic recursive
\cite{Bourdoncle93} strategy or the more advanced localized \cite{AmatoS13sas}
iteration strategies.
We believe that the analysis algorithm in \Soot could be generalized in order
to support different iteration strategies.
The best approach would be
to design a generic framework to solve fixpoint equations. In the abstract
interpretation community, the best known tool for this task is the
\emph{Fixpoint}
library\footnote{\url{http://pop-art.inrialpes.fr/~bjeannet/bjeannet-forge/fixpoint/}},
which is written in OCaml but could be ported to Java.
%Therefore, we decided to implement in \Jandom the JVM bytecode analysis engine. Our
%implementation is now ad-hoc: when we want to implement a different strategy,
%we hard code in the engine the relevant procedure.
Since we do not use the \Soot analysis framework, we have no reason to use the
\texttt{FlowSet} interface either. This is too limited to be used for
numerical properties, which need a lot of primitive operations such as
assignment of linear expressions, projection over a subset of variables, and
intersection with an half-plane. In \Jandom we use the
\texttt{NumericalProperty} abstract class as the base for all the numerical
properties. We could have made
\texttt{NumericalProperty} descend from \texttt{FlowSet}, but at the moment
this is not possible since we have implemented numerical properties as
immutable objects.
%we decided
%against it. The main reason is that, at this moment, we do not want to tie
%\Jandom too strictly with \Soot. Moreover, at the moment our numerical
%properties are immutable types, and we were not ready to change this (although
%this should be definitively done for performance reasons).
\subsection{Basic blocks}
The \texttt{Block} class in \Soot is able to represent large basic blocks.
However, in order to really generate blocks larger than the standard ones, we
had to provide our subclasses of \texttt{BlockGraph}. Although
\texttt{BlockGraph} has a method \texttt{computeLeader} which should
be used for such a purpose, overriding it was not enough, since the
\texttt{buildBlocks} method assumes that every jump instruction is the tail
of a block. Therefore, we had to override both methods. We believe that \Soot
could be retrofitted with our new implementation of \texttt{buildBlocks}.
Some operations in \Jandom may be simplified by the assumption that
when a node (either a \texttt{Unit} or a \texttt{Block}) has more than one
successor, the first one is the fall-through node, if it exists. Although
this seems to hold in the current implementation of \Soot, it is not
documented anywhere. We think that this is a useful
property that should be made explicit.
Note that representing larger blocks is completely optional and just for the
sake of
optimization, since the upper layer of \Jandom also works with block graphs
built by the standard \Soot libraries.
%
%
%Since, as described in Section~\ref{sec:bb}, our basic blocks are larger than
%the standard ones, we do not use the
%\texttt{Block} class in \Soot but we partition instructions in blocks with
%our own code. Another reason behind the choice of not using the
%\texttt{Block}
%class is that its API is not convenient for our purposes. In particular, we
%could not find
%an easy way to distinguish the fall-through edge of a basic block, which is a
%necessary information for implementing some advanced widening strategies when
%solving an abstract system of equations.
%In fact, the method
%\texttt{getSuccs} returns a list of successors, but the documentation does not
%state in which order. We could access the last \texttt{Unit} of the block and
%use it to determine its fall-through unit, but the association between leader
%nodes and blocks may only be obtained by subclassing the \texttt{BlockGraph}
%class. None of these obstacles is really unsurmountable, but since we had to
%keep our code for finding blocks for the \ASM backend, we did the same for
%\Soot.
\subsection{\Baf vs \Jimple vs \Grimp}
We have investigated both \Baf and \Jimple intermediate
representations. The common expectation is that \Jimple is easier to
analyze, since it is more high-level and has fewer statements. However, we are
not really sure that this makes a big difference for the kind of analysis we
are interested in.
Most of the complexity of the bytecode w.r.t.~the 3-address code used in \Jimple
is due to the big number of arithmetic and conditional instructions. The
standard analyses in \Soot do not care about arithmetic properties. Hence,
abstracting all these instructions into an \texttt{AssignStmt} unit may
actually
simplify the code. In our case we need to explicitly handle arithmetic
instructions. Using \Jimple instead of \Baf just means that we need to carefully
inspect the right hand side of assignments. However, expressions (i.e.,
objects of class
\texttt{Value}) have a more complex structure than bytecode.
Another reason for the big number of instructions in the bytecode is that some
of them have
several variants. For example, the \emph{aload} instruction has variants
\texttt{aload\_0}, \texttt{aload\_1}, \texttt{aload\_2}, \texttt{aload\_3} and
\texttt{aload}. However \Baf (and also \ASM) abstracts away from these
differences. For example, all the load instructions above are collapsed in the
single \texttt{LoadInst} instruction.
On the other side, \Jimple has still some advantages over \Baf, even for our
analyses: it abstracts away from the frame and stack of the JVM, so we only
need to deal with variables (objects implementing \texttt{Local}).
The \Grimp intermediate representation is not generally used for static
analysis. It is similar to \Jimple but expressions are not linearized and may
be quite complex. However, for the analysis of numerical properties, this
representation may help in improving precision. It turns out that, in many cases,
we may analyze the effect of a complex assignment with greater precision if it is
considered all at once.
%, w.r.t.~what is possible if it is broken in multiple
%simpler assignments.
Consider for example the assignment $\texttt{z = z + x + y}$. Given the
precondition $z=w \wedge x+y=0$, with the octagon abstract domain the
analyzer infers that, after the assignment, $z=w \wedge x+y=0$ still holds.
But if the assignment is decomposed as $\texttt{z = z + x}$ and $\texttt{z =
z + y}$, after $\texttt{z = z + x}$ any information regarding $z$ is lost,
since octagon cannot represent the correct
invariant which is $w = z-x$.
In conclusion, we think that performing analyses on \Grimp may be more precise
than on \Jimple, and not terribly more difficult. As observed by one of the
referees, for some numerical domains \Grimp could also improve performance,
since less abstract operations are performed, in particular for
domains implemented in \APRON or \PPL, due to the overhead of calling native
methods.
Although we have not yet tried \Grimp in \Jandom, we plan to transform the
\Jimple analyzer into a
\Grimp analyzer, while keeping the \Baf analyzer.
\subsection{The tag system}
The result of our analyzer is a map, which we call annotation, from program
points (i.e. \texttt{Unit}s) to abstract environments. At the moment, the map
is
implemented through a Scala \texttt{HashMap}. This implies the need to access
hash maps and compute hash functions each time we read or write annotations
(and we do it continuously in the analysis engine).
A better solution would be
to link annotations directly to the corresponding program point. To this aim,
the \Soot tag system might be used. However, it requires a
linear search for the tag name each time we access a tag, so it is not going
to improve performance very much.
It would be convenient to modify the tag system adding the possibility to
access tags in an
indexed manner, using an \texttt{ArrayList} as a backend instead of a
\texttt{List}. This should allow constant access time to an annotation if we
know its index.
\subsection{Planned use of \Soot in the future}
Other parts of \Soot we plan to explore in the future are:
\begin{itemize}
\item \Dava: although \Dava is presented as a decompiler for Java,
we think that performing a structured analysis on the AST of a Java program
may sometimes be more convenient than analyzing the unstructured bytecode. See
\cite{LogozzoF08} for a discussion on the benefits of analyzing bytecode vs
program
source code. However, to the best of our knowledge it is not possible to
generate a \Dava AST directly from the Java source code, and this limits its
usefulness for us. We could use one of the many available Java parsers, but it
would be
great to have this integrated in \Soot.
\item Eclipse plugin: using the \Soot tag system, we plan to interface \Jandom
with the Eclipse plugin.
\end{itemize}
\subsection{Documentation}
Not everything in the experience with \Soot was pleasant. One aspect which
can be improved is documentation. One of the main drawbacks of the current
documentation is the lack of a real user manual. There are many tutorials
available online, but nothing with a thorough treatment. On the contrary, \ASM
has a very detailed guide \cite{asm4} which allows the reader to become very
competent with the API quite easily. It is also true that \Soot is much more
complex and powerful than \ASM, and therefore it is more difficult to describe.
Also, the Javadoc could be improved. For example, consider the class
\texttt{PseudoTopologicalOrderer}. There is no reference to what a pseudo
topological order is. It turns out that the algorithm essentially computes a
depth-first visit of a graph, and reports the order of visit. It is the same
algorithm suggested by Burdoncle in \cite{Bourdoncle93} to efficiently find a
\emph{weak topological order}.
%\footnote{We also found an inquiry on the \Soot
%mailing list about this point.}
Although this can be inferred by the source
code, the user cannot be sure this is the intended behavior, and not an
artifact of the current implementation that may change in the future.
\section{Related work}
Although \Soot is not commonly used for the analysis of numerical
properties, \cite{QianHV02} describes an analysis to remove bound checks in
Java which is particularly relevant to our aims. Bound check elimination is
based on an intra-procedural numerical analysis called
\emph{variable constraint analysis} (VCA) coupled with auxiliary
inter-procedural analysis to improve precision. VCA is not dissimilar from a
classic abstract interpretation based analysis. \emph{Variables constraint
graphs}, which are used to represent linear constraints among variables, are
an alternative representation of an abstract domain known in the abstract
interpretation literature as \emph{difference bound matrices} \cite{Mine01}, a
precursor of the Octagon domain.
It seems that the authors of \cite{QianHV02} had to solve some problems
similar to the ones we found in \Jandom. For example, they do not use the
standard \Soot analysis engine, since they want more control on the order in
which semantic equations are solved. Although their aim is different from ours
and their analysis is optimized for a particular purpose, some ideas might be
implemented in \Jandom since appear to be of general usefulness, for example the strategy to distinguish between loop body and loop exit.
\section{Conclusions}
Have we any benefit to using \Soot in \Jandom instead of a simple
bytecode library such as \ASM ? The immediate answer to this question is: yes,
but not as much as we would. The point is that \Soot is a complex framework
and to get all the benefits we should embrace it completely. This is at the
moment not possible since \Jandom also supports \ASM and languages other than
Java bytecode. However, we should evaluate whether ditching the other targets
(or compiling them to bytecode) and making \Jandom a pure Java bytecode
analyzer.
We also expect \Soot to be much more useful once we implement inter-procedural
analysis in \Jandom. For example, the ability to browse all the classes of
the \texttt{Scene} and to compute call graphs will be of great help.
%We wonder whether at least part of the functionalities of \Jandom
%could be moved to \Soot, to the benefits of a broader community.
Nonetheless, there are some improvements to \Soot which could greatly help in
spreading the use of this library in the abstract interpretation community.
The most important one is probably,
as we said
before, integrating in \Soot a more sophisticated data-flow equation solver
such as \emph{Fixpoint}.
%\bibliography{asbiblio}
%\bibliographystyle{abbrvnat}
\begin{thebibliography}{22}
\providecommand{\natexlab}[1]{#1}
\providecommand{\url}[1]{\texttt{#1}}
\expandafter\ifx\csname urlstyle\endcsname\relax
\providecommand{\doi}[1]{doi: #1}\else
\providecommand{\doi}{doi: \begingroup \urlstyle{rm}\Url}\fi
\bibitem[Amato and Scozzari(2011)]{AmatoS11fi}
G.~Amato and F.~Scozzari.
\newblock Observational completeness on abstract interpretation.
\newblock \emph{Fundamenta Informaticae}, 106\penalty0
(2--4):\penalty0
149--173, 2011.
\newblock \doi{10.3233/FI-2011-381}.
\bibitem[Amato and Scozzari(2012{\natexlab{a}})]{AmatoS12-entcs}
G.~Amato and F.~Scozzari.
\newblock The abstract domain of parallelotopes.
\newblock In J.~Midtgaard and M.~Might, editors, \emph{Proceedings of
the
Fourth International Workshop on Numerical and Symbolic Abstract
Domains,
NSAD 2012}, volume 287 of \emph{Electronic Notes in Theoretical
Computer
Science}, pages 17--28. Elsevier, 2012{\natexlab{a}}.
\newblock \doi{10.1016/j.entcs.2012.09.003}.
\bibitem[Amato and Scozzari(2012{\natexlab{b}})]{AmatoS12lpar}
G.~Amato and F.~Scozzari.
\newblock Random: {R}-based analyzer for numerical domains.
\newblock In N.~Bjørner and A.~Voronkov, editors, \emph{Logic for
Programming, Artificial Intelligence, and Reasoning},
volume 7180 of
\emph{Lecture Notes in Computer Science}, pages 375--382. Springer,
2012{\natexlab{b}}.
\newblock \doi{10.1007/978-3-642-28717-6_29}.
\bibitem[Amato and Scozzari(2013)]{AmatoS13sas}
G.~Amato and F.~Scozzari.
\newblock Localizing widening and narrowing.
\newblock In F.~Logozzo and M.~F{\"a}hndrich, editors, \emph{Static
Analysis}, volume 7935 of \emph{Lecture Notes in Computer
Science}, pages
25--42. Springer, 2013.
\bibitem[Amato et~al.(2009)Amato, Lipton, and McGrail]{AmatoLM09}
G.~Amato, J.~Lipton, and R.~McGrail.
\newblock On the algebraic structure of declarative programming
languages.
\newblock \emph{Theoretical Computer Science}, 410\penalty0
(46):\penalty0
4626--4671, 2009.
\newblock \doi{10.1016/j.tcs.2009.07.038}.
\bibitem[Amato et~al.(2010{\natexlab{a}})Amato, Parton, and
Scozzari]{AmatoPS10-rv}
G.~Amato, M.~Parton, and F.~Scozzari.
\newblock A tool which mines partial execution traces to improve
static
analysis.
\newblock In H.~Barringer and \emph{et al.}, editors, \emph{Runtime
Verification},
volume 6418 of \emph{Lecture Notes in Computer Science}, pages
475--479.
Springer, 2010{\natexlab{a}}.
\newblock \doi{10.1007/978-3-642-16612-9_37}.
\bibitem[Amato et~al.(2010{\natexlab{b}})Amato, Parton, and
Scozzari]{AmatoPS10-sas}
G.~Amato, M.~Parton, and F.~Scozzari.
\newblock Deriving numerical abstract domains via principal component
analysis.
\newblock In R.~Cousot and M.~Martel, editors, \emph{Static Analysis},
volume 6337 of \emph{Lecture Notes in Computer Science}, pages
134--150.
Springer, 2010{\natexlab{b}}.
\newblock \doi{10.1007/978-3-642-15769-1_9}.
\bibitem[Amato et~al.(2012)Amato, Parton, and Scozzari]{AmatoPS11JSC}
G.~Amato, M.~Parton, and F.~Scozzari.
\newblock Discovering invariants via simple component analysis.
\newblock \emph{Journal of Symbolic Computation}, 47\penalty0
(12):\penalty0
1533--1560, 2012.
\newblock \doi{10.1016/j.jsc.2011.12.052}.
\bibitem[Bagnara et~al.(2008)Bagnara, Hill, and
Zaffanella]{BagnaraHZ08SCP}
R.~Bagnara, P.~M. Hill, and E.~Zaffanella.
\newblock The {Parma Polyhedra Library}: Toward a complete set of
numerical
abstractions for the analysis and verification of hardware and
software
systems.
\newblock \emph{Science of Computer Programming}, 72\penalty0
(1--2):\penalty0
3--21, 2008.
\newblock \doi{10.1016/j.scico.2007.08.001}.
\bibitem[Bardin et~al.(2008)Bardin, Finkel, Leroux, and
Petrucci]{FAST2008}
S.~Bardin, A.~Finkel, J.~Leroux, and L.~Petrucci.
\newblock Fast: acceleration from theory to practice.
\newblock \emph{International Journal on Software Tools for Technology
Transfer}, 10\penalty0 (5):\penalty0 401--424, 2008.
\newblock \doi{10.1007/s10009-008-0064-3}.
\bibitem[Bourdoncle(1993)]{Bourdoncle93}
F.~Bourdoncle.
\newblock Efficient chaotic iteration strategies with widenings.
\newblock In D.~Bj{\o}rner, M.~Broy, and I.~V. Pottosin, editors,
\emph{Formal
Methods in Programming and Their Applications},
volume 735 of \emph{Lecture Notes in Computer Science}, pages
128--141.
Springer, 1993.
\newblock \doi{10.1007/BFb0039704}.
\bibitem[Bruneton(2011)]{asm4}
E.~Bruneton.
\newblock \emph{ASM 4.0 -- A Java bytecode engineering library}, 2011.
\newblock URL
\url{http://download.forge.objectweb.org/asm/asm4-guide.pdf}.
\newblock Last accessed 2013/05/18.
\bibitem[Cousot and Cousot(1976)]{CousotC76}
P.~Cousot and R.~Cousot.
\newblock Static determination of dynamic properties of programs.
\newblock In \emph{Proceedings of the Second International Symposium
on
Programming}, pages 106--130, Paris, France, 1976. Dunod.
\bibitem[Cousot and Cousot(1977)]{CousotC77}
P.~Cousot and R.~Cousot.
\newblock Abstract interpretation: A unified lattice model for static
analysis
of programs by construction or approximation of fixpoints.
\newblock In \emph{POPL '77: Proceedings of the 4th ACM SIGACT-SIGPLAN
symposium on Principles of programming languages}, pages 238--252.
ACM Press, 1977.
\newblock \doi{10.1145/512950.512973}.
\bibitem[Cousot and Cousot(1979)]{CousotC79}
P.~Cousot and R.~Cousot.
\newblock Systematic design of program analysis frameworks.
\newblock In \emph{POPL '79: Proceedings of the 6th ACM SIGACT-SIGPLAN
symposium on Principles of programming languages}, pages 269--282.
ACM Press, 1979.
\newblock \doi{10.1145/567752.567778}.
\bibitem[Cousot and Halbwachs(1978)]{CousotH78}
P.~Cousot and N.~Halbwachs.
\newblock Automatic discovery of linear restraints among variables of
a
program.
\newblock In \emph{POPL '78: Proceedings of the 5th ACM SIGACT-SIGPLAN
symposium on Principles of programming languages}, pages 84--97.
ACM Press, 1978.
\newblock \doi{10.1145/512760.512770}.
\bibitem[Jeannet and Min{\'e}(2009)]{JeannettM08}
B.~Jeannet and A.~Min{\'e}.
\newblock {APRON}: A library of numerical abstract domains for static
analysis.
\newblock In A.~Bouajjani and O.~Maler, editors, \emph{Computer Aided
Verification}, volume 5643 of \emph{Lecture
Notes in
Computer Science}, pages 661--667. Springer,
2009.
\newblock \doi{10.1007/978-3-642-02658-4_52}.
\bibitem[Logozzo and F{\"a}hndrich(2008)]{LogozzoF08}
F.~Logozzo and M.~F{\"a}hndrich.
\newblock On the relative completeness of bytecode analysis versus
source code
analysis.
\newblock In L.~J. Hendren, editor, \emph{Compiler Construction},
volume 4959 of
\emph{Lecture
Notes in Computer Science}, pages 197--212. Springer,
2008.
\newblock \doi{10.1007/978-3-540-78791-4_14}.
\bibitem[Min{\'e}(2001)]{Mine01}
A.~Min{\'e}.
\newblock A new numerical abstract domain based on difference-bound
matrices.
\newblock In O.~Danvy and A.~Filinski, editors, \emph{Programs as
Data Objects},
volume 2053 of \emph{Lecture Notes in Computer Science}, pages
155--172.
Springer, 2001.
\newblock \doi{10.1007/3-540-44978-7_10}.
\bibitem[Min{\'e}(2004)]{mine:esop04}
A.~Min{\'e}.
\newblock Relational abstract domains for the detection of
floating-point
run-time errors.
\newblock In D.~Schmidt, editor, \emph{Programming Languages and
Systems}, volume 2986 of
\emph{Lecture Notes in Computer Science}, pages 3--17. Springer,
2004.
\newblock \doi{10.1007/978-3-540-24725-8_2}.
\bibitem[Min{\'e}(2006)]{Mine06}
A.~Min{\'e}.
\newblock The octagon abstract domain.
\newblock \emph{Higher-Order and Symbolic Computation}, 19\penalty0
(1):\penalty0 31--100, 2006.
\newblock \doi{10.1007/s10990-006-8609-1}.
\bibitem[Qian et~al.(2002)Qian, Hendren, and Verbrugge]{QianHV02}
F.~Qian, L.~Hendren, and C.~Verbrugge.
\newblock A comprehensive approach to array bounds check elimination
for Java.
\newblock In R.~N. Horspool, editor, \emph{Compiler Construction},
pages 325--341. Springer,
2002.
\newblock \doi{10.1007/3-540-45937-5_23}.
\bibitem[Vall{\'e}e-Rai et~al.(1999)Vall{\'e}e-Rai, Co, Gagnon, Hendren, Lam,
and Sundaresan]{Soot}
R.~Vall{\'e}e-Rai, P.~Co, E.~Gagnon, L.~Hendren, P.~Lam, and V.~Sundaresan.
\newblock Soot - a java bytecode optimization framework.
\newblock In \emph{Proceedings of the 1999 conference of the Centre for
Advanced Studies on Collaborative research}, CASCON '99. IBM
Press, 1999.
\end{thebibliography}
\end{document}