Latest update Sep. 6, 2019
This is an account for the most significant parts of my research. I have left out some supervised work where my contribution has been minor.
This is what I did for my PhD. Systolic arrays is a concept invented by H.T. Kung shortly before 1980. The concept was born out of the fact that the laws of nature eventually would limit (1) the speed of single processors, leaving parallelisation as the only way to increase hardware performance, and (2) the cost of communication between parallel processors would be strongly dependent on the physical distance between them.
Systolic arrays are thus highly regular, special purpose processor arrays where the simple processors are connected to their nearest neighbours only. This gives rise to various mesh-connected structures, not unlike modern mesh-connected many-core processors. Each processor is specialised to perform some simple operation, like the SAXPY (multiply-and-add) operation. The basic systolic array is synchronous, where at each clock cycle each processor will read its input, perform its operation, and write the result(s) to its outputs. Thus the term "systolic" - where data is being "pumped" through the array in analogy with the heart pumping blood through the body. There is no control except the timing and location of the appearance of the input data. Thus, systolic arrays are apt to process highly predictable and regular parallel algorithms. Such algorithms are found especially within linear algebra, but systolic algorithms have been proposed also for tasks like signal processing, sorting, and computation of transitive closure.
How can systolic algorithms be systematically designed? And how can a systolic array be designed that implements a particular systolic algorithm? An approach to this is to consider the problem as a DAG scheduling problem, where the nodes are computations and the edges represent data dependencies where some output from a computation is used as input to another. Each node will then be assigned a time when it is executed, and a location (processor) where this takes place: these assignments are called space-time mappings. The assignment is done subject to causality constraints (data must be produced before being used), communication constraints (communicated data must have enough time to propagate through the network from its production to its use(s)), and possible other constraints. The non-causal constraints can either arise from an existing systolic array, or provide design constraints on a systolic array to be synthesised: in the latter case, an array supporting the space-time mapping can be obtained through a projection along the time axis.
An explicit representation of DAG and space-time mapping would be prohibitively large for the kinds of massively parallel computations typically considered for systolic arrays. Fortunately these algorithms and mappings tend to have a regular structure, which makes it possible to have compact symbolic representations. Very common is to describe the steps in the computation by some recursion equation where the recursion argument is an integer vector ranging over an index set that is represented by a polyhedron. As the dimensionality of the index set often is low, like 2 or 3, these polyhedra will provide a very compact representation.
Also the space-time formed by time and processor coordinates will typically be an integer vector space in two or three dimensions. If we restrict the space-time mappings to linear mappings from index sets to space-time, then the design problem amounts to finding a matrix with a low number of elements that represents a mapping that satisfies the constraints. This is a typically a tractable problem, and one can also consider searching for optimal solutions with respect to some optimality criterion, like total execution time or processor utilization.
My contributions were in a formalization of the approach, giving a formal semantics to computations as DAGs where the nodes are substitutions, and edges represent data dependencies defined by variables being assigned by some substitution and used by others. Each substitution can be seen as a single-assignment (corresponding to a "let" declaration in a functional language). I also considered the problem of finding a suitable space-time mapping as a constraint solving problem, formulating various constraints as arithmetic constraints on the coefficients in the mapping. The problem of finding a time-optimal systolic array then turns into finding a space-time mapping from a set of ILP problems. I also considered extensions to "almost-systolic" processor arrays defined by non-linear space-time mappings, including efficient designs for FFT, and I investigated another extension to the approach with more detailed constraints, allowing good utilization of processing elements with pipelined functional units.
As a spinoff to my work formalizing the computation model for systolic algorithms, I developed a single-assignment semantics for imperative programs. In this semantics, each execution of an ordinary assignment generates a new single-assignment. The result of a program execution is a set of single-assignments (i.e., a DAG where edges represent data dependencies). This kind of semantics could be used to support parallelization, and I showed that the semantics is consistent with the traditional Floyd-Hoare semantics for a simple While language. There are also some relations to the SSA form used by some optimizing compilers: however my semantics generates single-assignments in a dynamic, possibly data-dependent manner whereas the SSA form is static.
In some unpublished work (techreport) I also considered how to go in the other direction: to synthesize an imperative straight-line program, with concurrent assignments, from a space-time mapping of a static set of single-assignments. The synthesis includes the selection of imperative program variables and the allocation of single-assignment variables, representing values, to program variables representing memory locations. This is a kind of static memory allocation, similar to register allocation. In this work I proved that the synthesized imperative program is semantically consistent with the set of single-assigmnents, and I also showed how to select program variables and assign single-assignment variables to them in a way that minimizes the memory requirements.
Unfold-fold program transformations can be used for many optimizations of functional programs. My interest was sparked by the possibility to use such transformations to simplify the control structure of a recursive program, to yield a program that is easier to implement in hardware. I thus first considered a technique called "total unfolding" where a recursive program, with partially given indata, is unfolded until termination yielding a big static expression. This static expression could then be seen as a DAG similar to the single-assignment DAGs that I had considered earlier. This would allow the use of succinct, parameterized recursive function definitions that could, with enough information about the arguments given, be unfolded into a static expression that would allow an efficient hardware implementation. To support this I proved a theorem about termination of the symbolic unfolding relative to termination of executions of the original recursive program.
Later I became interested in the correctness of unfold-fold transformations. Basically all previous results in this area concern deterministic programs. I considered recursive, possibly nondeterministic programs where I formalized the programming model using rewrite systems (the "Recursive Applicative Program Scheme" approach). In order to cover languages with variable binding constructs, like the lambda calculus, I considered higher order rewrite systems, and to capture possibly infinite computations I based the semantics on infinitary term rewriting. In this setting I was able to prove a theorem that gives some conditions guaranteeing that an unfold-fold transformation will produce a semantically equivalent, possibly nondeterministic, recursive program.
Systolic array synthesis from high-level recursive algorithm specifications, which I studied for my PhD, works only for a quite limited class of algorithms. A natural next step was to consider more general high-level parallel programming models. The data parallel model soon caught my interest: it expresses parallelism implicitly, through a number of primitives that work in parallel over data structures, rather than explicitly through some kind of parallel threads or processes. If these primitives are designed correctly then the resulting data parallel language will be free from race conditions and deadlocks. Another interesting property is that the judicious use of data parallel primitives can yield very succinct and clear code, similar to the use of higher-order functions like map and fold in functional programming (and, indeed, the data parallel primitives are very related to these).
My interest quickly turned into the latter aspect. Existing data-parallel and array languages often had flaws, or undue restrictions, that hampered their use for writing high-level parallel algorithm specifications. Could we design a simple, generic, yet expressive data parallel model from first principles? Data parallel data structures are often indexed, such as arrays. My idea was to consider such structures as partial functions from some index set, and define data parallel operations as operations on functions. This lead to the data field model, where the data structures, or data fields, are pairs of functions and bounds. The functions map from some index type (often integers, or tuples of integers), and the bounds are representations of index sets enclosing the domains of the data fields. For instance ordinary arrays can be seen as data fields where the function maps from integers, and the bound consists of a pair of integers representing an interval. However, data fields are not restricted to regular arrays. Instead the data field model requires index types and bounds to have some particular properties: these properties are needed for making the data parallel operations on data fields well-defined. They can be seen as an abstract interface, making the data field model highly generic. The data field model also includes a variable-binding construct called "forall-abstraction", forall x -> e, which parallels lambda-abstraction for functions but defines a data field rather than a function.
The functional nature of data fields make them suitable for inclusion in functional languages. Another advantage with this is that pure functional languages disallow functions with side-effects, which is required by the data field model. To demonstrate this we made a proof-of-concept implementation of a dialect of Haskell, "Data Field Haskell", where the nhc13 compiler for Haskell 1.3 was extended with an instance of data fields consisting of sparse and dense multi-dimensional arrays. The implementation also included forall-abstraction, with rules for how to derive the bounds of the resulting data field.
The Worst-Case Execution Time, or WCET, of a program is the longest time it takes to run the program without interruptions, or other disturbances from the environment. WCET analysis aims to estimate the WCET. Very often the WCET estimate is required to be safe, which means that it will never underestimate the real WCET. Provably safe WCET analyses are invariably static analyses formulated in some formal framework, like abstract interpretation.
In order to obtain a reasonably scalable safe analysis, the analysis is often broken down into three phases:
When I picked up my position at MDH, I switched to the research area of WCET analysis. Unlike my previous research this was more of a team effort, with people on the team like Andreas Ermedahl, Jan Gustafsson, Christer Sandberg, Linus Källberg, Stefan Bygde, and Andreas Gustavsson. We have mainly targeted the flow analysis, which often has been overlooked, and we have developed a number of approaches to automated flow analyses mainly based on abstract interpretation:
In addition to flow analysis, we have also developed some methods for approximate hybrid analysis that combine timing measurements with static analysis techniques. Such methods are not safe in general, but can sometimes still be useful in less safety-critical applications. Both our methods aim to replace the microarchitectural analysis with timing measurements from which simple timing models can be built:
Much of our work in WCET analysis is implemented in our tool SWEET (SWEdish Execution Time tool). SWEET is a tool mainly for flow analysis, and it implements both Gustafsson's AE and the more approximate method. It has a rich "flow fact" language for expressing program flow constraints. The computed flow facts can be context-sensitive, and they can express a multitude of flow constraints such as upper an lower loop iteration bounds, infeasible path constraints, etc. SWEET also implements several other supporting analyses, such as dataflow analysis (Reaching Definitions), static backwards program slicing, and a conventional value analysis by abstract interpretation. It also computes call graphs and control flow graphs, and it can compute approximate (possibly unsafe) WCET estimates using the simple timing models mentioned above.
SWEET analyses the "ALF" format. ALF is an intermediate format designed to represent both source code (such as C), as well as low-level code. Programs can be analysed by translation into ALF, translating the results back to the original program. A translator for C exists, as well as a more experimental translator for PowerPC binaries.
A spinoff of our WCET analysis research is the "Mälardalen Benchmarks", a benchmark suite that has become widely used for experimental evaluations in WCET analysis research as well as in more general embedded systems research. This is a collection of simple C programs that we gathered, organised, documented, and made easily available on the web. We still keep the benchmarks online, although we don't actively maintain the benchmark suite anymore. A successor to our benchmark suite is the TACLeBench benchmark suite, which is still actively maintained.
This work was a spinoff from our work regarding parametric WCET analysis. An important requirement on numeric abstract domains, when used for analysis of safety-critical programs, is that the analysis is sound also in the presence of numbers of bounded size. For instance, wrap-arounds must be handled correctly. For relational domains, like the polyhedral domain used in our symbolic flow analysis, this is non-trivial. Our contribution was to improve the precision significantly, by some quite simple means, for a polyhedral analysis by Simon and King that is sound for numbers of bounded size.
Related to WCET analysis, my PhD student Xavier Vera (main driver), Jingling Xue, and I investigated how to increase the predictability of the data cache contents for real-time tasks without sacrificing too much performance. The idea was to use dynamic cache locking, keeping the caches unlocked for code sections with single-path code and input-independent memory accesses, and otherwise lock the cache. With this discipline the cache contents will be known at all stages of computation, which, for instance, simplifies WCET analysis a lot. We also considered optimizations such a preloading the cache, cache partitioning to make it possible to use the technique for preemptively scheduled tasks, and compiler optimizations like loop tiling to reduce the memory footprint in systems with cache partitioning. Overall the experimental results were very good, with only small degradations in performance.
Related to the above, in collaboration with Raimund Kirner and Peter Puschner from TU Vienna, we developed a simple analysis to safely identify code parts that are single-path, i.e., code where the conditions that control the execution of the code are independent of input data. The analysis is formulated in the framework of abstract interpretation, and we provide a proof of soundness.
Originally my interest in program slicing was to use it to speed up our WCET flow analysis, by excluding the analysis of code that surely cannot influence the program flow. This led to an interest in the technique itself, including joint work with my PhD student Husni Khanfar regarding demand-driven approaches to slicing, and joint theoretical work with my colleague Abu Naser Masud on the correctness of inter-procedural slicing.
This work is still under development. I am interested in how slicing can be used to make software testing more efficient by enabling a better selection of test cases, or to reduce the search space in search-based testing. I also collaborate with Mehrdad Saadatmand and Pasqualina Potena at RISE/SICS, and Birgitta Lindström at HiS, on the use of slicing to reduce the number of equivalent mutants in mutation testing.
TADL2 is a language for expressing timing constraints. It has been developed in the automotive domain, and an early version provides the basis for the Autosar Timing Extensions. TADL2 is based on events, which basically are increasing sequences of times (called event occurrences). TADL constraints can concern single events, defining what makes an event periodic, sporadic, etc, or involve several constraints that express some timing relations between them. An example is a delay constraint that says "for every occurrence of event s, an occurrence of event r must appear within a certain time". TADL2 specifies in all 17 constraints of these types. These constraints can be used to specify various requirement on the timing.
My first contribution was to define a formal semantics for TADL2, together with Johan Nordlander. The semantics is based on a logic that basically is a first-order logic with relational constraints on arithmetical expressions, extended with sets of times and some operations on these. Each TADL2 constraint is given a precise definition in this logic.
I then considered whether some combinations of TADL2 constraints could possibly by verified automatically. In automotive it is not unusual to implement time-critical software through periodically triggered real-time tasks, which then naturally gives rise to periodic events describing the timing of the actions of these tasks. A type of formula that then becomes interesting is "if events e, e', ... are periodic, will they then also fulfil some relational constraint R(e, e', ...)?" I came up with a transformation on the formulas specifying this implication, which replaces the event occurrences in the constraint R with arithmetic variables ranging over the time windows specified by the respective periodicity constraints. In many cases this transformation results in a Presburger formula. I then proved that the transformation is applicable and correct for a majority of the TADL2 constraints. As Presburger logic is decidable, this gives a method to automatically find out whether a set of periodic events satisfies other timing constraints. Interestingly there are methods for solving Presburger formulas that are symbolic in that they can return a formula with free variables, rather than just deciding if a closed formula is true or false. This makes it possible to, for instance, leave certain parameters open in a TADL2 constraint and derive a formula expressing for which values of the parameters that the constraint is satisfied. I made some experiments with the iscc Presburger calculator, which uses such a method, which shows that the method really works at least for moderately small examples.
Parallel software may suffer from race conditions. Such data races can cause bugs that are extremely hard to find and reproduce, since their appearance may depend on the timing in the system in complex ways. THerefore it is interesting with analyses that can pinpoint possible data races and warn the developer. A first step in this kind of analysis is a so-called "may happen in parallel" (MHP) analysis, which determines which activities in the system that may possibly take place at the same time.
In cooperation with Ericsson and Evidente East, Abu Naser Masud (main driver) and I have developed a MHP analysis for an actor-based parallel task graph model that is used for writing parallel software for telecom applications. The analysis identifies tasks that may run in parallel, and which then possibly may give rise to race conditions. Experiments show that the analysis scales to industrial software with millions of lines of code.
PLEX is a programming language that was used to program the AXE telephone exchanges from Ericsson. PLEX is event-driven, where signals can spark off new jobs to execute, and its code and data is organized in a kind of primitive objects called blocks. PLEX is also sequential, with jobs being non-preemptively executed in a sequential fashion.
There is still a large backmarket for the AXE exchange. Ericsson has therefore been interested in parallelizing the software in order to leverage on modern parallel multi-core architectures. Indeed the tasks carried out by PLEX programs are typically inherently very parallel, and one would therefore like to execute PLEX jobs in parallel. However, unconstrained parallel execution of PLEX jobs may trigger data races that will not appear in sequential execution. Ericsson therefore sponsored a PhD student, Johan Lindhult, to work with me on the topic of safe parallelization of PLEX programs.
At that time Ericsson worked with a prototype parallel central processor for AXE that used a conservative form of parallel job execution, basically treating blocks as critical sections forbidding any intra-block parallel execution. This however gave insufficient parallelization, and we made an investigation into how this could be improved. In order to have firm formal underpinnings, and to document the quite nonstandard PLEX language, we first developed two formal semantics for PLEX: one modeling the sequential version, and one modeling PLEX executing on the parallel prototype system. We then considered a simple memory access analysis that could detect cases where code executing in the same block still could be safely executed in parallel without data races. Johan performed a case study on some production code, documented in his Licentiate thesis, which showed that in some cases the analysis could help improving the parallelization significantly.
As a by-product of the formalization we discovered a case where the semantics for the parallel PLEX implementation differed from the sequential semantics, indicating a bug in the parallel implementation. This bug was later confirmed by the Ericsson engineers.