Yosi Ben Asher


I am a researcher in:

  • compilers,
  • highlevel synthesis (HLS),
  • parallel programming.
I am motivated by two observations:
  • I believe compilers are basically tools to express tradeoffs between scheduling and resources (similar to the time-space pebble-games studied in theoretical computer science). Thus the target architecture is of secondary importance, the compilation of a given program should be done by obtaining an optimized point of the (schedulin_time X hardware_configuration) space. I believe compilers should be implemented in two phases: 1) a source level tradeoff-explorer targeting a virtual tunable hardware architecture (HA); 2) followed by a separate "normal" compiler that schedules and optimizes the code generated by the tradeoff-explorer targeting some real CPU or a circuit. I am still far from reaching this goal, though I have developed few systems to explore such tradeoffs in HLS and few systems that handle source level compilation (e.g., a system that merges independent programs in source level). In particular I need to develop a retargeble compiler whose code generation stage generates assembly instructions on-the-fly thus determining the best HA suitable for the source code it is compiling. Note that this direction differs than Application Specific Instruction Set Processors that have a configurable instruction set since we actually intend to scan all possible HAs. Though retargeble compilers exists, the propose compiler will really explore all possible hardware configurations which is differ than allowing the user to modify the HA changing few parameters. Another attempt I have made in this direction is to show that many of the optimization paths of a compiler can be expressed as pebble-games, hoping to find a unified algorithmic specification for compilers.
  • The ability to compute via re-configurations (directing signals through different edges in a graph) has always fascinated me. A reconfigurable unit basically computes a boolean function in one parallel step: set the re-direction of incoming signals at each node and then propagate a signal through the graph. It took sometime for me to understand that unlike boolean circuits reconfigurable units are not compositional, i.e., the outputs of a reconfigurable unit can not be connected to the input of another unit without violating the one-parallel-step assumption. I believe that a compiler targeting reconfigurable units can overcome this problem. I would like to study HLS systems that target branching programs instead of boolean circuits. I believe that (using novel switching technologies) by doing so we can overcome the ever-growing delays involved with the wires and MUXes that connect the boolean gates in regular circuits. The trick is that via a HLS compiler it will be possible to synthesize such large size branching programs bypassing the need to explicitly design their structure. This goal is closer to be realized I have some breakthroughs in developing necessary compilation techniques.

  • Email: yosi@cs.haifa.ac.il
  • Phone: 972-4-8240338

  • I am giving consulting services to industrial companies in the area of compiler optimizations and highlevel synthesis.



    These are my current research efforts that are in an advanced stage and will be added to SystemRacer. These are interesting technologies that are part of the quest to make SystemRacer very powerfull HLS system actually targeting complete unrestricted C programs.


    My current research projects include:

    Compilation techniques from C to Verilog targeting FPGAs:

    Compilation techniques from C to branching programs

    as opposed to generating circuits that are based on boolean gates

    HLS of multi-threaded code

    is an attempt to minimize the use of shared memories in the circuits resulting from HLS of multi-threaded code. In here the focus is on the tradeoff between optimized scheduling and minimal use of shared memory modules.

    Efficient compilation technique for Multicore machines.

    ParC is a relatively old parallel dialect of C that was developed originally by Myself and My advisor Prof. Marc Snir many years ago. At a first glance it looks similar to OpenMP but this is somewhat deceiving ParC has several delicate features that makes it (to my taste) different and more efficient than OpenMP. In this effort I am mainly considering the following aspects:


    Current Compiler Systems

    The HTINY system for source level optimizations and source level HLS

    SystemRacer an LLVM based HLS compiler targeting variable pipeline operations and multithreaded code

    ParC compiler+simulator for parallel programming of multicore machines


    Papers, Courses and Past Projects

     

     

    The Htiny System

    The HTINY system (Haifa's Tiny) is an extension of Wolfe's Auto parallelizing system. The main goal is to transform it to a full source-level compiler that targets both regular CPUs and synthesized circuits. By a source-level compiler we mean a compiler that does not starts by applying code-generation and whose intermediate representation is the abstract-syntax-tree (AST) rather than RTL instructions. We belive that source-level compilation is the key technology to obtain better compilers that can be used to explore different tradeoffs involved with code/circuit generation. The current system contains the following parts added to the original Tiny:

    HTINY accepts inputs in a Fortran style but can be applied to C-code using a conversion tool. Pointers are not supported and function calls are supported by the loop transformations part but not by the two synthesis parts.. Htiny contains a back-end to x86 machine code.

    Some screen-shots of the current system with Xilinx's ISE synthesis tool:

    Publications:

    back to the main level

    SystemRacer

    Traditionally, scheduling algorithms for software instructions concentrated on problems such as optimal register usage, branch elimination and instruction level parallelism. However, in the context of hardware synthesis, scheduling algorithms optimize entirely different set of constraints. In hardware synthesis, design frequency, size and power consumption are affected by parameters such as reuse of modules, wire length, longest gate chain, etc. Usually, in high level hardware synthesis, all functional units of the same type have a fixed known ``length'' (number of stages) and the scheduler mainly determines when each unit is activated. We focus on scheduling techniques for the high-level synthesis of pipelined functional units where the number of stages of these operations is a free parameter of the synthesis. This problem is motivated by the ability to create pipelined functional units, such as multipliers, with different pipe lengths. These units have different characteristics in terms of parallelism level, frequency, latency, etc. This work presents the variable pipeline scheduler (VPS). The ability to synthesize variable pipelined units expands the known scheduling problem of high-level synthesis to include a 2D search for a minimal number of hardware units (operations) and their desired number of stages. The proposed search procedure is based on algorithms that find a local minima in a $d$-dimensional grid, thus avoiding the need to evaluate all possible points in the space. We have implemented a C language compiler for VPS. Our results demonstrate that using variable pipeline units can reduce the overall resource usage and improve the execution time.

    The system we implemented is based on the LLVM compiler toolkit and is called SystemRacer. The system is comprised of two parts. First, a compiler front-end parses $C$ into the internal compiler representation. Secondly, a synthesis backend creates the hardware description language. In our case, Verilog. The synthesis unit contains the search algorithm which finds the best resources for the given high-level code using the above LS algorithm. We have extended the LLVM optimizations passes in order to expose parallelism. For example, loop-code which is executed on a CPU, may calculate the address of an array before reading from it. In hardware, we are able to calculate this address for the next iteration, in parallel to the current read operation, since they are not dependent. Another example is a pass which takes an arithmetic expression and balances commutative expressions so that more operations can be executed in parallel.

    Each type of the LLVM bytecode has a resource reservation table allowing the scheduler to place a given instruction in the 2D table. The reservation tables are generated dynamically based on the current hardware configuration (HC) point that is evaluated. For example, if the current HC contains $pm=4$ (pipeline length of multiplication), then the reservation table of $multiplier_i$ will be $RT_i[mull\_in,0]='X'$ and $RT[mull\_out,4]='X'$ and all other entries in $RT_i[,]$ are empty. The search starts by determining the range of the eight dimensional optimized hardware configuration (OHC) space including the number of arithmetic operations ($m$), memory ports ($r$), and delay factors ($mdf$). For example some point < m=5,pm=1,s=5,ps=1,d=5,pd=1,r=6,mdf=1 > may form the ``lower'' point in the range as it represents maximal resources and minimal delay factors imposing no restrictions on the list scheduler used by the system. The opposit ``upper'' point in the range is < m=1,pm=16,s=1,ps=10,d=1,pd=16,r=1,mdf=5 > is the most desired point as it minimizes the resources and maximizes the delay factors. The LS is initially applied to determine an actual upper bound to the maximal number of resources that can be used for each basic block. This is done by running the list scheduler with infinite resources and recording how many resources were actually used. This helps in reducing the search space for some of the parameters. A random search procedure is activated to find a local minima testing up to $10^4$ OHC points.

    A web-site of the current version of SystemRacer (Towards a commercialize version):

    Current state:

    In conclusion currently SystemRacer supports:

  • Synthesis of functions including loops and general control-statements.
  • Automatic search for optimized hardware configurations (OHCs) forming good compromises between conflicting requirements of loops. An OHC includes number of adders, multipliers, memory delay factors and so forth.
  • Using arithmetic operations with variable number of pipeline stages.
  • Full set of loop transformations (currently works as a separate tool).
  • Very powerful modulo scheduling capabilities (currently working as a separate tool).
  • Simple re-use phase that save resources and increase execution times.
  • back to the main level

    ParC

    ParC is a parallel dialect of C similar to OpenMP. Following the emergence of multicore machines I have started to work on developing efficient compiler for ParC programs targeting multicore machines. The main focus of this project is to develop efficient compilation techniques to overcome the large overhead caused by the busRD transactions on the bus of the multicore machine every-time a cache miss occurs. Simple experiments shows that the overhead of using the MESI/MOESI protocols to simulate shared memory can be really high. A simple experiment with two threads updating a common shared array showed that the parallel version is slower by 40% than the sequential execution (see results ). Though ParC is similar to OpenMP it has some unique features that makes it attractive for multicore machines:

    Current status of the system is as follows:

  • We are starting a new project called the "ParC chip". This chip will include tousands of small cpu cores (like Xilinx's microblaze) connected by a powerfull shared memory module SMM. We are close to complete the shared memory module (actualy a lab project) which will work at 500Ghz requireing however about 2*log N clock cycles where N is the number of parallel ports used by the SMM. This SMM is fully pipelinable. Hence inorder to exploit this pipelined acesses each core must executes (2*log N)^2 threads so that a pipeline with 2*log N stages can be used with a speedup of log N. For N=1024 we need each core to have 100-400 threads, to obtain suffcient pipeline utilization. Overall we need to have 102400 ParC threads active and running. Thus the ParC chip calls for really highly parallel applications. I am not sure about the rest of the architecture but plan to go ahead and handel problems as they come (e.g., how do we distribute a program to 1024 cores). This project is a bit too fantastic (designing a complete Kilocore system using BSc projects) so I suspect something fondumental will not work but it is interesting to try.

    back to the main level