Fig 1: a) Simplified single clock design. b) Applying CSR technique.
Fig. 1a shows the
basic structure of a sequential circuit with its combinatorial logic
(CL) and original design registers (DR). Inputs and outputs are not
shown for simplification. The sequential circuit handles one thread
T(1). Fig. 1b shows the CSR technique. The original logic is sliced
into C (here C=3) sections, and each original path has now C-1
additional registers. This results in C functional independent design
copies T(1..3) which use the logic in a time sliced fashion.
Each
thread has its
own thread index. For each design copy it now takes C micro-cycles to
achieve the same result as
in one cycle of
the original design (macro-cycle). The implemented register sets are
called “CSR Registers“, CRs. They are placed at
different C-levels (CR0, CR1, ...).
The sequence (1) shows how the complete design states (threads)
traverse through the logic each micro-cycle. There is no interaction
between threads and each thread uses the complete design in a time
sliced fashion.
Fig 2: a) SHP-ed design with thread controller (TC) and memory (Mem) b)
Improved SHP-ed design.
The DRs are now replaced by memories (Mem). The incoming threads
are stored at the relevant address (write pointer) based on the thread
index. D is the number of threads
which the memory can hold (memory depth). The outgoing thread can now
be freely selected within D available
threads (read pointer), except the threads already passing through the
design logic.
Equation (2) shows that an SHP-ed design can run any thread (T
<= D) in any possible order. Same threads must not be executed
at
the same time in this initial version.
A CSR-ed design has usually many shift registers. DR are followed by a
series of CSR registers. In the SHP-ed version, many
memory data outputs are connected to CRs directly. In this case, the
shift register chain at the outputs can be replaced by a shift register
chain at the read address inputs of the memories. Fig. 2b shows this
improved SHP version. The memory is sliced into individual
sections (Mem0, Mem1, Mem2, ...) and each section has a delayed read of
the thread. The outputs can now be directly connected to the relevant
combinatorial logic and the shift registers can be removed.
Additionally, triple read port memories can be used to further reduce
the CR count.
The same method can be applied at the inputs of the memories to further
reduce register count. CRs, which are directly connected to the data
inputs of the Mem can be merged into the
Mem. This can be achieved by splitting the individual sections (Mem0,
Mem1, Mem2, ...) again into individual
subsections (Mem0.0, Mem0.1, ) which are now controlled by an early
write address.
Adding CR into each path to use the logic into a time sliced fashion
also implies that C registers are added into each feedback loop, which
results into a high shift register count of a
CSR-ed design. Feedback loops with multiplexers are sometimes replaced
by a register write enable signal. This feature cannot be applied on a
CSR-ed design. The replacement of DRs with Mems in the SHP-ed version
allows the usage of the write enable signal and the feedback loop gets
obsolete.
Load balancing
Fig. 3. Histogram of different scenarios running CSR and/or SHP.
Fig. 3 shows the advantages of CSR and SHP compared to the original
design. The x-axis of the histogram shows different
scenarios/solutions, the y-axis the system performance.
Assuming a thread (T0) on a single CPU runs at e.g. 60MHz on an FPGA
(Fig. 3a). It can be seen, how CSR improves the system performance of
the original system implementation (Fig. 3b). When using CSR, the
system performance is not necessarily limited by the critical path of
the original design, but - for instance - by the switching limit of the
FPGA (e.g. 250MHz) or the external memory access instead. All threads
run at the same relative speed (fixed).
For executing multiple programs on multiple CPUs (symmetrical
multi-processing), SHP allows a more efficient usage of the system
resources (Fig. 3b to 3e). It adds the possibility to distribute the
system performance over a minimum (C, Fig. 3b), and a maximum set of
threads (D, Fig. 3c), whereas any solution in between can be realized.
Fig. 3d shows a random example. This load balancing is handled by a
thread controller (TC) and can be dynamically modified during runtime.
Threads can be inserted, stalled and killed on a cycle-by-cycle base,
and a flexible priority scheme takes care of individual load balancing
(Fig. 3d).
Acceleration techniques enable the speed-up of at least one thread
beyond the
speed of the thread running on the original design (Fig. 3e).
Video
introduction to SHP
My presentation on SHP at the 4th RISC-V workshop,
MIT, Boston, 2016:
Advantages
There are several advantages when using
SHP over standard approaches. There are the performance per area factor
increase, the performance increase of a single thread and system level
performance improvements, especially in the multi-core domain.
Latest
work on system level
improvements:
virtual peripherals
System level performance improvement is possible by
dynamically
vary
the number of active threads. This enables a much more flexible
multithreading approach, which can be used for running multiple
virtual peripherals:
T. Strauch, "Connecting Things to the IoT by Using Virtual Peripherals
on a Dynamically Multithreaded Cortex M3'', IEEE Trans. on Circuits and
Systems I: Regular Papers, vol. 64, issue 9, Sep. 2017, pp. 2462 -
2469. http://ieeexplore.ieee.org/document/7935353/
The paper is recommended by the Associate Editor, A. Sangiovanni
Vincentelli.
Initial
work: performance
per area
increase
This paper discusses
the increase of the classical performance per area factor when SHP is
used:
T.
Strauch,
"The Effects of System Hyper Pipelining on Three Computational
Benchmarks Using FPGAs", 11th
International Symposium in Applied Reconfigurable
Computing, ARC 2015, 13-17 April 2015, Bochum, Germany, pp.
1-12. Acceleration
techniques
One benefit of SHP is, that performance can be balanced more flexible
among individual threads. This paper shows, how individual threads can
be even further accelerated (Fig. 3e):
T. Strauch, "Acceleration Techniques for System-Hyper-Pipelined
Soft-Processors on FPGAs", IEEE Euromicro DSD 2017, 30th Aug.
-
1st Sep., Vienna, Austria, pp. 129-138. http://ieeexplore.ieee.org/document/8049775/
Performance
per area
and CGRA
SHP can be used on
coarse grained reconfigurable arrays (CGRA) as well:
T. Strauch, "Using System Hyper Pipelining (SHP) to Improve the
Performance of a Coarse-Grained Reconfigurable Architecture (CGRA)
Mapped on an FPGA", 2nd International Workshop on FPGAs for Software
Programmers, FSP 2015, 1st September 2015, London, UK, pp. 1-6.
System level
improvements:
more to come
I'm currently working on more system level improvements that can be
reached when using SHP.