This is the home page of the CPC research group of Tampere University. The group's name in Finnish is Räätälöity rinnakkaislaskenta. CPC's main research focus is on design and programming methodologies of customized parallel computing platforms and real time implementations of challenging algorithms.
In addition to publications and theses listed here as academic contributions, CPC has also made major open source contributions in the field of portable and customized heterogeneous computing: The group has created OpenASIP and Portable Computing Language (pocl) which are being used widely as research platforms and even for product use cases. CPC also created the prototype HIPCL tool which evolved into chipStar, a portable CUDA/HIP implementation using open standards.
An algorithm domain with extreme computational demands that CPC has been very interested in the past years is real time ray tracing. A separate focus group was formed for finding algorithmic, parallel/heterogeneous implementation and custom hardware solutions for its challenges in 2015. The group's web pages are here.
Barry de Bruin from Technical University of Eindhoven defended his doctoral thesis on energy-efficient coarse-grained reconfigurable arrays (CGRA). In the thesis, titled "Design of Energy‐Efficient CGRA‐based Systems", Barry leveraged OpenASIP's flexible framework and used it's retargetable compiler backend to compile C programs for the designed CGRAs. This allowed more flexibility and ease of programming when compared to other CGRA implementations. As a doctoral student, Barry also visited and worked as a member of the CPC group. The group's leader Pekka Jääskeläinen was a copromotor (co-supervisor) in the thesis. Congratulations Barry!
The CPC group published two papers in this year's IEEE Nordic Circuits and Systems Conference (NorCAS). Kari, the first author of both of the papers, participated in the conference in Lund and gave a presentation about each topic. The first paper leans on the recent interest in using AI-based methods for processor design space exploration. In this field, methods to evaluate design points quickly are key for fast exploration. The paper, done in collaboration with the Robot Learning team in Aalto University, describes a machine learning based method to estimate cycle counts in application-specific, static multi-issue architecuters. The paper is titled "Cycle Count Estimation of VLIW Processors Using Machine Learning".
The second paper, "Fully Automatic Compiler Retargeting and CV-X-IF Hardware Interface Generation for RISC-V Custom Instructions", concerns CPC's efforts in the TRISTAN project. Since developing, verifying, and possibly certifying processor IPs is time-consuming and expensive, there are ongoing efforts in the RISC-V community to specify and implement standardized coprocessor/accelerator interfaces to existing processors. Once the interface is in place, the processor IP can be instantiated with different coprocessor/accelerator IPs. In this work, we leveraged OpenASIP's hardware generation capabilities to automatically generate CV-X-IF-based coprocessors. Operations of the coprocessor can be defined with OpenASIP's processor designer (ProDE). In this work, we also describe the improvements to the RISC-V support in OpenASIP. Operations from C code can now automatically be mapped to (suitable) custom operations in the coprocessor.Instruction compression has been used in a variety of ways to mitigate the overheads of programmability in processors. We proposed a programmable instruction dictionary compression with the goal of improving dynamic compression ratio and energy-efficiency, and compared our approach to "traditional" instruction stream components. The article titled "Energy-efficient instruction compression with programmable dictionaries" was published in Springer Design Automation for Embedded Systems (DAES).
Implementing applications efficiently on FPGAs requires knowledge not only on the algorithms used in the application, but also on RTL description and FPGA EDA tools. In order to separate the tasks of the SW designer from those of the HW designer, Topi Leppänen proposes to use pre-generated bitstream databases together with partial FPGA reconfiguration. The SW designer can implement an application by picking from kernels in the database and is not required to have expertise in RTL or FPGA design. Our proposed tool, AFOCL, handles downloading the bitstreams and reconfiguring the FPGA automatically. The article "Bitstream Database-Driven FPGA Programming Flow Based on Standard OpenCL" is published in IEEE Transactions on Very Large Scale Integration (VLSI). The code is released as open-source and is available here.
Following a succesful tapeout of the Headsail SoC, Beaivi DSP is up and running after initial testing! Read the detailed news item here.
A delegation of three CPC members (Pekka, Kari and Joonas) participated in the RISC-V Summit Europe 2024 in Münich. The hero of the pack was Kari who delivered both a poster and an excellent talk about OpenASIP's RISC-V support. Check the slides here.
The 3rd year demonstrator of the AISA project was presented last Friday live at Paidia in Tampere, Finland. The demonstrator features adaptable AI compute offloading from a nanodrone to remote servers. The Crazyflie nanodrone offloads an object detection algorithm to a remote server via PoCL-R and adapts to the network quality by adjusting the compression rate of the images sent to the server on the fly. You can watch the demonstrator videos on YouTube.
When offloading computer vision (CV) computation from a small device, such as a drone, to a remote server, a stream of images needs to be sent over a wireless network channel. Traditional entropy-coded bitstreams, such as JPEG, transmitted via a digital channel are prone to a so-called “digital cliff”: A sudden drop in the reconstructed image quality due to data corruption caused by channel noise and lost packets. To circumvent the digital cliff, Linear Coding and Transmission schemes (LCT) were pioneered by SoftCast in 2010, in which the reconstructed image quality degrades smoothly with increased amount of channel impairments. So far, however, the impact of LCT and channel impairments on CV accuracy has been studied only minimally. Jakub Žádník recently presented a paper “Performance of Linear Coding and Transmission in Low-Latency Computer Vision Offloading” at the WCNC 2024 conference in Dubai (UAE) in which he studies the impact of LCT processing, wireless channel noise and packet losses on the accuracy of semantic segmentation and object detection tasks. The absence of the digital cliff in the task accuracy was confirmed via a thorough evaluation over a wide range of LCT configurations. The findings were further strengthened by a realistic 5G channel simulation and retraining the CV tasks to account for the distortions caused by LCT and noisy channel.
OpenCL Pipe is a memory object used for passing data between kernels. It is useful in streaming style applications, where data is forwarded from one task to another. Since the pipe can be implemented in multiple ways, and OpenCL is intended as a programming model for heterogeneous platforms, the performance of the pipe implementations can vary heavily. The PhD thesis work of Topi Leppänen has resulted in insights on how the pipe specification could be improved especially in the context of FPGAs. These findings, along with suggestions for the OpenCL specification, were presented in IWOCL 2024 by Topi. Read the publication here.
The modern computing landscape includes a variety of platforms. In addition to general-purpose devices, specialized processors are used to increase efficiency in various application domains and use cases. The OpenCL standard presents a unified way to program these heterogeneous devices, and the CPC group's PoCL is a vendor-independent, open-source implementation of the standard. In his MSc thesis "Adding fault tolerance to OpenCL" (2023) Robin Bijl added a mechanism to achieve robust computation with PoCL. This allows fault tolerance and reliable computing even in the context of heterogeneous platforms. Read the thesis here.
The Internet of things (IoT) consists of an enormous amount of devices with their size varying from large to extremely tiny. While it may be desirable to have complex functionalities in even the tiniest devices, this is often not feasible simply due to the lack of available resources. However, offloading the computation to a (nearby) server or a larger device enables sharing of the resources and seemingly allows even small devices to perform demanding computations. In his MSc thesis "Offloading Computation with a Minimized OpenCL Runtime from a Nano Drone" (2022) Jyry Uitto created a proof-of-concept implementation of a nano drone that can offload OpenCL kernel execution onto an edge server. Read the thesis here.
Static multi-issue processors exploit instruction level parallelism efficiently thanks to the lack of dynamic hardware that schedules instructions during run time. However, their instruction stream energy consumption is significantly higher than that of their dynamic multi- or single-issue counterparts. Processor designers must choose between the benefits of static multi-issue capabilities and higher code density, but is it too much to ask for both? In our latest article, we introduce an energy-efficient dual-mode (RISC-V single-issue and an exposed datapath VLIW) architecture for leveraging instruction level parallelism statically when available in the program, without suffering from VLIW’s poor code density when there’s a lack of it. The flexibility of the architecture is utilized by a novel compilation method that can generate code for both instruction sets with fine-grained mode switching. Read more in the article.
Our Dutch colleague Maarten Molendijk from TU Eindhoven presented a co-authored paper "BrainTTA: A 28.6 TOPS/W Compiler Programmable Transport-Triggered NN SoC" in IEEE ICCD 2023. The publication was a result of successful collaboration work between our CPC group and PARSE/TUE where a programmable TTA/SIMD-based accelerator was designed for ultra low power AI inference on low precision use cases. The design was done using the OpenASIP tools with the design work conducted by Molendijk et al. Read more about it in the preprint. The presentation slides are available here.
Our doctoral researcher Topi Leppänen presented the paper "AFOCL: Portable OpenCL Programming of FPGAs via Automated Built-in Kernel Management" in NorCAS 2023. AFOCL allows FPGA device users to avoid vendor lock-in and separates the roles of software and FPGA engineer. Behind the curtain, the OpenCL implementation automatically selects IPs from a precompiled bitstream database and handles FPGA reconfiguration. Details in the paper.
Check out the video below of the final demonstrator for the CPSoSAware EU project. The work was a collaboration with the University of Peloponnese. The demonstrator features a nanodrone, which offloads processing to edge resources wirelessly using Pocl-R.
Social Media
Follow the CPC group on Twitter/X: https://twitter.com/CustomParComp