Docsity
Docsity

Prepare for your exams
Prepare for your exams

Study with the several resources on Docsity


Earn points to download
Earn points to download

Earn points by helping other students or get them with a premium plan


Guidelines and tips
Guidelines and tips

EECS 570: Parallel Computer Architecture Project Handout, Exams of Computer Networks

In this project, you will innovate new designs or evaluate extensions to designs presented in class or a recent architecture/systems conference.

Typology: Exams

2022/2023

Uploaded on 05/11/2023

ameen
ameen 🇺🇸

4.6

(5)

236 documents

1 / 8

Toggle sidebar

This page cannot be seen from the preview

Don't miss anything!

bg1
EECS 570: Parallel Computer Architecture
Project Handout
24 January, 2020
1 Important Dates
Proposal: 11:59 PM, Wed, 5th Feb., via Canvas; kick-off meetings: Thu, 6th - Fri, 7th Feb.
Milestone 1: Report due: 11:59 PM, Mon, 24th Feb., via Canvas; meetings: Fri, 28th Feb.
Milestone 2: Meetings: Fri, 27th Mar.
Final report: Wed, 22nd Apr. (hand over hard copy at the poster session)
Poster session: Wed, 22nd Apr., 1:30 PM - 3:30 PM, Tishman Hall, BBB
2 Introduction
In this project, you will innovate new designs or evaluate extensions to designs presented in class or
a recent architecture/systems conference. The project must be broadly related to multipro-
cessor systems. The purpose of this pro ject is to conduct and generate publication-quality research
and to improve the state of the art. The project report will be in the form of a 6-page research article
similar to those we have been discussing in the class. Pro ject groups will include 3-4 students. Groups
will present their work in a final poster session.
The project will account for 25% of your final course grade. It will be graded out of 100 points.
Proposal and milestone reports (5 points)
Final report (75 points)
Problem definition and motivation (15 points)
Survey of previous related work (15 points)
Description of design (15 points)
Experimentation methodology (15 points)
Analysis of results (15 points)
Statement of contribution of each team member
Poster presentation (20 points)
3 Proposal
You are encouraged to meet with me prior to submitting the proposal if you have any questions. When
in doubt, meet with me. Please make an appointment or use the office hours.
The proposal is a written two-page document including:
1. A problem definition and motivation.
1
pf3
pf4
pf5
pf8

Partial preview of the text

Download EECS 570: Parallel Computer Architecture Project Handout and more Exams Computer Networks in PDF only on Docsity!

EECS 570: Parallel Computer Architecture

Project Handout

24 January, 2020

1 Important Dates

Proposal: 11:59 PM, Wed, 5th^ Feb., via Canvas; kick-off meetings: Thu, 6th^ - Fri, 7th^ Feb. Milestone 1: Report due: 11:59 PM, Mon, 24th^ Feb., via Canvas; meetings: Fri, 28th^ Feb. Milestone 2: Meetings: Fri, 27th^ Mar. Final report: Wed, 22nd^ Apr. (hand over hard copy at the poster session) Poster session: Wed, 22nd^ Apr., 1:30 PM - 3:30 PM, Tishman Hall, BBB

2 Introduction

In this project, you will innovate new designs or evaluate extensions to designs presented in class or a recent architecture/systems conference. The project must be broadly related to multipro- cessor systems. The purpose of this project is to conduct and generate publication-quality research and to improve the state of the art. The project report will be in the form of a 6-page research article similar to those we have been discussing in the class. Project groups will include 3-4 students. Groups will present their work in a final poster session.

The project will account for 25% of your final course grade. It will be graded out of 100 points.

  • Proposal and milestone reports (5 points)
  • Final report (75 points)
    • Problem definition and motivation (15 points)
    • Survey of previous related work (15 points)
    • Description of design (15 points)
    • Experimentation methodology (15 points)
    • Analysis of results (15 points)
    • Statement of contribution of each team member
  • Poster presentation (20 points)

3 Proposal

You are encouraged to meet with me prior to submitting the proposal if you have any questions. When in doubt, meet with me. Please make an appointment or use the office hours. The proposal is a written two-page document including:

  1. A problem definition and motivation.
  1. A brief survey of related work with at least four papers you have found and read directly related to the topic (ask me or the GSI for pointers or post on Piazza). Conferences that cover multiprocessor system research include ISCA, ASPLOS, MICRO, HPCA, PPoPP, PACT, SC, ISPASS, SOSP, OSDI, SIGMETRICS. You can search the online pages and/or contact me for pointers. You will also find the World Wide Computer Architecture web page www.cs.wisc. edu/~arch/www a good source of information.
  2. A detailed description of your experimental setup including the infrastructure you will use and any significant development effort you must invest to build new infrastructure.
  3. Project milestones and schedule. What do you expect to see by each milestone? Where do you plan to go based on your observations? You can draw a flow chart to clarify.
  4. Division of labor. How is each group member going to contribute to the project?

4 Milestone Reports

  1. (Milestone 1) You will hand in an up-to two-page write-up describing your progress and initial results. These results form the basis for the final outcome of your research. Based on these results, explain your plans for the second milestone. If there are any changes to plans, you should bring them up in this report. (Milestone 2) Milestone 2 does not require a written report. However, you must prepare graphs or figures of your experimental results to date and discuss them with me in the status meeting. You should paste each graph/chart into PowerPoint slide with 1-2 bullets summarizing the main conclusion of the experiment. These graphs should form the basis of your final report and poster.
  2. For each milestone, you will make an appointment to meet with me to go over the project status. The appointments are made by filling out an appointment sheet in class prior to the meetings.

5 Final Report

You will submit a 6-page final report in the format of conference papers like those we are reviewing (see ISLPED for an example of what 6-page 2-column papers typically look like). The report should include an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results section, and a conclusion section. You may choose to write a related work section preceding the conclusions, or a background section following the introduction. Finally, you must include a brief statement of each team member’s specific contribution to the project. You will need all the relevant citations in a reference section at the end. (Other outlines for your report may be appropriate, but check with me to make sure you are doing something sensible first).

6 Posters

During the final exam week, we will hold a poster session, open to the campus community, in which teams will get to present their results. I will review each of your posters and pose questions related to your project to each team member.

7 Best Projects

Reports should be sufficiently polished so that they could be submitted to a high-quality workshop without further modification. I anticipate that the best projects will be submitted to computer system

[2] Darwin: A Genomics Co-processor. Turakhia et al. ASPLOS 2018.

9.3 Secure Parallel Architectures

Significant security vulnerabilities have been recently exposed in modern hardware such as cold boot attacks, Rowhammer, Spectre, Meltdown. In addition to these attacks, a multiprocessor system can concurrently execute different applications, which share hardware resources (e.g., shared caches, inter- connect, memory, etc.). Such sharing can lead to side-channels that compromise isolation. You can try to replicate known side-channels, study new vulnerabilities, design and evaluate defenses. You may want to think about secure GPU architectures, which is relatively under-explored. A related but different problem is to guarantee privacy for computation executing on a third-party system (e.g., public cloud) where you cannot trust even the system software (e.g., cloud OS). This is important for application developers that would like to use cloud (banks, hospitals, defense, etc.) but are wary of computing on sensitive private data on the cloud. Modern processors provide spe- cialized hardware support in the form trusted hardware execution environments (e.g., Intel SGX). You can identify a parallel application that operate on sensitive private data (e.g., analyzing genome sequencing data) and use Intel SGX to off-load sensitive parts of computation to the secure hardware. Alternatively, you can study the challenges and solutions in designing secure hardware such as Intel SGX. References:

1 Chapter 7 on “Multiprocessor and Many-Core Protections” in “Principles of Secure Processor Architecture Design”. J. Szefer. Morgan and Claypool Synthesis Lecture.

2 Spectre Attacks: Exploiting Speculative Execution. Kocher et al. Oakland 2019.

3 InvisiPage: oblivious demand paging for secure enclaves. ISCA 2019. Aga and Narayanasamy.

4 InvisiMem: Smart Memory Defenses for Memory Bus Side Channel. ISCA 2017. Aga and Narayanasamy.

5 Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Exe- cution. Bulck et al. Usenix Security. 2018.

9.4 Optimizing Parallelization of 3-D Ultrasound Beamforming

Ultrasound beamforming is a highly parallel, memory dominated workload. One of the key challenges in speeding up this workload for modern hardware is determining how to parallelize while maximizing data locality. For this project you will be given a serial beamforming implementation with the task of exploring multiple parallelization strategies such as multi-threading, SIMD, GPU, etc. Additionally you will need to determine the best method of dividing the inherent parallelism in the algorithm such as across transducer channels or across image scanlines to maximize efficiency. You may also want to examine how your solution changes across hardware configuration to develop a more robust solution that is not tailored to a specific memory system size or core number. http://ieeexplore.ieee.org/document/6522329/

9.5 Low-Overhead Memory Access Profiling

Memory access patterns offer important insights into program behavior. For example, one can infer memory regions that are hot and cold from such access patterns, which can then even be correlated to data structures to gauge which data structures are frequently accessed. However, profiling programs to obtain their memory access patterns is expensive. Tools like Intel’s Pin are still too expensive (usually ∼0.5-2× slower) for them to be employed in a production environment. The aim of this project is to build a sampling memory access profiler that outputs a suitable sample of memory accesses from

which program properties can be inferred. The profiler should be able to run with very low overhead (ideally <1%). As a second step, you can try to design a scheme that automatically tunes the sampling frequency to be under a given maximum performance overhead.

  • Generating Miss Rate Curves with Low Overhead Using Existing Hardware https://pdfs.semanticscholar.org/3987/7cd79f45036d507ec2ad221c8306c2c0611b.pdf
  • Computer Performance Microscopy with SHIM https://www.microsoft.com/en-us/research/publication/computer-performance-microscopy- with-shim/
  • Advanced Hardware Profiling and Sampling (PEBS, IBS, etc.): Creating a New PAPI Sampling Interface http://web.eece.maine.edu/~vweaver/projects/perf_events/sampling/pebs_ibs_sampling. pdf

9.6 Designing Scalable Synchronization Primitives for Manycore Systems

Modern libraries provide efficient synchronization for different situations. For example, MCS locks provide maximum throughput to contended critical sections, while pthreads/Linux futex’s ensures that threads never wait on mutexes held by a sleeping thread (rather, the sleeping thread is immediately scheduled so the critical section makes forward progress). At the same time, MCS locks do not scale as the number of threads exceeds the number of CPUs and pthreads performs poorly under contention. The need to choose the correct synchronization implementation complicates program design. Recent work (“Decoupling Contention Management from Scheduling” by Johnson) proposes a new lock primitive, based on preemptable MCS locks that provides the best of both worlds – critical section throughput is maximized for various numbers of threads both with and without contention. Design new lock and synchronization primitives that perform well even with more threads than CPUs. Consider primitives besides mutex (pthreads condition variables are particularly bad), and primitives that can be implemented entirely in user-space (e.g., by leveraging pthreads inside your synchronization primitives).

9.7 Reconfigurable Many-Core Architecture for Specialized Workloads

The demise of Moore’s Law has led to renewed interest in accelerators to improve performance and reduce energy and cost. In order to address these issues for applications involving sparse matrix computations, we propose a custom accelerator, OuterSPACE, that consists of asynchronous Single Program, Multiple Data (SPMD)-style processing units with dynamically-reconfigurable non-coherent caches and crossbars. OuterSPACE is designed to work with the unconventional outer product based matrix multiplication approach, which involves multiplying the ith^ column of the first matrix (A) with the ith^ row of the second matrix (B), for all i. An important next step would be to extend this architecture to support dense matrix multiplication, stochastic gradient descent, graph-search algorithms, recurrent neural networks, long short-term memories, etc. https://drive.google.com/file/d/1-E72T4rk1qaHKb664PDIeZPdrn6Q-OBX/view?usp=sharing

9.8 Formal Verification of CPU or GPU Coherence Protocols

In programming assignment 2, you will be verifying a standard MSI protocol using a formal verification tool called Murphi. The involvement of transient states in addition to the stable states makes the coherence state machine complicated. This directly translates to more rigorous verification effort to ensure correctness under all conditions. The need for high-performance and scalable machines has made cache coherence protocols much more complex than simple MSI. As faster, larger systems are

9.14 Hardware Acceleration of Blockchain Applications

Recent years have seen the emergence of a new class of currencies, called cryptocurrencies. These currencies use cryptography to provide security and peer-to-peer networking to provide a decentralized system. Bitcoin is the most popular of these currencies. It uses a two-pass SHA-256 hash at its core. Producing new bitcoins is done through a process referred to as “mining”, which involves a brute-force search for a hash with a specific value. This process requires large amounts of computing power. The task is to implement an accelerator using FPGAs to accelerate the underlying blockchain application used to mine bitcoins. You may use simulation tools or Amazon EC2 F1 FPGA instances for this.

9.15 Characterization on Contemporary Architectures

  • Part 1: Characterize Workloads New applications and programming paradigms continue to emerge rapidly as the diversity and performance of computers increase. Whether they are in smartphones and deeply embedded systems at the low end or in massively parallel systems at the high end, the design of future computing machines can be significantly improved if we understand the characteristics of the workloads that are expected to run on them. In this part of the project, you are expected to select one or more applications and profile them on contemporary parallel architectures such as CPUs and GPUs.
  • Part 2: Measure and Analyze the Power/Energy Consumption of a Mobile Device Limiting power and energy consumption presents a critical issue in computing, particularly in portable and mobile platforms such as laptop computers and cell phones. Uneven energy drain is noticed to be increased when running applications that require intensive computations on the mobile devices. For this part you are to profile and analyze power/energy consumption patterns for varying applications run on a device with parallel processors.

9.16 Investigate Parallel Implementations of malloc

Many modern programs frequently use dynamic memory allocation. However, modern programs in- creasingly are multithreaded and parallel to take advantage of increasingly parallel processors. Un- fortunately, this trend conflicts with the fact that there is a single heap in most current programs. Consequently, research into parallel memory allocators is topical and important. You are tasked with innovating over an existing malloc implementation and accelerate it on a CPU or an FPGA.

9.17 Efficient false sharing detection and repair

False sharing occurs when multiple threads write to independent variables belonging to the same cache line resulting in cache-line ping ponging. This issue is regarded as one of the silent performance killers of multi-threaded applications and is expected to become increasingly important with future multi-core architectures employing large (e.g., 128 byte) cache lines. False sharing is difficult to detect statically as the memory layout for code and data depends on many decisions made by the memory allocator. Runtime profiling tools, e.g. Intel VTune, can detect false sharing, but they have high false positive rate, performance overhead and provide little insight to repair false sharing. Recently, hybrid techniques to detect and repair false sharing have been proposed, which can further be improved using machine learning techniques. This project will investigate techniques to improve false sharing detection accuracy and coverage, as well as propose memory layout transformations for false sharing repair. [1] Tanvir Ahmed Khan, et al. ”Huron: hybrid false sharing detection and repair.” Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2019.

[2] Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of false sharing using machine learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 30.