




Study with the several resources on Docsity
Earn points by helping other students or get them with a premium plan
Prepare for your exams
Study with the several resources on Docsity
Earn points to download
Earn points by helping other students or get them with a premium plan
Community
Ask the community for help and clear up your study doubts
Discover the best universities in your country according to Docsity users
Free resources
Download our free guides on studying techniques, anxiety management strategies, and thesis advice from Docsity tutors
In this project, you will innovate new designs or evaluate extensions to designs presented in class or a recent architecture/systems conference.
Typology: Exams
1 / 8
This page cannot be seen from the preview
Don't miss anything!
Proposal: 11:59 PM, Wed, 5th^ Feb., via Canvas; kick-off meetings: Thu, 6th^ - Fri, 7th^ Feb. Milestone 1: Report due: 11:59 PM, Mon, 24th^ Feb., via Canvas; meetings: Fri, 28th^ Feb. Milestone 2: Meetings: Fri, 27th^ Mar. Final report: Wed, 22nd^ Apr. (hand over hard copy at the poster session) Poster session: Wed, 22nd^ Apr., 1:30 PM - 3:30 PM, Tishman Hall, BBB
In this project, you will innovate new designs or evaluate extensions to designs presented in class or a recent architecture/systems conference. The project must be broadly related to multipro- cessor systems. The purpose of this project is to conduct and generate publication-quality research and to improve the state of the art. The project report will be in the form of a 6-page research article similar to those we have been discussing in the class. Project groups will include 3-4 students. Groups will present their work in a final poster session.
The project will account for 25% of your final course grade. It will be graded out of 100 points.
You are encouraged to meet with me prior to submitting the proposal if you have any questions. When in doubt, meet with me. Please make an appointment or use the office hours. The proposal is a written two-page document including:
You will submit a 6-page final report in the format of conference papers like those we are reviewing (see ISLPED for an example of what 6-page 2-column papers typically look like). The report should include an abstract, an introduction, followed by a detailed description of the design, a methodology section, a results section, and a conclusion section. You may choose to write a related work section preceding the conclusions, or a background section following the introduction. Finally, you must include a brief statement of each team member’s specific contribution to the project. You will need all the relevant citations in a reference section at the end. (Other outlines for your report may be appropriate, but check with me to make sure you are doing something sensible first).
During the final exam week, we will hold a poster session, open to the campus community, in which teams will get to present their results. I will review each of your posters and pose questions related to your project to each team member.
Reports should be sufficiently polished so that they could be submitted to a high-quality workshop without further modification. I anticipate that the best projects will be submitted to computer system
[2] Darwin: A Genomics Co-processor. Turakhia et al. ASPLOS 2018.
Significant security vulnerabilities have been recently exposed in modern hardware such as cold boot attacks, Rowhammer, Spectre, Meltdown. In addition to these attacks, a multiprocessor system can concurrently execute different applications, which share hardware resources (e.g., shared caches, inter- connect, memory, etc.). Such sharing can lead to side-channels that compromise isolation. You can try to replicate known side-channels, study new vulnerabilities, design and evaluate defenses. You may want to think about secure GPU architectures, which is relatively under-explored. A related but different problem is to guarantee privacy for computation executing on a third-party system (e.g., public cloud) where you cannot trust even the system software (e.g., cloud OS). This is important for application developers that would like to use cloud (banks, hospitals, defense, etc.) but are wary of computing on sensitive private data on the cloud. Modern processors provide spe- cialized hardware support in the form trusted hardware execution environments (e.g., Intel SGX). You can identify a parallel application that operate on sensitive private data (e.g., analyzing genome sequencing data) and use Intel SGX to off-load sensitive parts of computation to the secure hardware. Alternatively, you can study the challenges and solutions in designing secure hardware such as Intel SGX. References:
1 Chapter 7 on “Multiprocessor and Many-Core Protections” in “Principles of Secure Processor Architecture Design”. J. Szefer. Morgan and Claypool Synthesis Lecture.
2 Spectre Attacks: Exploiting Speculative Execution. Kocher et al. Oakland 2019.
3 InvisiPage: oblivious demand paging for secure enclaves. ISCA 2019. Aga and Narayanasamy.
4 InvisiMem: Smart Memory Defenses for Memory Bus Side Channel. ISCA 2017. Aga and Narayanasamy.
5 Foreshadow: Extracting the Keys to the Intel SGX Kingdom with Transient Out-of-Order Exe- cution. Bulck et al. Usenix Security. 2018.
Ultrasound beamforming is a highly parallel, memory dominated workload. One of the key challenges in speeding up this workload for modern hardware is determining how to parallelize while maximizing data locality. For this project you will be given a serial beamforming implementation with the task of exploring multiple parallelization strategies such as multi-threading, SIMD, GPU, etc. Additionally you will need to determine the best method of dividing the inherent parallelism in the algorithm such as across transducer channels or across image scanlines to maximize efficiency. You may also want to examine how your solution changes across hardware configuration to develop a more robust solution that is not tailored to a specific memory system size or core number. http://ieeexplore.ieee.org/document/6522329/
Memory access patterns offer important insights into program behavior. For example, one can infer memory regions that are hot and cold from such access patterns, which can then even be correlated to data structures to gauge which data structures are frequently accessed. However, profiling programs to obtain their memory access patterns is expensive. Tools like Intel’s Pin are still too expensive (usually ∼0.5-2× slower) for them to be employed in a production environment. The aim of this project is to build a sampling memory access profiler that outputs a suitable sample of memory accesses from
which program properties can be inferred. The profiler should be able to run with very low overhead (ideally <1%). As a second step, you can try to design a scheme that automatically tunes the sampling frequency to be under a given maximum performance overhead.
Modern libraries provide efficient synchronization for different situations. For example, MCS locks provide maximum throughput to contended critical sections, while pthreads/Linux futex’s ensures that threads never wait on mutexes held by a sleeping thread (rather, the sleeping thread is immediately scheduled so the critical section makes forward progress). At the same time, MCS locks do not scale as the number of threads exceeds the number of CPUs and pthreads performs poorly under contention. The need to choose the correct synchronization implementation complicates program design. Recent work (“Decoupling Contention Management from Scheduling” by Johnson) proposes a new lock primitive, based on preemptable MCS locks that provides the best of both worlds – critical section throughput is maximized for various numbers of threads both with and without contention. Design new lock and synchronization primitives that perform well even with more threads than CPUs. Consider primitives besides mutex (pthreads condition variables are particularly bad), and primitives that can be implemented entirely in user-space (e.g., by leveraging pthreads inside your synchronization primitives).
The demise of Moore’s Law has led to renewed interest in accelerators to improve performance and reduce energy and cost. In order to address these issues for applications involving sparse matrix computations, we propose a custom accelerator, OuterSPACE, that consists of asynchronous Single Program, Multiple Data (SPMD)-style processing units with dynamically-reconfigurable non-coherent caches and crossbars. OuterSPACE is designed to work with the unconventional outer product based matrix multiplication approach, which involves multiplying the ith^ column of the first matrix (A) with the ith^ row of the second matrix (B), for all i. An important next step would be to extend this architecture to support dense matrix multiplication, stochastic gradient descent, graph-search algorithms, recurrent neural networks, long short-term memories, etc. https://drive.google.com/file/d/1-E72T4rk1qaHKb664PDIeZPdrn6Q-OBX/view?usp=sharing
In programming assignment 2, you will be verifying a standard MSI protocol using a formal verification tool called Murphi. The involvement of transient states in addition to the stable states makes the coherence state machine complicated. This directly translates to more rigorous verification effort to ensure correctness under all conditions. The need for high-performance and scalable machines has made cache coherence protocols much more complex than simple MSI. As faster, larger systems are
Recent years have seen the emergence of a new class of currencies, called cryptocurrencies. These currencies use cryptography to provide security and peer-to-peer networking to provide a decentralized system. Bitcoin is the most popular of these currencies. It uses a two-pass SHA-256 hash at its core. Producing new bitcoins is done through a process referred to as “mining”, which involves a brute-force search for a hash with a specific value. This process requires large amounts of computing power. The task is to implement an accelerator using FPGAs to accelerate the underlying blockchain application used to mine bitcoins. You may use simulation tools or Amazon EC2 F1 FPGA instances for this.
Many modern programs frequently use dynamic memory allocation. However, modern programs in- creasingly are multithreaded and parallel to take advantage of increasingly parallel processors. Un- fortunately, this trend conflicts with the fact that there is a single heap in most current programs. Consequently, research into parallel memory allocators is topical and important. You are tasked with innovating over an existing malloc implementation and accelerate it on a CPU or an FPGA.
False sharing occurs when multiple threads write to independent variables belonging to the same cache line resulting in cache-line ping ponging. This issue is regarded as one of the silent performance killers of multi-threaded applications and is expected to become increasingly important with future multi-core architectures employing large (e.g., 128 byte) cache lines. False sharing is difficult to detect statically as the memory layout for code and data depends on many decisions made by the memory allocator. Runtime profiling tools, e.g. Intel VTune, can detect false sharing, but they have high false positive rate, performance overhead and provide little insight to repair false sharing. Recently, hybrid techniques to detect and repair false sharing have been proposed, which can further be improved using machine learning techniques. This project will investigate techniques to improve false sharing detection accuracy and coverage, as well as propose memory layout transformations for false sharing repair. [1] Tanvir Ahmed Khan, et al. ”Huron: hybrid false sharing detection and repair.” Proceedings of the 40th ACM SIGPLAN Conference on Programming Language Design and Implementation. 2019.
[2] Sanath Jayasena, Saman Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, and Yanbin Liu. 2013. Detection of false sharing using machine learning. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. ACM, 30.