EE587 Programming Massively Parallel Processors


The architecture of Graphics Processing Units (GPU) has evolved over the year from fixed-function graphic pipelines to arrays of unified programmable processors. This has allowed GPUs to be used for scientific computing. Equipped with hundreds or even thousands of cores, GPUs are qualified of massively parallel processors and provide significant performance improvement for parallel applications compared to common single-core or multi-core processors (CPU). In this course, students will develop a thorough understanding of the architecture of recent GPUs and will learn how to efficiently program these processors through data-level parallelism using high-level programming languages such as CUDA. Topics covered will include the history of GPUs, the architecture of GPUs, principles of parallel programming, data level parallelism, memory hierarchy, performance consideration, numerical consideration, parallel patterns such as map, reduction, scan, sort, histogram and matrix operations. Students completing this course will have a throughout understanding of the GPU programming model and will be able to design efficient parallel algorithms on GPUs.

Course Goals

By taking this course, the students will:

  • Develop a thorough understanding of the GPU architecture and programming model;
  • Complete four hand-on laboratories that will require significant programming in C CUDA;
  • Complete a course project involving the parallel implementation of a relevant algorithm on GPU in order to achieve a shorter execution time; and
  • Review of a state-of-the-art paper on GPU programming.


  • 4x laboratory reports;
  • Project report;
  • Project presentation; and
  • Written or oral critique of a research paper on GPU programming.

Mandatory Textbooks

D. Kirk and W. Hwu. “Programming Massively Parallel Processors”, 3rd edition, 2016, 576 p.

Lesson Plan

The course will be organized in three components as follows:

Component 1 will consist of a series of online lectures supplemented by reading assignments, tutorials and laboratory works. The instruction material comes from the CS193 course taught at Stanford University and has been made freely available online for others to use in their curriculum. The lectures can be downloaded using iTune. The tutorials and lab instructions are available on the CS193 GitHub repository. The links are given below. The lab instructions are embedded in the starting code for each lab. The reading assignments are from the mandatory textbook (D. Kirk and W. Hwu, 2016). The readings supplement the online lectures and prepare the students for the laboratory works.

Component 2 will consist of completing a course project using parallel programming on GPUs. The project can be completed individually or in teams of two and must include a significant hands-on component. Typical projects consist of parallelizing a known algorithm on a GPU and measure the speedup achieved. Suggested topics include the implementation of parallel metaheuristics on GPUs. The students will be required to submit a project proposal midway through the term, a project report at the end of the term and give an oral presentation during the last two weeks of the term.

Component 3 will consist of a review of a state-of-the-art paper on GPU programming. The review will consist of a single spaced one-page document that will summarize the content of the paper, critique or comment on the work presented and suggest future work on that topic.

Delivery Format

With the exception of the first week when the instructor introduces the course and the last week when the students present their project, there will be no formal lectures given by the instructor. All the material is available online and in the mandatory textbook. The students are expected to work autonomously and to submit their work on time throughout the semesters. The instructor will identify a period during the week when he is available to answer questions.

Lab reports

For each laboratory, a full lab report must be submitted. This report, must include an introduction, a high level description of the implementation, a self-explanatory description of the test and results, a discussion and a conclusion. The lab report must contain sufficient details to convince the instructor that the work was done successfully whithout requiring inspection of the code.


Marks will be weighed as follows:

  • Labs – 40%
  • Project report – 20%
  • Project presentation – 20%
  • Paper review – 20%


The course will follow the schedule shown below. Component 1 will take place during weeks 1 to 7 while components 2 and 3 will cover weeks 8 to 13.








8-12 Jan

Introduction to Massively Parallel Computing

Introduction to CUDA

Chap 1-2

Download all lectures

slides in pdf

itune videos

Connect to the CUDA server
(Instructions, videos, commands and example code)

Complete the CUDA Tutorial
(read the entire tutorial and refer back to it as needed throughout the course)


15-19 Sep

CUDA Threads & Atomic operations

CUDA Memories

Performance Considerations

Chap 3-6


Lab 1


33-26 Sep

Parallel Patterns I

Parallel Patterns II

Chap 7-9



29 Jan - 2 Feb

Introduction to Thrust

Sparse Matrix Vector Operations

Chap 10

 NVIDIA Cub example

Lab 2


5-9 Feb

Solving Partial Differential Equations with CUDA

The Fermi Architecture

Chap 13



12-16 Feb

NVIDIA OptiX: Ray Tracing on the GPU

Future of Throughput

Chap 17


Lab 3


19-23 Feb

Reading Week





26 Feb - 2 Mar

Path Planning System on the GPU

Optimizing Parallel GPU Performance

Parallel Sorting

Chap 11-12

Project proposal due

Lab 4


5-9 Mar

Project work





12-16 Mar

Project work


 All lab reports are due



19-23 Mar

Project work





26-30 Mar

Project work



2-6 Apr Project presentation


Paper critique due (1 page)

Project report due



9-13 Apr

Exam week (no final exam)