CDA 6938 Multi-core/Many-core Architecture and Programming Homework

ST: CDA 6938 Multi-core/many-core architectures and programming

Assignments

Homework #0 (No need to turn it in)

a. Write a multithreaded program using the Brook+ streaming programming model and the emulator to generate multiple “Hello world!”

b. Write a multithreaded program using the CUDA programming model and the emulator to generate multiple “Hello world!”

Tips:

Brook+:

Download link: http://ati.amd.com/technology/streamcomputing/sdkdwnld.html

You need to download ATI Stream SDK to install Brook+ and CAL. After that, you need to set environment variable for your project: if you want to run in the emulation mode BRT_RUNTIME = cpu. If you have an ATI card, set BRT_RUNTIME = cal. You may use Visual Studio to open a project in C:\Program Files\AMD\AMD Brook+ 1.2.1_beta\samples\tests\ to test it.

Note that you may need additional DLLs for CAL to run properly. You may either install the drivers from AMD’s website or download the following zip file, unzip it, and copy the DLLs files to your C:\windows\system32 directory.

CUDA:

Download link: http://www.nvidia.com/object/cuda_get.html

You need to download (1) CUDA driver (not necessary for the emulation mode), (2) CUDA toolkit (required), and (3) CUDA SDK (required)

After installation, you may use Visual Studio to open a project in "C:\Program Files\NVIDIA Corporation\NVIDIA CUDA SDK\projects\deviceQuery\". Select emurelease mode to build the project and run.

The latest versions of CAL and Brook+ as well as CUDA have been set up on the lab machines. If you choose to use them, the username is CDA6938 and the password is hec242.

On the desktop, there is a folder "ATI_BROOK". In this folder, you may copy the "template" folder to your own folders to start your work. There is another folder "NVIDIA_CUDA",

which contains a template for your projects using CUDA.

Homework #1 2-D Convolution in Brook+

  Send your source code along with a brief explanation and the performance results in a text file to zhou@eecs.ucf.edu

Write a 2D convolution function using the streaming computing model and Brook+. Debug your program using the CPU-based emulator. (due date: 2/05/09)

Brief explanation on 2D convolution (or image convolution): assume a matrix a[M, N] and a matrix h[J,K], the convolution is defined as follows:

For simplicity, you may assume out-of-bound elements of a (e.g., a[-1,-1]) are zero.

Estimate the theoretical performance limit that you may achieve based on your kernel function for part (a) and analyze the performance bottleneck of your kernel function. Test your program on the ATI HD 4870 graphics processors to measure the actual performance (on convolution of a 2kx2k matrix with a 7x7 kernel). (Due date: 2/05/09)

Improve the performance by computing multiple elements in c in one kernel function. Repeat your performance analysis and report how much speedup (compared to computing one element in the kernel function) you get from the ATI HD 4870 graphics processors (on convolution of a 2kx2k matrix with a 7x7 kernel). (Due date: 2/10/09)

Homework #2 2-D Convolution in CUDA

a.    Write a 2D convolution function using CUDA. Debug your program using the CPU-based emulator and test it on lab machines.

b. Write an optimized version of single-precision floating-point 2D convolution for Nvidia GTX8800 GPU (using CUDA). Use random numbers to initialize both matrices. Report the following results: (1) The number of lines of code in the kernel function(s). (2) The execution time (including data transmission time from CPU/GPU to GPU/CPU) for matrix size 256 x 256, 512 x 512, 1024x 1024, 2048x2048, 4096x4096, the convolution kernel size is 7x7.

Sample solutions

Homework #3 Cell Programming

Run the “Hello World” program in both the simulation environment and the PS3 playstation. (Due date: 3/19/09)

Tips of installing the Cell SDK 3.0 on an x86 machine

Write a 2D convolution algorithm for Cell processors. Report the following results: (1) The number of lines of code in the PPE code; The number of lines of code in the SPE code. (2) The execution time for matrix size 256 x 256, 512 x 512, 1024x 1024, 2048x2048, 4096x4096, the convolution kernel size is 7x7. (Due date: 4/07/09)

Sample code on using Mailbox and DMA

IBM’s DMA troubleshooting guide