NVIDIA CUDA C SDK Code Samples

The GPU Computing SDK provides examples with source code, utilities, and white papers to help you get started writing GPU Computing software. The full SDK includes dozens of code samples covering a wide range of applications including:

  • Simple techniques such as C++ code integration and efficient loading of custom datatypes
  • How-To examples covering CUDA BLAS and FFT libraries, texture fetching in CUDA, and CUDA interoperation with the OpenGL and Direct3D graphics APIS
  • Linear algebra primitives such as matrix transpose and matrix-matrix multiplication
  • Data-parallel algorithms such as parallel prefix sum of large arrays
  • Performance: profiling using timers and bandwidth tests
  • Advanced application examples such as image convolution, Black-Scholes options pricing and binomial options pricing

The NVIDIA CUDA Toolkit is required to compile code samples. Please obtain the CUDA Toolkit from CUDA Zone.

Select the category to view:

Vector Addition For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This CUDA Runtime API sample is a very basic sample that implements element by element vector addition. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Vector Addition Driver API For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This CUDA Driver API sample is a very basic sample that implements element by element vector addition. It is the same as the sample illustrating Chapter 3 of the programming guide with some additions like error checking.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Device Query For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample enumerates the properties of the CUDA devices present in the system.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Device Query Driver API For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample enumerates the properties of the CUDA devices present using CUDA Driver API calls
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Template For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

A trivial template project that can be used as a starting point to create new CUDA projects.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


C++ Integration For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This example demonstrates how to integrate CUDA into an existing C++ application, i.e. the CUDA entry point on host side is only a function which is called from C++ code and only the file containing this function is compiled with nvcc. It also demonstrates that vector types can be used from cpp.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Bandwidth Test For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This is a simple test program to measure the memcopy bandwidth of the GPU. It currently is capable of measuring device to device copy bandwidth, host to device copy bandwidth for pageable and page-locked memory, and device to host copy bandwidth for pageable and page-locked memory.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


asyncAPI For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample uses CUDA streams and events to overlap execution on CPU and GPU.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Clock For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This example shows how to use the clock function to measure the performance of kernel accurately.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple Atomic Intrinsics For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

A simple demonstration of global memory atomic instructions. Requires Compute Capability 1.1 or higher.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Pitch Linear Texture For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Use of Pitch Linear Textures
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


simpleStreams For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample uses CUDA streams to overlap kernel executions with memcopies between the device and the host. Requires Compute Capability 1.1 or higher.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple Templates For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample is a templatized version of the template project. It also shows how to correctly templatize dynamically allocated shared memory arrays.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


CUDA C 3D FDTD For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample applies a finite differences time domain progression stencil on a 3D surface.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple Texture For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple example that demonstrates use of textures in CUDA.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple Texture (Driver Version) For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple example that demonstrates use of textures in CUDA using the driver API.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple Vote Intrinsics For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple program which demonstrates how to use the Vote (any, all) intrinsic instruction in a CUDA kernel. Requires Compute Capability 1.2 or higher.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


simpleZeroCopy For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample illustrates how to use Zero MemCopy, kernels can read and write directly to pinned system memory. This sample requires GPUs that support this feature (MCP79 and GT200).
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Matrix Transpose For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Efficient matrix transpose.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


CUDA Context Thread Management For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple program illustrating how to the CUDA Context Management API. CUDA contexts can be created separately and attached independently to different threads.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple CUBLAS For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Example of using CUBLAS.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple CUFFT For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Example of using CUFFT. In this example, CUFFT is used to compute the 1D-convolution of some signal with some filter by transforming both into frequency domain, multiplying them together, and transforming the signal back to time domain.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple OpenGL For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple program which demonstrates interoperability between CUDA and OpenGL. The program modifies vertex positions with CUDA and uses OpenGL to render the geometry.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Simple Texture 3D For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple example that demonstrates use of 3D textures in CUDA.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Matrix Multiplication For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements matrix multiplication and is exactly the same as Chapter 6 of the programming guide. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Matrix Multiplication (Dynamic Linking Version) For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample revisits matrix multiplication using the CUDA driver API. It demonstrates how to link to CUDA driver at runtime and how to use JIT (just-in-time) compilation from PTX code. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Scalar Product For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample calculates scalar products of a given set of input vector pairs.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Concurrent Kernels For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates the use of CUDA streams for concurrent execution of several kernels on devices of compute capability 2.0 or higher. Devices of compute capability 1.x will run the kernels sequentially.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Aligned Types For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

A simple test, showing huge access speed gap between aligned and misaligned structures.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


PTX Just-in-Time compilation For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample trates how to use JIT compilation for PTX code.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


DCT8x8 For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates how Discrete Cosine Transform (DCT) for blocks of 8 by 8 pixels can be performed using CUDA: a naive implementation by definition and a more traditional approach used in many libraries. As opposed to implementing DCT in a fragment shader, CUDA allows for an easier and more efficient implementation.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


1D Discrete Haar Wavelet Decomposition For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Discrete Haar wavelet decomposition for 1D signals with a length which is a power of 2.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Eigenvalues For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

The computation of all or a subset of all eigenvalues is an important problem in linear algebra, statistics, physics, and many other fields. This sample demonstrates a parallel implementation of a bisection algorithm for the computation of all eigenvalues of a tridiagonal symmetric matrix of arbitrary size with CUDA.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Fast Walsh Transform For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Naturally(Hadamard)-ordered Fast Walsh Tranform for batched vectors of arbitrary eligible(power of two) lengths
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Histogram For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates efficient implementation of 64-bin and 256-bin histogram.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Line of Sight For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample is an implementation of a simple line-of-sight algorithm: Given a height map and a ray originating at some observation point, it computes all the points along the ray that are visible from the observation point. The implementation is based on the parallel scan primitive provided by the CUDPP library (http://www.gpgpu.org/developer/cudpp/).
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


New Matrix Transpose For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

High Performance matrix transpose.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Box Filter For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Fast image box filter using CUDA with OpenGL rendering.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Post-Process in OpenGL For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample shows how to post-process an image rendered in OpenGL using CUDA.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Parallel Reduction For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

A parallel sum reduction that computes the sum of large arrays of values. This sample demonstrates several important optimization stratezies for parallel algorithms like reduction.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


DirectX Texture Compressor (DXTC) For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

High Quality DXT Compression using CUDA. This example shows how to implement an existing computationally-intensive CPU compression algorithm in parallel on the GPU, and obtain an order of magnitude performance improvement.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Image denoising For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates two adaptive image denoising technqiues: KNN and NLM, based on computation of both geometric and color distance between texels. While both techniques are implemented in the DirectX SDK using shaders, massively speeded up variation of the latter techique, taking advantage of shared memory, is implemented in addition to DirectX counterparts.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Sobel Filter For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements the Sobel edge detection filter for 8-bit monochrome images.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Recursive Gaussian Filter For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements a Gaussian blur using Deriche's recursive method. The advantage of this method is that the execution time is independent of the filter width.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


CUDA Video Decoder GL API For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates how to efficiently use the CUDA Video Decoder API to decode MPEG-2 or H.264 sources, perform YUV2RGB convertion of the decoded surface with a CUDA kernel, and output the result to an OpenGL surface. An OpenGL window with current frame and fps is opened, but the decoded video is not displayed on the screen. Requires Compute Capability 1.1 or higher.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Bicubic Texture Filtering For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates how to efficiently implement bicubic texture filtering in CUDA.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Fluids (OpenGL Version) For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

An example of fluid simulation using CUDA and CUFFT, with OpenGL rendering.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


FFT Ocean Simulation For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample simulates an Ocean heightfield using CUFFT and renders the result using OpenGL.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


FFT-Based 2D Convolution For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates how 2D convolutions with very large kernel sizes can be efficiently implemented using FFT transformations.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Separable Convolution For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements a separable convolution filter of a 2D signal with a gaussian kernel.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Texture-based Separable Convolution For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Texture-based implementation of a separable 2D convolution with a gaussian kernel. Used for performance comparison against convolutionSeparable.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


threadFenceReduction For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample shows how to perform a reduction operation on an array of values using the thread Fence intrinsic. to produce a single value in a single kernel (as opposed to two or more kernel calls as shown in the "reduction" SDK sample). Single-pass reduction requires global atomic instructions (Compute Capability 1.1 or later) and the _threadfence() intrinsic (CUDA 2.2 or later).
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Radix Sort For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates a very fast and efficient parallel radix sort implemented in C for CUDA. The included RadixSort class can sort either key-value pairs (with float or unsigned integer keys) or keys only.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Sorting Networks For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implemenets bitonic sort and odd-even merge sort (also known as Batcher's sort), algorithms belonging to the class of sorting networks. While generally subefficient on large sequences compared to algorithms with better asymptotic algorithmic complexity (i.e. merge sort or radix sort), may be the algorithms of choice for sorting batches of short- to mid-sized (key, value) array pairs. Refer to the excellent tutorial by H. W. Lang http://www.iti.fh-flensburg.de/lang/algorithmen/sortieren/networks/indexen.htm
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Binomial Option Pricing For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample evaluates fair call price for a given set of European options under binomial model. This sample will also take advantage of double precision if a GTX 200 class GPU is present.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Black-Scholes Option Pricing For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample evaluates fair call and put prices for a given set of European options by Black-Scholes formula.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Niederreiter Quasirandom Sequence Generator For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements Niederreiter Quasirandom Sequence Generator and Inverse Cumulative Normal Distribution function for Standart Normal Distribution generation.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Monte Carlo Option Pricing For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample evaluates fair call price for a given set of European options using Monte Carlo approach. This sample use double precision hardware if a GTX 200 class GPU is present.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Monte Carlo Option Pricing with multi-GPU support For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample evaluates fair call price for a given set of European options using the Monte Carlo approach, taking advantage of all CUDA-capable GPUs installed in the system. This sample use double precision hardware if a GTX 200 class GPU is present.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


MersenneTwister For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements Mersenne Twister random number generator and Cartesian Box-Muller transformation on the GPU.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Mandelbrot For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample uses CUDA to compute and display the Mandelbrot or Julia sets interactively. It also illustrates the use of "double single" arithmetic to improve precision when zooming a long way into the pattern. This sample use double precision hardware if a GT200 class GPU is present. Thanks to Mark Granger of NewTek who submitted this sample to the SDK!
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Particles For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample uses CUDA to simulate and visualize a large set of particles and their physical interaction. It implements a uniform grid data structure using either a fast radix sort or atomic operations.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Marching Cubes Isosurfaces For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample extracts a geometric isosurface from a volume dataset using the marching cubes algorithm. It uses the scan (prefix sum) function from the CUDPP library to perform stream compaction.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Volume rendering For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates basic volume rendering using 3D textures.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


N-Body Simulation For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample demonstrates efficient all-pairs simulation of a gravitational n-body simulation in CUDA. This sample accompanies the GPU Gems 3 chapter "Fast N-Body Simulation with CUDA".
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Smoke Particles For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Smoke simulation with volumetric shadows using half-angle slicing technique. Uses CUDA for procedural simulation and sorting and OpenGL for rendering.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU



Whitepaper
Browse Files


Sobol Quasirandom Number Generator For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements Sobol Quasirandom Sequence Generator.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


Matrix Multiplication (Driver Version) For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

This sample implements matrix multiplication using the CUDA driver API. It has been written for clarity of exposition to illustrate various CUDA programming principles, not with the goal of providing the most performant generic kernel for matrix multiplication. CUBLAS provides high-performance matrix multiplication.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files


simpleMPI For a direct link to this sample, right-click and copy the URL (shortcut) of this link icon.

Simple example demonstrating how to use MPI in combination with CUDA.
  Minimum Required GPU
Minimum Required GPUor later
Minimum Required GPU




Browse Files

Last Update: 2/8/2010