João Freitas

The following article covers the basics of GPU programming and explains how programs made to run in a CPU are completely different than those made to run in a GPU, both in the execution and programming model.

In today’s AI age, the majority of developers train in the CPU way. This knowledge has been part of our academics as well, so it’s obvious to think and problem-solve in a CPU-oriented way.

However, the problem with CPUs is that they rely on a sequential architecture. In today’s world, where we are dependent on numerous parallel tasks, CPUs are unable to work well in these scenarios.

Some problems faced by developers include:

Executing Parallel Tasks

CPUs traditionally operate linearly, executing one instruction at a time. This limitation stems from the fact that CPUs typically feature a few powerful cores optimized for single-threaded performance.

When faced with multiple tasks, a CPU allocates its resources to address each task one after the other, leading to a sequential execution of instructions. This approach becomes inefficient in scenarios where numerous tasks need simultaneous attention.

While we make efforts to enhance CPU performance through techniques like multi-threading, the fundamental design philosophy of CPUs prioritizes sequential execution.

Running AI Models Efficiently

AI models, employing advanced architectures like transformers, leverage parallel processing to enhance performance. Unlike older recurrent neural networks (RNNs) that operate sequentially, modern transformers such as GPT can concurrently process multiple words, increasing efficiency and capability in training. Because when we train in parallel, it will result in bigger models, and bigger models will yield better outputs.

The concept of parallelism extends beyond natural language processing to other domains like image recognition. For instance, AlexNet, an architecture in image recognition, demonstrates the power of parallel processing by processing different parts of an image simultaneously, allowing for accurate pattern identification.

However, CPUs, designed with a focus on single-threaded performance, struggle to fully exploit parallel processing potential. They face difficulties efficiently distributing and executing the numerous parallel computations required for intricate AI models.

As a result, the development of GPUs has become prevalent to address the specific needs of parallel processing in AI applications, unlocking higher efficiency and faster computation.

How GPU Driven Development Solves These Issues

Massive Parallelism With GPU Cores

Engineers design GPUs with smaller, highly specialized cores compared to the larger, more powerful cores found in CPUs. This architecture allows GPUs to execute a multitude of parallel tasks simultaneously.

The high number of cores in a GPU are well-suited for workloads depending on parallelism, such as graphics rendering and complex mathematical computations.

We will soon demonstrate how using GPU parallelism can reduce the time taken for complex tasks.



Parallelism Used In AI Models

AI models, particularly those built on deep learning frameworks like TensorFlow, exhibit a high degree of parallelism. Neural network training involves numerous matrix operations, and GPUs, with their expansive core count, excel in parallelizing these operations. TensorFlow, along with other popular deep learning frameworks, optimizes to leverage GPU power for accelerating model training and inference.

We will show a demo soon how to train a neural network using the power of the GPU.



CPUs Vs GPUs: What’s the Difference?


Sequential Architecture

Central Processing Units (CPUs) are designed with a focus on sequential processing. They excel at executing a single set of instructions linearly.

CPUs are optimized for tasks that require high single-threaded performance, such as

Limited Cores For Parallel Tasks

CPUs feature a smaller number of cores, often in the range of 2-16 cores in consumer-grade processors. Each core is capable of handling its own set of instructions independently.


Parallelized Architecture

Graphics Processing Units (GPUs) are designed with a parallel architecture, making them highly efficient for parallel processing tasks.

This is beneficial for

GPUs handle multiple tasks simultaneously by breaking them into smaller, parallel sub-tasks.

Thousands Of Cores For Parallel Tasks

Unlike CPUs, GPUs boast a significantly larger number of cores, often numbering in the thousands. These cores are organized into streaming multiprocessors (SMs) or similar structures.

The abundance of cores allows GPUs to process a massive amount of data concurrently, making them well-suited for parallelisable tasks, such as image and video processing, deep learning, and scientific simulations.

AWS GPU Instances: A Beginner’s Guide

Amazon Web Services (AWS) offers a variety of GPU instances used for things like machine learning.

Here are the different types of AWS GPU instances and their use cases:

General-Purpose Gpu Instances

Inference-optimized GPU instances

Graphics-optimized GPU instances

Managed GPU Instances

Using Nvidia’s CUDA for GPU-Driven Development

What Is Cuda?

CUDA is a parallel computing platform and programming model developed by NVIDIA, enabling developers to accelerate their applications by harnessing the power of GPU accelerators.

The Practical examples in the demo will use CUDA.

How to Setup Cuda on Your Machine

To setup CUDA on your machine you can follow these steps.

Basic Commands to Use

Once you have CUDA installed, here are some helpful commands.

lspci | grep VGA

The purpose of this command is to identify and list the GPUs in your system.


It stands for “NVIDIA System Management Interface” It provides detailed information about the NVIDIA GPUs in your system, including utilization, temperature, memory usage and more.

sudo lshw -C display

The purpose is to provide detailed information about the display controllers in your system, including graphics cards.

inxi -G

This command provides information about the graphics subsystem, including details about the GPU and the display.

sudo hwinfo --gfxcard

Its purpose is to obtain detailed information about the graphics cards in your system.

Get Started with the Cuda Framework

As we have installed the CUDA Framework, let’s start executing operations that showcases its functionality.

Array Addition Problem

A suitable problem to demonstrate the parallelization of GPUs is the Array addition problem.

Consider the following arrays:

The previous method involves traversing the array elements one by one and performing the additions sequentially. However, when dealing with a substantial volume of numbers, this approach becomes sluggish due to its sequential nature.

To address this limitation, GPUs offer a solution by parallelizing the addition process. Unlike CPUs, which execute operations one after the other, GPUs can concurrently perform multiple additions.

For instance, the operations 1+7, 2+8, 3+9, 4+10, 5+11 and 6+12 can be executed simultaneously through parallelization with the assistance of a GPU.

Utilizing CUDA, the code to achieve this parallelized addition is as follows:

We will use a kernel file (.cu) for the demonstration.

Let’s go through the code one by one.

__global__ void vectorAdd(int* a, int* b, int* c)
    int i = threadIdx.x;
    c[i] = a[i] + b[i];

Now lets go through the main function.

Pointers cudaA, cudaB and cudaC are created to point to memory on the GPU.

// Uses CUDA to use functions that parallelly calculates the addition
int main(){
    int a[] = {1,2,3};
    int b[] = {4,5,6};
    int c[sizeof(a) / sizeof(int)] = {0};
    // Create pointers into the GPU
    int* cudaA = 0;
    int* cudaB = 0;
    int* cudaC = 0;

Using `cudaMalloc`, memory is allocated on the GPU for the vectors cudaA, cudaB, and cudaC.

// Allocate memory in the GPU

The content of vectors a and b is copied from the host to the GPU using `cudaMemcpy`.

// Copy the vectors into the gpu
cudaMemcpy(cudaA, a, sizeof(a), cudaMemcpyHostToDevice);
cudaMemcpy(cudaB, b, sizeof(b), cudaMemcpyHostToDevice);

The kernel function `vectorAdd` is launched with one block and a number of threads equal to the size of the vectors.

// Launch the kernel with one block and a number of threads equal to the size of the vectors
vectorAdd <<<1, sizeof(a) / sizeof(a[0])>>> (cudaA, cudaB, cudaC);

The result vector `cudaC` is copied from the GPU back to the host.

// Copy the result vector back to the host
cudaMemcpy(c, cudaC, sizeof(c), cudaMemcpyDeviceToHost);

We can then print the results as usual

// Print the result
for (int i = 0; i < sizeof(c) / sizeof(int); i++){
    printf("c[%d] = %d", i, c[i]);

    return 0;

For executing this code, we will use nvcc command.

We will get the output as

GPU Output

GPU Output

Here’s the full code for your reference.

Optimize Image Generation in Python Using the GPU

Import necessary libraries

from matplotlib import pyplot as plt
import numpy as np
from pylab import imshow, show
from timeit import default_timer as timer

Function to calculate the Mandelbrot set for a given point (x, y)

def mandel(x, y, max_iters):
    c = complex(x, y)
    z = 0.0j
    # Iterate to check if the point is in the Mandelbrot set
    for i in range(max_iters):
        z = z*z + c
        if (z.real*z.real + z.imag*z.imag) >= 4:
            return i
    # If within the maximum iterations, consider it part of the set
    return max_iters

Function to create the Mandelbrot fractal within a specified region

def create_fractal(min_x, max_x, min_y, max_y, image, iters):
    height = image.shape[0]
    width = image.shape[1]

    # Calculate pixel sizes based on the specified region
    pixel_size_x = (max_x - min_x) / width
    pixel_size_y = (max_y - min_y) / height

    # Iterate over each pixel in the image and compute the Mandelbrot value
    for x in range(width):
        real = min_x + x * pixel_size_x
        for y in range(height):
            imag = min_y + y * pixel_size_y
            color = mandel(real, imag, iters)
            image[y, x] = color

Create a blank image array for the Mandelbrot set

image = np.zeros((1024, 1536), dtype=np.uint8)

Record the start time for performance measurement

start = timer()

Generate the Mandelbrot set within the specified region and iterations

create_fractal(-2.0, 1.0, -1.0, 1.0, image, 20)

Calculate the time taken to create the Mandelbrot set

dt = timer() - start

Print the time taken to generate the Mandelbrot set

print("Mandelbrot created in %f s" % dt)

Display the Mandelbrot set using matplotlib


The above code produces the output in 4.07 seconds.

Mandelbrot without GPU

Mandelbrot without GPU

The above code gets executed in 0.0046 seconds. Which is a lot faster the CPU Based code we had earlier.

Mandelbrot with GPU

Mandelbrot with GPU

Here’s the full code for your reference.

Training a Cat VS Dog Neural Network Using the GPU

One of the hot topics we see nowadays is how GPUs are getting used in AI, So to demonstrate that we will be creating a neural network to differentiate between cats and dogs.


CNN File Structure

CNN File Structure

This is the code we will use for training and using the Cat vs Dog Model.

The below code uses a convolutional neural network, you can read more details about it

Importing Libraries

Initializing the Convolutional Neural Network

classifier = Sequential()

Loading the data for training

train_datagen = ImageDataGenerator(
test_datagen = ImageDataGenerator(rescale=1./255)

training_set = train_datagen.flow_from_directory(
    target_size=(64, 64),

test_set = test_datagen.flow_from_directory(
    target_size=(64, 64),

Building the CNN Architecture

classifier.add(Convolution2D(32, 3, 3, input_shape=(64, 64, 3), activation='relu'))
classifier.add(MaxPooling2D(pool_size=(2, 2)))
classifier.add(Dense(units=128, activation='relu'))
classifier.add(Dense(units=1, activation='sigmoid'))

Compiling the model

classifier.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

Training the model, epochs=25, validation_data=test_set, validation_steps=2000)'trained_model.h5')

Once we have trained the model, The model is stored in a .h5 file using

In the below code, we will use this trained_model.h5 file to recognize cats and dogs.

import numpy as np
from keras.models import load_model
import keras.utils as image

def predict_image(imagepath, classifier):
    predict = image.load_img(imagepath, target_size=(64, 64))
    predict_modified = image.img_to_array(predict)
    predict_modified = predict_modified / 255
    predict_modified = np.expand_dims(predict_modified, axis=0)
    result = classifier.predict(predict_modified)

    if result[0][0] >= 0.5:
        prediction = 'dog'
        probability = result[0][0]
        print("Probability = " + str(probability))
        print("Prediction = " + prediction)
        prediction = 'cat'
        probability = 1 - result[0][0]
        print("Probability = " + str(probability))
        print("Prediction = " + prediction)

Load the trained model

loaded_classifier = load_model('trained_model.h5')

Example usage

dog_image = "dog.jpg"
predict_image(dog_image, loaded_classifier)

cat_image = "cat.jpg"
predict_image(cat_image, loaded_classifier)

Let’s see the output

Here’s the full code for your reference


In the upcoming AI age, GPUs are not a thing to be ignored, We should be more aware of its capabilities.

As we transition from traditional sequential algorithms to increasingly prevalent parallelized algorithms, GPUs emerge as indispensable tools that empower the acceleration of complex computations. The parallel processing prowess of GPUs is particularly advantageous in handling the massive datasets and intricate neural network architectures inherent to artificial intelligence and machine learning tasks.

Furthermore, the role of GPUs extends beyond traditional machine learning domains, finding applications in scientific research, simulations, and data-intensive tasks. The parallel processing capabilities of GPUs have proven instrumental in addressing challenges across diverse fields, ranging from drug discovery and climate modelling to financial simulations.


#reads #rijul rajesh #gpu #programming #ai