🚨 New Course Alert! 🚨 Complete MLOps Bootcamp With 10+ End To End ML Projects is now live! 🎉 Enroll Now

Introduction to NVIDIA RAPIDS cuDF Pandas

Hello everyone, my name is Krish Naik. Welcome to my YouTube channel. If you have been following my channel, you'll know that I have previously made a couple of videos related to the cuDF library. For those unfamiliar with cuDF, it is a Python GPU DataFrame library built on the Apache Arrow column memory format. It's designed for loading, joining, aggregating, filtering, and otherwise manipulating tabular data.

One common problem with using the Pandas library is its inefficiency with large datasets. Pandas can be quite slow when dealing with huge datasets during data preprocessing steps. NVIDIA's cuDF pandas is a Python GPU DataFrame library designed to solve this problem. It leverages the power of GPUs for data preprocessing, provided the GPUs are available on your system or through platforms like Google Colab.

The New Update in RAPIDS

I am making this specific video to talk about a recent update in RAPIDS, namely version 24.08.2011. This version introduces a new pandas accelerated mode. This mode allows you to bring accelerated computing to your pandas workflow without requiring any code changes. This new update is particularly useful for preprocessing large string data.

In this post, we'll demonstrate how the cuDF pandas accelerator mode can help with the processing of a dataset that has large string fields, specifically a dataset that is more than 4 GB in memory size. We'll start by loading the cuDF libraries using a simple command and proceed step-by-step with implementation.

Preprocessing Large Data with Pandas

To illustrate the benefits of cuDF pandas, let's start by using traditional pandas. First, we will execute the following command to display information about the GPU in our system:

!nvidia-smi

For this example, we are using the Tesla T4 GPU available on Google Colab's paid plan. Now, we will download a dataset. This data will be the LinkedIn Jobs and Skills dataset for 2024, which requires more than 8GB of memory to load.

!wget [download_link] -O linkedin_jobs.zip && unzip linkedin_jobs.zip

Once the files are unzipped, you'll have:

  • jobscore_skills.csv
  • jobscore_summary.csv
  • LinkedIn_job_posting.csv

After unzipping the files, let's start by importing pandas and reading the data:

import pandas as pd
dataframe = pd.read_csv('jobscore_summary.csv')
print(dataframe.head())

Loading Data with Pandas

Initially, we will measure the memory usage of the dataset. The job summary column is particularly large.

memory_used = dataframe.memory_usage().sum() / (1024 ** 3)
cpu_time = time.time() - start_time
print(f"Data size: {memory_used:.2f} GB")
print(f"CPU Time: {cpu_time:.2f} seconds")

The CPU time to load an 8.19 GB dataset is approximately 59.8 seconds.

Exploratory Data Analysis (EDA)

We'll conduct some exploratory data analysis to identify job roles and companies with the longest job summaries. This involves various operations such as:

  • Reading the CSV files
  • Calculating the memory usage of each column
  • Merging datasets
  • Grouping and aggregating data

Here is an example of merging datasets:

merged_df = pd.merge(job_posting_df, job_summary_df, on='Job Link', how='left')

This takes 3.2 seconds.

Accelerating Data Processing with cuDF Pandas

Now, for the most exciting part: transforming our workflow using cuDF pandas.

!pip install cudf-cu11 --version 24.08.2011

After installing the library, load the cuDF pandas extension:

%load_ext cudf.pandas

Now, let's read the same job summary CSV file using cuDF and see the difference:

import cudf
dataframe = cudf.read_csv('jobscore_summary.csv')
print(dataframe.head())

The GPU takes just 2.27 seconds to load the dataset, a significant improvement from pandas' 1 minute 7 seconds.

Comparing Performance

Other tasks that showed significant speed improvements include:

  • Loading job skills and job posting datasets
  • Merging datasets
  • Grouping and aggregating data

Here's an example of performing a group-by operation using cuDF:

grouped_df = dataframe.groupby(['Company', 'Job Title']).agg({'Job Summary Length': ['mean']})
print(grouped_df)

This operation takes only 460 milliseconds.

Conclusion

In summary, cuDF pandas allows you to continue using pandas as your primary DataFrame library but accelerates tasks using GPU when needed. This is an enormous advantage when dealing with large datasets, drastically reducing preprocessing time from minutes to mere seconds.

For more information, check out the RAPIDS cuDF website.

If you want to dive deeper and see the implementation in action, I highly recommend watching the YouTube video.

Thank you for reading! Have a great day!