Hello everyone, my name is Krish Naik. Welcome to my YouTube channel. If you have been following my channel, you'll know that I have previously made a couple of videos related to the cuDF library. For those unfamiliar with cuDF, it is a Python GPU DataFrame library built on the Apache Arrow column memory format. It's designed for loading, joining, aggregating, filtering, and otherwise manipulating tabular data.
One common problem with using the Pandas library is its inefficiency with large datasets. Pandas can be quite slow when dealing with huge datasets during data preprocessing steps. NVIDIA's cuDF pandas is a Python GPU DataFrame library designed to solve this problem. It leverages the power of GPUs for data preprocessing, provided the GPUs are available on your system or through platforms like Google Colab.
I am making this specific video to talk about a recent update in RAPIDS, namely version 24.08.2011. This version introduces a new pandas accelerated mode. This mode allows you to bring accelerated computing to your pandas workflow without requiring any code changes. This new update is particularly useful for preprocessing large string data.
In this post, we'll demonstrate how the cuDF pandas accelerator mode can help with the processing of a dataset that has large string fields, specifically a dataset that is more than 4 GB in memory size. We'll start by loading the cuDF libraries using a simple command and proceed step-by-step with implementation.
To illustrate the benefits of cuDF pandas, let's start by using traditional pandas. First, we will execute the following command to display information about the GPU in our system:
!nvidia-smi
For this example, we are using the Tesla T4 GPU available on Google Colab's paid plan. Now, we will download a dataset. This data will be the LinkedIn Jobs and Skills dataset for 2024, which requires more than 8GB of memory to load.
!wget [download_link] -O linkedin_jobs.zip && unzip linkedin_jobs.zip
Once the files are unzipped, you'll have:
After unzipping the files, let's start by importing pandas and reading the data:
import pandas as pd
dataframe = pd.read_csv('jobscore_summary.csv')
print(dataframe.head())
Initially, we will measure the memory usage of the dataset. The job summary column is particularly large.
memory_used = dataframe.memory_usage().sum() / (1024 ** 3)
cpu_time = time.time() - start_time
print(f"Data size: {memory_used:.2f} GB")
print(f"CPU Time: {cpu_time:.2f} seconds")
The CPU time to load an 8.19 GB dataset is approximately 59.8 seconds.
We'll conduct some exploratory data analysis to identify job roles and companies with the longest job summaries. This involves various operations such as:
Here is an example of merging datasets:
merged_df = pd.merge(job_posting_df, job_summary_df, on='Job Link', how='left')
This takes 3.2 seconds.
Now, for the most exciting part: transforming our workflow using cuDF pandas.
!pip install cudf-cu11 --version 24.08.2011
After installing the library, load the cuDF pandas extension:
%load_ext cudf.pandas
Now, let's read the same job summary CSV file using cuDF and see the difference:
import cudf
dataframe = cudf.read_csv('jobscore_summary.csv')
print(dataframe.head())
The GPU takes just 2.27 seconds to load the dataset, a significant improvement from pandas' 1 minute 7 seconds.
Other tasks that showed significant speed improvements include:
Here's an example of performing a group-by operation using cuDF:
grouped_df = dataframe.groupby(['Company', 'Job Title']).agg({'Job Summary Length': ['mean']})
print(grouped_df)
This operation takes only 460 milliseconds.
In summary, cuDF pandas allows you to continue using pandas as your primary DataFrame library but accelerates tasks using GPU when needed. This is an enormous advantage when dealing with large datasets, drastically reducing preprocessing time from minutes to mere seconds.
For more information, check out the RAPIDS cuDF website.
If you want to dive deeper and see the implementation in action, I highly recommend watching the YouTube video.
Thank you for reading! Have a great day!