ChatGPT: Code Interpreter Plugin – A Data Analyst’s best friend

Lukas Görög

On July 6th, 2023, OpenAI made an official announcement on Twitter regarding the availability of Code Interpreter to users with ChatGPT Plus subscriptions. This new addition holds great potential in expanding the capabilities and functionalities of ChatGPT. We’re going to dive into some interesting tasks we could accomplish using ChatGPT’s code interpreter.

Code Interpreter enables the execution of code and facilitates the uploading of data, enabling users to leverage its capabilities for tasks such as data cleaning, analysis, visualization, and numerous other applications.

Let us proceed to explore its practical application through specific tasks. Previously, this particular tasks would have required several hours to complete. However, depending on the complexity of the task and an individual’s familiarity with Python libraries, it may demand even more time.

With the assistance of ChatGPT’s code interpreter, the completion of this task was considerably streamlined.

In the subsequent sections, I will guide you through a step-by-step breakdown of the tasks process, highlighting the remarkable capabilities of Code Interpreter in various domains, including data comprehension, cleaning and preprocessing, and the creation of data visualizations.

Plotting Data maps

Our first task was to create a visualization with a map of the USA, marking cities with dots proportional to the number of employees from each city. The data was sourced from 2 CSV files named “Employee_Sample_Data.csv” and “uscities.csv”.

I fed the following prompt to the ChatGPT along with the relevant data files.

“Write a Python code that creates a visualization with the USA map and mark cities with dots whose size is proportional to the number of employees from each city. The employees’ data is given in the “Employee_Sample_Data.csv” file. And the USA city coordinates are given in the “udcities.csv” file.”

Once you upload the CSV files one by one using the plus button on the left side of the message area, the code interpreter reads the file and prompts the information of the file like the employee’s ID, full name, job title, department, business unit, gender, ethnicity, age, hire date, annual salary, bonus percentage, country, city, and exit date.

If you click on the drop-down arrow on the “Show work” tab, you will see the Python code and the actual results, which you can use to read the data from the CSV file. The nice thing is that the Code Interpreter corrects the data formats as well, if there are any issues in it.

import pandas as pd

# Reload the data
df = pd.read_csv('/mnt/data/Employee_Sample_Data.csv', encoding='ISO-8859-1')

# Display the first few rows of the DataFrame
df.head()

Although we encountered some challenges with geocoding the cities due to restrictions in the environment and the geocoding services, manually providing the “udcities.csv” file which contained the longitude and latitude information solved the issue. And the Code Interpreter suggested a Python code that was successfully executable on Google Colab environment.

The Code Interpreter explains what it does with the code:

Loads the data from the CSV files.
Filters the employee data to only include rows where the country is the United States.
Groups the employee data by city and counts the number of employees in each city.
Merges the employee data with the city coordinates data on the ‘City’/’city’ columns.
Creates a scatter plot on a map of the USA using Plotly, with the size of each point corresponding to the number of employees in each city.

pip install pandas plotly numpy matplotlib

import pandas as pd

import plotly.express as px

import numpy as np

# Load the data

df = pd.read_csv('Employee_Sample_Data.csv', encoding='ISO-8859-1')

df_cities = pd.read_csv('uscities.csv')

# Filter data for USA only

df_usa = df[df['Country'] == 'United States']

# Group data by city and count the number of employees in each city

df_usa = df_usa.groupby('City').size().reset_index(name='Employees')

# Merge the employee data with the city coordinates data

df_usa = pd.merge(df_usa, df_cities, left_on='City', right_on='city')

# Generate a column of random colors

np.random.seed(0)  # For reproducibility

df_usa['Color'] = np.random.randint(0, 100, df_usa.shape[0])  # Random integers between 0 and 100

# Create the plot

fig = px.scatter_geo(df_usa, lat="lat", lon="lng", size="Employees", color="Color",

                     color_continuous_scale=px.colors.sequential.Plasma, size_max=15, scope='usa')

fig.show()

This is the result:

Visualizing Trends in the Data

The best thing is that if you have no idea of what trends to look for in the data, you can ask it as well.

Create a visual representation of the trends in the data in “Employee_Sample_Data”

ChatGPT Code Interpreter will suggest you a list of trends that make sense.

Before we can visualize the trends in this data, we need to understand what specific trends or relationships you’re interested in. Here are a few possible visualizations we could create:

The distribution of employees by department or job title.
The distribution of employees by gender or ethnicity.
The age distribution of employees.
The distribution of annual salaries (perhaps split by department, job title, gender, or ethnicity).
The distribution of bonuses (again, perhaps split by department, job title, gender, or ethnicity).
The number of employees hired each year (i.e., the growth of the company over time).
The number of employees who left the company each year (i.e., turnover over time).

Could you please specify what kind of trends or relationships you’re interested in exploring?

We first focused on the number of employees who left the company each year, a metric also known as turnover. After converting the ‘Exit Date’ to a year format, we created a line plot displaying the turnover trend over the years. This revealed an increase in turnover, particularly noticeable in 2021.

We then turned our attention to the growth of the company, represented by the number of employees hired each year. Again, we converted the ‘Hire Date’ into a year format and plotted the result. This showed that the company had been hiring more employees each year, indicating growth.

Further, we visualized the distribution of employees by department using a pie chart. This enabled us to see at a glance that the IT department had the highest number of employees, followed by Engineering.

You can also prompt the charts in a downloadable manner and also create a dashboard with all the necessary factors that you need to pay attention to.

Word Cloud Generation and Image Manipulation

Another exciting task we embarked on was the generation of a word cloud. We used a text file named “Newspaper_content.txt” to create a circular, dark-themed word cloud. The word cloud represented the most frequently occurring words in the text, with the size of each word indicating its frequency.

Create an elliptical colorful dark theme word cloud for the uploaded text content “Product_reviews.txt”.

This is the Python code that the Code Interpreter suggests.

from wordcloud import WordCloud, STOPWORDS
from matplotlib import pyplot as plt

# Load the text data
with open('Product_reviews.txt', 'r') as file:
    text = file.read()

# Define the stopwords
stopwords = set(STOPWORDS)

# Create a word cloud
wc = WordCloud(background_color='white',
              max_words=2000,
              contour_width=3,
              contour_color='steelblue',
              collocations=False,
              stopwords=stopwords)

# Generate the word cloud
wc.generate(text)

# Display the word cloud
plt.imshow(wc, interpolation='bilinear')
plt.axis('off')
plt.show()

The code interpreter created a default shape instead, and excused about the limitations in the environment but requested to try it in the local machine instead.

Image manipulation

In addition to creating word clouds, we demonstrated the ability to manipulate images. We cropped an image to remove elements on the sides and kept the middle part intact and also make it black and white but keeping a single colour unchanged. This demonstrated how one can use ChatGPT as a tool for image manipulation, uploading the original image created with Midjourney to the Code Interpreter.

The result were effective, but this is the least of the edits that you can think of to do to an image, try it your self to do more manipulations.

Creating QR Codes

Another cool thing that you can do with ChatGPT Code Interpreter is creating QR codes, here we have created a QR code which will direct you to the “Predictea” website.

Create a QR code to https://www.predictea.com.

Conclusions

Through these tasks, we’ve seen that OpenAI’s ChatGPT Code Interpreter plugin is more than just a text generator and code creator.it’s a powerful tool that can execute Python code most importantly you can upload a code and ask to explain it, recommend improvements for better efficiency and error correction, making it incredibly useful for tasks such as data analysis, visualization, and more. It has potential applications in various fields, from business analytics to academic research and beyond. And with its continuous improvements and updates, we’re excited to see what ChatGPT will be capable of in the future.