ChatGPT for Data Science and Data Analysis

admin2

ChatGPT for Data Science and Data Analysis

Welcome to the ultimate Chat
GPT for data science tutorial. In this tutorial, you are going to
learn about Chat GPT, how it works, and most importantly, how you can
utilize it to make data science easier, faster, and more efficient. We are going to make Chat GPT right as
QL queries, analyze data using Python and even trained machine learning models. So what is Chat GPT ? Chat GPT is
an advanced language model that can understand and generate text. You can use it to create content,
write articles, emails, and even write and explain codes. We can use it to generate data, write unit
tests, and train machine learning models. Now let's move on and see Chat GPT in. Head over to chat.openai.com
and if you don't have an account sign up, it won't take a minute. Once you've logged in, you're
going to see the main screen with an input box to talk to Chat GPT. Now here's my first prompt list, the top
10 free courses for machine learning. As you can see, it listed some of the
best machine learning courses out.

We can also ask it questions about
the answers it produced previously. Let's ask it, what were the key
takeaways from the third course? It listed above, it was able to note that the
third course mentioned above was machine learning by Andrew Ung. Not only that, it listed
the key takeaways correctly. The takeaways it mentioned is exactly
what you can expect from this course. Now let's use Chat GPT for data science. We start by loading a
data set to chat GPT. We're gonna get our data set from
W three schools, so head over to W three schools.com/sql l. Click on the itself button. It'll bring you to a page
where you can write SQL querie. Change the customer table to the product
table and run the S SQL L statement.

You will get the query results below. Just copy the first few
rows, including the headers. Then go to Chat GPT, tell it that
this is the product stable, and then paste what you've copied. It was able to understand that
this is a list of product. And it was able to detect
the column names as well. It detected the product id, product name,
supplier id, category ID unit, and price. The columns are self-explanatory and
will be using it to write QL queries. Let's tell it to convert the
data above to a tabular format so it can be easier to see. GPT generated the beautiful
looking table for us. Now we can ask it questions
about the data set.

Let's ask it, what is the product
that has the highest price? The answer is Michiko with a price of 97. Let's see if that's correct. Indeed, it's correct. Let's also ask it, what is the
product that has the lowest price? It's any seed with a price of 10. Let's tell it to calculate the
average price for the products above. Not only that it showed you
the average, but it showed you how to calculate it as well. If you write it again, it might
even show you an S SQL L query.

Speaking of SQL queries,
let's make it right soon. Let's ask it to write a query that gets
the product with the highest price. The query looks good. It orders the products by the
price in a descending fashion. And then limits the results to one. Thus effectively getting the
product with the highest price. Let's also tell it to get the
product with the lowest price, like the query above it did the
same thing, but it ordered the price in an ascending fashion.

Let's also tell it to calculate
the average product price. The query is straightforward and it's
using the average aggregation function. Now let's make things a
little bit harder on chat GPT. Let's import a couple more
tables and ask it questions. That is only possible to solve
by joining tables together. Let's head back to W three schools
and get the order details table, and let's import it into Chat GPT
like we did in the products table. Now let's do that for
a couple more tables. Let's do it for the orders table. The last table is going
to be the supplier stable. Now let's ask it to calculate the
average product price per supplier. This will require Chat GPT to
join the product stable and the supplier stable together. Let's see how it does. It was able to join the product stable
and the supplier stable on the supplier ID column, and it was able to use
the average aggregation function with the group by supplier name statement. This query looks correct, but let's copy
it and make sure that it runs correct.

Just copy the SQL code, then head over to
W three schools again and paste the query. Not only that, it ran without any bugs. The results look correct. Now let's make Chat GPT
do a small calculation. We can ask it to write an SQL
statement that gets the product that achieved the highest revenue. This would require it to join the products
table and the order detailed stable. And then understand that revenue
is price streams quantity, you can see that it understood
that revenue is equals to price streams, quantity, although I
never explicitly mentioned that.

Also, it was able to join the
products table and the order details tables together correctly. Now let's make Chat GPT write an S
SQL L statement with three joins. We can ask it to get the
employee that made the highest sales from the tables above. This will require it to join the
order stable order details table and the products table together. It was able to join the orders
table, order details table and the product stable together. And not only that, the S QL
statement looks synt tactically correct, but let's see if it runs. It ran without any bugs and
it actually got the employee ID that has the highest sales. Let's make Chat use window
functions and sub queries. We can simply ask it a variation
from the question above, we can ask it to get the employee that
made the second highest sales.

Wow. Not only that, it was able to use
the rank window function, but it was also able to put the sub query
into a commentable expression so that the query looks neat and. Now let's use Chat GPT to
analyze some data in Python. Let's use Kaggle's heart attack
analysis and predictions data set. For this tutorial, I'll leave the
dataset link in the description below. Let's download the dataset by clicking
the button on the operate corner. You'll get a file like
this called hard dot csv. Now let's copy the first few
rules to import it in Chat GPT. We can give it a small sentence
like this is a heart attack dataset and then paste in the dataset. Chad GPT was able to understand that this
is a heart attack dataset and it was also able to list down a couple of columns. It was also able to understand that the
output column had the information of whether or not a patient had a heart.

Now let's ask chat GPT to write a Python
program that treats the dataset, gets the data types for each column of the dataset. Get the summary statistics for the
dataset and drop any duplicate tools. We can see that chat. GPT used pandas. To read the dataset and to
do the rest of the tasks. This Crip tackles all the
points that we mentioned above. Also, it looks synt tactically correct. Let's copy the code and paste it in
a Python notebook to see if it works.

Will import pandas, then
read in the dataset. We'll just need to change the CSV
name to read the correct file. Chad g PT used D types to present
the data types for the columns. Then it used the described
function to print out summary statistics of the dataset. The described function presents
all sorts of statistics. It presents the count, the main. Minimum standard deviation
and the maximum as well as the present tiles for each column. This will give you an idea of
how each column is distributed. Let's print out the
shape of the data frame. You see that it has 303
rows and 14 columns. Then let's drop all the piros
and print out the shape. Now it has 302 rows. That means that there was
one duplicate row removed. Chat GPT is doing good till now, but
let's see how it performs when we ask it to create some uni varied analysis.

Let's ask it to create a
visualization with the proportions for all categorical columns. Let's copy paste the prompt we used above
and change the second and third points. Let's ask it to put all categorical
columns in a list manually. This is because I wanted to choose the
categorical columns and not depend on the data types because they are all numerical. Then I'll tele it to absorb the
proportions of different values for every categorical column we have. We can see that it generated a list with
the correct categorical column names. Then it looped over each column name in
the list and it was able to use the values counts function and the plot function
to produce the proportions by Chat.

Let's copy the code and
paste it in the notebook. Let me import the math plot
clip at the beginning of the file and then run the code. We can see that 70% of our
data set is six of one. We can also scroll down to see
the different proportions for each column that we have in our data set. The output column is important to look at.

It'll determine if your
data set is skewed or not. This is not a skewed data set,
and this makes training a machine learning model a little bit easier. Next, let's ask it for the distributions
for all new medical columns. Let's copy paste the prompt
above and change the categorical columns to numerical columns and
the proportions to distributions. You can see that Chat. GPT also chose the numerical
columns correctly and was able to use the plot function with the kind
his to produce frequency plots. For example, age looks more
or less normally distributed with an average age of 55.

Then we can see the resting blood pressure
is skewed to the right a little bit. You can also take a look on
the other columns as well. Instead of histograms,
let's say I want box plots. Box plots can be easier to interpret, so. All I have to do is copy paste
the prompt above and change distributions to box plots. Let's copy the code and
paste it in the notebook. Now we can see the box
plots for each column.

Now let's try boy varied analysis. Let's start by telling it
to generate a heat map. Let's copy paste the prompt above and
change the box plot to a heat map. We can see that it used seaborn for
this visualization, so let's copy the code and paste it in the notebook. There are no strong correlations
between the columns. The absolute value of a weak
correlation lies between zero to 0.3. A medium correlation is between
0.3 to 0.7, and the strong correlation is above 0.7. As you can see here, we have no
strong correlations between columns. Let's tell it to generate proportions
with regards to the output column. So I want to have the proportions of
a column when the output is zero, and another proportion when the output is
one, when the proportions are different. This might indicate that
this is a valuable feature to predict the output column. We can copy paste one of the
prompts above and change the second point to listing all categorical
columns except the output column. Then we want to plug the proportions
of all categorical columns per each value in the output column.

Used the same code that it used for
the proportion graphs, but it used it in a double four loop where the
second four loop is the output column. Let's copy and paste the
code in the notebook. You can see that the first graph
has the proportions of sex when the output is zero, and then the second
graph shows you the proportions of sex when the output is one. But this sort of back and
forth one, comparing the two graphs is quite cumbersome. So let's ask it to merge those
two graphs into one graph. Effectively, we're going to
have one graph per column. If we copy paste the prompt above
and add that one categorical column should be in one graph, then it
should produce the correct results. Now it generated a slightly
more complicated code, but the output is worth it. Now, I don't like the stacked bar
graph, so I'm just going to change the stacked attribute to faults. Now. There we go. This is way better than
what we had before.

Now we can compare proportions
without scrolling back and forth. We can see that when chest
pain is zero, the output is going to be most probably zero. Thus, the patient will
not have a heart attack. We can also do the same analysis
for the rest of the columns. Now let's create distributions. With regards to the output column. It's going to be the same as
the proportions, but rather than having the proportions, we're gonna
be plotting the distributions. Copy, paste the prompt above and
change the categorical to numerical and the proportions to distributions. Everything else should stay the same. You can see that the generated
code is very familiar. we can see that patients with a
heart attack has a higher resting blood pressure than people that
didn't have a heart attack. You can also check the
other columns as well. Now let's generate box plots
with regard to the output column.

We'll just copy paste the prompt above
and change distributions to box plots. Then we copy the code and
paste it in the notebook. With box plots, you can compare
the two distributions more clearly. Let's also create a per plot for
all numerical columns using Seaborn. The code looks easy and
straightforward, so just copy it and paste it in the notebook. The per plot is an efficient
way to see scatter plots for all combinations for your columns. Lastly, let's train a heart
attack prediction model.

Let's tell Chad GPT to write a Python
program that reads the dataset. Train a model that predicts whether
a patient had a heart attack and evaluate the model using SK
Learns classification report. The code seems on point. It first determined the feature
columns, then it separated the inputs and the outputs of the model. Then it split the data into
training and testing sets using the train test split function. Then it trained SK Learns logistic
regression model, and then it evaluated the testing set using
the classification report. Let's copy the code and
paste it in the notebook. Let's run the imports first. Then let's define the X and Y variables. Let's then split our data into
training and testing sets.

Then let's fit our model and let's evaluate it. Using the classification report,
we received an F1 score of 83%, which is not bad at all. This is actually impressive. Regarding that, we didn't code anything. So that's it guys. There are endless ways to get
creative and used Chat GPT. I hope you found this video helpful. If you enjoyed it, please give it a like
and subscribe for more videos like this.

admin2

About the Author

Share 0

ChatGPT for Data Science and Data Analysis

About the Author

Leave a Reply