Pandas dataframe.corr() stripping columns from input - python

I'm learning how to use the pandas library in python3 and I've run into an issue with dataframe.corr()
Here's an example of my dataset
Date,Gender,Age at Booking,Current Age
2015-12-23,M,21,22
2015-12-23,M,25,25
2015-12-23,M,37,37
2015-12-23,F,39,40
2015-12-23,M,24,24
And here is how I attempt to load it/transform it
crime_data = pd.read_csv(crime_data_s)
print(crime_data.head())
print(crime_data['Date'])
correlated_data = crime_data.corr()
print(correlated_data)
Printing crime data head shows the 4 columns with some associated data, accessing column 'Date' and printing its values works just as expected however when crime_data.corr() is called and I print the data it has stripped all other items except "age at booking" and "current age" therefore making it shape 2x2.
Calling the dataframe.info() method I can see that the date and gender columns are being labeled as objects rather than relevant data what can be done to fix this so that I can attempt to run a correlation on the data?

data['Gender']=data['Gender'].astype('category').cat.codes
data['Date']=data['Date'].astype('category').cat.codes
data.corr()
Output
Date Gender Age curage
Date NaN NaN NaN NaN
Gender NaN 1.000000 0.162804 -0.703474
Age NaN -0.162804 1.000000 0.814425
curage NaN -0.703474 0.814425 1.000000

It is because .corr() works only with numeric data type columns. You need to replace values M and F with for instance :
crime_data['Gender'] = crime_data['Gender'].replace('M',1).replace('F',0)

Related

Problem with categorization in pandas/OrdinalEncoder

I'm having some problems categorizing a column using pandas and OrdinalEncoder.
What I have to do is basically, convert a column to be categorical (so that I can use OrdinalEncoder after) but everything I try either doesn't work or returns NaN.
What I tried is the following:
df['Education'] is a column with all the data of the degrees obtained by the sample.
from pandas.api.types import CategoricalDtype
ordcat = CategoricalDtype(categories = ['Preschool', '1st-4th', '5th-6th', '7th-8th', '9th', '10th', '11th', '12th', 'HS-grad',
'Prof-school', 'Assoc-acdm', 'Assoc-voc', 'Some-college', 'Bachelors', 'Masters',
'Doctorate'], ordered = True)
df['Education'] = df['Education'].astype(ordcat)
print(df['Education'])
The output is the following:
category
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
...
32556 NaN
32557 NaN
32558 NaN
32559 NaN
32560 NaN
Which isn't what I need.
I also tried doing something different but didn't really give me any result other than errors about Series being mutable or NaNs again.
It's been like 4 days and I can't figure things out, do you have any idea of what I'm missing?
Thanks in advance for your help.
Edit_0: The database I'm using is the following: UCI Machine Learning repository (http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data) which I alredy stripped and put np.nan in place of ?
How do you declare df? Maybe you can try to create an empty dataframe series first and name it as Education. Then, execute the df['Education'] = df['Education'].astype(ordcat).

Python pandas .join() returns NaN in the column I joined

First of all I'll introduce what the post is all about.
I am learning NN from the book "Machine Learnin with SciKit and Tensorflow" - loose translation from my native language.
In the second chapter the author of this book presented a NN that predicts housing prices with regard to various inputs.
After completion of this chapter I decided I want to try to see the results on a chart and compare them to training data, but to do this I needed to convert numpy predictions array to pandas dataframe and then join the predictions with testing data.
But for some reason when I use plot_test_data = plot_test_data.join(predicted_data_frame)
on the converted numpy matrix the resulting column in the pandas dataframe object consists of NaN's.
The testing data consists of thousand of samples from which 5 are selected randomly, because of shuffling.
Dtypes of example of test data:
test_data.dtypes:
longitiude float64
latitiude float64
housing median age float64
count of rooms float64
count of bedrooms float64
population float64
families float64
median earnings float64
distance to ocean object
dtype: object
And from this test data the predictions are made using SciKit linear regression model, the resulting array is a numpy array.
predictions:
[ 85657.90192014 305492.60737488 152056.46122456 186095.70946094
244550.67966089]
And now I am converting this array using pandas .DataFrame() function like this:
plot_test_data = test_data.drop('distance to ocean', axis=1) # since 'distance to ocean' is
not a numerical value I drop it for plotting purposes
predicted_data_frame = pd.DataFrame(predictions.T, columns=['housing median'])
# and then I join predicted_data_frame and plot_test data:
plot_test_data = plot_test_data.join(predicted_data_frame)
But the resulting column in merged predicted_data_frame pandas objects consists of NaN's;
Even though housing median column consists of float64 types:
plot_test_data:
longitiude float64
latitiude float64
housing median age float64
count of rooms float64
count of bedrooms float64
population float64
families float64
median earnings float64
housing median float64
dtype: object
I don't know how to fix this, and honestly I find it important to know how can I visualise the predictions of my future models, so they can prove usefull in whatever they'll do, and I've searched on google for similar problems but I feel like I didn't found the answer(or I did not understood it).
So I'll really apreciate Your help.
Thank you in advance :)
I tried my best to describe what the problem is about, I hope it's understandable.
Edit:
Okay, the join() function needs a 'key' so I did this:
plot_test_data = test_data.drop('distance to ocean', axis=1)
predicted_data_frame = pd.DataFrame(predictions.T, columns=['housing
median'])
# new object meant to have the unique key for the usage of merge/join
# func
add_to_plot_test_data = test_data[['longitiude']].copy()
# since it does not have the 'key' to proceed with join or merge
# I used concat()
add_to_plot_test_data = pd.concat([add_to_plot_test_data,
predicted_data_frame], axis=1)
# merging two dataframes to get 'housing median' column
plot_test_data =
plot_test_data.merge(add_to_plot_test_data,on='longitiude',how='outer')
But unfortunately It didn't worked at all, the result was:
#######
add_to_plot_test_data
#######
longitiude housing median
0 NaN 85657.901920
1 NaN 305492.607375
2 NaN 152056.461225
3 NaN 186095.709461
4 NaN 244550.679661
2908 -119.04 NaN
12655 -121.46 NaN
14053 -117.13 NaN
15502 -117.23 NaN
20496 -118.70 NaN
The problem is I don't know how to "join" columns to these rows.
My original suggestion of just adding predictions as a column will work just fine:
import pandas as pd
x = [2908,12655,14053,15502,20496]
y = [-119.04,-121.46,-117.13,117.23,-118.70]
z = [85658.9,305492.6,152056.4,186095.7,244550.7]
df1 = pd.DataFrame(y,columns=['longitude'],index=x)
print(df1)
df1['housing median'] = z
print(df1)
Output:
longitude
2908 -119.040
12655 -121.460
14053 -117.130
15502 -117.230
20496 -118.700
longitude housing median
2908 -119.040 85658.9
12655 -121.460 305492.6
14053 -117.130 152056.4
15502 -117.230 186095.7
20496 -118.700 244550.7

fillna() is not replacing NaN values even after using inplace=True

I am trying to replace NaN value in my 'price' column of my dataset, I tried using:
avg_price = car.groupby('make')['price'].agg(np.mean) # calculating average value of the price of each car company model
new_price= car['price'].fillna(avg_price,inplace=True)
car['price']=new_price
The code runs well without any error, but on checking, I can still see the NaN values in the dataset. Dataset snap shot is attached below:
Are you trying to fill the NaN with a grouped (by make) average? Will this work?
df.loc[df.price.isnull(), 'price'] = df.groupby('make').price.transform('mean')

How to enter parameters into a function when using pandas apply

First time posting here - have decided to try and learn how to use python whilst on Covid-19 forced holidays.
I'm trying to summarise some data from a pretty simple database and have been using the value_counts function.
Rather than running it on every column individually, I'd like to loop it over each one and return a summary table. I can do this using df.apply(pd.value_counts) but can't work out how to enter parameters into the the value counts as I want to have dropna = False.
Basic example of data I have:
# Import libraries
import pandas as pd
import numpy as np
# create list of winners and runnerup
data = [['john', 'barry'], ['john','barry'], [np.nan,'barry'], ['barry','john'],['john',np.nan],['linda','frank']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['winner', 'runnerup'])
# print dataframe.
df
How I was doing the value counts for each column:
#Who won the most?
df['winner'].value_counts(dropna=False)
Output:
john 3
linda 1
barry 1
NaN 1
Name: winner, dtype: int64
How can I enter the dropna=False when using apply function? I like the table it outputs below but want the NaN to appear in the list.
#value counts table
df.apply(pd.value_counts)
winner runnerup
barry 1.0 3.0
frank NaN 1.0
john 3.0 1.0
linda 1.0 NaN
#value that is missing from list
#NaN 1.0 1.0
Any help would be appreciated!!
You can use df.apply, like this:
df.apply(pd.value_counts, dropna=False)
In pandas apply function, if there is a single parameter, you simply do:
.apply(func_name)
The parameter is the value of the cell.
This works exactly the same way for pandas build in function as well as user defined functions (UDF).
for UDF, when there are more than one parameters:
.apply(func_name, args=(arg1, arg2, arg3, ...))
See: this link

Print table with NaN values from given dataset and later print with Predicted values using Pandas or Recsys

I'm working with 100k Movie Lens dataset, I need to print the entire Table of u.data with NaN values and once again with predicted values. Pandas or Recsys are suitable, others too are welcomed though.
data = pd.read_csv('ml-100k/u.data', sep='\t')
print data
The above code doesn't provide the needful output, since it prints only first and last 30 records. Moreover, I need it foll. format
UserID <MovieID>1 <MovieID>2 <MovieID>3
1 <Rating>5 NaN 3
2 NaN 2 1
I've already been through
This 1 SF Question similar
This 2 Example from AnalyticsVidhya
I am not sure if this is what you were asking but:
To print column names and have UserID as the index just use:
data = pd.read_csv('ml-100k/u.data', sep='\t', names=['UserID','MovieID_1','MovieID_2','MovieID_3']).set_index('UserID')
while for printing the whole dataframe there was a similar question here, where it was suggested to use the option_context from pandas:
with pd.option_context('display.max_rows', None, 'display.max_columns', None):
print(data)

Categories

Resources