Graphing a group of data as a density function

Graphing a group of data as a density function - python

[.csv data looks like this:]
man birthyear claim number_items_reported Impression age_group
0 1977 110.0 1 2.0 older_adult
0 1987 12.0 1 2.0 adult
1 1982 628.0 1 0.0 adult
1 1968 503.6 1 0.0 older_adult
1 1980 807.8 3 2.0 older_adult
I have grouped the data from criteria to get the mean for four outcomes with (df.groupby). I am trying to apply results graphically, with the mean data (from the df.groupby function below) on the y-axis and number_items_reported on the x-axis. I appreciate your reply.
I have done the following but keep getting an error:
import pandas as pd
# Use GroupBy() & compute claim mean for each impression
impression_group = df.groupby('Impression')['claim'].mean()
print(impression_group)
Returning:
Impression
0.0 911.253743
1.0 866.242697
2.0 791.260000
3.0 818.035949
Name: claim, dtype: float64
Entering data commands for the graph:
#define index column
df.set_index('number_items_reported', inplace=True)
#group data by Impression and display mean claims for each Impression as line chart
df.groupby('Impression')['claim'].plot(legend=True)
Returning:
KeyError: "None of ['number_items_reported'] are in the columns"

Related

sklearn ValueError: Input contains NaN

ValueError: Input contains NaN
i have run
from sklearn.preprocessing import OrdinalEncoderfrom
data_.iloc[:,1:-1] = OrdinalEncoder().fit_transform(data_.iloc[:,1:-1])
here is data_
Age Sex Embarked Survived
0 22.0 male S 0
1 38.0 female C 2
2 26.0 female S 2
3 35.0 female S 2
4 35.0 male S 0

Before doing some processor, you have always have to to preprocess the data and make a few summary of how your data is. In concrete, the error you obtained is telling you that you have NaN values. To check it, try this command:
df.isnull().any().any()
If the output is TRUE, you have NaN values. You can run the next command if you want to know where this NaN values are:
df.isnull().any()
Then, you will know in which column are your NaN values.
Once you know you have NaN values, you have to preprocess them (eliminate, fill,... whatever you believe is the best option). The link gtomer commented is a nice resource.

Ignore column name while imputation pandas

I am trying to use the KNNImputer Package to impute missing values into my dataframe.
Here is my dataframe
pd.DataDrame(numeric_data)
age bmi children charges
0 19 NaN 0.0 16884.9240
1 18 33.770 1.0 NaN
2 28 33.000 3.0 4449.4620
3 33 22.705 0.0 NaN
Here is when I pass the imputer package and output the dataframe.
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data))
This gives:
0 1 2 3
0 19.0 34.0850 0.0 16884.924000
1 18.0 33.7700 1.0 6309.517125
2 28.0 33.0000 3.0 4449.462000
3 33.0 22.7050 0.0 4610.464925
How do I execute the same without losing my column name? Can I store the column name somewhere else and append later or can I impute with the column name being affected itself.
I have tried to exclude the column but I get the following error:
ValueError: could not convert string to float: 'age'

This should give you the desired result:-
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)

pandas dropna not working as expected on finding mean

When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?

As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]

This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0

Plotting irregular time-series (multiple) from dataframe using ggplot

I have a df structured as so:
UnitNo Time Sensor
0 1.0 2016-07-20 18:34:44 19.0
1 1.0 2016-07-20 19:27:39 19.0
2 3.0 2016-07-20 20:45:39 17.0
3 3.0 2016-07-20 23:05:29 17.0
4 3.0 2016-07-21 01:23:30 11.0
5 2.0 2016-07-21 04:23:59 11.0
6 2.0 2016-07-21 17:33:29 2.0
7 2.0 2016-07-21 18:55:04 2.0
I want to create a time-series plot where each UnitNo has its own line (color) and the y-axis values correspond to Sensor and the x-axis is Time. I want to do this in ggplot, but I am having trouble figuring out how to do this efficiently. I have looked at previous examples but they all have regular time series, i.e., observations for each variable occur at the same times which makes it easy to create a time index. I imagine I can loop through and add data to plot(?), but I was wondering if there was a more efficient/elegant way forward.

df.set_index('Time').groupby('UnitNo').Sensor.plot();

I think you need pivot or set_index and unstack with DataFrame.plot:
df.pivot('Time', 'UnitNo','Sensor').plot()
Or:
df.set_index(['Time', 'UnitNo'])['Sensor'].unstack().plot()
If some duplicates:
df = df.groupby(['Time', 'UnitNo'])['Sensor'].mean().unstack().plot()
df = df.pivot_table(index='Time', columns='UnitNo',values='Sensor', aggfunc='mean').plot()

convert a SAS datetime in Pandas - Multiple Columns

I am trying to learn Python, coming from a SAS background.
I have imported a SAS dataset, and one thing I noticed was that I have multiple date columns that are coming through as SAS dates (I believe).
In looking around, I found a link which explained how to perform this (here):
The code is as follows:
alldata['DateFirstOnsite'] = pd.to_timedelta(alldata.DateFirstOnsite, unit='s') + pd.datetime(1960, 1, 1)
However, I'm wondering how to do this for multiple columns. If I have multiple date fields, rather than repeating this line of code multiple times, can I create a list of fields I have, and then run this code on that list of fields? How is that done?
Thanks in advance

Yes, it's possible to create a list and iterate through that list to convert the SAS date fields to pandas datetime. However, I'm not sure why you're using a to_timedelta method, unless the SAS date fields are represented by seconds after 1960/01/01. If you plan on using the to_timedelta method, then its simply a case of creating a function that takes your df and your field and passing those two into your function:
def convert_SAS_to_datetime(df, field):
df[field] = pd.to_timedelta(df[field], unit='s') + pd.datetime(1960, 1, 1)
return df
Now, let's suppose you have your list of fields that you know should be converted to a datetime field (along with your df):
my_list = ['field1','field2','field3','field4','field5']
my_df = pd.read_sas('mySASfile.sas7bdat') # your SAS data that's converted to a pandas DF
You can now iterate through your list with a for loop while passing those fields and your df to the function:
for field in my_list:
my_df = convert_SAS_to_datetime(my_df, field)
Now, the other method I would recommend is using the to_datetime method, but this assumes that you know what the SAS format of your date fields are.
e.g. 01Jan2016 # date9 format
This is when you might have to look through the documentation here to determine the directive to converting the date. In the case of a date9 format, then you can use:
df[field] = pd.to_datetime(df[date9field], format="%d%b%Y")

If i read your question correctly, you want to apply your code to multiple columns? to do that simple do this:
alldata[['col1','col2','col3']] = 'your_code_here'
Exmaple:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['Pharmacy of IDAHO','Access medicare arkansas','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
df[['E', 'D']] = 1 # <---- notice double brackets
print(df)
A B C D E
0 NaN 1.0 Pharmacy of IDAHO 1 1
1 NaN 0.0 Access medicare arkansas 1 1
2 3.0 3.0 NJ Pharmacy 1 1
3 4.0 5.0 Idaho Rx 1 1
4 5.0 0.0 CA Herbals 1 1
5 5.0 0.0 Florida Pharma 1 1
6 3.0 NaN AK RX 1 1
7 1.0 9.0 Ohio Drugs 1 1
8 5.0 0.0 PA Rx 1 1
9 NaN 0.0 USA Pharma 1 1
Notice the double brackets in the beginning. Hope this helps!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Graphing a group of data as a density function - python

Related

sklearn ValueError: Input contains NaN

Ignore column name while imputation pandas

pandas dropna not working as expected on finding mean

Plotting irregular time-series (multiple) from dataframe using ggplot

convert a SAS datetime in Pandas - Multiple Columns

Categories

Resources