Not possible to give column names to concatenated Pandas Series - python

From a pandas dataframe I calculate mean(), sd() and max() from all variables with pandas built in functions. I get back three pandas serieses.
import pandas as pd
df_FALKO_R_scores_mean = df_FALKO_R_scores_only.mean()
df_FALKO_R_scores_sd = df_FALKO_R_scores_only.std()
df_FALKO_R_scores_max = df_FALKO_R_scores_only.max()
Than I concatenate the three serieses to get an output of mean, sd and max for every variable.
The Problem is, as you can see below, although I add "names" to the concat() function, the labels of the variables are named 0, 1 and 2. This is not readable, especially if I want to plot those numbers. How can I manage to get a Pandas series with the column labels ['mean','sd','max']? I also tried "ignore_index" True and False.
df_FALKO_R_scores_mean_sd_max = pd.concat([df_FALKO_R_scores_mean, df_FALKO_R_scores_sd, df_FALKO_R_scores_max], names=['mean', 'sd', 'max'], axis=1, ignore_index=True)
print(df_FALKO_R_scores_mean_sd_max)
Output:
0 1 2
R_fd_s_01a_s 1.026490 0.631897 2.0
R_fd_e_01b_s 0.794702 0.802645 2.0
R_fd_e_01c_s 1.039735 1.124757 4.0
R_fd_p_02a_s 1.390728 0.848320 3.0
R_fd_p_02b_s 0.880795 0.552897 2.0
R_fd_p_03_s 1.132450 1.004493 3.0
R_fd_s_04_s 0.834437 0.769679 2.0
R_fd_e_05_s 0.403974 0.694539 2.0
R_fd_p_06a_s 1.105960 0.644488 2.0
R_fd_e_06b_s 1.337748 0.979030 3.0
R_fd_e_07_s 1.192053 1.320178 4.0
R_fd_e_08a_s 0.748344 0.741337 2.0
R_fd_e_08b_s 0.529801 0.737635 2.0
R_fd_p_09a_s 1.688742 1.312430 4.0
R_fd_p_09b_s 0.701987 0.839005 3.0
R_fw_01_s 0.774834 0.731867 2.0
R_fw_02_s 0.761589 0.797568 2.0
R_fw_03_s 0.841060 0.857070 2.0
R_fw_04_s 0.589404 0.675983 2.0
R_fw_05_s 0.403974 0.655020 2.0
R_fw_06_s 0.211921 0.441351 2.0
R_fw_07_s 0.536424 0.789724 2.0
R_fw_08_s 0.927152 0.566855 2.0
R_fw_09a_s 1.317881 0.843571 2.0
Thanks for any help!

Why don't you use agg() instead of creating three different calculations and concatenating the results?
df_FALKO_R_scores_only.agg(['mean', 'std', 'max'], axis=1)
It will give you results with proper column names.
You didn't add any input, but I believe it could work in this case.
EDIT:
If you want to use pd.concat, you can name each series, example:
df_FALKO_R_scores_mean.name = 'mean'
Or you can just name output columns by using a list.
df_FALKO_R_scores_mean_sd_max.columns = ['mean', 'std', 'max']

Related

access to the row of a dataframe using the conditions and values for unnamed columns

I have a dataframe:
params = pd.DataFrame({ 'dE' : {'3.0':20.0, '4.0':15.0, '-4.0':15.0},
'Gg' : {'3.0':80.0, '4.0':55.0, '-4.0':55.0},
'gn2' : {'3.0':50.0, '4.0':10.0, '-4.0':10.0} })
The data inside:
dE Gg gn2
3.0 20.0 80.0 50.0
4.0 15.0 55.0 10.0
-4.0 15.0 55.0 10.0
How to access the row of a dataframe where the first unnamed column has value 4.0?
How actually create a subset using the unnamed column?
That unnamed column named index of row.
How to get row value of index 4.0 is params.loc['4.0']
In pandas version 1.5.2 the first "column" that you think is unnamed actually is the index.
Then to locate row use: params.loc['4.0']

How to populate a column in one dataframe by comparing it to another dataframe

I have a dataframe called res_df:
In [54]: res_df.head()
Out[54]:
Bldg_Sq_Ft GEOID CensusPop HU_Pop Pop_By_Area
0 753.026123 240010013002022 11.0 7.0 NaN
7 95.890495 240430003022003 17.0 8.0 NaN
8 1940.862793 240430003022021 86.0 33.0 NaN
24 2254.519775 245102801012021 27.0 13.0 NaN
25 11685.613281 245101503002000 152.0 74.0 NaN
I have a second dataframe made from the summarized information in res_df. It's grouped by the GEOID column and then summarized using aggregations to get the sum of the Bldg_Sq_Ft and the mean of the CensusPop columns for each unique GEOID. Let's call it geoid_sum:
In [55]:geoid_sum = geoid_sum.groupby('GEOID').agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'})
In [56]: geoid_sum.head()
Out[56]:
GEOID Bldg_Sq_Ft CensusPop
GEOID
100010431001011 1 1154.915527 0.0
100030144041044 1 5443.207520 26.0
100050519001066 1 1164.390503 4.0
240010001001001 15 30923.517090 41.0
240010001001007 3 6651.656677 0.0
My goal is to find the GEOIDs in res_df that match the GEOID's in geoid_sum. I want to populate the value in Pop_By_Area for that row using an equation:
Pop_By_Area = (geoid_sum['CensusPop'] * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft']
I've created a simple function that takes those parameters, but I am unsure how to iterate through the dataframes and apply the function.
def popByArea(census_pop_mean, bldg_sqft, bldg_sqft_sum):
x = float()
x = (census_pop_mean * bldg_sqft)/bldg_sqft_sum
return x
I've tried creating a series based on the GEOID matches: s = res_df.GEOID.isin(geoid_sum.GEOID.values) but that didn't seem to work (produced all false boolean values). How can I find the matches and apply my function to populate the Pop_By_Area column?
I think you need the reindex
geoid_sum = geoid_sum.groupby('GEOID').\
agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'}).\
reindex(res_df['GEOID'])
res_df['Pop_By_Area'] = (geoid_sum['CensusPop'].values * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft'].values

pandas dropna not working as expected on finding mean

When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?
As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]
This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0

Plotting irregular time-series (multiple) from dataframe using ggplot

I have a df structured as so:
UnitNo Time Sensor
0 1.0 2016-07-20 18:34:44 19.0
1 1.0 2016-07-20 19:27:39 19.0
2 3.0 2016-07-20 20:45:39 17.0
3 3.0 2016-07-20 23:05:29 17.0
4 3.0 2016-07-21 01:23:30 11.0
5 2.0 2016-07-21 04:23:59 11.0
6 2.0 2016-07-21 17:33:29 2.0
7 2.0 2016-07-21 18:55:04 2.0
I want to create a time-series plot where each UnitNo has its own line (color) and the y-axis values correspond to Sensor and the x-axis is Time. I want to do this in ggplot, but I am having trouble figuring out how to do this efficiently. I have looked at previous examples but they all have regular time series, i.e., observations for each variable occur at the same times which makes it easy to create a time index. I imagine I can loop through and add data to plot(?), but I was wondering if there was a more efficient/elegant way forward.
df.set_index('Time').groupby('UnitNo').Sensor.plot();
I think you need pivot or set_index and unstack with DataFrame.plot:
df.pivot('Time', 'UnitNo','Sensor').plot()
Or:
df.set_index(['Time', 'UnitNo'])['Sensor'].unstack().plot()
If some duplicates:
df = df.groupby(['Time', 'UnitNo'])['Sensor'].mean().unstack().plot()
df = df.pivot_table(index='Time', columns='UnitNo',values='Sensor', aggfunc='mean').plot()

convert a SAS datetime in Pandas - Multiple Columns

I am trying to learn Python, coming from a SAS background.
I have imported a SAS dataset, and one thing I noticed was that I have multiple date columns that are coming through as SAS dates (I believe).
In looking around, I found a link which explained how to perform this (here):
The code is as follows:
alldata['DateFirstOnsite'] = pd.to_timedelta(alldata.DateFirstOnsite, unit='s') + pd.datetime(1960, 1, 1)
However, I'm wondering how to do this for multiple columns. If I have multiple date fields, rather than repeating this line of code multiple times, can I create a list of fields I have, and then run this code on that list of fields? How is that done?
Thanks in advance
Yes, it's possible to create a list and iterate through that list to convert the SAS date fields to pandas datetime. However, I'm not sure why you're using a to_timedelta method, unless the SAS date fields are represented by seconds after 1960/01/01. If you plan on using the to_timedelta method, then its simply a case of creating a function that takes your df and your field and passing those two into your function:
def convert_SAS_to_datetime(df, field):
df[field] = pd.to_timedelta(df[field], unit='s') + pd.datetime(1960, 1, 1)
return df
Now, let's suppose you have your list of fields that you know should be converted to a datetime field (along with your df):
my_list = ['field1','field2','field3','field4','field5']
my_df = pd.read_sas('mySASfile.sas7bdat') # your SAS data that's converted to a pandas DF
You can now iterate through your list with a for loop while passing those fields and your df to the function:
for field in my_list:
my_df = convert_SAS_to_datetime(my_df, field)
Now, the other method I would recommend is using the to_datetime method, but this assumes that you know what the SAS format of your date fields are.
e.g. 01Jan2016 # date9 format
This is when you might have to look through the documentation here to determine the directive to converting the date. In the case of a date9 format, then you can use:
df[field] = pd.to_datetime(df[date9field], format="%d%b%Y")
If i read your question correctly, you want to apply your code to multiple columns? to do that simple do this:
alldata[['col1','col2','col3']] = 'your_code_here'
Exmaple:
import pandas as pd
import numpy as np
df = pd.DataFrame({'A' : [np.NaN,np.NaN,3,4,5,5,3,1,5,np.NaN],
'B' : [1,0,3,5,0,0,np.NaN,9,0,0],
'C' : ['Pharmacy of IDAHO','Access medicare arkansas','NJ Pharmacy','Idaho Rx','CA Herbals','Florida Pharma','AK RX','Ohio Drugs','PA Rx','USA Pharma'],
'D' : [123456,123456,1234567,12345678,12345,12345,12345678,123456789,1234567,np.NaN],
'E' : ['Assign','Unassign','Assign','Ugly','Appreciate','Undo','Assign','Unicycle','Assign','Unicorn',]})
df[['E', 'D']] = 1 # <---- notice double brackets
print(df)
A B C D E
0 NaN 1.0 Pharmacy of IDAHO 1 1
1 NaN 0.0 Access medicare arkansas 1 1
2 3.0 3.0 NJ Pharmacy 1 1
3 4.0 5.0 Idaho Rx 1 1
4 5.0 0.0 CA Herbals 1 1
5 5.0 0.0 Florida Pharma 1 1
6 3.0 NaN AK RX 1 1
7 1.0 9.0 Ohio Drugs 1 1
8 5.0 0.0 PA Rx 1 1
9 NaN 0.0 USA Pharma 1 1
Notice the double brackets in the beginning. Hope this helps!

Categories

Resources