pandas dropna not working as expected on finding mean - python

When I run the code below I get the error:
TypeError: 'NoneType' object has no attribute 'getitem'
import pyarrow
import pandas
import pyarrow.parquet as pq
df = pq.read_table("file.parquet").to_pandas()
df = df.iloc[1:,:]
df = df.dropna (how="any", inplace = True) # modifies it in place, creates new dataset without NAN
average_age = df["_c2"].mean()
print average_age
The dataframe looks like this:
_c0 _c1 _c2
0 RecId Class Age
1 1 1st 29
2 2 1st NA
3 3 1st 30
If I print the df after calling the dropna method, I get 'None'.
Shouldn't it be creating a new dataframe without the 'NA' in it, which would then allow me to get the average age without throwing an error?

As per OP’s comment, the NA is a string rather than NaN. So dropna() is no good here. One of many possible options for filtering out the string value ‘NA’ is:
df = df[df["_c2"] != "NA"]
A better option to catch inexact matches (e.g. with trailing spaces) as suggested by #DJK in the comments:
df = df[~df["_c2"].str.contains('NA')]
This one should remove any strings rather than only ‘NA’:
df = df[df[“_c2”].apply(lambda x: x.isnumeric())]

This will work, also if you the NA in your df is NaN (np.nan), this will not affect your getting the mean of the column, only if your NA is 'NA', which is string
(df.apply(pd.to_numeric,errors ='coerce',axis=1)).describe()
Out[9]:
_c0 _c1 _c2
count 3.0 0.0 2.000000
mean 2.0 NaN 29.500000
std 1.0 NaN 0.707107
min 1.0 NaN 29.000000
25% 1.5 NaN 29.250000
50% 2.0 NaN 29.500000
75% 2.5 NaN 29.750000
max 3.0 NaN 30.000000
More info
df.apply(pd.to_numeric,errors ='coerce',axis=1)# all object change to NaN and will not affect getting mean
Out[10]:
_c0 _c1 _c2
0 NaN NaN NaN
1 1.0 NaN 29.0
2 2.0 NaN NaN
3 3.0 NaN 30.0

Related

Pandas, how to calculate delta between one cell and another in different rows

I have the following frame:
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2
123,45,,,
123,,46,,
123,,47,,
123,,48,,
123,,49,,
123,,51,,
124,45,,,
124,,46,,
124,,47,,
124,,48,,
124,,49,,
124,,51,,
I'd like to add a 4th column that is (EVENT2TIME - EVENT1TIME)
USERID, EVENT1TIME, EVENT2TIME, MISC1, MISC2, DELTA
123,45,,,,
123,,46,,,1
123,,47,,,2
123,,48,,,3
123,,49,,,4
123,,51,,,6
124,45,,,,
124,,46,,,1
124,,47,,,2
124,,48,,,3
124,,49,,,4
124,,51,,,6
I think the first thing to do is to copy the value from the row where EVENT1TIME is populated into the other instances of that USERID. But I suspect there may be a better way.
I am making some assumptions:
You want to calculate the difference between column EVENT2TIME and first row of EVENT1TIME
You want to store the results into DELTA
You can do this as follows:
import pandas as pd
df = pd.read_csv('abc.txt')
print (df)
df['DELTA'] = df.iloc[:,2] - df.iloc[0,1]
print (df)
The output of this will be:
USERID EVENT1TIME EVENT2TIME MISC1 MISC2 DELTA
0 123 45.0 NaN NaN NaN
1 123 NaN 46.0 NaN NaN 1.0
2 123 NaN 47.0 NaN NaN 2.0
3 123 NaN 48.0 NaN NaN 3.0
4 123 NaN 49.0 NaN NaN 4.0
5 123 NaN 51.0 NaN NaN 6.0
If you know EVENT1TIME is always and only in the first row, just store it as a variable and subtract it.
val = df.EVENT1TIME[0]
df['DELTA'] = df.EVENT2TIME - val
If you have multiple values every so often in EVENT1TIME, use some logic to back or forward fill all the empty rows for EVENT1TIME. This fill is not stored in the final output df.
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.ffill() # forward fill (down) all nan values
# OR
df['DELTA'] = df.EVENT2TIME - df.EVENT1TIME.bfill() # back fill (up) all nan values
EDIT: Keeping this for continuity despite how hacky it is.
locations = list(df[~np.isnan(df.EVENT1TIME)].index)
vals = df.EVENT1TIME.loc[locations] # all EVENT1TIME values
locations.append(df.EVENT1TIME.index[-1]+1) # last row index + 1
last_loc = locations[0]
for idx, next_loc in enumerate(locations[1:]):
temp = df.loc[last_loc:next_loc-1]
df['DELTA'].loc[last_loc:next_loc-1] = temp.EVENT2VALUE - vals[last_loc]
last_loc = next_loc

pd.json_normalize() gives “str object has no attribute 'values'"

I manually create a DataFrame:
import pandas as pd
df_articles1 = pd.DataFrame({'Id' : [4,5,8,9],
'Class':[
{'encourage': 1, 'contacting': 1},
{'cardinality': 16, 'subClassOf': 3},
{'get-13.5.1': 1},
{'cardinality': 12, 'encourage': 1}
]
})
I export it to a csv file to import after splitting it:
df_articles1.to_csv(f"""{path}articles_split.csv""", index = False, sep=";")
I can split it with pd.json_normalize():
df_articles1 = pd.json_normalize(df_articles1['Class'])
I import its csv file to a DataFrame:
df_articles2 = pd.read_csv(f"""{path}articles_split.csv""", sep=";")
But this fails with:
AttributeError: 'str' object has no attribute 'values' pd.json_normalize(df_articles2['Class'])
that was because when you save by to_csv() the data in your 'Class' column is stored as string not as dictionary/json so after loading that saved data:
df_articles2 = pd.read_csv(f"""{path}articles_split.csv""", sep=";")
Then to make it back in original form make use of eval() method and apply() method:-
df_articles2['Class']=df_articles2['Class'].apply(lambda x:eval(x))
Finally:
resultdf=pd.json_normalize(df_articles2['Class'])
Now If you print resultdf you will get your desired output
While the accepted answer works, using eval is bad practice.
To parse a string column that looks like JSON/dict, use one of the following options (last one is best, if possible).
ast.literal_eval (better)
import ast
objects = df2['Class'].apply(ast.literal_eval)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
# Id encourage contacting cardinality subClassOf get-13.5.1
# 0 4 1.0 1.0 NaN NaN NaN
# 1 5 NaN NaN 16.0 3.0 NaN
# 2 8 NaN NaN NaN NaN 1.0
# 3 9 1.0 NaN 12.0 NaN NaN
json.loads (even better)
import json
objects = df2['Class'].apply(json.loads)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
# encourage contacting cardinality subClassOf get-13.5.1
# 0 1.0 1.0 NaN NaN NaN
# 1 NaN NaN 16.0 3.0 NaN
# 2 NaN NaN NaN NaN 1.0
# 3 1.0 NaN 12.0 NaN NaN
If the strings are single quoted, use str.replace to convert them to double quotes (and thus valid JSON) before applying json.loads:
objects = df2['Class'].str.replace("'", '"').apply(json.loads)
normed = pd.json_normalize(objects)
df2[['Id']].join(normed)
pd.json_normalize before pd.to_csv (recommended)
If possible, when you originally save to CSV, just save the normalized JSON (not raw JSON objects):
df1 = df1[['Id']].join(pd.json_normalize(df1['Class']))
df1.to_csv('df1_normalized.csv', index=False, sep=';')
# Id;encourage;contacting;cardinality;subClassOf;get-13.5.1
# 4;1.0;1.0;;;
# 5;;;16.0;3.0;
# 8;;;;;1.0
# 9;1.0;;12.0;;
This is a more natural CSV workflow (rather than storing/loading object blobs):
df2 = pd.read_csv('df1_normalized.csv', sep=';')
# Id encourage contacting cardinality subClassOf get-13.5.1
# 0 4 1.0 1.0 NaN NaN NaN
# 1 5 NaN NaN 16.0 3.0 NaN
# 2 8 NaN NaN NaN NaN 1.0
# 3 9 1.0 NaN 12.0 NaN NaN

python pandas reshape - time series table

i need to reshape the time series table
ex) A => B
A
no,A,B,B_sub
1,start,val_s,val_s_sub
2,study,val_st,val_st_sub
3,work,val_w,val_w_sub
4,end,val_e,val_e_sub
5,start,val_s1,val_s1_sub
6,end,val_e1,val_e1_sub
7,start,val_s2,val_s2_sub
8,work,val_w1,val_w1_sub
9,end,val_e2,val_e2_sub
B
,start,,study,,work,,end,
,B,B_sub,B,B_sub,B,B_sub,B,B_sub
4-1,val_s,val_s_sub,val_st,val_st_sub,val_w,val_w_sub,val_e,val_e_sub
6-5,val_s1,val_s1_sub,,,,,val_e1,val_e1_sub
9-7,val_s2,val_s2_sub,,,val_w1,val_w1_sub,val_e2,val_e2_sub
I tried to use the pivot table function of the python - pandas library,
but there is no common string to use as index in my table
can i get a hint?
i'm lost.. help me plz..
Does this get you close enough?
df_a['grp'] = (df_a['A'] == 'start').cumsum()
df_a.set_index(['grp','A']).unstack('A')
Output:
no B B_sub
A end start study work end start study work end start study work
grp
1 4.0 1.0 2.0 3.0 val_e val_s val_st val_w val_e_sub val_s_sub val_st_sub val_w_sub
2 6.0 5.0 NaN NaN val_e1 val_s1 NaN NaN val_e1_sub val_s1_sub NaN NaN
3 9.0 7.0 NaN 8.0 val_e2 val_s2 NaN val_w1 val_e2_sub val_s2_sub NaN val_w1_sub
Going a little further with reshaping and renaming and shaping:
df_r = df_a.set_index(['grp','A']).unstack('A')
steps = df_r[('no', 'end')].astype(int).astype(str).str.cat(df_r[('no', 'start')].astype(int).astype(str), sep='-')
df_r.set_index(steps)[['B', 'B_sub']].swaplevel(0,1, axis=1).sort_index(level=0, axis=1)
Output:
A end start study work
B B_sub B B_sub B B_sub B B_sub
(no, end)
4-1 val_e val_e_sub val_s val_s_sub val_st val_st_sub val_w val_w_sub
6-5 val_e1 val_e1_sub val_s1 val_s1_sub NaN NaN NaN NaN
9-7 val_e2 val_e2_sub val_s2 val_s2_sub NaN NaN val_w1 val_w1_sub

Ignore column name while imputation pandas

I am trying to use the KNNImputer Package to impute missing values into my dataframe.
Here is my dataframe
pd.DataDrame(numeric_data)
age bmi children charges
0 19 NaN 0.0 16884.9240
1 18 33.770 1.0 NaN
2 28 33.000 3.0 4449.4620
3 33 22.705 0.0 NaN
Here is when I pass the imputer package and output the dataframe.
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data))
This gives:
0 1 2 3
0 19.0 34.0850 0.0 16884.924000
1 18.0 33.7700 1.0 6309.517125
2 28.0 33.0000 3.0 4449.462000
3 33.0 22.7050 0.0 4610.464925
How do I execute the same without losing my column name? Can I store the column name somewhere else and append later or can I impute with the column name being affected itself.
I have tried to exclude the column but I get the following error:
ValueError: could not convert string to float: 'age'
This should give you the desired result:-
imputer = KNNImputer(n_neighbors=2, weights="uniform")
impute_data = pd.DataFrame(imputer.fit_transform(numeric_data), columns=numeric_data.columns)

How to populate a column in one dataframe by comparing it to another dataframe

I have a dataframe called res_df:
In [54]: res_df.head()
Out[54]:
Bldg_Sq_Ft GEOID CensusPop HU_Pop Pop_By_Area
0 753.026123 240010013002022 11.0 7.0 NaN
7 95.890495 240430003022003 17.0 8.0 NaN
8 1940.862793 240430003022021 86.0 33.0 NaN
24 2254.519775 245102801012021 27.0 13.0 NaN
25 11685.613281 245101503002000 152.0 74.0 NaN
I have a second dataframe made from the summarized information in res_df. It's grouped by the GEOID column and then summarized using aggregations to get the sum of the Bldg_Sq_Ft and the mean of the CensusPop columns for each unique GEOID. Let's call it geoid_sum:
In [55]:geoid_sum = geoid_sum.groupby('GEOID').agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'})
In [56]: geoid_sum.head()
Out[56]:
GEOID Bldg_Sq_Ft CensusPop
GEOID
100010431001011 1 1154.915527 0.0
100030144041044 1 5443.207520 26.0
100050519001066 1 1164.390503 4.0
240010001001001 15 30923.517090 41.0
240010001001007 3 6651.656677 0.0
My goal is to find the GEOIDs in res_df that match the GEOID's in geoid_sum. I want to populate the value in Pop_By_Area for that row using an equation:
Pop_By_Area = (geoid_sum['CensusPop'] * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft']
I've created a simple function that takes those parameters, but I am unsure how to iterate through the dataframes and apply the function.
def popByArea(census_pop_mean, bldg_sqft, bldg_sqft_sum):
x = float()
x = (census_pop_mean * bldg_sqft)/bldg_sqft_sum
return x
I've tried creating a series based on the GEOID matches: s = res_df.GEOID.isin(geoid_sum.GEOID.values) but that didn't seem to work (produced all false boolean values). How can I find the matches and apply my function to populate the Pop_By_Area column?
I think you need the reindex
geoid_sum = geoid_sum.groupby('GEOID').\
agg({'GEOID': 'count', 'Bldg_Sq_Ft': 'sum', 'CensusPop': 'mean'}).\
reindex(res_df['GEOID'])
res_df['Pop_By_Area'] = (geoid_sum['CensusPop'].values * ref_df['Bldg_Sq_Ft'])/geoid_sum['Bldg_Sq_Ft'].values

Categories

Resources