Avoiding KeyError in dataframe - python

I am validating my dataframe with below code,
df = df[(df[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1)) &
((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4)) &
(df[['network_url', 'formulary_url', 'sbc_download_url', 'treatment_cost_calculator_url']].astype(str).apply(lambda x: (x.str.contains('\A(https?:\/\/)([a-zA-Z0-9\-_])*(\.)*([a-zA-Z0-9\-]+)\.([a-zA-Z\.]{2,5})(\.*.*)?\Z')) | x.isin(['nan'])).all(axis=1)) &
(df[['promotional_label']].astype(str).apply(lambda x: (x.str.len <= 65) | x.isin(['nan'])).all(axis=1)) &
# (df[['sort_rank_override']].astype(str).apply(lambda x: (x.str.isdigit()) | x.isin(['nan'])).all(axis=1)) &
((df['hios_plan_identifier'].notnull()) & (df['hios_plan_identifier'].str.len() >= 10) & (df['hios_plan_identifier'].str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z'))) &
(df['type'].isin(['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan'])) &
(df['price_period'].isin(['Monthly', 'Yearly'])) &
(df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']))]
# (df[['composite_rating']].astype(str).apply(lambda x: (x.str.isin(['True', 'False']) & x.isnotin(['nan'])).all(axis=1)))]
This would throw me
KeyError: "['name'] not in index"
when the column is not present in my dataframe. I need to handle for all columns. How can I efficiently add a check to my above code which checks for validation only when the column is present?

You can use intersection:
L = ['name', 'issuer_id', 'service_area_id']
cols = df.columns.intersection(L)
(df[cols].notnull().all(axis=1))
EDIT:
df = pd.DataFrame({
'name':list('abcdef'),
'plan_year':[2015,2015,2015,5,5,4],
})
print (df)
name plan_year
0 a 2015
1 b 2015
2 c 2015
3 d 5
4 e 5
5 f 4
Idea is create dictionary of valid values for each colum first:
valid = {'name':'a',
'issuer_id':'a',
'service_area_id':'a',
'plan_year':2015,
...}
Then filter new dictionary by missing columns and assign to original DataFrame and create new DataFrame:
d1 = {k: v for k, v in valid.items() if k in set(valid.keys()) - set(df.columns)}
print (d1)
{'issuer_id': 'a', 'service_area_id': 'a'}
df1 = df.assign(**d1)
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
3 d 5 a a
4 e 5 a a
5 f 4 a a
Last filter:
m1 = (df1[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1))
m2 = ((df1['plan_year'].notnull()) &
(df1['plan_year'].astype(str).str.isdigit()) &
(df1['plan_year'].astype(str).str.len() == 4))
df1 = df1[m1 & m2]
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
Last you can remove helper columns:
df1 = df1[m1 & m2].drop(d1.keys(), axis=1)
print (df1)
name plan_year
0 a 2015
1 b 2015
2 c 2015

Add another variable called columns and filter it with the ones that exist in df:
columns = ['name', 'issuer_id', 'service_area_id']
existing = [i for i in columns if i in df.columns]
df = df[(df[existing]...
EDIT
You could also assign each condition to a variable and use it later like this:
cond1 = df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']) if 'is_age_29_plan' in df.columns else True
Then, use the cond1 in your filtering statement.

Related

Aggregate and concatenate multiple columns

I want to groupby my dataframe and concatenate the values/strings from the other columns together.
Year Letter Number Note Text
0 2022 a 1 8 hi
1 2022 b 1 7 hello
2 2022 a 1 6 bye
3 2022 b 3 5 joe
To this:
Column
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe
I tried some things with groupby, apply() and agg() but I can't get it work:
df.groupby(['Year', 'Letter']).agg(lambda x: '|'.join(x))
Output:
Text
Year Letter
2022 a hi|bye
b hello|joe
You can first join values per rows converted to strings by DataFrame.astype and DataFrame.agg and then aggregate join in GroupBy.agg:
df1 = (df.assign(Text= df[['Number','Note','Text']].astype(str).agg('|'.join, axis=1))
.groupby(['Year', 'Letter'])['Text']
.agg('; '.join)
.to_frame())
print (df1)
Text
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe
Or create custom lambda function in GroupBy.apply:
f = lambda x: '; '.join('|'.join(y) for y in x.astype(str).to_numpy())
df1 = (df.groupby(['Year', 'Letter'])[['Number','Note','Text']].apply(f)
.to_frame(name='Text')
)
print (df1)
Text
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe
If need join all columns without grouping columns:
grouped = ['Year','Letter']
df1 = (df.assign(Text= df[df.columns.difference(grouped, sort=False)]
.astype(str).agg('|'.join, axis=1))
.groupby(['Year', 'Letter'])['Text']
.agg('; '.join)
.to_frame())
grouped = ['Year','Letter']
f = lambda x: '; '.join('|'.join(y) for y in x.astype(str).to_numpy())
df1 = (df.groupby(grouped)[df.columns.difference(grouped, sort=False)].apply(f)
.to_frame(name='Text')
)
Using groupby.apply:
cols = ['Year', 'Letter']
(df.groupby(cols)
.apply(lambda d: '; '.join(d.drop(columns=cols) # or slice the columns here
.astype(str)
.agg('|'.join, axis=1)))
.to_frame(name='Column')
)
Output:
Column
Year Letter
2022 a 1|8|hi; 1|6|bye
b 1|7|hello; 3|5|joe

Pandas groupby and get nunique of multiple columns in a dataframe

I have a dataframe like as below
stu_id,Mat_grade,sci_grade,eng_grade
1,A,C,A
1,A,C,A
1,B,C,A
1,C,C,A
2,D,B,B
2,D,C,B
2,D,D,C
2,D,A,C
tf = pd.read_clipboard(sep=',')
My objective is to
a) Find out how many different unique grades that a student got under Mat_grade, sci_grade and eng_grade
So, I tried the below
tf['mat_cnt'] = tf.groupby(['stu_id'])['Mat_grade'].nunique()
tf['sci_cnt'] = tf.groupby(['stu_id'])['sci_grade'].nunique()
tf['eng_cnt'] = tf.groupby(['stu_id'])['eng_grade'].nunique()
But this doesn't provide the expected output. Since, I have more than 100K unique ids, any efficient and elegant solution is really helpful
I expect my output to be like as below
You can specify columns names in list and for column cols call DataFrameGroupBy.nunique with rename:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = dict(zip(cols, new))
df = tf.groupby(['stu_id'], as_index=False)[cols].nunique().rename(columns=d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2
Another idea is used named aggregation:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = {v: (k,'nunique') for k, v in zip(cols, new)}
print (d)
{'mat_cnt': ('Mat_grade', 'nunique'),
'sci_cnt': ('sci_grade', 'nunique'),
'eng_cnt': ('eng_grade', 'nunique')}
df = tf.groupby(['stu_id'], as_index=False).agg(**d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2

How to merge many DataFrames by index combining values where columns overlap?

I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df

I want to remove all rows which turn out the result "unknown" for a certain column

I'm new to programming and to Python3/Pandas.
I have written a csv file to a dF and am using pandas and numpty. The dF contains a series of columns, A, B, C, etc. and several thousand rows of data for them (not all numerical). I want to remove all instances of "unknown" from the data frame.
I have tried:
dF = dF[dF['A' != 'unknown']]
but it gives me an error message.
You means this?
df = df[df['A'] != 'unknown']
Or you can use query():
df = df.query('A != "unknown"')
You need filter data by boolean indexing:
df = pd.DataFrame({'A':['a','unknown','b'],
'B':pd.date_range('2017-01-01', periods=3),
'C':[7,8,9],
'D':[1,3,5]})
print (df)
A B C D
0 a 2017-01-01 7 1
1 unknown 2017-01-02 8 3
2 b 2017-01-03 9 5
You need to enclose multiple conditions in braces due to operator precedence and use the bitwise and (&) and or (|) operators if multiple conditions:
df1 = df[(df['A'] != 'unknown') & (df['B'] > '2017-01-02')]
print (df1)
A B C D
2 b 2017-01-03 9 5
But if need processes data later:
df1['C'] = df1['C'] + 1
print (df1)
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
Problem is if you modify values in df1 later you will find that the modifications do not propagate back to the original data (df) and that Pandas does warning.
Solution is copy:
df1 = df[(df['A'] != 'unknown') & (df['B'] > '2017-01-02')].copy()
print (df1)
A B C D
2 b 2017-01-03 9 5
df1['C'] = df1['C'] + 1
print (df1)
A B C D
2 b 2017-01-03 10 5

How to do an update on only part of a DataFrame

Lets say one has a DataFrame df1 with INDEX, Column1, Column2 and another df2 with INDEX, Column1, Column3.
Both INDEX have similar values so I want to use that to merge the information of one table on the other.
I have been told to do as follows by other users:
df1.update(df2, join='left', overwrite=True)
This works if both INDEXES have similar values. The result will be df1 will now have INDEX, Column1 (from df2) and Column2 (original from df1). Column3 is not added to df1 (this behaviour is wanted vs. the "merge" command that adds everything).
Now, I would like to update df1 only on a few cases and based on Column2. I thought this would work:
df1[df1['Column2'] == 'Cond'].update(df2, join='left', overwrite=True)
But it doesn't; sometimes I get an error, other the command works but ALL df1 values have been modified.
Any idea on how to do this?
PS: Using .loc won't work as that requires that whatever INDEX you search for exists and this is not the case.
EDIT: Additional example
In [37]: df1 = pd.DataFrame([['USA',1],['USA',2],['USA',3],['FRA',1],['FRA',2]], columns = ['country', 'value'])
In [38]: df2 = pd.DataFrame([['USA',10],['FRA',20]], columns = ['country', 'value'])
In [39]: df1 = df1.set_index('country')
In [40]: df2 = df2.set_index('country')
In [41]: mask = df1['value'] >= 2
In [42]: idx = df1.index[mask]
In [43]: idx = idx.unique()
In [44]: df1
Out[44]:
value
country
USA 1
USA 2
USA 3
FRA 1
FRA 2
In [45]: df2
Out[45]:
value
country
USA 10
FRA 20
In [46]: idx
Out[46]: array(['USA', 'FRA'], dtype=object)
In [47]: df1.update(df2.loc[idx])
In [48]: df1
Out[48]:
value
country
USA 10
USA 10
USA 10
FRA 20
FRA 20
Define the boolean mask
mask = (df1['Column2'] == 'Cond')
If df1.index is identical to df2.index, then mask can be used to select
rows from df2 -- i.e., df2.loc[mask]. But if they are not identical, then
df2.loc[mask] may raise an error (if len(df1) != len(df2)), or worse, silently select the wrong rows
because the boolean mask is not aligning index values between df1 and df2.
So in the more general case when the indexes are not identical, the trick is to
convert the boolean mask into an Index that can be used to restrict
df2.
If df1.index is unique then call df1.update on the restricted df2:
idx = df1.index[mask]
df1.update(df2.loc[idx])
For example,
import pandas as pd
df1 = pd.DataFrame({'Column1':[1,2,3], 'Column2':['Cond',5,'Cond']}, index=['A','B','C'])
# Column1 Column2
# A 1 Cond
# B 2 5
# C 3 Cond
df2 = pd.DataFrame({'Column1':[10,20,30], 'Column3':[40,50,60]}, index=['D','B','C'])
# Column1 Column3
# D 10 40
# B 20 50
# C 30 60
mask = df1['Column2'] == 'Cond'
idx = df1.index[mask]
df1.update(df2.loc[idx])
print(df1)
prints
Column1 Column2
A 1 Cond
B 2 5
C 30 Cond
If df1.index is not unique, then make the index unique by adding mask to it:
df1['mask'] = df1['value'] >= 2
df2['mask'] = True
df1 = df1.set_index('mask', append=True)
df2 = df2.set_index('mask', append=True)
Then calling df1.update(df2) produces the desired result because update aligns indices.
For example,
import pandas as pd
df1 = pd.DataFrame([['USA',1],['USA',2],['USA',3],['FRA',1],['FRA',2]],
columns = ['country', 'value'])
df2 = pd.DataFrame([['USA',10],['FRA',20]], columns = ['country', 'value'])
df1 = df1.set_index('country')
# value
# country
# USA 1
# USA 2
# USA 3
# FRA 1
# FRA 2
df2 = df2.set_index('country')
# value
# country
# USA 10
# FRA 20
df1['mask'] = df1['value'] >= 2
df2['mask'] = True
df1 = df1.set_index('mask', append=True)
# value
# country mask
# USA False 1
# True 2
# True 3
# FRA False 1
# True 2
df2 = df2.set_index('mask', append=True)
# value
# country mask
# USA True 10
# FRA True 20
df1.update(df2)
df1.index = df1.index.droplevel('mask')
print(df1)
yields
value
country
USA 1
USA 10
USA 10
FRA 1
FRA 20

Categories

Resources