How to do an update on only part of a DataFrame - python

Lets say one has a DataFrame df1 with INDEX, Column1, Column2 and another df2 with INDEX, Column1, Column3.
Both INDEX have similar values so I want to use that to merge the information of one table on the other.
I have been told to do as follows by other users:
df1.update(df2, join='left', overwrite=True)
This works if both INDEXES have similar values. The result will be df1 will now have INDEX, Column1 (from df2) and Column2 (original from df1). Column3 is not added to df1 (this behaviour is wanted vs. the "merge" command that adds everything).
Now, I would like to update df1 only on a few cases and based on Column2. I thought this would work:
df1[df1['Column2'] == 'Cond'].update(df2, join='left', overwrite=True)
But it doesn't; sometimes I get an error, other the command works but ALL df1 values have been modified.
Any idea on how to do this?
PS: Using .loc won't work as that requires that whatever INDEX you search for exists and this is not the case.
EDIT: Additional example
In [37]: df1 = pd.DataFrame([['USA',1],['USA',2],['USA',3],['FRA',1],['FRA',2]], columns = ['country', 'value'])
In [38]: df2 = pd.DataFrame([['USA',10],['FRA',20]], columns = ['country', 'value'])
In [39]: df1 = df1.set_index('country')
In [40]: df2 = df2.set_index('country')
In [41]: mask = df1['value'] >= 2
In [42]: idx = df1.index[mask]
In [43]: idx = idx.unique()
In [44]: df1
Out[44]:
value
country
USA 1
USA 2
USA 3
FRA 1
FRA 2
In [45]: df2
Out[45]:
value
country
USA 10
FRA 20
In [46]: idx
Out[46]: array(['USA', 'FRA'], dtype=object)
In [47]: df1.update(df2.loc[idx])
In [48]: df1
Out[48]:
value
country
USA 10
USA 10
USA 10
FRA 20
FRA 20

Define the boolean mask
mask = (df1['Column2'] == 'Cond')
If df1.index is identical to df2.index, then mask can be used to select
rows from df2 -- i.e., df2.loc[mask]. But if they are not identical, then
df2.loc[mask] may raise an error (if len(df1) != len(df2)), or worse, silently select the wrong rows
because the boolean mask is not aligning index values between df1 and df2.
So in the more general case when the indexes are not identical, the trick is to
convert the boolean mask into an Index that can be used to restrict
df2.
If df1.index is unique then call df1.update on the restricted df2:
idx = df1.index[mask]
df1.update(df2.loc[idx])
For example,
import pandas as pd
df1 = pd.DataFrame({'Column1':[1,2,3], 'Column2':['Cond',5,'Cond']}, index=['A','B','C'])
# Column1 Column2
# A 1 Cond
# B 2 5
# C 3 Cond
df2 = pd.DataFrame({'Column1':[10,20,30], 'Column3':[40,50,60]}, index=['D','B','C'])
# Column1 Column3
# D 10 40
# B 20 50
# C 30 60
mask = df1['Column2'] == 'Cond'
idx = df1.index[mask]
df1.update(df2.loc[idx])
print(df1)
prints
Column1 Column2
A 1 Cond
B 2 5
C 30 Cond
If df1.index is not unique, then make the index unique by adding mask to it:
df1['mask'] = df1['value'] >= 2
df2['mask'] = True
df1 = df1.set_index('mask', append=True)
df2 = df2.set_index('mask', append=True)
Then calling df1.update(df2) produces the desired result because update aligns indices.
For example,
import pandas as pd
df1 = pd.DataFrame([['USA',1],['USA',2],['USA',3],['FRA',1],['FRA',2]],
columns = ['country', 'value'])
df2 = pd.DataFrame([['USA',10],['FRA',20]], columns = ['country', 'value'])
df1 = df1.set_index('country')
# value
# country
# USA 1
# USA 2
# USA 3
# FRA 1
# FRA 2
df2 = df2.set_index('country')
# value
# country
# USA 10
# FRA 20
df1['mask'] = df1['value'] >= 2
df2['mask'] = True
df1 = df1.set_index('mask', append=True)
# value
# country mask
# USA False 1
# True 2
# True 3
# FRA False 1
# True 2
df2 = df2.set_index('mask', append=True)
# value
# country mask
# USA True 10
# FRA True 20
df1.update(df2)
df1.index = df1.index.droplevel('mask')
print(df1)
yields
value
country
USA 1
USA 10
USA 10
FRA 1
FRA 20

Related

Merging df in python

Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0

How to merge many DataFrames by index combining values where columns overlap?

I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df

Avoiding KeyError in dataframe

I am validating my dataframe with below code,
df = df[(df[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1)) &
((df['plan_year'].notnull()) & (df['plan_year'].astype(str).str.isdigit()) & (df['plan_year'].astype(str).str.len() == 4)) &
(df[['network_url', 'formulary_url', 'sbc_download_url', 'treatment_cost_calculator_url']].astype(str).apply(lambda x: (x.str.contains('\A(https?:\/\/)([a-zA-Z0-9\-_])*(\.)*([a-zA-Z0-9\-]+)\.([a-zA-Z\.]{2,5})(\.*.*)?\Z')) | x.isin(['nan'])).all(axis=1)) &
(df[['promotional_label']].astype(str).apply(lambda x: (x.str.len <= 65) | x.isin(['nan'])).all(axis=1)) &
# (df[['sort_rank_override']].astype(str).apply(lambda x: (x.str.isdigit()) | x.isin(['nan'])).all(axis=1)) &
((df['hios_plan_identifier'].notnull()) & (df['hios_plan_identifier'].str.len() >= 10) & (df['hios_plan_identifier'].str.contains('\A(\d{5}[A-Z]{2}[a-zA-Z0-9]{3,7}-TMP|\d{5}[A-Z]{2}\d{3,7}(\-?\d{2})*)\Z'))) &
(df['type'].isin(['MetalPlan', 'MedicarePlan', 'BasicHealthPlan', 'DualPlan', 'MedicaidPlan', 'ChipPlan'])) &
(df['price_period'].isin(['Monthly', 'Yearly'])) &
(df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']))]
# (df[['composite_rating']].astype(str).apply(lambda x: (x.str.isin(['True', 'False']) & x.isnotin(['nan'])).all(axis=1)))]
This would throw me
KeyError: "['name'] not in index"
when the column is not present in my dataframe. I need to handle for all columns. How can I efficiently add a check to my above code which checks for validation only when the column is present?
You can use intersection:
L = ['name', 'issuer_id', 'service_area_id']
cols = df.columns.intersection(L)
(df[cols].notnull().all(axis=1))
EDIT:
df = pd.DataFrame({
'name':list('abcdef'),
'plan_year':[2015,2015,2015,5,5,4],
})
print (df)
name plan_year
0 a 2015
1 b 2015
2 c 2015
3 d 5
4 e 5
5 f 4
Idea is create dictionary of valid values for each colum first:
valid = {'name':'a',
'issuer_id':'a',
'service_area_id':'a',
'plan_year':2015,
...}
Then filter new dictionary by missing columns and assign to original DataFrame and create new DataFrame:
d1 = {k: v for k, v in valid.items() if k in set(valid.keys()) - set(df.columns)}
print (d1)
{'issuer_id': 'a', 'service_area_id': 'a'}
df1 = df.assign(**d1)
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
3 d 5 a a
4 e 5 a a
5 f 4 a a
Last filter:
m1 = (df1[['name', 'issuer_id', 'service_area_id']].notnull().all(axis=1))
m2 = ((df1['plan_year'].notnull()) &
(df1['plan_year'].astype(str).str.isdigit()) &
(df1['plan_year'].astype(str).str.len() == 4))
df1 = df1[m1 & m2]
print (df1)
name plan_year issuer_id service_area_id
0 a 2015 a a
1 b 2015 a a
2 c 2015 a a
Last you can remove helper columns:
df1 = df1[m1 & m2].drop(d1.keys(), axis=1)
print (df1)
name plan_year
0 a 2015
1 b 2015
2 c 2015
Add another variable called columns and filter it with the ones that exist in df:
columns = ['name', 'issuer_id', 'service_area_id']
existing = [i for i in columns if i in df.columns]
df = df[(df[existing]...
EDIT
You could also assign each condition to a variable and use it later like this:
cond1 = df['is_age_29_plan'].astype(str).isin(['True', 'False', 'nan']) if 'is_age_29_plan' in df.columns else True
Then, use the cond1 in your filtering statement.

pandas convert grouped rows into columns

I have a dataframe such as:
label column1
a 1
a 2
b 6
b 4
I would like to make a dataframe with a new column, with the opposite value from column1 where the labels match. Such as:
label column1 column2
a 1 2
a 2 1
b 6 4
b 4 6
I know this is probably very simple to do with a groupby command but I've been searching and can't find anything.
The following uses groupby and apply and seems to work okay:
x = pd.DataFrame({ 'label': ['a','a','b','b'],
'column1': [1,2,6,4] })
y = x.groupby('label').apply(
lambda g: g.assign(column2 = np.asarray(g.column1[::-1])))
y = y.reset_index(drop=True) # optional: drop weird index
print(y)
you can try the code block below:
#create the Dataframe
df = pd.DataFrame({'label':['a','a','b','b'],
'column1':[1,2,6,4]})
#Group by label
a = df.groupby('label').first().reset_index()
b = df.groupby('label').last().reset_index()
#Concat those groups to create columns2
df2 = (pd.concat([b,a])
.sort_values(by='label')
.rename(columns={'column1':'column2'})
.reset_index()
.drop('index',axis=1))
#Merge with the original Dataframe
df = df.merge(df2,left_index=True,right_index=True,on='label')[['label','column1','column2']]
Hope this helps
Assuming their are only pairs of labels, you could use the following as well:
# Create dataframe
df = pd.DataFrame(data = {'label' :['a', 'a', 'b', 'b'],
'column1' :[1,2, 6,4]})
# iterate over dataframe, identify matching label and opposite value
for index, row in df.iterrows():
newvalue = int(df[(df.label == row.label) & (df.column1 != row.column1)].column1.values[0])
# set value to new column
df.set_value(index, 'column2', newvalue)
df.head()
You can use groupby with apply where create new Series with back order:
df['column2'] = df.groupby('label')["column1"] \
.apply(lambda x: pd.Series(x[::-1].values)).reset_index(drop=True)
print (df)
column1 label column2
0 1 a 2
1 2 a 1
2 6 b 4
3 4 b 6

How to assign a value_count output to a dataframe

I am trying to assign the output from a value_count to a new df. My code follows.
import pandas as pd
import glob
df = pd.concat((pd.read_csv(f, names=['date','bill_id','sponsor_id']) for f in glob.glob('/home/jayaramdas/anaconda3/df/s11?_s_b')))
column_list = ['date', 'bill_id']
df = df.set_index(column_list, drop = True)
df = df['sponsor_id'].value_counts()
df.columns=['sponsor', 'num_bills']
print (df)
The value count is not being assigned the column headers specified 'sponsor', 'num_bills'. I'm getting the following output from print.head
1036 426
791 408
1332 401
1828 388
136 335
Name: sponsor_id, dtype: int64
your column length doesn't match, you read 3 columns from the csv and then set the index to 2 of them, you calculated value_counts which produces a Series with the column values as the index and the value_counts as the values, you need to reset_index and then overwrite the column names:
df = df.reset_index()
df.columns=['sponsor', 'num_bills']
Example:
In [276]:
df = pd.DataFrame({'col_name':['a','a','a','b','b']})
df
Out[276]:
col_name
0 a
1 a
2 a
3 b
4 b
In [277]:
df['col_name'].value_counts()
Out[277]:
a 3
b 2
Name: col_name, dtype: int64
In [278]:
type(df['col_name'].value_counts())
Out[278]:
pandas.core.series.Series
In [279]:
df = df['col_name'].value_counts().reset_index()
df.columns = ['col_name', 'count']
df
Out[279]:
col_name count
0 a 3
1 b 2
Appending value_counts() to multi-column dataframe:
df = pd.DataFrame({'C1':['A','B','A'],'C2':['A','B','A']})
vc_df = df.value_counts().to_frame('Count').reset_index()
display(df, vc_df)
C1 C2
0 A A
1 B B
2 A A
C1 C2 Count
0 A A 2
1 B B 1

Categories

Resources