I have several data frames and need to do the same thing to all of them.
I'm currently doing this:
df1=df1.reindex(newindex)
df2=df2.reindex(newindex)
df3=df3.reindex(newindex)
df4=df4.reindex(newindex)
Is there a neater way of doing this?
Maybe something like
df=[df1,df2,df3,df4]
for d in df:
d=d.reindex(newindex)
Yes, your solution is good only necessary assign to new list of DataFrames by list comprehension:
dfs = [df1,df2,df3,df4]
dfs_new = [d.reindex(newindex) for d in dfs]
Nice solution with unpack like suggest #Joe Halliwell, thank you:
df1, df2, df3, df4 = [d.reindex(newindex) for d in dfs]
Or like suggest #roganjosh is possible create dictionary of DataFrames:
dfs = [df1,df2,df3,df4]
names = ['a','b','c','d']
dfs_new_dict = {name: d.reindex(newindex) for name, d in zip(names, dfs)}
And then select each DataFrame by key:
print (dfs_new_dict['a'])
Sample:
df = pd.DataFrame({'a':[4,5,6]})
df1 = df * 10
df2 = df + 10
df3 = df - 10
df4 = df / 10
dfs = [df1,df2,df3,df4]
print (dfs)
[ a
0 40
1 50
2 60, a
0 14
1 15
2 16, a
0 -6
1 -5
2 -4, a
0 0.4
1 0.5
2 0.6]
newindex = [2,1,0]
df1, df2, df3, df4 = [d.reindex(newindex) for d in dfs]
print (df1)
print (df2)
print (df3)
print (df4)
a
2 60
1 50
0 40
a
2 16
1 15
0 14
a
2 -4
1 -5
0 -6
a
2 0.6
1 0.5
0 0.4
Or:
newindex = [2,1,0]
names = ['a','b','c','d']
dfs_new_dict = {name: d.reindex(newindex) for name, d in zip(names, dfs)}
print (dfs_new_dict['a'])
print (dfs_new_dict['b'])
print (dfs_new_dict['c'])
print (dfs_new_dict['d'])
If you have a lot of large dataframes, you can use multiple threads. I suggest using the pathos module (can be installed using pip install pathos):
from pathos.multiprocessing import ThreadPool
# create a thread pool with the max number of threads
tPool = ThreadPool()
# apply the same function to each df
# the function applies to your list of dataframes
newDFs = tPool.map(lambda df: df.reindex(newIndex),dfs)
Related
Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0
I have a dataframe like as below
stu_id,Mat_grade,sci_grade,eng_grade
1,A,C,A
1,A,C,A
1,B,C,A
1,C,C,A
2,D,B,B
2,D,C,B
2,D,D,C
2,D,A,C
tf = pd.read_clipboard(sep=',')
My objective is to
a) Find out how many different unique grades that a student got under Mat_grade, sci_grade and eng_grade
So, I tried the below
tf['mat_cnt'] = tf.groupby(['stu_id'])['Mat_grade'].nunique()
tf['sci_cnt'] = tf.groupby(['stu_id'])['sci_grade'].nunique()
tf['eng_cnt'] = tf.groupby(['stu_id'])['eng_grade'].nunique()
But this doesn't provide the expected output. Since, I have more than 100K unique ids, any efficient and elegant solution is really helpful
I expect my output to be like as below
You can specify columns names in list and for column cols call DataFrameGroupBy.nunique with rename:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = dict(zip(cols, new))
df = tf.groupby(['stu_id'], as_index=False)[cols].nunique().rename(columns=d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2
Another idea is used named aggregation:
cols = ['Mat_grade','sci_grade', 'eng_grade']
new = ['mat_cnt','sci_cnt','eng_cnt']
d = {v: (k,'nunique') for k, v in zip(cols, new)}
print (d)
{'mat_cnt': ('Mat_grade', 'nunique'),
'sci_cnt': ('sci_grade', 'nunique'),
'eng_cnt': ('eng_grade', 'nunique')}
df = tf.groupby(['stu_id'], as_index=False).agg(**d)
print (df)
stu_id mat_cnt sci_cnt eng_cnt
0 1 3 1 1
1 2 1 4 2
I have 3 dataframes like as shown below
ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)
Now I want to identify the IDs where col1 is NA in all 3 input dataframes.
So, I tried the below
L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]
While the above works, is there any efficient and elegant approach do the above?
As you can see that am repeating the same statement thrice (for 3 dataframes)
However, in my real data, I have 12 dataframes where I have to get IDs where col1 is NA in all 12 dataframes.
update - my current read operation looks like below
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
df= pd.read_excel(fname, sheet_name="Sheet1")
df.columns = df.iloc[7]
df = df.iloc[8: , :]
NA_list.append(df[df['col1'].isna()]['ID'])
dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]
You can use:
dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)
# Output
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314
Full code:
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
def preprocessing(fname):
return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)
dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
This are times when def function get you sorted. If the dataframe list will continually change I will create a def function. If I got you right the following will do;
def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
newdflist.append(d[d['col1'].isna()])
s=pd.concat(newdflist)
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()
dflist=[df1,df2,df3]#List of dfs
CombinedNaNs(dflist)#apply function
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314
How to efficiently filter df by multiple dictionary sets. The example will be as following:
df = pd.DataFrame({'A':[10,20,20,10,20], 'B':[0,1,0,1,1], 'C':['up','down','up','down','down'],'D':[100,200,200,100,100]})
filter_sets = [{'A':10, 'B':0, 'C':'up'}, {'A':20, 'B':1, 'C':'down'}]
I only know that I can filter df by single dictionary by:
df.loc[(df[list(filter_set)] == pd.Series(filter_set)).all(axis=1)]
But is it possible to filter several dict masks at once?
** The format of filter_sets is not necessary to be something like above. If it can provide filter for multiple columns, then it is fine.
Use np.logical_or.reduce with list comprehension:
mask = np.logical_or.reduce([(df[list(x)]==pd.Series(x)).all(axis=1) for x in filter_sets])
#alternative solution
mask = (pd.concat([(df[list(x)]==pd.Series(x)).all(axis=1) for x in filter_sets], axis=1)
.any(axis=1))
df2 = df[mask]
print (df2)
A B C D
0 10 0 up 100
1 20 1 down 200
4 20 1 down 100
Or if all keys are same is possible create helper DataFrame with merge:
df2 = pd.DataFrame(filter_sets).merge(df)
print (df2)
A B C D
0 10 0 up 100
1 20 1 down 200
2 20 1 down 100
I have many DataFrames that I need to merge.
Let's say:
base: id constraint
1 'a'
2 'b'
3 'c'
df_1: id value constraint
1 1 'a'
2 2 'a'
3 3 'a'
df_2: id value constraint
1 1 'b'
2 2 'b'
3 3 'b'
df_3: id value constraint
1 1 'c'
2 2 'c'
3 3 'c'
If I try and merge all of them (it'll be in a loop), I get:
a = pd.merge(base, df_1, on=['id', 'constraint'], how='left')
b = pd.merge(a, df_2, on=['id', 'constraint'], how='left')
c = pd.merge(b, df_3, on=['id', 'constraint'], how='left')
id constraint value value_x value_y
1 'a' 1 NaN NaN
2 'b' NaN 2 NaN
3 'c' NaN NaN 3
The desired output would be:
id constraint value
1 'a' 1
2 'b' 2
3 'c' 3
I know about the combine_first and it works, but I can't have this approach because it is thousands of time slower.
Is there a merge that can replace values in case of columns overlap?
It's somewhat similar to this question, with no answers.
Given your MCVE:
import pandas as pd
base = pd.DataFrame([1,2,3], columns=['id'])
df1 = pd.DataFrame([[1,1]], columns=['id', 'value'])
df2 = pd.DataFrame([[2,2]], columns=['id', 'value'])
df3 = pd.DataFrame([[3,3]], columns=['id', 'value'])
I would suggest to concat first your dataframe (using a loop if needed):
df = pd.concat([df1, df2, df3])
And then merge:
pd.merge(base, df, on='id')
It yields:
id value
0 1 1
1 2 2
2 3 3
Update
Runing the code with the new version of your question and the input provided by #Celius Stingher:
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df1 = pd.DataFrame(b)
df2 = pd.DataFrame(c)
df3 = pd.DataFrame(d)
We get:
id constrains value
0 1 a 1
1 2 b 2
2 3 c 3
Which seems to be compliant with your expected output.
You can use ffill() for the purpose:
df_1 = pd.DataFrame({'val':[1]}, index=[1])
df_2 = pd.DataFrame({'val':[2]}, index=[2])
df_3 = pd.DataFrame({'val':[3]}, index=[3])
(pd.concat((df_1,df_2,df_3), axis=1)
.ffill(1)
.iloc[:,-1]
)
Output:
1 1.0
2 2.0
3 3.0
Name: val, dtype: float64
For your new data:
base.merge(pd.concat((df1,df2,df3)),
on=['id','constraint'],
how='left')
output:
id constraint value
0 1 'a' 1
1 2 'b' 2
2 3 'c' 3
Conclusion: you are actually looking for the option how='left' in merge
If you must only merge all dataframes with base:
Based on edit
import pandas as pd
a = {'id':[1,2,3],'constrains':['a','b','c']}
b = {'id':[1,2,3],'value':[1,2,3],'constrains':['a','a','a']}
c = {'id':[1,2,3],'value':[1,2,3],'constrains':['b','b','b']}
d = {'id':[1,2,3],'value':[1,2,3],'constrains':['c','c','c']}
base = pd.DataFrame(a)
df_1 = pd.DataFrame(b)
df_2 = pd.DataFrame(c)
df_3 = pd.DataFrame(d)
dataframes = [df_1,df_2,df_3]
for i in dataframes:
base = base.merge(i,how='left',on=['id','constrains'])
summation = [col for col in base if col.startswith('value')]
base['value'] = base[summation].sum(axis=1)
base = base.dropna(how='any',axis=1)
print(base)
Output:
id constrains value
0 1 a 1.0
1 2 b 2.0
2 3 c 3.0
For those who want to simply do a merge, overriding the values (which is my case), can achieve that using this method, which is really similar to Celius Stingher answer.
Documented version is on the original gist.
import pandas as pa
def rmerge(left,right,**kwargs):
# Function to flatten lists from http://rosettacode.org/wiki/Flatten_a_list#Python
def flatten(lst):
return sum( ([x] if not isinstance(x, list) else flatten(x) for x in lst), [] )
# Set default for removing overlapping columns in "left" to be true
myargs = {'replace':'left'}
myargs.update(kwargs)
# Remove the replace key from the argument dict to be sent to
# pandas merge command
kwargs = {k:v for k,v in myargs.items() if k is not 'replace'}
if myargs['replace'] is not None:
# Generate a list of overlapping column names not associated with the join
skipcols = set(flatten([v for k, v in myargs.items() if k in ['on','left_on','right_on']]))
leftcols = set(left.columns)
rightcols = set(right.columns)
dropcols = list((leftcols & rightcols).difference(skipcols))
# Remove the overlapping column names from the appropriate DataFrame
if myargs['replace'].lower() == 'left':
left = left.copy().drop(dropcols,axis=1)
elif myargs['replace'].lower() == 'right':
right = right.copy().drop(dropcols,axis=1)
df = pa.merge(left,right,**kwargs)
return df