Pandas find common NA records across multiple large dataframes - python

I have 3 dataframes like as shown below
ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)
Now I want to identify the IDs where col1 is NA in all 3 input dataframes.
So, I tried the below
L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]
While the above works, is there any efficient and elegant approach do the above?
As you can see that am repeating the same statement thrice (for 3 dataframes)
However, in my real data, I have 12 dataframes where I have to get IDs where col1 is NA in all 12 dataframes.
update - my current read operation looks like below
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
df= pd.read_excel(fname, sheet_name="Sheet1")
df.columns = df.iloc[7]
df = df.iloc[8: , :]
NA_list.append(df[df['col1'].isna()]['ID'])
dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]

You can use:
dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)
# Output
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314
Full code:
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
def preprocessing(fname):
return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)
dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]

This are times when def function get you sorted. If the dataframe list will continually change I will create a def function. If I got you right the following will do;
def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
newdflist.append(d[d['col1'].isna()])
s=pd.concat(newdflist)
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()
dflist=[df1,df2,df3]#List of dfs
CombinedNaNs(dflist)#apply function
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314

Related

Merging df in python

Say I have two DataFrames
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
I want to merge so that any values in df1 are overwritten in there is a value in df2 at that location and any new values in df2 are added including the new rows and columns.
The result should be:
A B C
0 1 3 nan
1 2 8 10
2 nan 9 11
I've tried combine_first but that causes only nan values to be overwritten
updated has the issue where new rows are created rather than overwritten
merge has many issues.
I've tried writing my own function
def take_right(df1, df2, j, i):
print (df1)
print (df2)
try:
s1 = df1[j][i]
except:
s1 = np.NaN
try:
s2 = df2[j][i]
except:
s2 = np.NaN
if math.isnan(s2):
#print(s1)
return s1
else:
# print(s2)
return s2
def combine_df(df1, df2):
rows = (set(df1.index.values.tolist()) | set(df2.index.values.tolist()))
#print(rows)
columns = (set(df1.columns.values.tolist()) | set(df2.columns.values.tolist()))
#print(columns)
df = pd.DataFrame()
#df.columns = columns
for i in rows:
#df[:][i]=[]
for j in columns:
df = df.insert(int(i), j, take_right(df1,df2,j,i), allow_duplicates=False)
# print(df)
return df
This won't add new columns or rows to an empty DataFrame.
Thank you!!
One approach is to create an empty output dataframe with the union of columns and indices from df1 and df2 and then use the df.update method to assign their values into the out_df
import pandas as pd
df1 = pd.DataFrame({'A':[1,2], 'B':[3,4]}, index = [0,1])
df2 = pd.DataFrame({'B':[8,9], 'C':[10,11]}, index = [1,2])
out_df = pd.DataFrame(
columns = df1.columns.union(df2.columns),
index = df1.index.union(df2.index),
)
out_df.update(df1)
out_df.update(df2)
out_df
Why does combine_first not work?
df = df2.combine_first(df1)
print(df)
Output:
A B C
0 1.0 3 NaN
1 2.0 8 10.0
2 NaN 9 11.0

combining two dataframes into one new dataframe in a zig zag/zipper way

I have df1 and df2, i want to create new data frame df3, such that the first record of df3 should be first record from df1, second record of df3 should be first record of df2. and it continues in the similar manner.
I tried many methods with pandas, but didn't get answer.
Is there any ways to achieve it.
You can create a column with incremental id (one with odd numbers and other with even numbers:
import numpy as np
df1['unique_id'] = np.arange(0, df1.shape[0]*2,2)
df2['unique_id'] = np.arange(1, df2.shape[0]*2,2)
and then append them and sort by this column:
df3 = df1.append(df2)
df3 = df3.sort_values(by=['unique_id'])
after which you can drop the column you created:
df3 = df3.drop(columns=['unique_id'])
You could do it this way:
import pandas as pd
df1 = pd.DataFrame({'A':[3,3,4,6], 'B':['a1','b1','c1','d1']})
df2 = pd.DataFrame({'A':[5,4,6,1], 'B':['a2','b2','c2','d2']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
print(pd.concat([df1, df2]).sort_index(kind='merge'))
Which gives
A B
0 3 a1
0 5 a2
1 3 b1
1 4 b2
2 4 c1
2 6 c2
3 6 d1
3 1 d2

assignment with df.iloc() returns nan

I created a dataframe df = pd.DataFrame({'col':[1,2,3,4,5,6]}) and I would like to take some values and put them in another dataframe df2 = pd.DataFrame({'A':[0,0]})by creating new columns.
I created a new column 'B' df2['B'] = df.iloc[0:2,0] and everything was fine, but then i created another column C df2['C'] = df.iloc[2:4,0] and there were only NaN values. I don't know why and if I print print(df.iloc[2:4]) everything is normal.
full code:
import pandas as pd
df = pd.DataFrame({'col':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'A':[0,0]})
df2['B'] = df.iloc[0:2,0]
df2['C'] = df.iloc[2:4,0]
print(df2)
print('\n',df.iloc[2:4])
output:
A B C
0 0 1 NaN
1 0 2 NaN
col
2 3
3 4
Assignement df2['C'] = df.iloc[2:4,0] does not work as expected, because index is not the same. You can skip this using .values attributes.
import pandas as pd
df = pd.DataFrame({'col':[1,2,3,4,5,6]})
df2 = pd.DataFrame({'A':[0,0]})
df2['B'] = df.iloc[0:2,0]
df2['C'] = df.iloc[2:4,0].values
print(df2)

Use Dataframe names in loop Pandas

I have several data frames and need to do the same thing to all of them.
I'm currently doing this:
df1=df1.reindex(newindex)
df2=df2.reindex(newindex)
df3=df3.reindex(newindex)
df4=df4.reindex(newindex)
Is there a neater way of doing this?
Maybe something like
df=[df1,df2,df3,df4]
for d in df:
d=d.reindex(newindex)
Yes, your solution is good only necessary assign to new list of DataFrames by list comprehension:
dfs = [df1,df2,df3,df4]
dfs_new = [d.reindex(newindex) for d in dfs]
Nice solution with unpack like suggest #Joe Halliwell, thank you:
df1, df2, df3, df4 = [d.reindex(newindex) for d in dfs]
Or like suggest #roganjosh is possible create dictionary of DataFrames:
dfs = [df1,df2,df3,df4]
names = ['a','b','c','d']
dfs_new_dict = {name: d.reindex(newindex) for name, d in zip(names, dfs)}
And then select each DataFrame by key:
print (dfs_new_dict['a'])
Sample:
df = pd.DataFrame({'a':[4,5,6]})
df1 = df * 10
df2 = df + 10
df3 = df - 10
df4 = df / 10
dfs = [df1,df2,df3,df4]
print (dfs)
[ a
0 40
1 50
2 60, a
0 14
1 15
2 16, a
0 -6
1 -5
2 -4, a
0 0.4
1 0.5
2 0.6]
newindex = [2,1,0]
df1, df2, df3, df4 = [d.reindex(newindex) for d in dfs]
print (df1)
print (df2)
print (df3)
print (df4)
a
2 60
1 50
0 40
a
2 16
1 15
0 14
a
2 -4
1 -5
0 -6
a
2 0.6
1 0.5
0 0.4
Or:
newindex = [2,1,0]
names = ['a','b','c','d']
dfs_new_dict = {name: d.reindex(newindex) for name, d in zip(names, dfs)}
print (dfs_new_dict['a'])
print (dfs_new_dict['b'])
print (dfs_new_dict['c'])
print (dfs_new_dict['d'])
If you have a lot of large dataframes, you can use multiple threads. I suggest using the pathos module (can be installed using pip install pathos):
from pathos.multiprocessing import ThreadPool
# create a thread pool with the max number of threads
tPool = ThreadPool()
# apply the same function to each df
# the function applies to your list of dataframes
newDFs = tPool.map(lambda df: df.reindex(newIndex),dfs)

How to combine summary statistics from 100s of *csv files into one *csv with pandas?

I have several hundreds *csv files, which when imported into a pandas data frame look as follows:
import pandas as pd
df = pd.read_csv("filename1.csv")
df
column1 column2 column3 column4
0 10 A 1 ID1
1 15 A 1 ID1
2 19 B 1 ID1
3 5071 B 0 ID1
4 5891 B 0 ID1
5 3210 B 0 ID1
6 12 B 2 ID1
7 13 C 2 ID1
8 20 C 0 ID1
9 5 C 3 ID1
10 9 C 3 ID1
Each *csv file has a unique ID for column4 (whereby each row has the same element).
I would like to create a new csv file, whereby each filename is a row, keeping the ID/value from column4 and the max values of column1 and column3. What is the best pandas way to do this?
ID1 5891 3
....
My idea would be:
import numpy as np
import pandas as pd
files = glob.glob("*.csv") # within the correct subdirectory
newdf1 = pd.DataFrame()
for file in newdf1:
df = pd.read_csv(file)
df["ID"] = df.column4.unique()
df["max_column1"] = df.column2.max()
df["max_column3"] = df.column3.max()
newdf1 = newdf1.append(df, ignore_index=True)
newdf1.to_csv("totalfile.csv")
However, (1) I don't know if this is efficient and (2) I don't know if the dimensions of the final csv is correct. ALSO, how would one deal with a *csv that was missing a column1 or column3? That is, it should "pass" these values.
What is the correct way to do this?
I think you can loop by files, get first value by iat and max and append to list.
Then use DataFrame constructor and write to file.
files = glob.glob("*.csv") # within the correct subdirectory
L = []
for file in files:
df = pd.read_csv(file)
u = df.column4.iat[0]
m1 = df.column1.max()
m2 = df.column3.max()
L.append({'ID':u,'max_column1':m1,'max_column3':m2})
newdf1 = pd.DataFrame(L)
newdf1.to_csv("totalfile.csv")
EDIT:
L = []
for file in files:
print (file)
df = pd.read_csv(file)
#print (df)
m1, m2 = np.nan, np.nan
if df.columns.str.contains('column1').any():
m1 = df.column1.max()
if df.columns.str.contains('column3').any():
m2 = df.column3.max()
u = df.column4.iat[0]
L.append({'ID':u,'max_column1':m1,'max_column3':m2})
newdf1 = pd.DataFrame(L)
Repeated appending to a pandas DataFrame is highly inefficient as it copies the DataFrame.
Instead you could write the max values found to the resultant file directly.
files = glob.glob("*.csv")
with open("totalfile.csv", "w") as fout:
for f in files:
df = pd.read_csv(f)
result = df.loc[:, ['column4', 'column2', 'column3']].max()\
.fillna('pass').to_dict()
fout.write("{column4},{column2},{column3}\n".format(**result))
df.loc[:, ['column4', 'column2', 'column3']] would return NaN filled columns for missing columns. This would raise exception only when all three columns are missing.
fill_na('pass') will substitute missing values.

Categories

Resources