Merge excel files with multiple sheets into one dataframe - python

I'm new to pd python and I'm trying to combine a lot of excel files from a folder (each file contains two sheets) and then add only certain columns from those sheets to the new dataframe. Each file has the same quantity of columns and sheet names, but sometimes a different number of rows.
I'll show you what I did with an example with two files. Screens of the sheets:
First sheet
Second sheet
Sheets from the second file have the same structure, but with different data in it.
Code:
import pandas as pd
import os
folder = [file for file in os.listdir('./test_folder/')]
consolidated = pd.DataFrame()
for file in folder:
first = pd.concat(pd.read_excel('./test_folder/'+file, sheet_name=['first']))
second = pd.concat(pd.read_excel('./test_folder/'+file, sheet_name=['second']))
first_new = first.drop(['Col_K', 'Col_L', 'Col_M'], axis=1) #dropping unnecessary columns
second_new = second.drop(['Col_DD', 'Col_EE', 'Col_FF','Col_GG','Col_HH', 'Col_II', 'Col_JJ', 'Col_KK', 'Col_LL', 'Col_MM', 'Col_NN', 'Col_OO', 'Col_PP', 'Col_QQ', 'Col_RR', 'Col_SS', 'Col_TT'], axis=1) #dropping unnecessary columns
frames = [consolidated, second_new, first_new]
consolidated = pd.concat(frames, axis=0)
consolidated.to_excel('all.xlsx', index=True)
So here is a result
And here's my desired result
So basically, I do not know how to ignore these empty cells and align these two data frames with each other. Most likely there's some problem with DFs indexes(first_new, second_new), but I don't know how to resolve it

pd.concat() has an ignore_index parameter, which you will need if your rows have differing indices across the individual frames. If they have a common index (like in my example), you do not need to ignore_index and can keep the column names.
Try:
pd.concat(frames, axis=1, ignore_index=True)
In [5]: df1 = pd.DataFrame({"A":2, "B":3}, index=[0, 1])
In [6]: df1
Out[6]:
A B
0 2 3
1 2 3
In [7]: df2 = pd.DataFrame({"AAA":22, "BBB":33}, index=[0, 1])
In [10]: df = pd.concat([df1, df2], axis=1, ignore_index=True)
In [11]: df
Out[11]:
0 1 2 3
0 2 3 22 33
1 2 3 22 33
In [12]: df = pd.concat([df1, df2], axis=1, ignore_index=False)
In [13]: df
Out[13]:
A B AAA BBB
0 2 3 22 33
1 2 3 22 33

Related

Pandas find common NA records across multiple large dataframes

I have 3 dataframes like as shown below
ID,col1,col2
1,X,35
2,X,37
3,nan,32
4,nan,34
5,X,21
df1 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,nan,305
2,X,307
3,X,302
4,nan,304
5,X,201
df2 = pd.read_clipboard(sep=',',skipinitialspace=True)
ID,col1,col2
1,X,315
2,nan,317
3,X,312
4,nan,314
5,X,21
df3 = pd.read_clipboard(sep=',',skipinitialspace=True)
Now I want to identify the IDs where col1 is NA in all 3 input dataframes.
So, I tried the below
L1=df1[df1['col1'].isna()]['ID'].tolist()
L2=df2[df2['col1'].isna()]['ID'].tolist()
L3=df3[df3['col1'].isna()]['ID'].tolist()
common_ids_all = list(set.intersection(*map(set, [L1,L2,L3])))
final_df = pd.concat([df1,df2,df3],ignore_index=True)
final_df[final_df['ID'].isin(common_ids_all)]
While the above works, is there any efficient and elegant approach do the above?
As you can see that am repeating the same statement thrice (for 3 dataframes)
However, in my real data, I have 12 dataframes where I have to get IDs where col1 is NA in all 12 dataframes.
update - my current read operation looks like below
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
dfs=[]
NA_list=[]
def preprocessing(fname):
df= pd.read_excel(fname, sheet_name="Sheet1")
df.columns = df.iloc[7]
df = df.iloc[8: , :]
NA_list.append(df[df['col1'].isna()]['ID'])
dfs.append(df)
[preprocessing(fname) for fname in fnames]
final_df = pd.concat(dfs, ignore_index=True)
L1 = NA_list[0]
L2 = NA_list[1]
L3 = NA_list[2]
final_list = (list(set.intersection(*map(set, [L1,L2,L3]))))
final_df[final_df['ID'].isin(final_list)]
You can use:
dfs = [df1, df2, df3]
final_df = pd.concat(dfs).query('col1.isna()')
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
print(final_df)
# Output
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314
Full code:
fnames = ['file1.xlsx','file2.xlsx', 'file3.xlsx']
def preprocessing(fname):
return pd.read_excel(fname, sheet_name='Sheet1', skiprows=6)
dfs = [preprocessing(fname) for fname in fnames]
final_df = pd.concat([df[df['col1'].isna()] for df in dfs])
final_df = final_df[final_df.groupby('ID')['ID'].transform('size') == len(dfs)]
This are times when def function get you sorted. If the dataframe list will continually change I will create a def function. If I got you right the following will do;
def CombinedNaNs(lst):
newdflist =[]
for d in dflist:
newdflist.append(d[d['col1'].isna()])
s=pd.concat(newdflist)
return s[s.duplicated(subset=['ID'], keep=False)].drop_duplicates()
dflist=[df1,df2,df3]#List of dfs
CombinedNaNs(dflist)#apply function
ID col1 col2
3 4 NaN 34
3 4 NaN 304
3 4 NaN 314

Pandas Apply returns a Series instead of a dataframe

The goal of following code is to go through each row in df_label, extract app1 and app2 names, filter df_all using those two names, concatenate the result and return it as a dataframe. Here is the code:
def create_dataset(se):
# extracting the names of applications
app1 = se.app1
app2 = se.app2
# extracting each application from df_all
df1 = df_all[df_all.workload == app1]
df1.columns = df1.columns + '_0'
df2 = df_all[df_all.workload == app2]
df2.columns = df2.columns + '_1'
# combining workloads to create the pairs dataframe
df3 = pd.concat([df1, df2], axis=1)
display(df3)
return df3
df_pairs = pd.DataFrame()
df_label.apply(create_dataset, axis=1)
#df_pairs = df_pairs.append(df_label.apply(create_dataset, axis=1))
I would like to append all dataframes returned from apply. However, while display(df3) shows the correct dataframe, when returned from function, it's not a dataframe anymore and it's a series. A series with one element and that element seems to be the whole dataframe. Any ideas what I am doing wrong?
When you select a single column, you'll get a Series instead of a DataFrame so df1 and df2 will both be series.
However, concatenating them on axis=1 should produce a DataFrame (whereas combining them on axis=0 would produce a series). For example:
df = pd.DataFrame({'a':[1,2],'b':[3,4]})
df1 = df['a']
df2 = df['b']
>>> pd.concat([df1,df2],axis=1)
a b
0 1 3
1 2 4
>>> pd.concat([df1,df2],axis=0)
0 1
1 2
0 3
1 4
dtype: int64

rename column name index according to list placement with multiple duplicate names

I just asked a similar question rename columns according to list which has a correct answer for how to add suffixes to column names correctly. But i have a new issue. I want to rename the actual index name for the columns per dataframe. I have three lists of data frames (some of the data frames contain duplicate column index names (and actual data frame names as well - but thats not the issue, the issue is the duplicated original column.names). I simply want to append a suffix to each dataframe.column.name within each list, with a name in the suffix list, based on its numeric order.
here is an example of the data and the output i would like:
# add string to end of x in list of dfs
df1, df2, df3, df4 = (pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('a', 'b')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('c', 'd')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('e', 'f')),
pd.DataFrame(np.random.randint(0,10,size=(10, 2)), columns=('g', 'h')))
df1.columns.name = 'abc'
df2.columns.name = 'abc'
df3.columns.name = 'efg'
df4.columns.name = 'abc'
cat_a = [df2, df1]
cat_b = [df3, df2, df1]
cat_c = [df1]
dfs = [cat_a, cat_b, cat_c]
suffix = ['group1', 'group2', 'group3']
# expected output =
#for df in cat_a: df.columns.name = df.columns.name + 'group1'
#for df in cat_b: df.columns.name = df.columns.name + 'group2'
#for df in cat_c: df.columns.name = df.columns.name + 'group3'
and here is some code that i have written that doesn't work - where df.column.names are duplicated across data frames, multiple suffixes are appended
for x, df in enumerate(dfs):
for i in df:
n = ([(i.columns.name + '_' + str(suffix[x])) for out in i.columns.name])
i.columns.name=n[x]
thank you for looking, i really appreciate it
Your current code is not working as you have multiple references to the same df in your lists, so only the last change matters. You need to make copies.
Assuming you want to change the columns index name for each df in dfs, you can use a list comprehension:
dfs = [[d.rename_axis(suffix[i], axis=1) for d in group]
for i,group in enumerate(dfs)]
output:
>>> dfs[0][0]
group1 c d
0 5 0
1 9 3
2 3 9
3 4 2
4 1 0
5 7 6
6 5 2
7 8 0
8 1 2
9 7 2

combining two dataframes into one new dataframe in a zig zag/zipper way

I have df1 and df2, i want to create new data frame df3, such that the first record of df3 should be first record from df1, second record of df3 should be first record of df2. and it continues in the similar manner.
I tried many methods with pandas, but didn't get answer.
Is there any ways to achieve it.
You can create a column with incremental id (one with odd numbers and other with even numbers:
import numpy as np
df1['unique_id'] = np.arange(0, df1.shape[0]*2,2)
df2['unique_id'] = np.arange(1, df2.shape[0]*2,2)
and then append them and sort by this column:
df3 = df1.append(df2)
df3 = df3.sort_values(by=['unique_id'])
after which you can drop the column you created:
df3 = df3.drop(columns=['unique_id'])
You could do it this way:
import pandas as pd
df1 = pd.DataFrame({'A':[3,3,4,6], 'B':['a1','b1','c1','d1']})
df2 = pd.DataFrame({'A':[5,4,6,1], 'B':['a2','b2','c2','d2']})
dfff = pd.DataFrame()
for i in range(0,4):
dfx = pd.concat([df1.iloc[i].T, df2.iloc[i].T])
dfff = pd.concat([dfff, dfx])
print(pd.concat([df1, df2]).sort_index(kind='merge'))
Which gives
A B
0 3 a1
0 5 a2
1 3 b1
1 4 b2
2 4 c1
2 6 c2
3 6 d1
3 1 d2

Python : how to 'concatenate' 2 pd.DataFrame columns ? two columns into one

I have trouble with some pandas dataframes.
Its very simple, I have 4 columns, and I want to reshape them in 2...
For 'practical' reasons, I don't want to use 'header names', but I need to use 'index' (for the columns header names).
I have :
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
I want as a result :
df_res = pd.DataFrame({'NewName1': [1,2,3,4,5,6],'NewName2': [7,8,9,10,11,12]})
(in fact NewName1 doesn't matter, it can stay a or whatever the name...)
I tried with for loops, append, concat, but couldn't figured it out...
Any suggestions ?
Thanks for your help !
Bina
You can extract the desired columns and create a new pandas.DataFrame like so:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
first_col = np.concatenate((df.a.to_numpy(), df.b.to_numpy()))
second_col = np.concatenate((df.c.to_numpy(), df.d.to_numpy()))
df2 = pd.DataFrame({"NewName1": first_col, "NewName2": second_col})
>>> df2
NewName1 NewName2
0 1 7
1 2 8
2 3 9
3 4 10
4 5 11
5 6 12
This is probably not the most elegant solution, but I would isolate the two dataframes and then concatenate them. I needed to rename the column axis so that the four columns could be aligned correctly.
import pandas as pd
df = pd.DataFrame({'a': [1,2,3],'b': [4,5,6],'c': [7,8,9],'d':[10,11,12]})
af = df[['a', 'c']]
bf = df[['b', 'd']]
frames = (
af.rename({'a': 'NewName1', 'c': 'NewName2'}, axis=1),
bf.rename({'b': 'NewName1', 'd': 'NewName2'}, axis=1)
)
out = pd.concat(frames)
[EDIT] Replying to the comment.
I'm not that familiar with indexing but this might be one solution. You could avoid column names by using .iloc. Replace the af, and bf frames above with these lines.
af = df.iloc[:, ::2]
bf = df.iloc[:, 1::2]

Categories

Resources