How to combine multiple dataframes using for loop? - python

I am trying to merge multiple columns where after one column the following column starts in a specific index. for example, as you can see in the code below, I have 15 sets of data from df20 to df90. As seen in the code, I have merge the data i and then followed by another starting from index = 1,000.
So I wanted my output to be df20 followed by df25 starting at index=1000, then followed by df30 starting at index=2000, then followed by df35 at index=3000. I wanted to see all 15 columns but I only have one column in my output.
I have tried it below, but doesn't seem to work. Please help.
dframe = [df20, df25, df30, df35, df40, df45, df50, df55, df60, df65, df70, df75, df80, df85, df90]
for i in dframe:
a = i.merge((i).set_index((i).index+1000), how='outer', left_index=True, right_index=True)
print(a)
Output:
df90_x df90_y
0 0.000757 NaN
1 0.001435 NaN
2 0.002011 NaN
3 0.002497 NaN
4 0.001723 NaN
... ... ...
10995 NaN 1.223000e-12
10996 NaN 1.305000e-12
10997 NaN 1.809000e-12
10998 NaN 2.075000e-12
10999 NaN 2.668000e-12
[11000 rows x 2 columns]
Expected Output:
df20 df25 df30
0 0.000757 0 0
1 0.001435 0 0
2 0.002011 0 0
3 0.002497 0 0
4 0.001723 0 0
... ... ... ...
1000 1.223000e-12 0
1001 1.305000e-12 0
1002 1.809000e-12 0
1003 2.668000e-12 0
... ...
2000 0.1234
2001 0.4567
2002 0.8901
2003 0.2345

you can try this code, if you want variable for num_dataframe , length_dataframe:
import pandas as pd
import random
dframe = list()
num_dataframe = 3
len_dataframe = 5
for i in range((num_dataframe)):
dframe.append(pd.DataFrame({i:[random.randrange(1, 50, 1) for i in range(len_dataframe)]},
index=range(i*len_dataframe, (i+1)*len_dataframe)))
result = pd.concat([dframe[i] for i in range(num_dataframe)], axis=1)
result.fillna(0)
output:
and for your question, you want 20 data frame with 1000 length, you can try this:
import pandas as pd
import random
dframe = list()
num_dataframe = 20
len_dataframe = 1000
for i in range((num_dataframe)):
dframe.append(pd.DataFrame({i:[np.random.random() for i in range(len_dataframe)]},
index=range(i*len_dataframe, (i+1)*len_dataframe)))
result = pd.concat([dframe[i] for i in range(num_dataframe)], axis=1)
result.fillna(0)
output:
as you mentioned in the comment, I edit the post and add this code:
dframe = [df20, df25, df30, df35, df40, df45, df50, df55, df60, df65, df70, df75, df80, df85, df90]
result = pd.concat([dframe[i] for i in range(len(dframe))], axis=0)
result.fillna(0)

please refer to official page.
Concat multiple dataframes
df1=pd.DataFrame(
{
"A":["A0","A1","A2","A3"]
},
index=[0, 1, 2, 3]
)
df2=pd.DataFrame(
{
"B":["B4","B5"]
},
index=[4, 5]
)
df3=pd.DataFrame(
{
"C":["C6", "C7", "C8", "C9", "C10"]
},
index=[6, 7, 8, 9, 10]
)
result = pd.concat([df1, df2, df3], axis=1)
display(result)
Output:
A B C
0 A0 NaN NaN
1 A1 NaN NaN
2 A2 NaN NaN
3 A3 NaN NaN
4 NaN B4 NaN
5 NaN B5 NaN
6 NaN NaN C6
7 NaN NaN C7
8 NaN NaN C8
9 NaN NaN C9
10 NaN NaN C10
Import file into a list via looping
method 1:
you can create a list to put whole filenames into a list
filenames = ['sample_20.csv', 'sample_25.csv', 'sample_30.csv', ...]
dataframes = [pd.read_csv(f) for f in filenames]
method 1-1:
If you do have lots of files then you need a faster way to create the name list
filenames = ['sample_{}.csv'.format(i) for i in range(20, 90, 5)]
dataframes = [pd.read_csv(f) for f in filenames]
method 2:
from glob import glob
filenames = glob('sample*.csv')
dataframes = [pd.read_csv(f) for f in filenames]

Related

How to merge two dataframes without generating extra rows in result?

I am doing the following with two dataframes but it generates duplicates and does not get sorted as the first dataframe.
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:10"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
result = pd.merge(df1, df2, on="time", how="left")
This generates the result of 8 rows! I am removing 3 digits from time column in df1 to match the time in df2.
time value counts growth
0 15:09 10 fg 1.0
1 15:09 10 mn 3.0
2 15:09 20 fg 1.0
3 15:09 20 mn 3.0
4 15:10 30 gl 6.0
5 15:11 40 NaN NaN
6 15:12 50 NaN NaN
7 15:12 60 NaN NaN
There are duplicated columns due to join.
Is it possible to join the dataframes based on time column in df1 where events are sorted well with more time granularity? Is there a way to partially match the time column values of two dataframes and merge? Ideal result would look like the following
time value counts growth
0 15:09.123 10 fg 1.0
1 15:09.234 20 mn 3.0
2 15:10.123 30 gl 6.0
3 15:11.123 40 NaN NaN
4 15:12.123 50 NaN NaN
5 15:12.987 60 NaN NaN
here is one way to do it
Assusmption: number of rows for a time without seconds in df1 and df2 will be same
# create time without seconds
df1['time2']=df1['time'].str[:-4]
# add a sequence when there are multiple rows for any time
df1['seq']=df1.groupby('time2')['time2'].cumcount()
# add a sequence when there are multiple rows for any time
df2['seq']=df2.groupby('time').cumcount()
# do a merge on time (stripped) in df1 and sequence
pd.merge(df1,
df2,
left_on=['time2', 'seq'],
right_on=['time','seq'],
how='left',
suffixes=(None,'_y')).drop(columns=['time2', 'seq'])
time value time_y counts growth
0 15:09.123 10 15:09 fg 1.0
1 15:09.234 20 15:09 mn 3.0
2 15:10.123 30 15:10 gl 6.0
3 15:11.123 40 NaN NaN NaN
4 15:12.123 50 NaN NaN NaN
5 15:12.987 60 NaN NaN NaN
Merge on column 'time' with preserved order
Assumption: Data from df1 and df2 are in order of occurrence
import pandas as pd
dict1 = {
"time": ["15:09.123", "15:09.234", "15:10.123", "15:11.123", "15:12.123", "15:12.987"],
"value": [10, 20, 30, 40, 50, 60]
}
dict2 = {
"time": ["15:09", "15:09", "15:11"],
"counts": ["fg", "mn", "gl"],
"growth": [1, 3, 6]
}
df1 = pd.DataFrame(dict1)
df2 = pd.DataFrame(dict2)
df1["time"] = df1["time"].str[:-4]
df1_keys = df1["time"].unique()
df_list = list()
for key in df1_keys:
tmp_df1 = df1[df1["time"] == key]
tmp_df1 = tmp_df1.reset_index(drop=True)
tmp_df2 = df2[df2["time"] == key]
tmp_df2 = tmp_df2.reset_index(drop=True)
df_list.append(pd.merge(tmp_df1, tmp_df2, left_index=True, right_index=True, how="left"))
print(pd.concat(df_list, axis = 0))

align two pandas dataframes on values in one column, otherwise insert NA to match row number

I have two pandas DataFrames (df1, df2) with a different number of rows and columns and some matching values in a specific column in each df, with caveats (1) there are some unique values in each df, and (2) there are different numbers of matching values across the DataFrames.
Baby example:
df1 = pd.DataFrame({'id1': [1, 1, 1, 2, 2, 3, 3, 3, 3, 4, 6, 6]})
df2 = pd.DataFrame({'id2': [1, 1, 2, 2, 2, 2, 3, 4, 5],
'var1': ['B', 'B', 'W', 'W', 'W', 'W', 'H', 'B', 'A']})
What I am seeking to do is create df3 where df2['id2'] is aligned/indexed to df1['id1'], such that:
NaN is added to df3[id2] when df2[id2] has fewer (or missing) matches to df1[id1]
NaN is added to df3[id2] & df3[var1] if df1[id1] exists but has no match to df2[id2]
'var1' is filled in for all cases of df3[var1] where df1[id1] and df2[id2] match
rows are dropped when df2[id2] has more matching values than df1[id1] (or no matches at all)
The resulting DataFrame (df3) should look as follows (Notice id2 = 5 and var1 = A are gone):
id1
id2
var1
1
1
B
1
1
B
1
NaN
B
2
2
W
2
2
W
3
3
H
3
NaN
H
3
NaN
H
3
NaN
H
4
4
B
6
NaN
NaN
6
NaN
NaN
I cannot find a combination of merge/join/concatenate/align that correctly solves this problem. Currently, everything I have tried stacks the rows in sequence without adding NaN in the proper cells/rows and instead adds all the NaN values at the bottom of df3 (so id1 and id2 never align). Any help is greatly appreciated!
You can first assign a helper column for id1 and id2 based on groupby.cumcount, then merge. Finally ffill values of var1 based on the group id1
def helper(data,col): return data.groupby(col).cumcount()
out = df1.assign(k = helper(df1,['id1'])).merge(df2.assign(k = helper(df2,['id2'])),
left_on=['id1','k'],right_on=['id2','k'] ,how='left').drop('k',1)
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
Or similar but without assign as HenryEcker suggests :
out = df1.merge(df2, left_on=['id1', helper(df1, ['id1'])],
right_on=['id2', helper(df2, ['id2'])], how='left').drop(columns='key_1')
out['var1'] = out['id1'].map(dict(df2[['id2','var1']].drop_duplicates().to_numpy()))
print(out)
id1 id2 var1
0 1 1.0 B
1 1 1.0 B
2 1 NaN B
3 2 2.0 W
4 2 2.0 W
5 3 3.0 H
6 3 NaN H
7 3 NaN H
8 3 NaN H
9 4 4.0 B
10 6 NaN NaN
11 6 NaN NaN

Find different rows between 2 dataframes of different size with Pandas

I have 2 dataframes df1 and df2 of different size.
df1 = pd.DataFrame({'A':[np.nan, np.nan, np.nan, 'AAA','SSS','DDD'], 'B':[np.nan,np.nan,'ciao',np.nan,np.nan,np.nan]})
df2 = pd.DataFrame({'C':[np.nan, np.nan, np.nan, 'SSS','FFF','KKK','AAA'], 'D':[np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan]})
My goal is to identify the elements of df1 which do not appear in df2.
I was able to achieve my goal using the following lines of code.
df = pd.DataFrame({})
for i, row1 in df1.iterrows():
found = False
for j, row2, in df2.iterrows():
if row1['A']==row2['C']:
found = True
print(row1.to_frame().T)
if found==False and pd.isnull(row1['A'])==False:
df = pd.concat([df, row1.to_frame().T], axis=0)
df.reset_index(drop=True)
Is there a more elegant and efficient way to achieve my goal?
Note: the solution is
A B
0 DDD NaN
I believe need isin withboolean indexing :
Also omit NaNs rows by default chain new condition:
#changed df2 with no NaN in C column
df2 = pd.DataFrame({'C':[4, 5, 5, 'SSS','FFF','KKK','AAA'],
'D':[np.nan,np.nan,np.nan,1,np.nan,np.nan,np.nan]})
print (df2)
C D
0 4 NaN
1 5 NaN
2 5 NaN
3 SSS 1.0
4 FFF NaN
5 KKK NaN
6 AAA NaN
df = df1[~(df1['A'].isin(df2['C']) | (df1['A'].isnull()))]
print (df)
A B
5 DDD NaN
If not necessary omit NaNs if not exist in C column:
df = df1[~df1['A'].isin(df2['C'])]
print (df)
A B
0 NaN NaN
1 NaN NaN
2 NaN ciao
5 DDD NaN
If exist NaNs in both columns use second solution:
(input DataFrames are from question)
df = df1[~df1['A'].isin(df2['C'])]
print (df)
A B
5 DDD NaN

Why does pd.DataFrame with pd.isnull fail?

tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(pd.isnull(tt).astype(int), index = tt.index, columns=map(lambda x: x + '_'+'NA',tt.columns))
bb
I want create this dataframe with pd.isnull(tt), and the columns name contain the NA, but why does this fail?
Using values
tt = pd.DataFrame({'a':[1,2,None,3],'b':[None,3,4,5]})
bb=pd.DataFrame(data=pd.isnull(tt).astype(int).values, index = tt.index, columns=list(map(lambda x: x + '_'+'NA',tt.columns)))
The reason why :
pandas data carry over the column and index , which pd.isnull(tt).astype(int) already have the columns name as b and a
More information
bb=pd.DataFrame(data=pd.isnull(tt).astype(int), index = tt.index,columns=['a','b', 'a_NA','b_NA'] )
bb
Out[399]:
a b a_NA b_NA
0 0 1 NaN NaN
1 0 0 NaN NaN
2 1 0 NaN NaN
3 0 0 NaN NaN

Can't create jagged dataframe in pandas?

I have a simple dataframe with 2 columns and 2rows.
I also have a list of 4 numbers.
I want to concatenate this list to the FIRST column of the dataframe, and only the first. So the dataframe will have 6rows in the first column, and 2in the second.
I wrote this code:
df1 = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
numbers = [5, 6, 7, 8]
for i in range(0, 4):
df1['A'].loc[i + 2] = numbers[i]
print(df1)
It prints the original dataframe oddly enough. But when I debug and evaluate the expression df1['A'] then it does show the new numbers. What's going on here?
It's not just that it's printing the original df, it also writes the original df to csv when I use to_csv method.
It seems you need:
for i in range(0, 4):
df1.loc[0, i] = numbers[i]
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN
df1 = pd.concat([df1, pd.DataFrame([numbers], index=[0])], axis=1)
print (df1)
A B 0 1 2 3
0 1 2 5.0 6.0 7.0 8.0
1 3 4 NaN NaN NaN NaN

Categories

Resources