I have 3 df that I would like to combine where the first 3 columns are same data if exists, and columns afterwards are new from each of the df. For example df[3:] are different, than df2[3:]
I would like to merge them if they have the same unique identifier, otherwise I would like to concat.
df1
ID A B 2009 2010
1 A B 2 3
2 A C 2 2
3 A B 3 3
df2
ID A B 2011 2012
2 A C 2 2
3 A C 3 4
5 A B 8 9
df3
ID A B 2013 2014
2 A C 2 3
4 A E 3 4
5 A B 8 9
result df
ID A B 2009 2010 2011 2012 2013 2014
1 A B 2 3. 2. 3.
2 A C 2 2. 2. 2. 2. 3
3 A C 3 3. 3. 4.
4 A E 3. 4
5 A B 8 9 8. 9
Edit: fixed df data. Secondly One issue I notice is when I merge, my data A, and B, are duplicated, A_X, A_Y, A_Z, B_X, B_Y, B_Z
thank you in advance
Try pd.concat([df.set_index('ID') for df in [df1, df2, df3]], axis=1).reset_index()
The list comprehension sets ID as the index of each dataframe. Then we concatenate horizontally. Horizontal concatenation tries to match up the indexes where possible, otherwise it adds rows. Finally, we reset the index.
There is something wrong with the result.
But the code for merging will be like this:
from functools import reduce
import pandas as pd
dfs = [df1,df2,df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), dfs)
df_merged:
ID 2009 2010 2011 2012 2013 2014
0 1 2.0 3.0 2.0 3.0 NaN NaN
1 2 3.0 4.0 3.0 4.0 2.0 3.0
2 3 4.0 5.0 4.0 5.0 NaN NaN
3 4 NaN NaN NaN NaN 3.0 4.0
4 5 NaN NaN NaN NaN 8.0 9.0
Edit:
Just use on=['ID', 'A', 'B']
output:
ID A B 2009 2010 2011 2012 2013 2014
0 1 A B 2.0 3.0 NaN NaN NaN NaN
1 2 A C 2.0 2.0 2.0 2.0 2.0 3.0
2 3 A B 3.0 3.0 NaN NaN NaN NaN
3 3 A C NaN NaN 3.0 4.0 NaN NaN
4 5 A B NaN NaN 8.0 9.0 8.0 9.0
5 4 A E NaN NaN NaN NaN 3.0 4.0
Related
I expect to describe well want I need. I have a data frame with the same columns name and another column that works as an index. The data frame looks as follows:
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5],'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
df
Out[21]:
ID X Y
0 1 1 1
1 1 2 2
2 1 3 3
3 1 4 4
4 1 5 5
5 2 2 2
6 2 3 3
7 2 4 4
8 3 1 5
9 3 3 4
10 3 4 3
11 3 5 2
My intention is to copy X as an index or one column (it doesn't matter) and append Y columns from each 'ID' in the following way:
You can try
out = pd.concat([group.rename(columns={'Y': f'Y{name}'}) for name, group in df.groupby('ID')])
out.columns = out.columns.str.replace(r'\d+$', '', regex=True)
print(out)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
Here's another way to do it:
df_org = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],'X':[1,2,3,4,5,2,3,4,1,3,4,5]})
df = df_org.copy()
for i in set(df_org['ID']):
df1 = df_org[df_org['ID']==i]
col = 'Y'+str(i)
df1.columns = ['ID', col]
df = pd.concat([ df, df1[[col]] ], axis=1)
df.columns = df.columns.str.replace(r'\d+$', '', regex=True)
print(df)
Output:
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 1.0
9 3 3 NaN NaN 3.0
10 3 4 NaN NaN 4.0
11 3 5 NaN NaN 5.0
Another solution could be as follow.
Get unique values for column ID (stored in array s).
Use np.transpose to repeat column ID n times (n == len(s)) and evaluate the array's matches with s.
Use np.where to replace True with values from df.Y and False with NaN.
Finally, drop the orignal df.Y and rename the new columns as required.
import pandas as pd
import numpy as np
df = pd.DataFrame({'ID':[1,1,1,1,1,2,2,2,3,3,3,3],
'X':[1,2,3,4,5,2,3,4,1,3,4,5],
'Y':[1,2,3,4,5,2,3,4,5,4,3,2]})
s = df.ID.unique()
df[s] = np.where((np.transpose([df.ID]*len(s))==s),
np.transpose([df.Y]*len(s)),
np.nan)
df.drop('Y', axis=1, inplace=True)
df.rename(columns={k:'Y' for k in s}, inplace=True)
print(df)
ID X Y Y Y
0 1 1 1.0 NaN NaN
1 1 2 2.0 NaN NaN
2 1 3 3.0 NaN NaN
3 1 4 4.0 NaN NaN
4 1 5 5.0 NaN NaN
5 2 2 NaN 2.0 NaN
6 2 3 NaN 3.0 NaN
7 2 4 NaN 4.0 NaN
8 3 1 NaN NaN 5.0
9 3 3 NaN NaN 4.0
10 3 4 NaN NaN 3.0
11 3 5 NaN NaN 2.0
If performance is an issue, this method should be faster than this answer, especially when the number of unique values for ID increases.
I am relatively new to python and I am wondering how I can merge these two tables and preserve both their values?
Consider these two tables:
df = pd.DataFrame([[1, 3], [2, 4],[2.5,1],[5,6],[7,8]], columns=['A', 'B'])
A B
1 3
2 4
2.5 1
5 6
7 8
df2 = pd.DataFrame([[1],[2],[3],[4],[5],[6],[7],[8]], columns=['A'])
A
1
2
...
8
I want to obtain the following result:
A B
1 3
2 4
2.5 1
3 NaN
4 NaN
5 6
6 NaN
7 8
8 NaN
You can see that column A includes all values from both the first and second dataframe in an ordered manner.
I have attempted:
pd.merge(df,df2,how='outer')
pd.merge(df,df2,how='right')
But the former does not result in an ordered dataframe and the latter does not include rows that are unique to df.
Let us do concat then drop_duplicates
out = pd.concat([df2,df]).drop_duplicates('A',keep='last').sort_values('A')
Out[96]:
A B
0 1.0 3.0
1 2.0 4.0
2 2.5 1.0
2 3.0 NaN
3 4.0 NaN
3 5.0 6.0
5 6.0 NaN
4 7.0 8.0
7 8.0 NaN
df1 = pd.DataFrame([(1,5),(2,10),(3,15)],columns=["2009","2008"],index=["C","A","B"])
2009 2008
C 1 5
A 2 10
B 3 15
df2 = pd.DataFrame([(5,7),(11,14),(14,15)],columns=["2008","2007"],index=["D","B","C"])
2008 2007
D 5 7
B 11 14
C 14 15
desired_output =
2009 2008 2007
C 1 5 15
A 2 10 na
B 3 15 14
D na 5 7
I know there are four main ways to combine two dataframes: join, merge, append, concat and I have experimented with a number of ways of doing them but I cannot seem to succeed.
df1.merge(df2,how="outer",left_index=True,right_index=True,on="2008")
2009 2008 2007
A 2.0 10 NaN
B 3.0 15 14.0
C 1.0 5 15.0
D NaN 5 7.0
is the closest I could find - but the columns get resorted. I want all intersecting indices to come first in original order of df1, then any non-intersecting indices to be append (ideally also in order of df2)
Any help would be appreciated?
You can try this using pd.Index.difference with DataFrame.append to maintain both index and columns order.
idx = df2.index.difference(df1.index)
df1.append(df2.loc[idx]).fillna(df2)
2009 2008 2007
C 1.0 5 15.0
A 2.0 10 NaN
B 3.0 15 14.0
D NaN 5 7.0
Try combine_first with reindex and union column indexes with sort=False:
df1.combine_first(df2).reindex(df1.columns.union(df2.columns, sort=False), axis=1)
Output:
2009 2008 2007
A 2.0 10.0 NaN
B 3.0 15.0 14.0
C 1.0 5.0 15.0
D NaN 5.0 7.0
I have a dataframe with ones and NaN values and would like to calculate the two rows following the ones to two and three.
import pandas as pd
df=pd.DataFrame({"b" : [1,None,None,None,None,1,None,None,None]})
print(df)
b
0 1.0
1 NaN
2 NaN
3 NaN
4 NaN
5 1.0
6 NaN
7 NaN
8 NaN
Like this:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
I know i can use df.loc[df['b']==1] to retrive the ones but i dont know how to calculate the two rows below.
You can create a group variable where each 1 in b starts a new group, then forward fill 2 rows for each group, and do a cumsum:
g = (df.b == 1).cumsum()
df.b.groupby(g).apply(lambda g: g.ffill(limit = 2).cumsum())
#0 1.0
#1 2.0
#2 3.0
#3 NaN
#4 NaN
#5 1.0
#6 2.0
#7 3.0
#8 NaN
#Name: b, dtype: float64
One without groupby:
temp = df.ffill(limit=2).cumsum()
temp-temp.mask(df.b.isnull()).ffill(limit=2)+1
Out[91]:
b
0 1.0
1 2.0
2 3.0
3 NaN
4 NaN
5 1.0
6 2.0
7 3.0
8 NaN
Using your current line of thinking, you simply need the index of the rows after the 1s and set to appropriate values:
df.loc[np.where(df['b']==1)[0]+1, 'b'] = 2
df.loc[np.where(df['b']==1)[0]+2, 'b'] = 3
I'm new to Python and Pandas so there might be a simple solution which I don't see.
I have a number of discontinuous datasets which look like this:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 3.5 2 0
4 4.0 4 5
5 4.5 3 3
I now look for a solution to get the following:
ind A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NAN NAN
4 2.0 NAN NAN
5 2.5 NAN NAN
6 3.0 NAN NAN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
The problem is,that the gap in A varies from dataset to dataset in position and length...
set_index and reset_index are your friends.
df = DataFrame({"A":[0,0.5,1.0,3.5,4.0,4.5], "B":[1,4,6,2,4,3], "C":[3,2,1,0,5,3]})
First move column A to the index:
In [64]: df.set_index("A")
Out[64]:
B C
A
0.0 1 3
0.5 4 2
1.0 6 1
3.5 2 0
4.0 4 5
4.5 3 3
Then reindex with a new index, here the missing data is filled in with nans. We use the Index object since we can name it; this will be used in the next step.
In [66]: new_index = Index(arange(0,5,0.5), name="A")
In [67]: df.set_index("A").reindex(new_index)
Out[67]:
B C
0.0 1 3
0.5 4 2
1.0 6 1
1.5 NaN NaN
2.0 NaN NaN
2.5 NaN NaN
3.0 NaN NaN
3.5 2 0
4.0 4 5
4.5 3 3
Finally move the index back to the columns with reset_index. Since we named the index, it all works magically:
In [69]: df.set_index("A").reindex(new_index).reset_index()
Out[69]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Using the answer by EdChum above, I created the following function
def fill_missing_range(df, field, range_from, range_to, range_step=1, fill_with=0):
return df\
.merge(how='right', on=field,
right = pd.DataFrame({field:np.arange(range_from, range_to, range_step)}))\
.sort_values(by=field).reset_index().fillna(fill_with).drop(['index'], axis=1)
Example usage:
fill_missing_range(df, 'A', 0.0, 4.5, 0.5, np.nan)
In this case I am overwriting your A column with a newly generated dataframe and merging this to your original df, I then resort it:
In [177]:
df.merge(how='right', on='A', right = pd.DataFrame({'A':np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5)})).sort(columns='A').reset_index().drop(['index'], axis=1)
Out[177]:
A B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
So in the general case you can adjust the arange function which takes a start and end value, note I added 0.5 to the end as ranges are open closed, and pass a step value.
A more general method could be like this:
In [197]:
df = df.set_index(keys='A', drop=False).reindex(np.arange(df.iloc[0]['A'], df.iloc[-1]['A'] + 0.5, 0.5))
df.reset_index(inplace=True)
df['A'] = df['index']
df.drop(['A'], axis=1, inplace=True)
df.reset_index().drop(['level_0'], axis=1)
Out[197]:
index B C
0 0.0 1 3
1 0.5 4 2
2 1.0 6 1
3 1.5 NaN NaN
4 2.0 NaN NaN
5 2.5 NaN NaN
6 3.0 NaN NaN
7 3.5 2 0
8 4.0 4 5
9 4.5 3 3
Here we set the index to column A but don't drop it and then reindex the df using the arange function.
This question was asked a long time ago, but I have a simple solution that's worth mentioning. You can simply use NumPy's NaN. For instance:
import numpy as np
df[i,j] = np.NaN
will do the trick.