df1 = pd.DataFrame([(1,5),(2,10),(3,15)],columns=["2009","2008"],index=["C","A","B"])
2009 2008
C 1 5
A 2 10
B 3 15
df2 = pd.DataFrame([(5,7),(11,14),(14,15)],columns=["2008","2007"],index=["D","B","C"])
2008 2007
D 5 7
B 11 14
C 14 15
desired_output =
2009 2008 2007
C 1 5 15
A 2 10 na
B 3 15 14
D na 5 7
I know there are four main ways to combine two dataframes: join, merge, append, concat and I have experimented with a number of ways of doing them but I cannot seem to succeed.
df1.merge(df2,how="outer",left_index=True,right_index=True,on="2008")
2009 2008 2007
A 2.0 10 NaN
B 3.0 15 14.0
C 1.0 5 15.0
D NaN 5 7.0
is the closest I could find - but the columns get resorted. I want all intersecting indices to come first in original order of df1, then any non-intersecting indices to be append (ideally also in order of df2)
Any help would be appreciated?
You can try this using pd.Index.difference with DataFrame.append to maintain both index and columns order.
idx = df2.index.difference(df1.index)
df1.append(df2.loc[idx]).fillna(df2)
2009 2008 2007
C 1.0 5 15.0
A 2.0 10 NaN
B 3.0 15 14.0
D NaN 5 7.0
Try combine_first with reindex and union column indexes with sort=False:
df1.combine_first(df2).reindex(df1.columns.union(df2.columns, sort=False), axis=1)
Output:
2009 2008 2007
A 2.0 10.0 NaN
B 3.0 15.0 14.0
C 1.0 5.0 15.0
D NaN 5.0 7.0
Related
I work with panel data. Typically my panel data is not balanced, i.e., there are some missing years. The general look of panel data is as follows:
df = pd.DataFrame({'name': ['a']*4+['b']*3+['c']*4,
'year':[2001,2002,2004,2005]+[2000,2002,2003]+[2001,2002,2003,2005],
'val1':[1,2,3,4,5,6,7,8,9,10,11],
'val2':[2,5,7,11,13,17,19,23,29,31,37]})
name year val1 val2
0 a 2001 1 2
1 a 2002 2 5
2 a 2004 3 7
3 a 2005 4 11
4 b 2000 5 13
5 b 2002 6 17
6 b 2003 7 19
7 c 2001 8 23
8 c 2002 9 29
9 c 2003 10 31
10 c 2005 11 37
Now I want to create lead and lag variables that are groupby name. Using:
df['val1_lag'] = df.groupby('name')['val1'].shift(1)
df['val1_lead'] = df.groupby('name')['val1'].shift(-1)
This simply shift up/down 1 row before/after which is not what I want. I want to shift in relative to year. My expected output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
My current work around solution is to fill is missing year by:
df.set_index(['name', 'year'], inplace=True)
mux = pd.MultiIndex.from_product([df.index.levels[0], df.index.levels[1]], names=['name', 'year'])
df = df.reindex(mux).reset_index()
Then using normal shift. However, because my data size is quite large. Using this often x3 the data size which is not very efficiency here.
I am looking for a better approach for this scenario
The solution is to create a check column if that year is continuous by lag and lead. Set the check col to 1.0 and np.NaN then multiply to your normal groupby
df['yearlag'] = (df['year'] == 1 + df.groupby('name')['year'].shift(1))*1.0
df.loc[df['yearlag']==0.0, 'yearlag'] = None
df['yearlead'] = (df['year'] == -1 + df.groupby('name')['year'].shift(-1))*1.0
df.loc[df['yearlead']==0.0, 'yearlead'] = None
To create lag lead variables:
%timeit df['val1_lag'] = df.groupby('name')['val1'].shift(1)*df['yearlag']
You can check if one with the merge method above, it is much more efficiency
%timeit df['val1_lag'] = df[['name', 'year']].merge(df.eval('year=year+1'), how='left')['val1']
Don't use shift but a merge with the year ± 1:
df['val1_lag'] = df[['name', 'year']].merge(df.eval('year = year+1'), how='left')['val1']
df['val1_lead'] = df[['name', 'year']].merge(df.eval('year = year-1'), how='left')['val1']
Output:
name year val1 val2 val1_lag val1_lead
0 a 2001 1 2 NaN 2.0
1 a 2002 2 5 1.0 NaN
2 a 2004 3 7 NaN 4.0
3 a 2005 4 11 3.0 NaN
4 b 2000 5 13 NaN NaN
5 b 2002 6 17 NaN 7.0
6 b 2003 7 19 6.0 NaN
7 c 2001 8 23 NaN 9.0
8 c 2002 9 29 8.0 10.0
9 c 2003 10 31 9.0 NaN
10 c 2005 11 37 NaN NaN
I am relatively new to python and I am wondering how I can merge these two tables and preserve both their values?
Consider these two tables:
df = pd.DataFrame([[1, 3], [2, 4],[2.5,1],[5,6],[7,8]], columns=['A', 'B'])
A B
1 3
2 4
2.5 1
5 6
7 8
df2 = pd.DataFrame([[1],[2],[3],[4],[5],[6],[7],[8]], columns=['A'])
A
1
2
...
8
I want to obtain the following result:
A B
1 3
2 4
2.5 1
3 NaN
4 NaN
5 6
6 NaN
7 8
8 NaN
You can see that column A includes all values from both the first and second dataframe in an ordered manner.
I have attempted:
pd.merge(df,df2,how='outer')
pd.merge(df,df2,how='right')
But the former does not result in an ordered dataframe and the latter does not include rows that are unique to df.
Let us do concat then drop_duplicates
out = pd.concat([df2,df]).drop_duplicates('A',keep='last').sort_values('A')
Out[96]:
A B
0 1.0 3.0
1 2.0 4.0
2 2.5 1.0
2 3.0 NaN
3 4.0 NaN
3 5.0 6.0
5 6.0 NaN
4 7.0 8.0
7 8.0 NaN
I have a hard time merging and updating Pandas dataframes right now.
I have a bunch of CSV files that I'm parsing with pandas (which is not a
problem). In very few cases I have multiple files that contains some columns
present in both files.
So, for example, let's say I have:
import pandas as pd
a = pd.DataFrame({"A": [0, 1, 2, 3], "B": [4, 5, 6, 7]}, index=[0,1,2,3])
b = pd.DataFrame({"A": [11, 12, 13, 14]}, index=[41,51,61,71])
c = pd.DataFrame({"A": [110, 111, 113]}, index=[0,1,3])
What I want is this dataframe:
A B
0 110 4
1 111 5
2 2 6
3 113 7
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
Pandas has this nice guide: Merge, join, concatenate and compare. But I
fail to find a solution to what I want to achieve.
For example a.join(b, how="outer") raises ValueError: columns overlap but no suffix specified: Index(['A'], dtype='object'). Passing rsuffix="R"
is not an option, because the end result is:
A B AR
0 0.0 4.0 NaN
1 1.0 5.0 NaN
2 2.0 6.0 NaN
3 3.0 7.0 NaN
41 NaN NaN 11.0
51 NaN NaN 12.0
61 NaN NaN 13.0
71 NaN NaN 14.0
Not quite what I want.
pd.merge(a, b, how="outer") looks promising, but it is not quite right either,
because the indices are ignored:
A B
0 0 4.0
1 1 5.0
2 2 6.0
3 3 7.0
4 11 NaN
5 12 NaN
6 13 NaN
7 14 NaN
Passing left_index=True and right_index=True yields a dataframe similar to
.join(..., rsuffix="_x", lsuffix="_y"), so not what I want.
Using update is almost what I want, a.merge(c) would modify a to
A B
0 110.0 4
1 111.0 5
2 2.0 6
3 113.0 7
but a.update(b) does nothing (I assume because the indices of a and b are
disjunct).
So, is what I want even possible with a single line of code?
EDIT
I came up with this one:
> lll = pd.concat([a,b, c]).sort_index()
> pd.concat([a,b,c]).sort_index().drop_duplicates().groupby(a.index).last()
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
This is what I want, the question is: is this correct or just a coincidence that
this yields the same result as I wanted?
How are you determining which 'A' column has priority?
In the order I'm reading the files. The files are generated by a device (which
is kind of a "black box" to me) and generates files with a date in them. So I
do:
tasks = [parse_csv_file(fn) for fn in sorted(glob.glob("*.csv"))]
results = await asyncio.gather(*tasks)
And I would like to do (no error checking as this is an example):
results = iter(results)
merged_df = next(results)
for df in results:
merged_df = the_magic_function_Im_looking_for(df)
reduceing with combine_first:
from functools import reduce
to_merge = [c, b, a]
result = reduce(pd.DataFrame.combine_first, to_merge)
which successively applies combine_first to entries of the list to end up with all-combined, i.e., reduced dataframe at the end.
(we can put reversed(to_merge) in reduce if to_merge comes with the reversed order),
to get
>>> result
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Try with concat + groupby last on axis=1 to merge the Dataframes then get the "last" valid value per column group:
df = pd.concat([a, b, c], axis=1).groupby(level=0, axis=1).last()
df:
A B
0 110.0 4.0
1 111.0 5.0
2 2.0 6.0
3 113.0 7.0
41 11.0 NaN
51 12.0 NaN
61 13.0 NaN
71 14.0 NaN
Or concating long and getting the last valid row per row index thanks to #anky:
df = pd.concat([a, b, c]).groupby(level=0).last()
df:
A B
0 110 4.0
1 111 5.0
2 2 6.0
3 113 7.0
41 11 NaN
51 12 NaN
61 13 NaN
71 14 NaN
I have 3 df that I would like to combine where the first 3 columns are same data if exists, and columns afterwards are new from each of the df. For example df[3:] are different, than df2[3:]
I would like to merge them if they have the same unique identifier, otherwise I would like to concat.
df1
ID A B 2009 2010
1 A B 2 3
2 A C 2 2
3 A B 3 3
df2
ID A B 2011 2012
2 A C 2 2
3 A C 3 4
5 A B 8 9
df3
ID A B 2013 2014
2 A C 2 3
4 A E 3 4
5 A B 8 9
result df
ID A B 2009 2010 2011 2012 2013 2014
1 A B 2 3. 2. 3.
2 A C 2 2. 2. 2. 2. 3
3 A C 3 3. 3. 4.
4 A E 3. 4
5 A B 8 9 8. 9
Edit: fixed df data. Secondly One issue I notice is when I merge, my data A, and B, are duplicated, A_X, A_Y, A_Z, B_X, B_Y, B_Z
thank you in advance
Try pd.concat([df.set_index('ID') for df in [df1, df2, df3]], axis=1).reset_index()
The list comprehension sets ID as the index of each dataframe. Then we concatenate horizontally. Horizontal concatenation tries to match up the indexes where possible, otherwise it adds rows. Finally, we reset the index.
There is something wrong with the result.
But the code for merging will be like this:
from functools import reduce
import pandas as pd
dfs = [df1,df2,df3]
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['ID'],
how='outer'), dfs)
df_merged:
ID 2009 2010 2011 2012 2013 2014
0 1 2.0 3.0 2.0 3.0 NaN NaN
1 2 3.0 4.0 3.0 4.0 2.0 3.0
2 3 4.0 5.0 4.0 5.0 NaN NaN
3 4 NaN NaN NaN NaN 3.0 4.0
4 5 NaN NaN NaN NaN 8.0 9.0
Edit:
Just use on=['ID', 'A', 'B']
output:
ID A B 2009 2010 2011 2012 2013 2014
0 1 A B 2.0 3.0 NaN NaN NaN NaN
1 2 A C 2.0 2.0 2.0 2.0 2.0 3.0
2 3 A B 3.0 3.0 NaN NaN NaN NaN
3 3 A C NaN NaN 3.0 4.0 NaN NaN
4 5 A B NaN NaN 8.0 9.0 8.0 9.0
5 4 A E NaN NaN NaN NaN 3.0 4.0
I have two dataframes:
Df_1:
A B C D
1 10 nan 20 30
2 20 30 20 10
Df_2:
A B
1 10 40
2 30 70
I want to merge them and have this final dataframe.
A B C D
1 10 40 20 30
2 20 30 20 10
3 30 70 nan nan
How do I do that?
Looking at the expected result, I think, the index in the second row of Df_2
should be 3 (instead of 2).
Run Df_1.combine_first(Df_2).
The result is:
A B C D
1 10.0 40.0 20.0 30.0
2 20.0 30.0 20.0 10.0
3 30.0 70.0 NaN NaN
i.e. due to possible NaN values, the type of columns is coerced to float.
But if you want, you can revert this where possible, by applying to_numeric:
Df_1.combine_first(Df_2).apply(pd.to_numeric, downcast='integer')