I am trying to convert a sql query to python. The sql statement is as follows:
select * from table 1
union
select * from table 2
union
select * from table 3
union
select * from table 4
Now I have those tables in 4 dataframe df1, df2, df3, df4 and I would like to union 4 pandas dataframe which would match the result as the same as sql query.
I am confused of what operation to be used which is equivalent to sql union?
Thanks in advance!!
Note:
The column name for all the dataframes are the same.
If I understand well the issue, you are looking for the concat function.
pandas.concat([df1, df2, df3, df4]) should work correctly if the column names are the same for both dataframes.
IIUC you can use merge and join by columns matching_col of all dataframes:
import pandas as pd
# Merge multiple dataframes
df1 = pd.DataFrame({"matching_col": pd.Series({1: 4, 2: 5, 3: 7}),
"a": pd.Series({1: 52, 2: 42, 3:7})}, columns=['matching_col','a'])
print df1
matching_col a
1 4 52
2 5 42
3 7 7
df2 = pd.DataFrame({"matching_col": pd.Series({1: 2, 2: 7, 3: 8}),
"a": pd.Series({1: 62, 2: 28, 3:9})}, columns=['matching_col','a'])
print df2
matching_col a
1 2 62
2 7 28
3 8 9
df3 = pd.DataFrame({"matching_col": pd.Series({1: 1, 2: 0, 3: 7}),
"a": pd.Series({1: 28, 2: 52, 3:3})}, columns=['matching_col','a'])
print df3
matching_col a
1 1 28
2 0 52
3 7 3
df4 = pd.DataFrame({"matching_col": pd.Series({1: 4, 2: 9, 3: 7}),
"a": pd.Series({1: 27, 2: 24, 3:7})}, columns=['matching_col','a'])
print df4
matching_col a
1 4 27
2 9 24
3 7 7
Solution1:
df = pd.merge(pd.merge(pd.merge(df1,df2,on='matching_col'),df3,on='matching_col'), df4, on='matching_col')
set columns names
df.columns = ['matching_col','a1','a2','a3','a4']
print df
matching_col a1 a2 a3 a4
0 7 7 28 3 7
Solution2:
dfs = [df1, df2, df3, df4]
#use built-in python reduce
df = reduce(lambda left,right: pd.merge(left,right,on='matching_col'), dfs)
#set columns names
df.columns = ['matching_col','a1','a2','a3','a4']
print df
matching_col a1 a2 a3 a4
0 7 7 28 3 7
But if you need only concat dataframes, use concat with reseting index by parameter ignore_index=True:
print pd.concat([df1, df2, df3, df4], ignore_index=True)
matching_col a
0 4 52
1 5 42
2 7 7
3 2 62
4 7 28
5 8 9
6 1 28
7 0 52
8 7 3
9 4 27
10 9 24
11 7 7
This should be a comment on Jezrael's answer (+1'd for merge over concat) but I haven't sufficient reputation.
The OP asked how to union the dfs, but merge returns intersection by default:
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.merge.html#pandas.merge
To get unions, add how='outer' to the merge calls.
Related
Following this answer:
https://stackoverflow.com/a/47107164/11462274
I try to create a DataFrame that is only the lines not found in another DataFrame, however, not according to all columns, but according to only some specific columns, so I tried to do it this way:
import pandas as pd
df1 = pd.DataFrame(data = {'col1' : [1, 2, 3, 4, 5, 3],
'col2' : [10, 11, 12, 13, 14, 10],
'col3' : [1,5,7,9,6,7]})
df2 = pd.DataFrame(data = {'col1' : [1, 2, 3],
'col2' : [10, 11, 12],
'col3' : [1,5,8]})
df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'],
how='left', indicator=True)
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)
But note that when not using all the columns, they change their name like col2 to col2_x:
col1 col2_x col3 col2_y _merge
0 1 10 1 10.0 both
1 2 11 5 11.0 both
2 3 12 7 NaN left_only
3 4 13 9 NaN left_only
4 5 14 6 NaN left_only
5 3 10 7 NaN left_only
So when I try to create the final DataFrame without the unnecessary columns, the unused columns are not found to generate the desired filter:
KeyError(f"{not_found} not in index")
KeyError: "['col2'] not in index"
You can use the suffixes parameter of pandas.DataFrame.merge :
df_merge = df1.merge(df2.drop_duplicates(), on=['col1','col3'],
how='left', indicator=True, suffixes=("", "_"))
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
Output :
print(df_merge)
col1 col2 col3
2 3 12 7
3 4 13 9
4 5 14 6
5 3 10 7
Another option is that considering that it's left join you can just drop the columns from other df that you know would overlap (thereby making a smaller merge result):
df_merge = df1.merge(df2.drop_duplicates().drop(columns=['col2']),
on=['col1','col3'], how='left', indicator=True)
df_merge = df_merge.query("_merge == 'left_only'")[df1.columns]
print(df_merge)
col1 col2 col3
2 3 12 7
3 4 13 9
4 5 14 6
5 3 10 7
I am curious what is the best practice to do the following:
Let's say I have 2 dataframes:
df1:
A B C D
0 1 2 3 4
1 1 3 5 5
2 1 2 3 4
3 3 5 6 7
4 9 7 6 5
df2:
A B C
0 1 2 3
1 9 7 6
I want to filter down df1 on columns A, B, C to only show records which are present in df2's A,B,C columns.
The result I want to see:
A B C D
0 1 2 3 4
1 1 2 3 4
2 9 7 6 5
As you can see I only need records from df1 where the combination of the first 3 columns are either 1,2,3 or 9,7,6.
What I tried is a bit overkill in my opinion:
merged_df = df1.merge(df2, how="left", on=["A", "B", "C"], indicator=True)
mask = merged_df["_merge"] == "both"
result = merged_df[mask].drop(["_merge"], axis=1)
Is there any better way to do this?
Merge on inner
import pandas as pd
df1 = pd.DataFrame(
{
"A": [1, 1, 1,3,9],
"B": [2,3,2,5,7],
"C": [3,5,3,6,6],
"D": [4,5,4,7,5]
}
)
df2 = pd.DataFrame(
{
"A": [1, 9],
"B": [2,7],
"C": [3,6],
}
)
#print(df1)
df =pd.merge(df1, df2, on=['A', 'B', 'C'], how='inner')
print(df)
output #
A B C D
0 1 2 3 4
1 1 2 3 4
2 9 7 6 5
I have 3 tables of following form:
import pandas as pd
df1 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Value1': [2012, 2014, 2013, 2014],
'Value2': [55, 40, 84, 31]})
df1 = df1.set_index("ISIN")
df2 = pd.DataFrame({'ISIN': [1, 4, 7, 10],
'Symbol': ['a', 'b', 'c', 'd']})
df2 = df2.set_index("ISIN")
df3 = pd.DataFrame({'Symbol': ['a', 'b', 'c', 'd'],
'01.01.2020': [1, 2, 3, 4],
'01.01.2021': [3,2,3,2]})
df3 = df3.set_index("Symbol")
My aim now is to merge all 3 tabels together. I would go the following way:
Step1 (merge df1 and df2):
result1 = pd.merge(df1, df2, on=["ISIN"])
print(result1)
The result is ok and gives me the table:
Value1 Value2 Symbol
ISIN
1 2012 55 a
4 2014 40 b
7 2013 84 c
10 2014 31 d
In next step I want to merge it with df3, so I did make a step between and merge df2 and df3:
print(result1)
result2 = pd.merge(df2, df3, on=["Symbol"])
print(result2)
My problem now, the output is:
Symbol 01.01.2020 01.01.2021
0 a 1 3
1 b 2 2
2 c 3 3
3 d 4 2
the column ISIN here is lost. And the step
result = pd.merge(result, result2, on=["ISIN"])
result.set_index("ISIN")
produces an error.
Is there an elegant way to merge this 3 tabels together (with key column ISIN) and why is the key column lost in the second merge process?
Just chain the merge operations:
result = df1.merge(df2.reset_index(), on='ISIN').merge(df3, on='Symbol')
Or using your syntax, use result1 as source for the second merge:
result1 = pd.merge(df1, df2.reset_index(), on=["ISIN"])
result2 = pd.merge(result1, df3, on=["Symbol"])
output:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You should not set the index prior to joining if you wish to keep it as part of the data in your dataframe. I suggest first merging, then setting the index to your desired value. In a single line:
output = df1.merge(df2,on='ISIN').merge(df3,on='Symbol')
Outputs:
ISIN Value1 Value2 Symbol 01.01.2020 01.01.2021
0 1 2012 55 a 1 3
1 4 2014 40 b 2 2
2 7 2013 84 c 3 3
3 10 2014 31 d 4 2
You can now set the index to ISIN by adding .set_index('ISIN') to output:
Value1 Value2 Symbol 01.01.2020 01.01.2021
ISIN
1 2012 55 a 1 3
4 2014 40 b 2 2
7 2013 84 c 3 3
10 2014 31 d 4 2
I have two dataframes:
df1 = pd.DataFrame({
'Name' : ['A', 'A', 'A', 'A', 'B', 'B'],
'Value': [10, 9, 8, 10, 99 , 88],
'Day' : [1,2,3,4,1,2]
})
df2 = pd.DataFrame({
'Name' : ['C', 'C', 'C', 'C'],
'Value': [1,2,3,4],
'Day' : [1,2,3,4]
})
I would like to subtract the values in df1 with the values in df2 based on the day and create a new dataframe called delta_values. If there are no entries for the day then no action should occur.
To explain further: B in the name column only has values for day 1 and 2. df2 should subtract its values associated with day 1 and 2 with B's values for day 1 and 2, but since B has no values for day 3 and 4, no arithmetic should occur. I am having trouble with this part.
The output I am looking for is
If nothing better comes to somebidy's mind, here's a correct but not very elegant solution:
results = df1.set_index(['Day','Name']).unstack()['Value']\
.subtract(df2.set_index('Day')['Value'], axis=0)\
.stack().reset_index()
Make the result look like the expected output:
result.columns = 'Day', 'Name', 'Value'
result.Value = result.Value.astype(int)
result.sort_values(['Name', 'Day'], inplace=True)
result = result[['Name', 'Value', 'Day']]
We can merge the two DataFrame's on the Day column and then subtract from there.
merged = df1.merge(df2, how='inner', on='Day', suffixes=('', '_y'))
print(merged)
Name Value Day Name_y Value_y
0 A 10 1 C 1
1 A 9 2 C 2
2 A 8 3 C 3
3 A 10 4 C 4
4 B 99 1 C 1
5 B 88 2 C 2
delta_values = df1.copy()
delta_values['Value'] = merged['Value'] - merged['Value_y']
print(delta_values)
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2
You can make do with either map or merge. Here's a map solution:
delta_values = df1.copy()
delta_values['Value'] -= delta_values['Day'].map(df2.set_index('Day')['Value']
).fillna(0)
Output:
Name Value Day
0 A 9 1
1 A 7 2
2 A 5 3
3 A 6 4
4 B 98 1
5 B 86 2
Assume I have two dataframes of this format (call them df1 and df2):
+------------------------+------------------------+--------+
| user_id | business_id | rating |
+------------------------+------------------------+--------+
| rLtl8ZkDX5vH5nAx9C3q5Q | eIxSLxzIlfExI6vgAbn2JA | 4 |
| C6IOtaaYdLIT5fWd7ZYIuA | eIxSLxzIlfExI6vgAbn2JA | 5 |
| mlBC3pN9GXlUUfQi1qBBZA | KoIRdcIfh3XWxiCeV1BDmA | 3 |
+------------------------+------------------------+--------+
I'm looking to get a dataframe of all the rows that have a common user_id in df1 and df2. (ie. if a user_id is in both df1 and df2, include the two rows in the output dataframe)
I can think of many ways to approach this, but they all strike me as clunky. For example, we could find all the unique user_ids in each dataframe, create a set of each, find their intersection, filter the two dataframes with the resulting set and concatenate the two filtered dataframes.
Maybe that's the best approach, but I know Pandas is clever. Is there a simpler way to do this? I've looked at merge but I don't think that's what I need.
My understanding is that this question is better answered over in this post.
But briefly, the answer to the OP with this method is simply:
s1 = pd.merge(df1, df2, how='inner', on=['user_id'])
Which gives s1 with 5 columns: user_id and the other two columns from each of df1 and df2.
If I understand you correctly, you can use a combination of Series.isin() and DataFrame.append():
In [80]: df1
Out[80]:
rating user_id
0 2 0x21abL
1 1 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
5 2 0x21abL
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
9 1 0x21abL
In [81]: df2
Out[81]:
rating user_id
0 2 0x1d14L
1 1 0xdbdcad7
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
5 1 0x5734a81e2
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
9 4 0x5734a81e2
In [82]: ind = df2.user_id.isin(df1.user_id) & df1.user_id.isin(df2.user_id)
In [83]: ind
Out[83]:
0 True
1 False
2 True
3 True
4 True
5 False
6 True
7 True
8 True
9 False
Name: user_id, dtype: bool
In [84]: df1[ind].append(df2[ind])
Out[84]:
rating user_id
0 2 0x21abL
2 1 0xdafL
3 0 0x21abL
4 4 0x1d14L
6 1 0x21abL
7 0 0xdafL
8 4 0x1d14L
0 2 0x1d14L
2 1 0x21abL
3 3 0x21abL
4 3 0x21abL
6 2 0x1d14L
7 0 0xdafL
8 0 0x1d14L
This is essentially the algorithm you described as "clunky", using idiomatic pandas methods. Note the duplicate row indices. Also, note that this won't give you the expected output if df1 and df2 have no overlapping row indices, i.e., if
In [93]: df1.index & df2.index
Out[93]: Int64Index([], dtype='int64')
In fact, it won't give the expected output if their row indices are not equal.
In SQL, this problem could be solved by several methods:
select * from df1 where exists (select * from df2 where df2.user_id = df1.user_id)
union all
select * from df2 where exists (select * from df1 where df1.user_id = df2.user_id)
or join and then unpivot (possible in SQL server)
select
df1.user_id,
c.rating
from df1
inner join df2 on df2.user_i = df1.user_id
outer apply (
select df1.rating union all
select df2.rating
) as c
Second one could be written in pandas with something like:
>>> df1 = pd.DataFrame({"user_id":[1,2,3], "rating":[10, 15, 20]})
>>> df2 = pd.DataFrame({"user_id":[3,4,5], "rating":[30, 35, 40]})
>>>
>>> df4 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df = pd.merge(df1, df2, on='user_id', suffixes=['_1', '_2'])
>>> df3 = df[['user_id', 'rating_1']].rename(columns={'rating_1':'rating'})
>>> df4 = df[['user_id', 'rating_2']].rename(columns={'rating_2':'rating'})
>>> pd.concat([df3, df4], axis=0)
user_id rating
0 3 20
0 3 30
This is simple solution:
df1[df1 == df2].dropna()
You can do this for n DataFrames and k colums by using pd.Index.intersection:
import pandas as pd
from functools import reduce
from typing import Union
def dataframe_intersection(
dataframes: list[pd.DataFrame], by: Union[list, str]
) -> list[pd.DataFrame]:
set_index = [d.set_index(by) for d in dataframes]
index_intersection = reduce(pd.Index.intersection, [d.index for d in set_index])
intersected = [df.loc[index_intersection].reset_index() for df in set_index]
return intersected
df1 = pd.DataFrame({"user_id":[1,2,3], "business_id": ['a', 'b', 'c'], "rating":[10, 15, 20]})
df2 = pd.DataFrame({"user_id":[3,4,5], "business_id": ['c', 'd', 'e'], "rating":[30, 35, 40]})
df3 = pd.DataFrame({"user_id":[3,3,3], "business_id": ['f', 'c', 'f'], "rating":[50, 70, 80]})
df_list = [df1, df2, df3]
This gives
>>> pd.concat(dataframe_intersection(df_list, by='user_id'))
user_id business_id rating
0 3 c 20
0 3 c 30
0 3 f 50
1 3 c 70
2 3 f 80
And
>>> pd.concat(dataframe_intersection(df_list, by=['user_id', 'business_id']))
user_id business_id rating
0 3 c 20
0 3 c 30
0 3 c 70