Pandas left join returning multiple rows - python

I am using python to merge two dataframe:
join=pd.merge(df1,df2,on=["A","B"],how="left")
Table 1:
A B
a 1
b 2
c 3
Table 2:
A B Flag C
a 1 0 20
b 2 1 40
c 3 0 60
a 1 1 80
b 2 0 10
The result that I get after left join is:
A B Flag C
a 1 0 20
a 1 1 80
b 2 1 40
b 2 0 10
c 3 0 60
Here we see row 1 and row 2 has come twice because of table 2. I want to keep just one row based on Flag column. I want to keep one of the two rows whose Falg value is `= 1
So Final Expected output is:
A B Flag C
a 1 1 80
b 2 1 40
c 3 0 60
Is there any pythonic way to do it?

# raise preferred lines to the top
df2 = df2.sort_values(by='Flag', ascending=False)
# deduplicate
df2 = df2.drop_duplicates(subset=['A','B'], keep='first')
# merge
pd.merge(df1, df2, on=['A','B'])
A B Flag C
0 a 1 1 80
1 b 2 1 40
2 c 3 0 60

The concept is similar to what you would do on SQL: separate a table with the selection criterea (in this case maximums for flag), leaving enough columns to match an observation on the joint table.
join = pd.merge(df1, df2, how="left").reset_index()
maximums = join.groupby(by='A').max()
join = pd.merge(join, maximums, on=['Flag', 'A'])

Try using this join:
join=pd.merge(df1,df2,on=["A","B"],how="left", left_index=True, right_index=True)
print(join)

Related

How to get the un-filtered rows in python?

I have a doubt with filtering. SO, I have a data set based on the condition. Now based on the condtion I get the filtered record buts also I want unfiltered data how do we get it?
Code:
df =
a b c
0 1 2 3
1 21 02 1
2 18 03 2
3 20 1 1
4 1 1 18
5 0 0 1
filter = 'a>1 and c<3'
df.query(filter)
output for this:
a b c
1 21 02 1
2 18 03 2
3 20 1 1
Now I want the other records as well. How do we get the unfiltered records?
Excepted Output:
a b c
0 1 2 3
4 1 1 18
5 0 0 1
First I recreate your dataframe and create the desired subset dataframe based on your condition:
df = pd.DataFrame({'a':[1,21,18,20,1,0],
'b':[2,2,3,1,1,0],
'c':[3,1,2,1,18,1]})
filter = 'a>1 and c<3'
df_subset = df.query(filter)
Then we merge the subset on the original, we filter out for the unmerged values and drop the column _merge
df_1 = df.merge(df_subset, how='left', indicator=True)
df_1 = df_1[df_1['_merge'] == 'left_only']
df_1.drop(['_merge'],axis=1, inplace = True)
this gives you your desired output, represented by dataframe df_1:
Or you can simply filter post merge and not drop the column, i.e.
df_1 = df.merge(df_subset, how='left', indicator=True)
df_1[df_1['_merge'] == 'left_only'][['a','b','c']]

Duplicate row of low occurrence in pandas dataframe

In the following dataset what's the best way to duplicate row with groupby(['Type']) count < 3 to 3. df is the input, and df1 is my desired outcome. You see row 3 from df was duplicated by 2 times at the end. This is only an example deck. the real data has approximately 20mil lines and 400K unique Types, thus a method that does this efficiently is desired.
>>> df
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
>>> df1
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Thought about using something like the following but do not know the best way to write the func.
df.groupby('Type').apply(func)
Thank you in advance.
Use value_counts with map and repeat:
counts = df.Type.value_counts()
repeat_map = 3 - counts[counts < 3]
df['repeat_num'] = df.Type.map(repeat_map).fillna(0,downcast='infer')
df = df.append(df.set_index('Type')['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)[['Type','Val']]
print(df)
Type Val
0 a 1
1 a 2
2 a 3
3 b 1
4 c 3
5 c 2
6 c 1
7 b 1
8 b 1
Note : sort=False for append is present in pandas>=0.23.0, remove if using lower version.
EDIT : If data contains multiple val columns then make all columns columns as index expcept one column and repeat and then reset_index as:
df = df.append(df.set_index(['Type','Val_1','Val_2'])['Val'].repeat(df['repeat_num']).reset_index(),
sort=False, ignore_index=True)

Comparing dataframes in pandas

I have two separate pandas dataframes (df1 and df2) which have multiple columns with some common columns.
I would like to find every row in df2 that does not have a match in df1. Match between df1 and df2 is defined as having the same values in two different columns A and B in the same row.
df1
A B C text
45 2 1 score
33 5 2 miss
20 1 3 score
df2
A B D text
45 3 1 shot
33 5 2 shot
10 2 3 miss
20 1 4 miss
Result df (Only Rows 1 and 3 are returned as the values of A and B in df2 have a match in the same row in df1 for Rows 2 and 4)
A B D text
45 3 1 shot
10 2 3 miss
Is it possible to use the isin method in this scenario?
This works:
# set index (as selecting columns)
df1 = df1.set_index(['A','B'])
df2 = df2.set_index(['A','B'])
# now .isin will work
df2[~df2.index.isin(df1.index)].reset_index()
A B D text
0 45 3 1 shot
1 10 2 3 miss

Is it possible to split a Pandas dataframe using groupby and merge each group with separate dataframes

I have a Pandas dataframe that contains a grouping variable. I would like to merge each group with other dataframes based on the contents of one of the columns. So, for example, I have a dataframe, dfA, which can be defined as:
dfA = pd.DataFrame({'a':[1,2,3,4,5,6],
'b':[0,1,0,0,1,1],
'c':['a','b','c','d','e','f']})
a b c
0 1 0 a
1 2 1 b
2 3 0 c
3 4 0 d
4 5 1 e
5 6 1 f
Two other dataframes, dfB and dfC, contain a common column ('a') and an extra column ('d') and can be defined as:
dfB = pd.DataFrame({'a':[1,2,3],
'd':[11,12,13]})
a d
0 1 11
1 2 12
2 3 13
dfC = pd.DataFrame({'a':[4,5,6],
'd':[21,22,23]})
a d
0 4 21
1 5 22
2 6 23
I would like to be able to split dfA based on column 'b' and merge one of the groups with dfB and the other group with dfC to produce an output that looks like:
a b c d
0 1 0 a 11
1 2 1 b 12
2 3 0 c 13
3 4 0 d 21
4 5 1 e 22
5 6 1 f 23
In this simplified version, I could concatenate dfB and dfC and merge with dfA without splitting into groups as shown below:
dfX = pd.concat([dfB,dfC])
dfA = dfA.merge(dfX,on='a',how='left')
print(dfA)
a b c d
0 1 0 a 11
1 2 1 b 12
2 3 0 c 13
3 4 0 d 21
4 5 1 e 22
5 6 1 f 23
However, in the real-world situation, the smaller dataframes will be generated from multiple different complex sources; generating the dataframes and combining into a single dataframe beforehand may not be feasible because there may be overlapping data on the column that will be used for merging the dataframes (but this will be avoided if the dataframe can be split based on the grouping variable). Is it possible to use Pandas groupby() method to do this instead? I was thinking of something like the following (which doesn't work, perhaps because I'm not combining the groups into a new dataframe correctly):
grouped = dfA.groupby('b')
for name, group in grouped:
if name == 0:
group = group.merge(dfB,on='a',how='left')
elif name == 1:
group = group.merge(dfC,on='a',how='left')
Any thoughts would be appreciated.
This will fix your code
l=[]
grouped = dfA.groupby('b')
for name, group in grouped:
if name == 0:
group = group.merge(dfB,on='a',how='left')
elif name == 1:
group = group.merge(dfC,on='a',how='left')
l.append(group)
pd.concat(l)
Out[215]:
a b c d
0 1 0 a 11.0
1 3 0 c 13.0
2 4 0 d NaN
0 2 1 b NaN
1 5 1 e 22.0
2 6 1 f 23.0

merge two dataframes without repeats pandas

I am trying to merge two dataframes, one with columns: customerId, full name, and emails and the other dataframe with columns: customerId, amount, and date. I want to have the first dataframe be the main dataframe and the other dataframe information be included but only if the customerIds match up; I tried doing:
merge = pd.merge(df, df2, on='customerId', how='left')
but the dataframe that is produced contains a lot of repeats and looks wrong:
customerId full name emails amount date
0 002963338 Star shine star.shine#cdw.com $2,910.94 2016-06-14
1 002963338 Star shine star.shine#cdw.com $9,067.70 2016-05-27
2 002963338 Star shine star.shine#cdw.com $6,507.24 2016-04-12
3 002963338 Star shine star.shine#cdw.com $1,457.99 2016-02-24
4 986423367 palm tree tree.palm#snapchat.com,tree#.com $4,604.83 2016-07-16
this cant be right, please help!
There is problem you have duplicates in customerId column.
So solution is remove them, e.g. by drop_duplicates:
df2 = df2.drop_duplicates('customerId')
Sample:
df = pd.DataFrame({'customerId':[1,2,1,1,2], 'full name':list('abcde')})
print (df)
customerId full name
0 1 a
1 2 b
2 1 c
3 1 d
4 2 e
df2 = pd.DataFrame({'customerId':[1,2,1,2,1,1], 'full name':list('ABCDEF')})
print (df2)
customerId full name
0 1 A
1 2 B
2 1 C
3 2 D
4 1 E
5 1 F
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 1 a C
2 1 a E
3 1 a F
4 2 b B
5 2 b D
6 1 c A
7 1 c C
8 1 c E
9 1 c F
10 1 d A
11 1 d C
12 1 d E
13 1 d F
14 2 e B
15 2 e D
df2 = df2.drop_duplicates('customerId')
merge = pd.merge(df, df2, on='customerId', how='left')
print (merge)
customerId full name_x full name_y
0 1 a A
1 2 b B
2 1 c A
3 1 d A
4 2 e B
I do not see repeats as a whole row but there are repetetions in customerId. You could remove them using:
df.drop_duplicates('customerId', inplace = 1)
where df could be the dataframe corresponding to amount or one obtained post merge. In case you want fewer rows (say n), you could use:
df.groupby('customerId).head(n)

Categories

Resources