Panda's MERGE on customerEmail column having duplicates [duplicate] - python

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 2 years ago.
Aim is to detect fraud from this dataset.
I have two dataframes with columns as:
DF1[customerEmail, customerphone, customerdevice,customeripadd,NoOftransactions,Fraud] etc (168,11)
DF2[customerEmail,transactionid, payment methods,orderstatus] etc (623,11)
The customerEmail column is common in both the dataframes so it makes sense to merge tables on customerEmail.
The problem is that I have repeating customerEmail in DF2 with no reference in DF1. So when I merge using:
: DF3 = pd.merge(DF1, DF2, on='customerEmail')
the total size of rows and columns is (819,18) with repeating email ID having misleading data.
I want it to match using customerEmail from DF1 so my final dataframe DF3 should be somewhere equal to DF1.
Here's a link to the data for you to look at. Cheers
https://www.kaggle.com/aryanrastogi7767/ecommerce-fraud-data

Try changing the how parameter to 'left'.
For example:
DF3 = DF1.merge(DF2, how='left', on='customerEmail')
Failing this, we prob need some more information.

Maybe you should consider a different value for the option "how". By default, it is "inner" meaning deleting all rows without any match
Maybe the option "right", would help you, as then DF2 is the reference and DF1 is join to DF2.

Related

Combining two dataframes with different rows, keeping contents of first dataframe on all rows

Good day All,
I have two data frames that needs to be merged which is a little different to the ones I found so far and could not get it working. What I am currently getting, which I am sure is to do with the index, as dataframe 1 only has 1 record. I need to copy the contents of dataframe one into new columns of dataframe 2 for all rows.
Current problem highlighted in red
I have tried merge, append, reset index etc...
DF 1:
Dataframe 1
DF 2:
Dataframe 2
Output Requirement:
Required Output
Any suggestions would be highly appreciated
Update:
I got it to work using the below statements, is there a more dynamic way than specifying the column names?
mod_df['Type'] = mod_df['Type'].fillna(method="ffill")
mod_df['Date'] = mod_df['Date'].fillna(method="ffill")
mod_df['Version'] = mod_df['Version'].fillna(method="ffill")
Assuming you have a single row in df1, use a cross merge:
out = df2.merge(df1, how='cross')

Looking for a way to do in pandas version (version<1.2) merge, how=cross [duplicate]

This question already has answers here:
Performant cartesian product (CROSS JOIN) with pandas
(5 answers)
Closed 4 years ago.
I can't find anything about cross join include the merge/join or some other.
I need deal with two dataframe using {my function} as myfunc .
the equivalent of :
{
for itemA in df1.iterrows():
for itemB in df2.iterrows():
t["A"] = myfunc(itemA[1]["A"],itemB[1]["A"])
}
the equivalent of :
{
select myfunc(df1.A,df2.A),df1.A,df2.A from df1,df2;
}
but I need more efficient solution:
if used apply i will be how to implement them thx;^^
Create a common 'key' to cross join the two:
df1['key'] = 0
df2['key'] = 0
df1.merge(df2, on='key', how='outer')
For the cross product, see this question.
Essentially, you have to do a normal merge but give every row the same key to join on, so that every row is joined to each other across the frames.
You can then add a column to the new frame by applying your function:
new_df = pd.merge(df1, df2, on=key)
new_df.new_col = new_df.apply(lambda row: myfunc(row['A_x'], row['A_y']), axis=1)
axis=1 forces .apply to work across the rows. 'A_x' and 'A_y' will be the default column names in the resulting frame if the merged frames share a column like in your example above.

changing row values in a dataframe by looking into another dataframe [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed last year.
I have a look up table as a dataframe (1000 rows) consisting of codes and labels. I have another dataframe (2,00,000 rows) consisting of codes and geometries.
I need to get label names for each corresponding code by looking in the look up dataframe.
Output should be dataframe.
I tried it as follows.
df = pd.read_csv(filepath)
codes = df['codes'].values
labels = df['labels'].values
df2 = pd.read_csv(filepath)
print (df2.shape)
for ix in df2.index:
code = df2.loc[ix, 'code']
df2.loc[ix, 'label'] = labels[codes==code][0]
print (df2)
Result is correct, but it's very slow... for looping is very slow
Can you help me?
You should use the merge method of DataFrames (https://pandas.pydata.org/docs/reference/api/pandas.merge.html). It allows to join two dataframes based on a common column. Your code should look like this:
df2 = df2.merge(df, left_on="code", right_on="codes", how="left")
# Check labels using df2["labels"]
The common column name is specified in the parameters left_on and right_on. The parameter how='left' indicates that all the rows from df2 are preserved even if there is no code for a row.

How to join two dataframes in Pandas? [duplicate]

This question already has answers here:
Pandas: how to merge two dataframes on a column by keeping the information of the first one?
(4 answers)
Closed 3 years ago.
I have two dataframes.
The first dataframe is A.
And the second dataframe is B.
Basically both dataframes have AdId fields. First dataframe has unique AdIds per row but the second dataframe has multiple instances of a single AdId. I want to get all the information of that AdId to the second dataframe.
I am expecting the output as follows
I have tried the following code
B.join(A, on='AdId', how='left', lsuffix='_caller')
But this does not give the expected output.
Use pandas concat:
result = pd.concat([df1, df4], axis=1, sort=False)
More on merging with pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html#set-logic-on-the-other-axes

how to remove duplicates when using pandas concat to combine two dataframe [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 3 years ago.
I have two data from.
df1 with columns: id,x1,x2,x3,x4,....xn
df2 with columns: id,y.
df3 =pd.concat([df1,df2],axis=1)
when I use pandas concat to combine them, it became
id,y,id,x1,x2,x3...xn.
there are two id here.How can I get rid of one.
I have tried :
df3=pd.concat([df1,df2],axis=1).drop_duplicates().reset_index(drop=True).
but not work.
DataFrames are concatenated on the index. Make sure that id is the index before concatenating:
df3 = pd.concat([df1.set_index('id'),
df2.set_index('id')], axis=1).reset_index()
Or, better yet, use join:
df3 = df1.join(df2, on='id')
drop_duplicates() only removes rows that are completely identical.
what you're looking for is pd.merge().
pd.merge(df1, df2, on='id)

Categories

Resources