Python - Pandas - finding matches between two data frames

Python - Pandas - finding matches between two data frames - python

Suppose I have 2 pandas data frames, both sharing the same column names, like this:
name: dob: role:
James Franco 1-1-1980 Actor
Cameron Diaz 4-2-1976 Actor
Jim Carey 12-1-1968 Actor
Miley Cyrus 5-23-1987 Actor
name: dob: role:
50 cent 4-6-1984 Singer
lil baby 12-1-1990 Singer
ghostmane 8-10-1989 Singer
Miley Cyrus 5-23-1987 Singer
And say I wanted to identify individuals who share the same name and dob, and exist in both dataframes (and thus, have 2 different roles).
How can I do this?
similar to if everything existed in 1 dataframe, and I did a df.groupby(["name", "dob"]).count())
I would like to be able to identify these individuals, print them, and count the number of occurrences.
Thank you

df2=df.append(df1)#append the two dfs
dfnew=df2[df2.duplicated(subset=['name:',"dob:"], keep=False)]#keep all duplicated on the columns you wires to check

Well ,this will give you just the matches:
df1.merge(df2, on=["name:","dob:",])
output:
name: dob: role:_x role:_y
0 Miley Cyrus 5-23-1987 Actor Singer
You can use an outer join to get all the results and filter them as you see fit:
df1.merge(df2, how="outer", on=["name:","dob:",])
Output:
name: dob: role:_x role:_y
0 James Franco 1-1-1980 Actor NaN
1 Cameron Diaz 4-2-1976 Actor NaN
2 Jim Carey 12-1-1968 Actor NaN
3 Miley Cyrus 5-23-1987 Actor Singer
4 50 cent 4-6-1984 NaN Singer
5 lil baby 12-1-1990 NaN Singer
6 ghostmane 8-10-1989 NaN Singer

Related

Pandas Number of Unique Values from 2 Fields

I am trying to find the number of unique values that cover 2 fields. So for example, a typical example would be last name and first name. I have a data frame.
When I do the following, I just get the number of unique fields for each column, in this case, Last and First. Not a composite.
df[['Last Name','First Name']].nunique()
Thanks!

Groupby both columns first, and then use nunique
>>> df.groupby(['First Name', 'Last Name']).nunique()

IIUC, you could use value_counts() for that:
df[['Last Name','First Name']].value_counts().size
3
For another example, if you start with this extended data frame that contains some dups:
Last Name First Name
0 Smith Bill
1 Johnson Bill
2 Smith John
3 Curtis Tony
4 Taylor Elizabeth
5 Smith Bill
6 Johnson Bill
7 Smith Bill
Then value_counts() gives you the counts by unique composite last-first name:
df[['Last Name','First Name']].value_counts()
Last Name First Name
Smith Bill 3
Johnson Bill 2
Curtis Tony 1
Smith John 1
Taylor Elizabeth 1
Then the length of that frame will give you the number of unique composite last-first names:
df[['Last Name','First Name']].value_counts().size
5

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Sorry for the confusing title. I am practicing how to manipulate dataframes in Python through pandas. How do I make this kind of table:
id role name
0 11 ACTOR Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 11 DIRECTOR Christian Schwochow
2 22 ACTOR Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...
3 22 DIRECTOR Andrew Baird
4 33 ACTOR Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
5 33 DIRECTOR Saron Sakina
Into this kind:
id director actors name
0 11 Christian Schwochow Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 22 Andrew Baird Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...d
2 33 Saron Sakina Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...

Try this way
df.pivot(index='id', columns='role', values='name')

You can do in addition to #Tejas's answer:
df = (df.pivot(index='id', columns='role', values='name').
reset_index().
rename_axis('',axis=1).
rename(columns={'ACTOR':'actors name','DIRECTOR':'director'}))

Combining three datasets removing duplicates

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.

After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

Groupby for 3 columns

I would like use groupby function based on 3 columns. First column has surname info for families, second column has name of individuals in that families.Third column has which animal every individual has in those families. I want to get information of person with name and surname who has cat or dog and how many of cat or dog those indivual has.
My data looks like
Family SubFamily Animal
Smith Karen Cat
Smith Karen Cow
Smith Karen Dog
Jackson Jason Dog
I tried
merged_family.groupby(["Family","Animal","SubFamily"]).size().loc[:,'Cat'].loc[:,'Dog']
The result might be
Family SubFamily Cat Dog
Smith Karen 1 1
or something similar
It did not work. Could you help me?

I think is a better task for pivot_table
df_merged.query("Animal.isin(['Cat', 'Dog'])")
.pivot_table(columns='Animal', index=['Family', 'SubFamily'], aggfunc='size')
.fillna(0)
.reset_index()
.rename_axis(None, axis=1)
# Family SubFamily Cat Dog
# 0 Jackson Jason 0.0 1.0
# 1 Smith Karen 1.0 1.0

Break up a data-set into separate excel files based on a certain row value in a given column in Pandas?

I have a fairly large dataset that I would like to split into separate excel files based on the names in column A ("Agent" column in the example provided below). I've provided a rough example of what this data-set looks like in Ex1 below.
Using pandas, what is the most efficient way to create a new excel file for each of the names in column A, or the Agent column in this example, preferably with the name found in column A used in the file title?
For example, in the given example, I would like separate files for John Doe, Jane Doe, and Steve Smith containing the information that follows their names (Business Name, Business ID, etc.).
Ex1
Agent Business Name Business ID Revenue
John Doe Bobs Ice Cream 12234 $400
John Doe Car Repair 445848 $2331
John Doe Corner Store 243123 $213
John Doe Cool Taco Stand 2141244 $8912
Jane Doe Fresh Ice Cream 9271499 $2143
Jane Doe Breezy Air 0123801 $3412
Steve Smith Big Golf Range 12938192 $9912
Steve Smith Iron Gyms 1231233 $4133
Steve Smith Tims Tires 82489233 $781
I believe python / pandas would be an efficient tool for this, but I'm still fairly new to pandas, so I'm having trouble getting started.

I would loop over the groups of names, then save each group to its own excel file:
s = df.groupby('Agent')
for name, group in s:
group.to_excel(f"{name}.xls")

Use lise comprehension with groupby on agent column:
dfs = [d for _,d in df.groupby('Agent')]
for df in dfs:
print(df, '\n')
Output
Agent Business Name Business ID Revenue
4 Jane Doe Fresh Ice Cream 9271499 $2143
5 Jane Doe Breezy Air 123801 $3412
Agent Business Name Business ID Revenue
0 John Doe Bobs Ice Cream 12234 $400
1 John Doe Car Repair 445848 $2331
2 John Doe Corner Store 243123 $213
3 John Doe Cool Taco Stand 2141244 $8912
Agent Business Name Business ID Revenue
6 Steve Smith Big Golf Range 12938192 $9912
7 Steve Smith Iron Gyms 1231233 $4133
8 Steve Smith Tims Tires 82489233 $781

Grouping is what you are looking for here. You can iterate over the groups, which gives you the grouping attributes and the data associated with that group. In your case, the Agent name and the associated business columns.
Code:
import pandas as pd
# make up some data
ex1 = pd.DataFrame([['A',1],['A',2],['B',3],['B',4]], columns = ['letter','number'])
# iterate over the grouped data and export the data frames to excel workbooks
for group_name,data in ex1.groupby('letter'):
# you probably have more complicated naming logic
# use index = False if you have not set an index on the dataframe to avoid an extra column of indices
data.to_excel(group_name + '.xlsx', index = False)

Use the unique values in the column to subset the data and write it to csv using the name:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_csv(f"{unique_val}.csv")
if you need excel:
import pandas as pd
for unique_val in df['Agent'].unique():
df[df['Agent'] == unique_val].to_excel(f"{unique_val}.xlsx")

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Pandas - finding matches between two data frames - python

df2=df.append(df1)#append the two dfs dfnew=df2[df2.duplicated(subset=['name:',"dob:"], keep=False)]#keep all duplicated on the columns you wires to check

Related

Pandas Number of Unique Values from 2 Fields

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Combining three datasets removing duplicates

Groupby for 3 columns

Break up a data-set into separate excel files based on a certain row value in a given column in Pandas?

Categories

Resources