Combining three datasets removing duplicates

Combining three datasets removing duplicates - python

I've three datasets:
dataset 1
Customer1 Customer2 Exposures + other columns
Nick McKenzie Christopher Mill 23450
Nick McKenzie Stephen Green 23450
Johnny Craston Mary Shane 12
Johnny Craston Stephen Green 12
Molly John Casey Step 1000021
dataset2 (unique Customers: Customer 1 + Customer 2)
Customer Age
Nick McKenzie 53
Johnny Craston 75
Molly John 34
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
dataset 3
Customer1 Customer2 Exposures + other columns
Mick Sale Johnny Craston
Mick Sale Stephen Green
Exposures refers to Customer 1 only.
There are other columns omitted for brevity. Dataset 2 is built by getting unique customer 1 and unique customer 2: no duplicates are in that dataset. Dataset 3 has the same column of dataset 1.
I'd like to add the information from dataset 1 into dataset 2 to have
Final dataset
Customer Age Exposures + other columns
Nick McKenzie 53 23450
Johnny Craston 75 12
Molly John 34 1000021
Christopher Mill 63
Stephen Green 65
Mary Shane 54
Casey Step 34
Mick Sale
The final dataset should have all Customer1 and Customer 2 from both dataset 1 and dataset 3, with no duplicates.
I have tried to combine them as follows
result = pd.concat([df2,df1,df3], axis=1)
but the result is not that one I'd expect.
Something wrong is in my way of concatenating the datasets and I'd appreciate it if you can let me know what is wrong.

After concatenating the dataframe df1 and df2 (assuming they have same columns), we can remove the duplicates using df1.drop_duplicates(subset=['customer1']) and then we can join with df2 like this
df1.set_index('Customer1').join(df2.set_index('Customer'))
In case df1 and df2 has different columns based on the primary key we can join using the above command and then again join with the age table.
This would give the result. You can concatenate dataset 1 and datatset 3 because they have same columns. And then run this operation to get the desired result. I am joining specifying the respective keys.
Note: Though not related to the question but for the concatenation one can use this code pd.concat([df1, df3],ignore_index=True) (Here we are ignoring the index column)

Related

Python pandas merge map with multiple values xlookup

I have a dataframe of actor names:
df1
actor_id actor_name
1 Brad Pitt
2 Nicole Kidman
3 Matthew Goode
4 Uma Thurman
5 Ethan Hawke
And another dataframe of movies that the actors were in:
df2
actor_id actor_movie movie_revenue_m
1 Once Upon a Time in Hollywood 150
2 The Others 50
2 Moulin Rouge 200
3 Stoker 75
4 Kill Bill 125
5 Gattaca 85
I want to merge the two dataframes together to show the actors with their movie names and movie revenues, so I use the merge function:
df3 = df1.merge(df2, on = 'actor_id', how = 'left')
df3
actor_id actor_name actor_movie movie_revenue
1 Brad Pitt Once Upon a Time in Hollywood 150
2 Nicole Kidman Moulin Rouge 50
2 Nicole Kidman The Others 200
3 Matthew Goode Stoker 75
4 Uma Thurman Kill Bill 125
5 Ethan Hawke Gattaca 85
But this pulls in all movies, so Nicole Kidman gets duplicated, and I only want to show one movie per actor. How can I merge the dataframes without "duplicating" my list of actors?
How would I merge the movie title that is alphabetically first?
How would I merge the movie title with the highest revenue?
Thank you!

One way is to continue with the merge and then filter the result set
movie title that is alphabetically first
# sort by name, movie and then pick the first while grouping by actor
df.sort_values(['actor_name','actor_movie'] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman Moulin Rouge 50
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85
movie title with the highest revenue
# sort by name, and review (descending), groupby actor and pick first
df.sort_values(['actor_name','movie_revenue'], ascending=[1,0] ).groupby('actor_id', as_index=False).first()
actor_id actor_name actor_movie movie_revenue
0 1 Brad Pitt Once Upon a Time in Hollywood 150
1 2 Nicole Kidman The Others 200
2 3 Matthew Goode Stoker 75
3 4 Uma Thurman Kill Bill 125
4 5 Ethan Hawke Gattaca 85

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Sorry for the confusing title. I am practicing how to manipulate dataframes in Python through pandas. How do I make this kind of table:
id role name
0 11 ACTOR Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 11 DIRECTOR Christian Schwochow
2 22 ACTOR Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...
3 22 DIRECTOR Andrew Baird
4 33 ACTOR Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...
5 33 DIRECTOR Saron Sakina
Into this kind:
id director actors name
0 11 Christian Schwochow Luna Wedler, Jannis Niewöhner, Milan Peschel, ...
1 22 Andrew Baird Guy Pearce, Matilda Anna Ingrid Lutz, Travis F...d
2 33 Saron Sakina Glenn Fredly, Marcello Tahitoe, Andien Aisyah,...

Try this way
df.pivot(index='id', columns='role', values='name')

You can do in addition to #Tejas's answer:
df = (df.pivot(index='id', columns='role', values='name').
reset_index().
rename_axis('',axis=1).
rename(columns={'ACTOR':'actors name','DIRECTOR':'director'}))

Splitting DataFrame and maintaining DataFrame group integrity

To whom it may concern,
I have a very large dataframe (MasterDataFrame) that contains ~180K groups that I would like to split into 5 smaller DataFrames and process each smaller DataFrame separately. Does anyone know of any way that I could achieve this split into 5 smaller DataFrames without accidentally splitting/jeopardizing the integrity of any of the groups from the MasterDataFrame? In other words, I would like for the 5 smaller DataFrames to not have overlapping groups.
Thanks in advance,
Christos
This is what my dataset looks like:
|======MasterDataset======|
Name Age Employer
Tom 12 Walmart
Nick 15 Disney
Chris 18 Walmart
Darren 19 KMart
Nate 43 ESPN
Harry 23 Walmart
Uriel 24 KMart
Matt 23 Disney
. . .
. . .
. . .
I need to be able to split my dataset such that the groups shown in the MasterDataset above are preserved. The smaller groups into which my MasterDataset will be split need to look like this:
|======SubDataset1======|
Name Age Employer
Tom 12 Walmart
Chris 18 Walmart
Harry 23 Walmart
Darren 19 KMart
Uriel 24 KMart
|======SubDataset2======|
Name Age Employer
Nick 15 Disney
Matt 23 Disney

I assume that you mean the number of lines with "groups"
For that .iloc should be perfect.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html
df_1 = df.iloc[0:100000,:]
df_2 = df.iloc[100001:200000,:]
....

How to run hypothesis test with pandas data frame and specific conditions?

I am trying to run a hypothesis test using model ols. I am trying to do this model Ols for tweet count based on four groups that I have in my data frame. The four groups are Athletes, CEOs, Politicians, and Celebrities. I have the four groups each labeled for each name in one column as a group.
frames = [CEO_df, athletes_df, Celebrity_df, politicians_df]
final_df = pd.concat(frames)
final_df=final_df.reindex(columns=["name","group","tweet_count","retweet_count","favorite_count"])
final_df
model=ols("tweet_count ~ C(group)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
I want to do something along the lines of:
model=ols("tweet_count ~ C(Athlete) + C(Celebrity) + C(CEO) + C(Politicians)", data=final_df).fit()
table=sm.stats.anova_lm(model, typ=2)
print(table)
Is that even possible? How else will I be able to run a hypothesis test with those conditions?
Here is my printed final_df:
name group tweet_count retweet_count favorite_count
0 #aws_cloud # #ReInvent R “Ray” Wang 王瑞光 #1A CEO 6 6 0
1 Aaron Levie CEO 48 1140 18624
2 Andrew Mason CEO 24 0 0
3 Bill Gates CEO 114 78204 439020
4 Bill Gross CEO 36 486 1668
... ... ... ... ... ...
56 Tim Kaine Politician 48 8346 50898
57 Tim O'Reilly Politician 14 28 0
58 Trey Gowdy Politician 12 1314 6780
59 Vice President Mike Pence Politician 84 1146408 0
60 klay thompson Politician 48 41676 309924

How to perform groupby and mean on categorical columns in Pandas

I'm working on a dataset called gradedata.csv in Python Pandas where I've created a new binned column called 'Status' as 'Pass' if grade > 70 and 'Fail' if grade <= 70. Here is the listing of first five rows of the dataset:
fname lname gender age exercise hours grade \
0 Marcia Pugh female 17 3 10 82.4
1 Kadeem Morrison male 18 4 4 78.2
2 Nash Powell male 18 5 9 79.3
3 Noelani Wagner female 14 2 7 83.2
4 Noelani Cherry female 18 4 15 87.4
address status
0 9253 Richardson Road, Matawan, NJ 07747 Pass
1 33 Spring Dr., Taunton, MA 02780 Pass
2 41 Hill Avenue, Mentor, OH 44060 Pass
3 8839 Marshall St., Miami, FL 33125 Pass
4 8304 Charles Rd., Lewis Center, OH 43035 Pass
Now, how do i compute the mean hours of exercise of female students with a 'status' of passing...?
I've used the below code, but it isn't working.
print(df.groupby('gender', 'status')['exercise'].mean())
I'm new to Pandas. Anyone please help me in solving this.

You are very close. Note that your groupby key must be one of mapping, function, label, or list of labels. In this case, you want a list of labels. For example:
res = df.groupby(['gender', 'status'])['exercise'].mean()
You can then extract your desired result via pd.Series.get:
query = res.get(('female', 'Pass'))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Combining three datasets removing duplicates - python

Related

Python pandas merge map with multiple values xlookup

How do you create new column from two distinct categorical column values in a dataframe by same column ID in pandas?

Splitting DataFrame and maintaining DataFrame group integrity

How to run hypothesis test with pandas data frame and specific conditions?

How to perform groupby and mean on categorical columns in Pandas

Categories

Resources