I have a following dataframe
Name Activities
Eric Soccer,Baseball,Swimming
Natasha Soccer
Mike Basketball,Baseball
I need to transform it into following dataframe
Activities Name
Soccer Eric,Natasha,Mike
Swimming Eric
Baseball Eric,Mike
Basketball Mike
how should I do it?

Using pd.get_dummies
First, use get_dummies:
tmp = df.set_index('Name').Activities.str.get_dummies(sep=',')
Now using stack and agg:
tmp.mask(tmp.eq(0)).stack().reset_index('Name').groupby(level=0).agg(', '.join)
Baseball Eric, Mike
Basketball Mike
Soccer Eric, Natasha
Swimming Eric
Using str.split and melt
(df.set_index('Name').Activities.str.split(',', expand=True)
.reset_index().melt(id_vars='Name').groupby('value').Name.agg(', '.join))

You can separate the Activities by performing a split and then converting the resulting list to a Series.
Then melt from wide to long format, and groupby the resulting value column (which is Activities).
In your grouped data frame, join the Name fields associated with each Activity.
Like this:
.merge(df, right_index=True, left_index=True)
.melt(id_vars="Name", value_vars=[0,1,2])
.agg({'Name': lambda x: ','.join(x)})
Activities Name
0 Baseball Eric,Mike
1 Basketball Mike
2 Soccer Eric,Natasha
3 Swimming Eric
Note: The reset_index() and rename() methods at the end of the chain are just cosmetic; the main operations are complete after the groupby aggregation.


How to collapse all rows in pandas dataframe across all columns

I am trying to collapse all the rows of a dataframe into one single row across all columns.
My data frame looks like the following:
I am trying to get the following output:
bob jack
business dentist
I am trying to group across all columns, I do not care if the value column is converted to dtype object (string).
I'm just trying to collapse all the rows across all columns.
I've tried groupby(index=0) but did not get good results.
You could apply join:
out = df.apply(lambda x: ' '.join(x.dropna().astype(str))).to_frame().T
name job value
0 bob jack business dentist 100.0
Try this:
new_df = df.agg(lambda x: x.dropna().astype(str).tolist()).str.join(' ').to_frame().T
>>> new_df
name job value
0 bob jack business dentist 100.0

Compare two data-frames with different column names and update first data-frame with the column from second data-frame

I am working on two data-frames which have different column names and dimensions.
First data-frame "df1" contains single column "name" that has names need to be located in second data-frame. If matched, value from df2 first column df2[0] needs to be returned and added in the result_df
Second data-frame "df2" has multiple columns and no header. This contains all the possible diminutive names and full names. Any of the column can have the "name" that needs to be matched
Goal: Locate the name in "df1" in "df2" and if it is matched, return the value from first column of the df2 and add in the respective row of df1
The code i have written so far is giving error. I need to write it as an efficient code as it will be checking millions of entries in df1 with df2:
result_df = process_name(df1, df2)
def process_name(df1, df2):
for elem in df2.values:
if elem in df1['name']:
df1["matched_name"] = df2[0]
Try via concat(),merge(),drop() and rename() and reset_index() method:
df=(pd.concat((df1.merge(df2,left_on='name',right_on=x) for x in df2.columns))
Output of df:
name matched_name
0 robert robert
1 ab abram
2 alex alexander
3 bill william
4 bob robert

When merging Dataframes on a common column like ID (primary key),how do you handle data that appears more than once for a single ID, in the second df?

So I have two dfs.
Superhero ID Superhero City
212121 Spiderman New york
364331 Ironman New york
678523 Batman Gotham
432432 Dr Strange New york
665544 Thor Asgard
123456 Superman Metropolis
555555 Nightwing Gotham
666666 Loki Asgard
SID Mission End date
665544 10/10/2020
665544 03/03/2021
212121 02/02/2021
665544 05/12/2020
212121 15/07/2021
123456 03/06/2021
666666 12/10/2021
I need to create a new df that summarizes how many heroes are in each city and in which quarter will their missions be complete. I'll be able to match the superhero (and their city) in df1 to the mission end date via their Superhero ID or SID in Df2 ('Superhero Id'=='SID'). Superhero IDs appear only once in Df1 but can appear multiple times in DF2.
Ultimately I need a count for the total no. of heroes in the different cities (which I can do - see below) as well as how many heroes will be free per quarter.
These are the thresholds for the quarters
Quarter 1 – Apr, May, Jun
Quarter 2 – Jul, Aug, Sept
Quarter 3 – Oct, Nov, Dec
Quarter 4 – Jan, Feb, Mar
The following code tells me how many heroes are in each city:
df_Count = pd.DataFrame(df1.City.value_counts().reset_index())
Which produces:
City Count
New york 3
Gotham 2
Asgard 2
Metropolis 1
I can also convert the dates into datetime format via the following operation:
#Convert to datetime series
Df2['Mission End date'] = pd.to_datetime('Df2['Mission End date']')
Ultimately I need a new df that looks like this
City Total Count No. of heroes free in Q3 No. of heroes free in Q4 Free in Q1 2021+
New york 3 2 0 1
Gotham 2 2 2 0
Asgard 2 1 2 0
Metropolis 1 0 0 1
If anyone can help me create the appropriate quarters and be able to sort them into the appropriate columns I'd be extremely grateful. I'd also like a way to handle heroes having multiple mission end dates. I can't ignore them I need to still count them. I suspect I'll need to create a custom function which I can than apply to each row via the apply() method and a lambda expression. This issue has been a pain for a while now so I'd appreciate all the help I can get. Thank you very much :)
After merging your dataframe with
df = df1.merge(df2, left_on='Superhero ID', right_on='SID')
And converting your date column to pd.datetime format
df.assign(missing_end_date=lambda x: pd.to_datetime(x['Missing End Date']))
You can create two columns; one to extract the quarter and one to extract the year of the newly created datetime column
df.assign(quarter_end_date=lambda x: x.missing_end_date.dt.quarter)
.assign(year_end_date=lambda x: x.missing_end_date.dt.year)
And combine them into a column that shows the quarter in a format Qx, yyyy
df.assign(quarter_year_end=lambda x: f"Q{int(x.quarter_end_date)}, {int(x.year_end_date)}")
Finally groupby the city and quarter, count the number of superheros and pivot the dataframe to get your desired result
df.groupby(['City', 'quarter_year_end'])
.pivot(index='City', columns='quarter_year_end', values='Superhero')

pyspark using agg to concat string after groupBy

In pandas dataframe, I am able to do
df2 = df.groupBy('name').agg({'id': 'first', 'grocery': ','.join})
name id grocery
Mike 01 Apple
Mike 01 Orange
Kate 99 Beef
Kate 99 Wine
name id grocery
Mike 01 Apple,Orange
Kate 99 Beef,Wine
since id is the same across multiple rows for the same person, I just took the first one for each person, and concat the grocery.
I can't seem to make this work in pyspark. How can I do the same thing in pyspark? I want the grocery to be string instead of list
Use collect_list to collect elements into a list and then join the list as string with concat_ws:
import pyspark.sql.functions as f
f.concat_ws(",", f.collect_list("grocery")).alias("grocery")
#|name| id| grocery|
#|Kate| 99| Beef,Wine|
#|Mike| 01|Apple,Orange|

How to link data from one Dataframe to another? [duplicate]

How to get merged data frame from two data frames having common column value such that only those rows make merged data frame having common value in a particular column.
I have 5000 rows of df1 as format : -
director_name actor_1_name actor_2_name actor_3_name movie_title
0 James Cameron CCH Pounder Joel David Moore Wes Studi Avatar
1 Gore Verbinski Johnny Depp Orlando Bloom Jack Davenport Pirates
of the Caribbean: At World's End
2 Sam Mendes Christoph Waltz Rory Kinnear Stephanie Sigman Spectre
and 10000 rows of df2 as
movieId genres movie_title
1 Adventure|Animation|Children|Comedy|Fantasy Toy Story
2 Adventure|Children|Fantasy Jumanji
3 Comedy|Romance Grumpier Old Men
4 Comedy|Drama|Romance Waiting to Exhale
A common column 'movie_title' have common values and based on them, I want to get all rows where 'movie_title' is same. Other rows to be deleted.
Any help/suggestion would be appreciated.
Note: I already tried
pd.merge(dfinal, df1, on='movie_title')
and output comes like one row
director_name actor_1_name actor_2_name actor_3_name movie_title movieId title genres
and on how ="outer"/"left", "right", I tried all and didn't get any row after dropping NaN although many common coloumn do exist.
You can use pd.merge:
import pandas as pd
pd.merge(df1, df2, on="movie_title")
Only rows are kept for which common keys are found in both data frames. In case you want to keep all rows from the left data frame and only add values from df2 where a matching key is available, you can use how="left":
pd.merge(df1, df2, on="movie_title", how="left")
We can merge two Data frames in several ways. Most common way in python is using merge operation in Pandas.
import pandas
dfinal = df1.merge(df2, on="movie_title", how = 'inner')
For merging based on columns of different dataframe, you may specify left and right common column names specially in case of ambiguity of two different names of same column, lets say - 'movie_title' as 'movie_name'.
dfinal = df1.merge(df2, how='inner', left_on='movie_title', right_on='movie_name')
If you want to be even more specific, you may read the documentation of pandas merge operation.
If you want to merge two DataFrames and you want a merged DataFrame in which only common values from both data frames will appear then do inner merge.
import pandas as pd
merged_Frame = pd.merge(df1, df2, on = id, how='inner')

