Pandas dataframe: groupby one column, but concatenate and aggregate by others [duplicate] - python

This question already has answers here:
Concatenate strings from several rows using Pandas groupby
(8 answers)
Pandas groupby: How to get a union of strings
(8 answers)
Closed 4 years ago.
How do I turn the below input data (Pandas dataframe fed from Excel file):
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith 100
334014 E&E Tom Smith 200
334014 Real Estate Perspectives Janet Brown 100
334014 E&E Janet Brown 200
into this:
ID Category Speaker Price
334014 Real Estate Perspectives Tom Smith, Janet Brown 100
334014 E&E Tom Smith, Janet Brown 200
So basiscally I want to group by Category, concatenate the Speakers, but not aggregate Price.
I tried different approaches with Pandas dataframe.groupby() and .agg(), but to no avail. Maybe there is simpler pure Python solution?

There are 2 possible solutions - aggregate by multiple columns and join:
dataframe.groupby(['ID','Category','Price'])['Speaker'].apply(','.join)
Or need aggregate only Price column, then is necessary aggregate all columns by first or last:
dataframe.groupby('Price').agg({'Speaker':','.join, 'ID':'first', 'Price':'first'})

Try this
df.groupby(['ID','Category'],as_index=False).agg(lambda x : x if x.dtype=='int64' else ', '.join(x))

Related

In pandas, how to groupby and apply/transform on each whole group (NOT aggregation)?

I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?
Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison
This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.

Creating a DataFrame from other 2 dataframes based on coditions in Python [duplicate]

This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 csv file. They have one common column which is ID. What I want to do is I want to extract the common rows and built another dataframe. Firstly, I want to select job, and after that, as I said they have one common column, I want to find the rows whose IDs are the same. Visually, the dataframe should be seen like this:
Let first DataFrame is:
#ID
#Gender
#Job
#Shift
#Wage
1
Male
Engineer
Night
8000
2
Male
Engineer
Night
7865
3
Female
Worker
Day
5870
4
Male
Accountant
Day
5870
5
Female
Architecture
Day
4900
Let second one is:
#ID
#Department
1
IT
2
Quality Control
5
Construction
7
Construction
8
Human Resources
And the new DataFrame should be like:
#ID
#Department
#Job
#Wage
1
IT
Engineer
8000
2
Quality Control
Engineer
7865
5
Construction
Architecture
4900
You can use:
df_result = df1.merge(df2, on = 'ID', how = 'inner')
If you want to select only certain columns from a certain df use:
df_result = df1[['ID','Job', 'Wage']].merge(df2[['ID', 'Department']], on = `ID`, how = 'inner')
Use:
df = df2.merge(df1[['ID','Job', 'Wage']], on='ID')

How to pivot columns correctly [duplicate]

This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I have a dataframe like this:
NUMBER NAME
1 000231 John Stockton
2 009456 Karl Malone
3 100000901 John Stockton
4 100008496 Karl Malone
I want to obtain a new dataframe with:
NAME VALUE1 VALUE2
1 John Stockton 000231 100000901
2 Karl Malone 009456 100008496
I think I should use pd.groupby(), but I have no function to pass as an aggregator (I don't need to compute any mean(), min(), or max() value). If I just use pd.groupby() without any aggregator, I get:
In[1]: pd.DataFrame(df.groupby(['NAME']))
Out[1]:
0 1
0 John Stockton NAME NUMBER 000231 100000901
1 Karl Malone NAME NUMBER 009456 100008496
What am I doing wrong? Do I need to pivot the dataframe?
Actually, you need a bit mode complicated pipeline:
(df.assign(group=df.groupby('NAME').cumcount().add(1)
.pivot(index='NAME', columns='group', values='NUMBER')
.rename_axis(None, axis=1)
.add_prefix('VALUE')
.reset_index()
)
output:
NAME VALUE1 VALUE2
0 John Stockton 231 100000901
1 Karl Malone 9456 100008496

Drop 50% of rows containing a keyword [duplicate]

This question already has answers here:
"Drop random rows" from pandas dataframe
(2 answers)
Closed 2 years ago.
I have a dataset which contains a column 'location' with countries.
id location
0 001 United State
1 002 United State
2 003 Germany
3 004 Brazil
4 005 China
Now I only want the rows with specific countries.
I did this like this:
df2 = df[(df['location'].str.contains('United States')) | (df['location'].str.contains('Germany'))
That works.
Now I want only half of the rows with 'United States'.
(The reason is I have a really large dataset and most of the rows are 'United States'. For the sake of performance for further operations i want to cut half of it or just any %.)
Can anyone help me do that in a fast and clean way? Im sturggling.
TY <3
You can use sample for that, together with drop
df.drop(df[df['location'] == 'United State']).sample(frac=.5).index)
The filter inside, returns ALL the rows that have values equal to 'United State'. Then the sample takes randomly 50% of those and the index will return the index number which then will be used to drop those rows.

Add data from multiple dataframe by its index using pandas [duplicate]

This question already has answers here:
replace column values in one dataframe by values of another dataframe
(5 answers)
Closed 5 years ago.
So i got my dataframe (df1):
Number Name Gender Hobby
122 John Male -
123 Patrick Male -
124 Rudy Male -
I want to add data to hobby based on number column. Assuming i've got my list of hobby based on its number on different dataframe. Like Example (df2):
Number Hobby
124 Soccer
... ...
... ...
and df3 :
Number Hobby
122 Basketball
... ...
... ...
How can i achieve this dataframe:
Number Name Gender Hobby
122 John Male Basketball
123 Patrick Male -
124 Rudy Male Soccer
So far i've already tried this following solutions :
Select rows from a DataFrame based on values in a column in pandas
but its only selecting some data. How can i update the 'Hobby' column ?
Thanks in advance.
You can use map, merge and join will also achieve it
df['Hobby']=df.Number.map(df1.set_index('Number').Hobby)
df
Out[155]:
Number Name Gender Hobby
0 122 John Male NaN
1 123 Patrick Male NaN
2 124 Rudy Male Soccer

Categories

Resources