I've looked into agg/apply/transform after groupby, but none of them seem to meet my need.
Here is an example DF:
df_seq = pd.DataFrame({
'person':['Tom', 'Tom', 'Tom', 'Lucy', 'Lucy', 'Lucy'],
'day':[1,2,3,1,4,6],
'food':['beef', 'lamb', 'chicken', 'fish', 'pork', 'venison']
})
person,day,food
Tom,1,beef
Tom,2,lamb
Tom,3,chicken
Lucy,1,fish
Lucy,4,pork
Lucy,6,venison
The day column shows that, for each person, he/she consumes food in sequential orders.
Now I would like to group by the person col, and create a DataFrame which contains food pairs for two neighboring days/time (as shown below).
Note the day column is only for example purpose here so the values of it should not be used. It only means the food column is in sequential order. In my real data, it's a datetime column.
person,day,food,food_next
Tom,1,beef,lamb
Tom,2,lamb,chicken
Lucy,1,fish,pork
Lucy,4,pork,venison
At the moment, I can only do this with a for-loop to iterate through all users. It's very slow.
Is it possible to use a groupby and apply/transform to achieve this, or any vectorized operations?
Create new column by DataFrameGroupBy.shift and then remove rows with missing values in food_next by DataFrame.dropna:
df = (df_seq.assign(food_next = df_seq.groupby('person')['food'].shift(-1))
.dropna(subset=['food_next']))
print (df)
person day food food_next
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
3 Lucy 1 fish pork
4 Lucy 4 pork venison
This might be a slightly patchy answer, and it doesn't perform an aggregation in the standard sense.
First, a small querying function that, given a name and a day, will return the first result (assuming the data is pre-sorted) that matches the parameters, and failing that, returns some default value:
def get_next_food(df, person, day):
results = df.query(f"`person`=='{person}' and `day`>{day}")
if len(results)>0:
return results.iloc[0]['food']
else:
return "Mystery"
You can use this as follows:
get_food(df_seq,"Tom", 1)
> 'lamb'
Now, we can use this in an apply statement, to populate a new column with the results of this function applied row-wise:
df_seq['next_food']=df_seq.apply(lambda x : get_food(df_seq, x['person'], x['day']), axis=1)
>
person day food next_food
0 Tom 1 beef lamb
1 Tom 2 lamb chicken
2 Tom 3 chicken Mystery
3 Lucy 1 fish pork
4 Lucy 4 pork venison
5 Lucy 6 venison Mystery
Give it a try, I'm not convinced you'll see a vast performance improvement, but it'd be interesting to find out.
This question already has answers here:
Pandas Merging 101
(8 answers)
Closed 1 year ago.
I have 2 csv file. They have one common column which is ID. What I want to do is I want to extract the common rows and built another dataframe. Firstly, I want to select job, and after that, as I said they have one common column, I want to find the rows whose IDs are the same. Visually, the dataframe should be seen like this:
Let first DataFrame is:
#ID
#Gender
#Job
#Shift
#Wage
1
Male
Engineer
Night
8000
2
Male
Engineer
Night
7865
3
Female
Worker
Day
5870
4
Male
Accountant
Day
5870
5
Female
Architecture
Day
4900
Let second one is:
#ID
#Department
1
IT
2
Quality Control
5
Construction
7
Construction
8
Human Resources
And the new DataFrame should be like:
#ID
#Department
#Job
#Wage
1
IT
Engineer
8000
2
Quality Control
Engineer
7865
5
Construction
Architecture
4900
You can use:
df_result = df1.merge(df2, on = 'ID', how = 'inner')
If you want to select only certain columns from a certain df use:
df_result = df1[['ID','Job', 'Wage']].merge(df2[['ID', 'Department']], on = `ID`, how = 'inner')
Use:
df = df2.merge(df1[['ID','Job', 'Wage']], on='ID')
This question already has answers here:
How can I pivot a dataframe?
(5 answers)
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Duplicate This question has been answered, is not unique, and doesn’t differentiate itself from another question.
I have a dataframe like this:
NUMBER NAME
1 000231 John Stockton
2 009456 Karl Malone
3 100000901 John Stockton
4 100008496 Karl Malone
I want to obtain a new dataframe with:
NAME VALUE1 VALUE2
1 John Stockton 000231 100000901
2 Karl Malone 009456 100008496
I think I should use pd.groupby(), but I have no function to pass as an aggregator (I don't need to compute any mean(), min(), or max() value). If I just use pd.groupby() without any aggregator, I get:
In[1]: pd.DataFrame(df.groupby(['NAME']))
Out[1]:
0 1
0 John Stockton NAME NUMBER 000231 100000901
1 Karl Malone NAME NUMBER 009456 100008496
What am I doing wrong? Do I need to pivot the dataframe?
Actually, you need a bit mode complicated pipeline:
(df.assign(group=df.groupby('NAME').cumcount().add(1)
.pivot(index='NAME', columns='group', values='NUMBER')
.rename_axis(None, axis=1)
.add_prefix('VALUE')
.reset_index()
)
output:
NAME VALUE1 VALUE2
0 John Stockton 231 100000901
1 Karl Malone 9456 100008496
This question already has answers here:
"Drop random rows" from pandas dataframe
(2 answers)
Closed 2 years ago.
I have a dataset which contains a column 'location' with countries.
id location
0 001 United State
1 002 United State
2 003 Germany
3 004 Brazil
4 005 China
Now I only want the rows with specific countries.
I did this like this:
df2 = df[(df['location'].str.contains('United States')) | (df['location'].str.contains('Germany'))
That works.
Now I want only half of the rows with 'United States'.
(The reason is I have a really large dataset and most of the rows are 'United States'. For the sake of performance for further operations i want to cut half of it or just any %.)
Can anyone help me do that in a fast and clean way? Im sturggling.
TY <3
You can use sample for that, together with drop
df.drop(df[df['location'] == 'United State']).sample(frac=.5).index)
The filter inside, returns ALL the rows that have values equal to 'United State'. Then the sample takes randomly 50% of those and the index will return the index number which then will be used to drop those rows.
This question already has answers here:
replace column values in one dataframe by values of another dataframe
(5 answers)
Closed 5 years ago.
So i got my dataframe (df1):
Number Name Gender Hobby
122 John Male -
123 Patrick Male -
124 Rudy Male -
I want to add data to hobby based on number column. Assuming i've got my list of hobby based on its number on different dataframe. Like Example (df2):
Number Hobby
124 Soccer
... ...
... ...
and df3 :
Number Hobby
122 Basketball
... ...
... ...
How can i achieve this dataframe:
Number Name Gender Hobby
122 John Male Basketball
123 Patrick Male -
124 Rudy Male Soccer
So far i've already tried this following solutions :
Select rows from a DataFrame based on values in a column in pandas
but its only selecting some data. How can i update the 'Hobby' column ?
Thanks in advance.
You can use map, merge and join will also achieve it
df['Hobby']=df.Number.map(df1.set_index('Number').Hobby)
df
Out[155]:
Number Name Gender Hobby
0 122 John Male NaN
1 123 Patrick Male NaN
2 124 Rudy Male Soccer