My data is stored in df. I have multiple users per group. I want to group df by group and apply different functions to different columns. The twist is that I would like to assign custom names to the new columns during this process.
np.random.seed(123)
df = pd.DataFrame({"user":range(4),"group":[1,1,2,2],"crop":["2018-01-01","2018-01-01","2018-03-01","2018-03-01"],
"score":np.random.randint(400,1000,4)})
df["crop"] = pd.to_datetime(df["crop"])
print(df)
user group crop score
0 0 1 2018-01-01 910
1 1 1 2018-01-01 765
2 2 2 2018-03-01 782
3 3 2 2018-03-01 722
I want to get the mean of score, and the min and max values of crop grouped by group and assign custom names to each new column. The desired output should look like this:
group mean_score min_crop max_crop
0 1 837.5 2018-01-01 2018-01-01
1 2 752.0 2018-03-01 2018-03-01
I don't know how to do this in a one-liner in Python. In R, I would use data.table and get the following:
df[, list(mean_score = mean(score),
max_crop = max(crop),
min_crop = min(crop)), by = group]
I know I could group the data and use .agg combined with a dictionary. Is there an alternative way where I can custom-name each column in this process?
Try creating a function with the required operations using groupby().apply():
def f(x):
d = {}
d['mean_score'] = x['score'].mean()
d['min_crop'] = x['crop'].min()
d['max_crop'] = x['crop'].max()
return pd.Series(d, index=['mean_score', 'min_crop', 'max_crop'])
data = df.groupby('group').apply(f)
Related
I want to do group_by into aggregation, but for each group I want to use a function based on values from a special column which stores which function needed to be used. Easier to show on example:
id
group
val
func
0
0
0
"avg"
1
0
2
"avg"
2
0
2
"avg"
3
1
0
"med"
4
1
2
"med"
So in that example expected behaviour would be "avg" aggregation for group 0 and "median" for group 1. How can I make agg to choose function based on "func" column values? I know that I can calculate each agg function for each group and then use func as mask for choosing right values, but that isn't that great since I'd do a lot of not needed calculations, there should be a better approach...
P.S. It's guaranteed that func is the same within each group so I don't have to worry about that.
I've written my own solution for my specific case and I'll add that in question, but answer below is fine too.
So, my approach was:
Use dict to transform from table-provided format to proper pandas names as suggested in answer:
func_dict = {"avg": "mean", "med": "median", "min": "min","max": "max", "rnk": "first"}
I wrote a custom function to pass to apply later:
def pick_price(subframe: pd.DataFrame) -> float:
func_name = subframe["agg"].iloc[0]
func_name = func_dict[func_name]
# this picks from first line in subframe a name and get real name from dict
# and next "if" block applies them among subframe
if func_name != "first":
ans = subframe["comp_price"].agg(func_name)
return 1.0 * ans
else:
idx = subframe["rank"].idxmin()
return 1.0 * subframe["comp_price"].loc[idx]
That function takes subframe with group with one same function to apply, and well, apply it.
3. Finally, use that function. First, group by groups where we need to apply different functions, and just apply with apply() method:
grouped = X.groupby("sku")
grouped.apply(pick_price)
I would use a dictionary of group: function:
f = {0: 'mean', 1: 'median'}
df['out'] = df.groupby('group')['val'].transform(lambda s: s.agg(f.get(s.name)))
Output:
id group val out
0 0 0 0 1.333333
1 1 0 2 1.333333
2 2 0 2 1.333333
3 3 1 0 1.000000
4 4 1 2 1.000000
variant using a column as source
NB. it's a bit hacky, I prefer the dictionary. It extract the function name from the first rows of the group. The names must be valid, like mean/meadian, not avg/med.
df['out'] = (df.groupby('group')['val']
.transform(lambda s: s.agg(df.loc[s.index[0], 'func']))
)
Output:
id group val func out
0 0 0 0 mean 1.333333
1 1 0 2 mean 1.333333
2 2 0 2 mean 1.333333
3 3 1 0 median 1.000000
4 4 1 2 median 1.000000
I have dataframe like this:
id
name
emails
1
a
a#e.com,b#e.com,c#e.com,d#e.com
2
f
f#gmail.com
And I need iterate over emails if there are more than one, create additional rows in dataframe with additional emails, not corresponding to name, should be like this:
id
name
emails
1
a
a#e.com
2
f
f#gmail.com
3
NaN
b#e.com
4
NaN
c#e.com
5
NaN
d#e.com
What is the best way to do it apart of iterrows with append or concat? is it ok to modify iterated dataframe during iteration?
Thanks.
Use DataFrame.explode with splitted values by Series.str.split first, then compare values before # and if no match set missing value and last sorting like missing values are in end of DataFrame with assign range to id column:
df = df.assign(emails = df['emails'].str.split(',')).explode('emails')
mask = df['name'].eq(df['emails'].str.split('#').str[0])
df['name'] = np.where(mask, df['name'], np.nan)
df = df.sort_values('name', key=lambda x: x.isna(), ignore_index=True)
df['id'] = range(1, len(df) + 1)
print (df)
id name emails
0 1 a a#e.com
1 2 f f#gmail.com
2 3 NaN b#e.com
3 4 NaN c#e.com
4 5 NaN d#e.com
I don't know how can i create a dataframe based on another dataframe using a groupby conditions. For example, i have a dataframe that if i apply the function:
flights_df.groupby(by='DepHour')['Cancelled'].value_counts()
I obtain something like this
DepHour Cancelled
0.0 0 20361
1 7
1.0 0 5857
1 4
2.0 0 1850
1 1
**3.0 0 833**
4.0 0 3389
1 1
5.0 0 148143
1 24
As can be seen, for DepHour == 3.0 there's no cancelled flights.
Using the same dataframe that i used to generate this output i want to create another dataframe containing only of values where there's no cancelled flighs for DepHour. In this case, the output will be a dataframe containing only values of DepHour == 3.0.
I know that i can use mask, but it allows only filter values where cancelled == 0 (i.e. all other values for where DepHour cancelled == 0 are included).
Thanks and sorry for my bad english!
There might be a cleaner way (probably without using groupby twice) but this should should work:
flights_df.groupby('DepHour') \
.filter(lambda x: (x['Cancelled'].unique()==[0]).all()) \
.groupby('DepHour')['Cancelled'].value_counts()
I have data that looks like
Name,Report_ID,Amount,Flag,Actions
Fizz,123,5,,A
Fizz,123,10,Y,A
Buzz,456,10,,B
Buzz,456,40,,C
Buzz,456,70,,D
Bazz,678,100,Y,F
From these individual operations, i'd like to create a new dataframe that captures various statistics / meta name. Mostly summations and counts of items / counts of unique entries. I'd like the output of the dataframe to look like the following:
Report_ID,Number of Flags,Number of Entries, Total,Unique Actions
123,1,2,15,1
456,0,3,120,3
678,1,1,100,1
I've tried using groupby, but I cannot merge all of the individual groupby objects back together correctly. So far I've tried
totals = raw_data.groupby('Report_ID')['Amount'].sum()
event_count = raw_data.groupby('Report_ID').size()
num_actions = raw_data.groupby('Report_ID').Actions.nunique()
output = pd.concat([totals,event_count,num_actions])
When I try this i get TypeError: cannot concatenate a non-NDFrame object. Any help would be appreciated!
You can use agg on the groupby
f = dict(Flag=['count', 'size'], Amount='sum', Actions='nunique')
df.groupby('Report_ID').agg(f)
Flag Amount Actions
count size sum nunique
Report_ID
123 1 2 15 1
456 0 3 120 3
678 1 1 100 1
You just need to specify axis=1 when concatenating:
event_count.name = 'Event Count' # Name the Series, as you did not group on one.
>>> pd.concat([totals, event_count, num_actions], axis=1)
Amount Event Count Actions
Report_ID
123 15 2 1
456 120 3 3
678 100 1 1
First things first, I have a data frame that has these columns:
issue_date | issue | special | group
Multiple rows can comprise the same group. For each group, I want to get its maximum date:
date_current = history.groupby('group').agg({'issue_date' : [np.min, np.max]})
date_current = date_current.issue_date.amax
After that, I want to filter each group by its max_date-months:
date_before = date_current.values - pd.Timedelta(weeks=4*n)
I.e., for each group, I want to discard rows where the column issue_date < date_before:
hh = history[history['issue_date'] > date_before]
ValueError: Lengths must match to compare
This last line doesn't work though, since the the lengths don't match. This is expected because I have x lines in my data frame, but date_before has length equals to the number of groups in my data frame.
Given data, I'm wondering how I can perform this subtraction, or filtering, by groups. Do I have to iterate of the data frame somehow?
You can solve this in a similar manner as you attempted it.
I've created my own sample data as follows:
history
issue_date group
0 2014-01-02 1
1 2014-01-02 2
2 2016-02-04 3
3 2016-03-05 2
You use group_by and apply to do what you were attempting. First you definge the function you want to apply. Then group_by.apply will apply it to every group. In this case I've used n=1 to demonstrate the point:
def date_compare(df):
date_current = df.issue_date.max()
date_before = date_current - pd.Timedelta(weeks=4*1)
hh = df[df['issue_date'] > date_before]
return hh
hh = history.groupby('group').apply(date_compare)
issue_date group
group
1 0 2014-01-02 1
2 3 2016-03-05 2
3 2 2016-02-04 3
So the smaller date in group 2 has not survived the cut.
Hope that's helpful and that it follows the same logic you were going for.
I think your best option will be to merge your original df with date_current, but this will only work if you change your calculation of the date_before such that the group information isn't lost:
date_before = date_current - pd.Timedelta(weeks=4*n)
Then you can merge left on group and right on index(since you grouped on that before)
history = pd.merge(history, date_before.to_frame(), left_on='group', right_index=True)
Then your filter should work. The call of to_frame is nessesary because you can't merge a dataframe and a series.
Hope that helps.