sum of values larger than median of each row in pandas dataframes - python

Is there an efficient way to find the sum of values whose absolute value is larger than the median of the row in a pandas data frame?
For example:
Monday Tuesday Wednesday Thursday Friday Saturday
0 2.2 4.4 0.5 9 4 3
1 2 4 1 8 4 5
2 1.8 4.5 0.9 8 1 15
3 4 1 5 10 4 5
…
How to generate the sum of numbers in each row which are larger than the median of the corresponding row? What about 25 percentile or 75 percentile?

I think you want this:
In [19]:
df[df.gt(df.median(axis=1), axis=0)]
Out[19]:
Monday Tuesday Wednesday Thursday Friday Saturday
0 NaN 4.4 NaN 9 4 NaN
1 NaN NaN NaN 8 NaN 5
2 NaN 4.5 NaN 8 NaN 15
3 NaN NaN 5 10 NaN 5
This uses .gt which is greater than and uses as the value the median (row-wise by passing axis=1).
You can then call sum on this:
In [20]:
df[df.gt(df.median(axis=1), axis=0)].sum()
Out[20]:
Monday NaN
Tuesday 8.9
Wednesday 5.0
Thursday 35.0
Friday 4.0
Saturday 25.0
dtype: float64

And to enhance #EdChum's answer to get the quantiles:
quantile = 0.75 # 0.25, 0.5, 0.75, etc.
df[df.gt(df.quantile(q=quantile, axis=1), axis=0)].sum(axis=1)
Given that there are only seven days in a week, I'm not sure if this will do as intended unless you have more columns than shown. Do you want the quantiles by column instead of row?

Since you want to sum values in each row which is greater then median, and if you want to preserve Day values, below approach works fine
def func(row):
return row[row>np.percentile(row, 50)].sum()
func function will be now applied on df
In [67]: df['rule'] = df.apply(func, axis=1)
In [68]: df
Out[68]:
Monday Tuesday Wednesday Thursday Friday Saturday rule
0 2.2 4.4 0.5 9 4 3 17.4
1 2.0 4.0 1.0 8 4 5 13.0
2 1.8 4.5 0.9 8 1 15 27.5
3 4.0 1.0 5.0 10 4 5 20.0
And, for different quantiles, you could use [25, 50, 75] in np.percentile(row, x)

Related

Using pandas fillna with bfill method for particular cells

We have data representing users penalties count having NaN's, changing in time (the value goes up only). Below is subset of the data:
import pandas as pd
import numpy as np
d = {'day':['Monday','Monday','Monday','Tuesday','Tuesday','Tuesday','Wednesday','Thursday','Thursday','Friday'],
'user_id': [1, 4,2,4,4,2,2,1,2,1], 'penalties_count': [1, 3,2,np.nan,4,2,np.nan,2,3,3]}
df = pd.DataFrame(data=d)
display(df)
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 NaN
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 NaN
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
The goal is to fill NaN cells with previous value, but only for particular user_id. The result should be:
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 3.0
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 2.0
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
But when I use
df.fillna(method='bfill')
The result is incorrect in line 4 for user_id=4 (we should see 3 here, not 4):
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 4.0
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 2.0
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
What can fix the issue?
If you want to fill NA by group, you need to first use groupby before fill NA. Also it seems that you need ffill but not bfill. Like df.groupby("user_id")["penalties_count"].ffill()

Converting a data frame of events into a timetable format

I am working on converting a list of online classes into a heat map using Python & Pandas and I've come to a dead end. Right now, I have a data frame 'data' with some events containing a day of the week listed as 'DAY' and the time of the event in hours listed as 'TIME'. The dataset is displayed as follows:
ID TIME DAY
108 15 Saturday
110 15 Sunday
112 16 Wednesday
114 16 Friday
116 15 Monday
.. ... ...
639 12 Wednesday
640 12 Saturday
641 18 Saturday
642 16 Thursday
643 15 Friday
I'm looking for a way to sum repetitions of every 'TIME' value for every 'DAY' and then present these sums in a new table 'event_count'. I need to turn the linear data in my 'data' table into a more timetable-like form that can later be converted into a visual heatmap.
Sounds like a difficult transformation, but I feel like I'm missing something very obvious.
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
10 5 2 4 6 1 0 2
11 4 2 4 6 1 0 2
12 6 2 4 6 1 0 2
13 3 2 4 6 1 0 2
14 7 2 4 6 1 0 2
I tried to achieve this through pivot_table and stack, however, the best I got was a list of all days of the week with mean averages for time. Could you advise me which direction should I look into and how can I approach solving this?
IIUC you can do something like this:
df is from your given example data.
import pandas as pd
df = pd.DataFrame({
'ID': [108, 110, 112, 114, 116, 639, 640, 641, 642, 643],
'TIME': [15, 15, 16, 16, 15, 12, 12, 18, 16, 15],
'DAY': ['Saturday','Sunday','Wednesday','Friday','Monday','Wednesday','Saturday','Saturday','Thursday','Friday']
})
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
out = (pd.crosstab(index=df['TIME'], columns=df['DAY'], values=df['TIME'],aggfunc='count')
.sort_index(axis=0) #sort by the index 'TIME'
.reindex(weekdays, axis=1) # sort columns in order of the weekdays
.rename_axis(None, axis=1) # delete name of index
.reset_index() # 'TIME' from index to column
)
print(out)
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 12 NaN NaN 1.0 NaN NaN 1.0 NaN
1 15 1.0 NaN NaN NaN 1.0 1.0 1.0
2 16 NaN NaN 1.0 1.0 1.0 NaN NaN
3 18 NaN NaN NaN NaN NaN 1.0 NaN
You were also in the right path with pivot_table. I'm not sure what was missing to get you the right result but here is one approach with it. I added `margins, maybe it is also interesting for you to get the total amount of each index/column.
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Total']
out2 = (pd.pivot_table(data=df, index='TIME', columns='DAY', aggfunc='count', margins=True, margins_name='Total')
.droplevel(0,axis=1)
.reindex(weekdays, axis=1)
)
print(out2)
DAY Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
TIME
12 NaN NaN 1.0 NaN NaN 1.0 NaN 2
15 1.0 NaN NaN NaN 1.0 1.0 1.0 4
16 NaN NaN 1.0 1.0 1.0 NaN NaN 3
18 NaN NaN NaN NaN NaN 1.0 NaN 1
Total 1.0 NaN 2.0 1.0 2.0 3.0 1.0 10

maximum sum of consecutive n-days using pandas

I've seen solutions in different languages (i.e. SQL, fortran, or C++) which mainly do for loops.
I am hoping that someone can help me solve this task using pandas instead.
If I have a data frame that looks like this.
date pcp sum_count sumcum
7/13/2013 0.1 3.0 48.7
7/14/2013 48.5
7/15/2013 0.1
7/16/2013
8/1/2013 1.5 1.0 1.5
8/2/2013
8/3/2013
8/4/2013 0.1 2.0 3.6
8/5/2013 3.5
9/22/2013 0.3 3.0 26.3
9/23/2013 14.0
9/24/2013 12.0
9/25/2013
9/26/2013
10/1/2014 0.1 11.0
10/2/2014 96.0 135.5
10/3/2014 2.5
10/4/2014 37.0
10/5/2014 9.5
10/6/2014 26.5
10/7/2014 0.5
10/8/2014 25.5
10/9/2014 2.0
10/10/2014 5.5
10/11/2014 5.5
And I was hoping I could do the following:
STEP 1 : create the sum_count column by determining total count of consecutive non-zeros in the 'pcp' column.
STEP 2 : create the sumcum column and calculate the sum of non-consecutive 'pcp'.
STEP 3 : create a pivot table that will look like this:
year max_sum_count
2013 48.7
2014 135.5
BUT!! the max_sum_count is based on the condition when sum_count = 3
I'd appreciate any help! thank you!
UPDATED QUESTION:
I have previously emphasized that the sum_count should only return the maximum consecutive 3 pcps. But I, mistakenly gave the wrong data frame, I had to edit it. Sorry.
The sumcum of 135.5 came from 96.0 + 2.5 + 37.0. It is the maximum consecutive 3 pcps within the sum_count 11.
Thank you
Use:
#filtering + rolling by days
N = 3
df['date'] = pd.to_datetime(df['date'])
df = df.set_index('date')
#test NaNs
m = df['pcp'].isna()
#groups by consecutive non NaNs
df['g'] = m.cumsum()[~m]
#extract years
df['year'] = df.index.year
#filter no NaNs rows
df = df[~m].copy()
#filter rows greater like N
df['sum_count1'] = df.groupby(['g','year'])['g'].transform('size')
df = df[df['sum_count1'].ge(N)].copy()
#get rolling sum per groups per N days
df['sumcum1'] = (df.groupby(['g','year'])
.rolling(f'{N}D')['pcp']
.sum()
.reset_index(level=[0, 1], drop=True))
#get only maximal counts non NaN and consecutive datetimes
#add missing years
r = range(df['year'].min(), df['year'].max() + 1)
df1 = df.groupby('year')['sumcum1'].max().reindex(r).reset_index(name='max_sum_count')
print (df1)
year max_sum_count
0 2013 48.7
1 2014 135.5
First, convert date as a real datetime dtype and create a binary mask which keep rows where pcp is not null. Then you can create groups and compute your variables:
Input data:
>>> df
date pcp
0 7/13/2013 0.1
1 7/14/2013 48.5
2 7/15/2013 0.1
3 7/16/2013 NaN
4 8/1/2013 1.5
5 8/2/2013 NaN
6 8/3/2013 NaN
7 8/4/2013 0.1
8 8/5/2013 3.5
9 9/22/2013 0.3
10 9/23/2013 14.0
11 9/24/2013 12.0
12 9/25/2013 NaN
13 9/26/2013 NaN
14 10/1/2014 0.1
15 10/2/2014 96.0
16 10/3/2014 2.5
17 10/4/2014 37.0
18 10/5/2014 9.5
19 10/6/2014 26.5
20 10/7/2014 0.5
21 10/8/2014 25.5
22 10/9/2014 2.0
23 10/10/2014 5.5
24 10/11/2014 5.5
Code:
df['date'] = pd.to_datetime(df['date'])
mask = df['pcp'].notna()
grp = df.loc[mask, 'date'] \
.ne(df.loc[mask, 'date'].shift().add(pd.Timedelta(days=1))) \
.cumsum()
df = df.join(df.reset_index()
.groupby(grp)
.agg(index=('index', 'first'),
sum_count=('pcp', 'size'),
sumcum=('pcp', 'sum'))
.set_index('index'))
pivot = df.groupby(df['date'].dt.year)['sumcum'].max() \
.rename('max_sum_count').reset_index()
Output results:
>>> df
date pcp sum_count sumcum
0 2013-07-13 0.1 3.0 48.7
1 2013-07-14 48.5 NaN NaN
2 2013-07-15 0.1 NaN NaN
3 2013-07-16 NaN NaN NaN
4 2013-08-01 1.5 1.0 1.5
5 2013-08-02 NaN NaN NaN
6 2013-08-03 NaN NaN NaN
7 2013-08-04 0.1 2.0 3.6
8 2013-08-05 3.5 NaN NaN
9 2013-09-22 0.3 3.0 26.3
10 2013-09-23 14.0 NaN NaN
11 2013-09-24 12.0 NaN NaN
12 2013-09-25 NaN NaN NaN
13 2013-09-26 NaN NaN NaN
14 2014-10-01 0.1 11.0 210.6
15 2014-10-02 96.0 NaN NaN
16 2014-10-03 2.5 NaN NaN
17 2014-10-04 37.0 NaN NaN
18 2014-10-05 9.5 NaN NaN
19 2014-10-06 26.5 NaN NaN
20 2014-10-07 0.5 NaN NaN
21 2014-10-08 25.5 NaN NaN
22 2014-10-09 2.0 NaN NaN
23 2014-10-10 5.5 NaN NaN
24 2014-10-11 5.5 NaN NaN
>>> pivot
date max_sum_count
0 2013 48.7
1 2014 210.6

How to apply a function/impute on an interval in Pandas

I have a Pandas dataset with a monthly Date-time index and a column of outstanding orders (like below):
Date
orders
1991-01-01
nan
1991-02-01
nan
1991-03-01
24
1991-04-01
nan
1991-05-01
nan
1991-06-01
nan
1991-07-01
nan
1991-08-01
34
1991-09-01
nan
1991-10-01
nan
1991-11-01
22
1991-12-01
nan
I want to linearly interpolate the values to fill the nans. However it has to be applied within 6-month blocks (non-rolling). So for example, one 6-month block would be all the rows between 1991-01-01 and 1991-06-01, where we would do forward and backward linear imputation such that if there is a nan the interpolation would be descending to a final value of 0. So for the same dataset above here is how I would like the end result to look:
Date
orders
1991-01-01
8
1991-02-01
16
1991-03-01
24
1991-04-01
18
1991-05-01
12
1991-06-01
6
1991-07-01
17
1991-08-01
34
1991-09-01
30
1991-10-01
26
1991-11-01
22
1991-12-01
11
I am lost on how to do this in Pandas however. Any ideas?
Idea is grouping per 6 months with prepend and append 0 values, interpolate and then remove first and last 0 values per groups:
df['Date'] = pd.to_datetime(df['Date'])
f = lambda x: pd.Series([0] + x.tolist() + [0]).interpolate().iloc[1:-1]
df['orders'] = (df.groupby(pd.Grouper(freq='6MS', key='Date'))['orders']
.transform(f))
print (df)
Date orders
0 1991-01-01 8.0
1 1991-02-01 16.0
2 1991-03-01 24.0
3 1991-04-01 18.0
4 1991-05-01 12.0
5 1991-06-01 6.0
6 1991-07-01 17.0
7 1991-08-01 34.0
8 1991-09-01 30.0
9 1991-10-01 26.0
10 1991-11-01 22.0
11 1991-12-01 11.0

How to plot values of pandas dataframe with reference to a list (problems with indexing)?

I am looking for a clever way to produce a plot styled like this rather childish example:
with source data like this:
days = ['Monday','Tuesday','Wednesday','Thursday','Friday']
Feature Values observed on
0 1 [5.5, 14.3, 12.0, 11.8] [Tuesday, Wednesday, Thursday, Friday]
1 2 [6.1, 14.6, 12.7] [Monday, Tuesday, Wednesday]
2 3 [15.2, 13.3] [Tuesday, Friday]
3 4 [14.9, 14.3, 17.0] [Monday, Thursday, Friday]
4 5 [13.0, 13.1, 13.5, 10.3] [Monday, Tuesday, Thursday, Friday]
5 6 [12.5, 7.0] [Wednesday, Friday]
In other words, for each line of this dataframe, I want to plot/connect the values for the "days" on which they were acquired. (Please note the days are here just to illustrate my problem, using datetime is not a solution.)
But I got lost in indexing.
This is how I prepared the figure (i.e. having vertical black lines for each day)
for count, log in enumerate(days):
plt.plot(np.ones(len(allvalues))*count,np.array(allvalues),'k',linestyle='-',linewidth=1.)
plt.xticks(np.arange(0,5,1),['M','T','W','T','F'])
and this works, I get my vertical lines and the labels. (later I may want to plot other datasets instead of those vertical lines, but for now, the vertical lines are more illustrative)
But now, how can I plot the values for each day?
for index, group in observations.iterrows():
whichdays= group['observed on']
values = group['Values']
for d in whichdays:
plt.plot(days[np.where(days==d)],values)
but this produces TypeError: list indices must be integers, not tuple
One possible solution is flatenning values from lists, pivot and then plot:
from itertools import chain
df2 = pd.DataFrame({
"Feature": np.repeat(df.Feature.values, df.Values.str.len()),
"Values": list(chain.from_iterable(df.Values)),
"observed on": list(chain.from_iterable(df['observed on']))})
print (df2)
Feature Values observed on
0 1 5.5 Tuesday
1 1 14.3 Wednesday
2 1 12.0 Thursday
3 1 11.8 Friday
4 2 6.1 Monday
5 2 14.6 Tuesday
6 2 12.7 Wednesday
7 3 15.2 Tuesday
8 3 13.3 Friday
9 4 14.9 Monday
10 4 14.3 Thursday
11 4 17.0 Friday
12 5 13.0 Monday
13 5 13.1 Tuesday
14 5 13.5 Thursday
15 5 10.3 Friday
16 6 12.5 Wednesday
17 6 7.0 Friday
df = df2.pivot(index='observed on', columns='Feature', values='Values')
df.index.name = None
df.columns.name = None
print (df)
1 2 3 4 5 6
Friday 11.8 NaN 13.3 17.0 10.3 7.0
Monday NaN 6.1 NaN 14.9 13.0 NaN
Thursday 12.0 NaN NaN 14.3 13.5 NaN
Tuesday 5.5 14.6 15.2 NaN 13.1 NaN
Wednesday 14.3 12.7 NaN NaN NaN 12.5
df.plot(linestyle='-',linewidth=1.)

Categories

Resources