I am working on converting a list of online classes into a heat map using Python & Pandas and I've come to a dead end. Right now, I have a data frame 'data' with some events containing a day of the week listed as 'DAY' and the time of the event in hours listed as 'TIME'. The dataset is displayed as follows:
ID TIME DAY
108 15 Saturday
110 15 Sunday
112 16 Wednesday
114 16 Friday
116 15 Monday
.. ... ...
639 12 Wednesday
640 12 Saturday
641 18 Saturday
642 16 Thursday
643 15 Friday
I'm looking for a way to sum repetitions of every 'TIME' value for every 'DAY' and then present these sums in a new table 'event_count'. I need to turn the linear data in my 'data' table into a more timetable-like form that can later be converted into a visual heatmap.
Sounds like a difficult transformation, but I feel like I'm missing something very obvious.
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
10 5 2 4 6 1 0 2
11 4 2 4 6 1 0 2
12 6 2 4 6 1 0 2
13 3 2 4 6 1 0 2
14 7 2 4 6 1 0 2
I tried to achieve this through pivot_table and stack, however, the best I got was a list of all days of the week with mean averages for time. Could you advise me which direction should I look into and how can I approach solving this?
IIUC you can do something like this:
df is from your given example data.
import pandas as pd
df = pd.DataFrame({
'ID': [108, 110, 112, 114, 116, 639, 640, 641, 642, 643],
'TIME': [15, 15, 16, 16, 15, 12, 12, 18, 16, 15],
'DAY': ['Saturday','Sunday','Wednesday','Friday','Monday','Wednesday','Saturday','Saturday','Thursday','Friday']
})
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
out = (pd.crosstab(index=df['TIME'], columns=df['DAY'], values=df['TIME'],aggfunc='count')
.sort_index(axis=0) #sort by the index 'TIME'
.reindex(weekdays, axis=1) # sort columns in order of the weekdays
.rename_axis(None, axis=1) # delete name of index
.reset_index() # 'TIME' from index to column
)
print(out)
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 12 NaN NaN 1.0 NaN NaN 1.0 NaN
1 15 1.0 NaN NaN NaN 1.0 1.0 1.0
2 16 NaN NaN 1.0 1.0 1.0 NaN NaN
3 18 NaN NaN NaN NaN NaN 1.0 NaN
You were also in the right path with pivot_table. I'm not sure what was missing to get you the right result but here is one approach with it. I added `margins, maybe it is also interesting for you to get the total amount of each index/column.
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Total']
out2 = (pd.pivot_table(data=df, index='TIME', columns='DAY', aggfunc='count', margins=True, margins_name='Total')
.droplevel(0,axis=1)
.reindex(weekdays, axis=1)
)
print(out2)
DAY Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
TIME
12 NaN NaN 1.0 NaN NaN 1.0 NaN 2
15 1.0 NaN NaN NaN 1.0 1.0 1.0 4
16 NaN NaN 1.0 1.0 1.0 NaN NaN 3
18 NaN NaN NaN NaN NaN 1.0 NaN 1
Total 1.0 NaN 2.0 1.0 2.0 3.0 1.0 10
Related
We have data representing users penalties count having NaN's, changing in time (the value goes up only). Below is subset of the data:
import pandas as pd
import numpy as np
d = {'day':['Monday','Monday','Monday','Tuesday','Tuesday','Tuesday','Wednesday','Thursday','Thursday','Friday'],
'user_id': [1, 4,2,4,4,2,2,1,2,1], 'penalties_count': [1, 3,2,np.nan,4,2,np.nan,2,3,3]}
df = pd.DataFrame(data=d)
display(df)
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 NaN
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 NaN
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
The goal is to fill NaN cells with previous value, but only for particular user_id. The result should be:
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 3.0
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 2.0
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
But when I use
df.fillna(method='bfill')
The result is incorrect in line 4 for user_id=4 (we should see 3 here, not 4):
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 4.0
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 2.0
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
What can fix the issue?
If you want to fill NA by group, you need to first use groupby before fill NA. Also it seems that you need ffill but not bfill. Like df.groupby("user_id")["penalties_count"].ffill()
I have a dataframe, df that looks like this
Date Value
10/1/2019 5
10/2/2019 10
10/3/2019 15
10/4/2019 20
10/5/2019 25
10/6/2019 30
10/7/2019 35
I would like to calculate the delta for a period of 7 days
Desired output:
Date Delta
10/1/2019 30
This is what I am doing: A user has helped me with a variation of the code below:
df['Delta']=df.iloc[0:,1].sub(df.iloc[6:,1]), Date=pd.Series
(pd.date_range(pd.Timestamp('2019-10-01'),
periods=7, freq='7d'))[['Delta','Date']]
Any suggestions is appreciated
Let us try shift
s = df.set_index('Date')['Value']
df['New'] = s.shift(freq = '-6 D').reindex(s.index).values
df['DIFF'] = df['New'] - df['Value']
df
Out[39]:
Date Value New DIFF
0 2019-10-01 5 35.0 30.0
1 2019-10-02 10 NaN NaN
2 2019-10-03 15 NaN NaN
3 2019-10-04 20 NaN NaN
4 2019-10-05 25 NaN NaN
5 2019-10-06 30 NaN NaN
6 2019-10-07 35 NaN NaN
I have the following df
import pandas as pd
import datetime as dt
start='2020-01-01'
end='2021-12-31'
df = pd.DataFrame({"Date": pd.date_range(start, end)})
df['Day'] = df['Date'].dt.day
df['Day_name'] = df[['Date']].apply(lambda x: dt.datetime.strftime(x['Date'], '%A'), axis=1)
I want to add another column to the df['wk'] that will loop trough the dates and create a custom week starting with a specific date.
For example Wk 1 will start from 2020-01-03, loop 7 days till 2020-01-09 and create wk 1, wk 2 will be from 2020-01-10 till 2020-01-16 and so on. Always move 7 days
How can I do this in python
I am thinking it should be something like this:
for i,row in df.iterrows():
df.loc[i,'wk']= row['Date'] + dt.timedelta(days = 7)
But this just adds 7 days to the current one, not stores the wk. I need a little guidance on how to do this
You can try something like this:
#You can create the Day_name column without apply
df['Day_name'] = df['Date'].dt.day_name()
#Solution starts here
week_start = '2020-01-03' #inout date
i = df.loc[df['Date'] == week_start,'Date'].dt.day_name().iloc[0]
df['wk'] = df['Day_name'].eq(i).cumsum()
print(df.head(12))
Date Day Day_name wk
0 2020-01-01 1 Wednesday 0.0
1 2020-01-02 2 Thursday 0.0
2 2020-01-03 3 Friday 1.0
3 2020-01-04 4 Saturday 1.0
4 2020-01-05 5 Sunday 1.0
5 2020-01-06 6 Monday 1.0
6 2020-01-07 7 Tuesday 1.0
7 2020-01-08 8 Wednesday 1.0
8 2020-01-09 9 Thursday 1.0
9 2020-01-10 10 Friday 2.0
10 2020-01-11 11 Saturday 2.0
11 2020-01-12 12 Sunday 2.0
I am looking for a clever way to produce a plot styled like this rather childish example:
with source data like this:
days = ['Monday','Tuesday','Wednesday','Thursday','Friday']
Feature Values observed on
0 1 [5.5, 14.3, 12.0, 11.8] [Tuesday, Wednesday, Thursday, Friday]
1 2 [6.1, 14.6, 12.7] [Monday, Tuesday, Wednesday]
2 3 [15.2, 13.3] [Tuesday, Friday]
3 4 [14.9, 14.3, 17.0] [Monday, Thursday, Friday]
4 5 [13.0, 13.1, 13.5, 10.3] [Monday, Tuesday, Thursday, Friday]
5 6 [12.5, 7.0] [Wednesday, Friday]
In other words, for each line of this dataframe, I want to plot/connect the values for the "days" on which they were acquired. (Please note the days are here just to illustrate my problem, using datetime is not a solution.)
But I got lost in indexing.
This is how I prepared the figure (i.e. having vertical black lines for each day)
for count, log in enumerate(days):
plt.plot(np.ones(len(allvalues))*count,np.array(allvalues),'k',linestyle='-',linewidth=1.)
plt.xticks(np.arange(0,5,1),['M','T','W','T','F'])
and this works, I get my vertical lines and the labels. (later I may want to plot other datasets instead of those vertical lines, but for now, the vertical lines are more illustrative)
But now, how can I plot the values for each day?
for index, group in observations.iterrows():
whichdays= group['observed on']
values = group['Values']
for d in whichdays:
plt.plot(days[np.where(days==d)],values)
but this produces TypeError: list indices must be integers, not tuple
One possible solution is flatenning values from lists, pivot and then plot:
from itertools import chain
df2 = pd.DataFrame({
"Feature": np.repeat(df.Feature.values, df.Values.str.len()),
"Values": list(chain.from_iterable(df.Values)),
"observed on": list(chain.from_iterable(df['observed on']))})
print (df2)
Feature Values observed on
0 1 5.5 Tuesday
1 1 14.3 Wednesday
2 1 12.0 Thursday
3 1 11.8 Friday
4 2 6.1 Monday
5 2 14.6 Tuesday
6 2 12.7 Wednesday
7 3 15.2 Tuesday
8 3 13.3 Friday
9 4 14.9 Monday
10 4 14.3 Thursday
11 4 17.0 Friday
12 5 13.0 Monday
13 5 13.1 Tuesday
14 5 13.5 Thursday
15 5 10.3 Friday
16 6 12.5 Wednesday
17 6 7.0 Friday
df = df2.pivot(index='observed on', columns='Feature', values='Values')
df.index.name = None
df.columns.name = None
print (df)
1 2 3 4 5 6
Friday 11.8 NaN 13.3 17.0 10.3 7.0
Monday NaN 6.1 NaN 14.9 13.0 NaN
Thursday 12.0 NaN NaN 14.3 13.5 NaN
Tuesday 5.5 14.6 15.2 NaN 13.1 NaN
Wednesday 14.3 12.7 NaN NaN NaN 12.5
df.plot(linestyle='-',linewidth=1.)
Is there an efficient way to find the sum of values whose absolute value is larger than the median of the row in a pandas data frame?
For example:
Monday Tuesday Wednesday Thursday Friday Saturday
0 2.2 4.4 0.5 9 4 3
1 2 4 1 8 4 5
2 1.8 4.5 0.9 8 1 15
3 4 1 5 10 4 5
…
How to generate the sum of numbers in each row which are larger than the median of the corresponding row? What about 25 percentile or 75 percentile?
I think you want this:
In [19]:
df[df.gt(df.median(axis=1), axis=0)]
Out[19]:
Monday Tuesday Wednesday Thursday Friday Saturday
0 NaN 4.4 NaN 9 4 NaN
1 NaN NaN NaN 8 NaN 5
2 NaN 4.5 NaN 8 NaN 15
3 NaN NaN 5 10 NaN 5
This uses .gt which is greater than and uses as the value the median (row-wise by passing axis=1).
You can then call sum on this:
In [20]:
df[df.gt(df.median(axis=1), axis=0)].sum()
Out[20]:
Monday NaN
Tuesday 8.9
Wednesday 5.0
Thursday 35.0
Friday 4.0
Saturday 25.0
dtype: float64
And to enhance #EdChum's answer to get the quantiles:
quantile = 0.75 # 0.25, 0.5, 0.75, etc.
df[df.gt(df.quantile(q=quantile, axis=1), axis=0)].sum(axis=1)
Given that there are only seven days in a week, I'm not sure if this will do as intended unless you have more columns than shown. Do you want the quantiles by column instead of row?
Since you want to sum values in each row which is greater then median, and if you want to preserve Day values, below approach works fine
def func(row):
return row[row>np.percentile(row, 50)].sum()
func function will be now applied on df
In [67]: df['rule'] = df.apply(func, axis=1)
In [68]: df
Out[68]:
Monday Tuesday Wednesday Thursday Friday Saturday rule
0 2.2 4.4 0.5 9 4 3 17.4
1 2.0 4.0 1.0 8 4 5 13.0
2 1.8 4.5 0.9 8 1 15 27.5
3 4.0 1.0 5.0 10 4 5 20.0
And, for different quantiles, you could use [25, 50, 75] in np.percentile(row, x)