Using pandas fillna with bfill method for particular cells - python

We have data representing users penalties count having NaN's, changing in time (the value goes up only). Below is subset of the data:
import pandas as pd
import numpy as np
d = {'day':['Monday','Monday','Monday','Tuesday','Tuesday','Tuesday','Wednesday','Thursday','Thursday','Friday'],
'user_id': [1, 4,2,4,4,2,2,1,2,1], 'penalties_count': [1, 3,2,np.nan,4,2,np.nan,2,3,3]}
df = pd.DataFrame(data=d)
display(df)
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 NaN
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 NaN
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
The goal is to fill NaN cells with previous value, but only for particular user_id. The result should be:
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 3.0
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 2.0
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
But when I use
df.fillna(method='bfill')
The result is incorrect in line 4 for user_id=4 (we should see 3 here, not 4):
day user_id penalties_count
0 Monday 1 1.0
1 Monday 4 3.0
2 Monday 2 2.0
3 Tuesday 4 4.0
4 Tuesday 4 4.0
5 Tuesday 2 2.0
6 Wednesday 2 2.0
7 Thursday 1 2.0
8 Thursday 2 3.0
9 Friday 1 3.0
What can fix the issue?

If you want to fill NA by group, you need to first use groupby before fill NA. Also it seems that you need ffill but not bfill. Like df.groupby("user_id")["penalties_count"].ffill()

Related

Converting a data frame of events into a timetable format

I am working on converting a list of online classes into a heat map using Python & Pandas and I've come to a dead end. Right now, I have a data frame 'data' with some events containing a day of the week listed as 'DAY' and the time of the event in hours listed as 'TIME'. The dataset is displayed as follows:
ID TIME DAY
108 15 Saturday
110 15 Sunday
112 16 Wednesday
114 16 Friday
116 15 Monday
.. ... ...
639 12 Wednesday
640 12 Saturday
641 18 Saturday
642 16 Thursday
643 15 Friday
I'm looking for a way to sum repetitions of every 'TIME' value for every 'DAY' and then present these sums in a new table 'event_count'. I need to turn the linear data in my 'data' table into a more timetable-like form that can later be converted into a visual heatmap.
Sounds like a difficult transformation, but I feel like I'm missing something very obvious.
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
10 5 2 4 6 1 0 2
11 4 2 4 6 1 0 2
12 6 2 4 6 1 0 2
13 3 2 4 6 1 0 2
14 7 2 4 6 1 0 2
I tried to achieve this through pivot_table and stack, however, the best I got was a list of all days of the week with mean averages for time. Could you advise me which direction should I look into and how can I approach solving this?
IIUC you can do something like this:
df is from your given example data.
import pandas as pd
df = pd.DataFrame({
'ID': [108, 110, 112, 114, 116, 639, 640, 641, 642, 643],
'TIME': [15, 15, 16, 16, 15, 12, 12, 18, 16, 15],
'DAY': ['Saturday','Sunday','Wednesday','Friday','Monday','Wednesday','Saturday','Saturday','Thursday','Friday']
})
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
out = (pd.crosstab(index=df['TIME'], columns=df['DAY'], values=df['TIME'],aggfunc='count')
.sort_index(axis=0) #sort by the index 'TIME'
.reindex(weekdays, axis=1) # sort columns in order of the weekdays
.rename_axis(None, axis=1) # delete name of index
.reset_index() # 'TIME' from index to column
)
print(out)
TIME Monday Tuesday Wednesday Thursday Friday Saturday Sunday
0 12 NaN NaN 1.0 NaN NaN 1.0 NaN
1 15 1.0 NaN NaN NaN 1.0 1.0 1.0
2 16 NaN NaN 1.0 1.0 1.0 NaN NaN
3 18 NaN NaN NaN NaN NaN 1.0 NaN
You were also in the right path with pivot_table. I'm not sure what was missing to get you the right result but here is one approach with it. I added `margins, maybe it is also interesting for you to get the total amount of each index/column.
weekdays = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Total']
out2 = (pd.pivot_table(data=df, index='TIME', columns='DAY', aggfunc='count', margins=True, margins_name='Total')
.droplevel(0,axis=1)
.reindex(weekdays, axis=1)
)
print(out2)
DAY Monday Tuesday Wednesday Thursday Friday Saturday Sunday Total
TIME
12 NaN NaN 1.0 NaN NaN 1.0 NaN 2
15 1.0 NaN NaN NaN 1.0 1.0 1.0 4
16 NaN NaN 1.0 1.0 1.0 NaN NaN 3
18 NaN NaN NaN NaN NaN 1.0 NaN 1
Total 1.0 NaN 2.0 1.0 2.0 3.0 1.0 10

Sort dataframe per column reassigning indexes

Given this dataframe:
print(df)
0 1 2
0 354.7 April 4.0
1 55.4 August 8.0
2 176.5 December 12.0
3 95.5 February 2.0
4 85.6 January 1.0
5 152 July 7.0
6 238.7 June 6.0
7 104.8 March 3.0
8 283.5 May 5.0
9 278.8 November 11.0
10 249.6 October 10.0
11 212.7 September 9.0
If I do order by column 2 using df.sort_values('2'), I get:
0 1 2
4 85.6 January 1.0
3 95.5 February 2.0
7 104.8 March 3.0
0 354.7 April 4.0
8 283.5 May 5.0
6 238.7 June 6.0
5 152.0 July 7.0
1 55.4 August 8.0
11 212.7 September 9.0
10 249.6 October 10.0
9 278.8 November 11.0
2 176.5 December 12.0
Is there a smart way to re-define the index column (from 0 to 11) preserving the new order I got?
Use reset_index:
df.sort_values('2').reset_index(drop=True)
Also (this will replace the original dataframe)
df[:] = df.sort_values('2').values

Python - Add lines in between a panda dataframe

I have a dataframe like this:
index = [0,1,2,3,4,5]
s = pd.Series([1,1,1,2,2,2],index= index)
t = pd.Series([2007,2008,2011,2006,2007,2009],index= index)
f = pd.Series([2,4,6,8,10,12],index= index)
pp =pd.DataFrame(np.c_[s,t,f],columns = ["group","year","amount"])
pp
group year amount
0 1 2007 2
1 1 2008 4
2 1 2011 6
3 2 2006 8
4 2 2007 10
5 2 2009 12
I want to add lines in between missing years for each group. My desire dataframe is like this:
group year amount
0 1.0 2007 2.0
1 1.0 2008 4.0
2 1.0 2009 NaN
3 1.0 2010 NaN
4 1.0 2011 6
5 1.0 2006 8.0
6 2.0 2007 10.0
7 2.0 2008 NaN
8 2.0 2009 12.0
Is there any way to do it for a large dataframe?
First change year to datetime:
df.year = pd.to_datetime(df.year, format='%Y')
set_index with resample
df.set_index('year').groupby('group').amount.resample('Y').mean().reset_index()
group year amount
0 1 2007-12-31 2.0
1 1 2008-12-31 4.0
2 1 2009-12-31 NaN
3 1 2010-12-31 NaN
4 1 2011-12-31 6.0
5 2 2006-12-31 8.0
6 2 2007-12-31 10.0
7 2 2008-12-31 NaN
8 2 2009-12-31 12.0

sum of values larger than median of each row in pandas dataframes

Is there an efficient way to find the sum of values whose absolute value is larger than the median of the row in a pandas data frame?
For example:
Monday Tuesday Wednesday Thursday Friday Saturday
0 2.2 4.4 0.5 9 4 3
1 2 4 1 8 4 5
2 1.8 4.5 0.9 8 1 15
3 4 1 5 10 4 5
…
How to generate the sum of numbers in each row which are larger than the median of the corresponding row? What about 25 percentile or 75 percentile?
I think you want this:
In [19]:
df[df.gt(df.median(axis=1), axis=0)]
Out[19]:
Monday Tuesday Wednesday Thursday Friday Saturday
0 NaN 4.4 NaN 9 4 NaN
1 NaN NaN NaN 8 NaN 5
2 NaN 4.5 NaN 8 NaN 15
3 NaN NaN 5 10 NaN 5
This uses .gt which is greater than and uses as the value the median (row-wise by passing axis=1).
You can then call sum on this:
In [20]:
df[df.gt(df.median(axis=1), axis=0)].sum()
Out[20]:
Monday NaN
Tuesday 8.9
Wednesday 5.0
Thursday 35.0
Friday 4.0
Saturday 25.0
dtype: float64
And to enhance #EdChum's answer to get the quantiles:
quantile = 0.75 # 0.25, 0.5, 0.75, etc.
df[df.gt(df.quantile(q=quantile, axis=1), axis=0)].sum(axis=1)
Given that there are only seven days in a week, I'm not sure if this will do as intended unless you have more columns than shown. Do you want the quantiles by column instead of row?
Since you want to sum values in each row which is greater then median, and if you want to preserve Day values, below approach works fine
def func(row):
return row[row>np.percentile(row, 50)].sum()
func function will be now applied on df
In [67]: df['rule'] = df.apply(func, axis=1)
In [68]: df
Out[68]:
Monday Tuesday Wednesday Thursday Friday Saturday rule
0 2.2 4.4 0.5 9 4 3 17.4
1 2.0 4.0 1.0 8 4 5 13.0
2 1.8 4.5 0.9 8 1 15 27.5
3 4.0 1.0 5.0 10 4 5 20.0
And, for different quantiles, you could use [25, 50, 75] in np.percentile(row, x)

How to merge rows in DataFrame according to unique elements and get averages?

I'm struggling to figure out how to achieve this. I'm trying to get the average price for each day and hour entries. So a DataFrame like
day hour price booked
0 monday 7 12.0 True
1 monday 8 12.0 False
2 tuesday 7 13.0 True
3 tuesday 8 13.0 False
4 monday 7 15.0 True
5 monday 8 13.0 False
6 tuesday 7 13.0 True
7 tuesday 8 15.0 False
should give something like:
day hour avg. price
0 monday 7 13
1 monday 8 12.5
2 tuesday 7 13.0
3 tuesday 8 14.0
I would like this generalize to larger data sets.
You can groupby the day and hour column and then call mean on the price column:
In [46]:
df.groupby(['day','hour'])['price'].mean()
Out[46]:
day hour
monday 7 13.5
8 12.5
tuesday 7 13.0
8 14.0
Name: price, dtype: float64
To restore the day and hour back as columns you can call reset_index:
In [47]:
df.groupby(['day','hour'])['price'].mean().reset_index()
Out[47]:
day hour price
0 monday 7 13.5
1 monday 8 12.5
2 tuesday 7 13.0
3 tuesday 8 14.0
You can also rename the column if you desire:
In [48]:
avg = df.groupby(['day','hour'])['price'].mean().reset_index()
avg.rename(columns={'price':'avg_price'},inplace=True)
avg
Out[48]:
day hour avg_price
0 monday 7 13.5
1 monday 8 12.5
2 tuesday 7 13.0
3 tuesday 8 14.0

Categories

Resources