Applying a function to a muti-index dataframe - python

I want to apply an operation to the following data frame:
index date username count
0 2015-11-01 1 16
1 2015-11-01 2 1
2 2015-11-01 3 1
3 2015-10-01 1 2
4 2015-10-01 4 29
5 2015-10-01 5 1
6 2014-09-01 1 3
7 2014-09-01 3 1
8 2014-09-01 4 1
And apply an operation that will get it to this:
index date mean
0 2015-11-01 6
1 2015-10-01 10.7
2 2014-09-01 1.3
The calculation takes the sum of all counts in a given date (e.g. for 2015-11-01 is it is 16+1+1=18) then divides by the unique number of usernames for a given date (e.g. for 2015-10-01 there are 3). A new column, mean is created to record the calculation, in this case we have called it mean.
I have been trying to use the 'apply' method from DataFrame but without success yet. Help would be very much appreciated. Thanks

You can use GroupBy + sum divided by GroupBy + nunique:
g = df.groupby('date')
res = g['count'].sum().div(g['username'].nunique())\
.rename('mean').reset_index()
print(res)
date mean
0 2014-09-01 1.666667
1 2015-10-01 10.666667
2 2015-11-01 6.000000

Related

Python - Tell if there is a non consecutive date in pandas dataframe

I have a pandas data frame with dates. I need to know if every other date pair is consecutive.
2 1988-01-01
3 2015-01-31
4 2015-02-01
5 2015-05-31
6 2015-06-01
7 2021-11-16
11 2021-11-17
12 2022-10-05
8 2022-10-06
9 2022-10-12
10 2022-10-13
# How to build this example dataframe
df=pd.DataFrame({'date':pd.to_datetime(['1988-01-01','2015-01-31','2015-02-01', '2015-05-31','2015-06-01', '2021-11-16', '2021-11-17', '2022-10-05', '2022-10-06', '2022-10-12', '2022-10-13'])})
Each pair should be consecutive. I have tried different sorting but everything I see relates to the entire series being consecutive. I need to compare each pair of dates after the first date.
cb_gap = cb_sorted.sort_values('dates').groupby('dates').diff() > pd.to_timedelta('1 day')
What I need to see is this...
2 1988-01-01 <- Ignore the start date
3 2015-01-31 <- these dates have no gap
4 2015-02-01
5 2015-05-31 <- these dates have no gap
6 2015-06-01
7 2021-11-16 <- these have a gap!!!!
11 2021-11-18
12 2022-10-05 <- these have no gap
8 2022-10-06
9 2022-10-12
One way is to use shift and compute differences.
pd.DataFrame({'date':df.date,'diff':df.date.shift(-1)-df.date})[1::2]
returns
date diff
1 2015-01-31 1 days
3 2015-05-31 1 days
5 2021-11-16 1 days
7 2022-10-05 1 days
9 2022-10-12 1 days
It is also faster
Method
Timeit
Naveed's
4.23 ms
This one
0.93 ms
here is one way to do it
btw, what is your expected output? the answer get you the difference b/w the consecutive dates skipping the first row and populate diff column
# make date into datetime
df['date'] = pd.to_datetime(df['date'])
# create two intermediate DF skipping the first and taking alternate values
# and concat them along x-axis
df2=pd.concat([df.iloc[1:].iloc[::2].reset_index()[['id','date']],
df.iloc[2:].iloc[::2].reset_index()[['id','date']]
],axis=1 )
# take the difference of second date from the first one
df2['diff']=df2.iloc[:,3]-df2.iloc[:,1]
df2
id date id date diff
0 3 2015-01-31 4 2015-02-01 1 days
1 5 2015-05-31 6 2015-06-01 1 days
2 7 2021-11-16 11 2021-11-17 1 days
3 12 2022-10-05 8 2022-10-06 1 days
4 9 2022-10-12 10 2022-10-13 1 days

Number of active IDs in each period

I have a dataframe that looks like this
ID | START | END
1 |2016-12-31|2017-02-30
2 |2017-01-30|2017-10-30
3 |2016-12-21|2018-12-30
I want to know the number of active IDs in each possible day. So basically count the number of overlapping time periods.
What I did to calculate this was creating a new data frame c_df with the columns date and count. The first column was populated using a range:
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
Then for every line in my original data frame I calculated a different range for the start and end dates:
id_dates = pd.date_range(start=min(user['START']), end=max(user['END']))
I then used this range of dates to increment by one the corresponding count cell in c_df.
All these loops though are not very efficient for big data sets and look ugly. Is there a more efficient way of doing this?
If your dataframe is small enough so that performance is not a concern, create a date range for each row, then explode them and count how many times each date exists in the exploded series.
Requires pandas >= 0.25:
df.apply(lambda row: pd.date_range(row['START'], row['END']), axis=1) \
.explode() \
.value_counts() \
.sort_index()
If your dataframe is large, take advantage of numpy broadcasting to improve performance.
Work with any version of pandas:
dates = pd.date_range(df['START'].min(), df['END'].max()).values
start = df['START'].values[:, None]
end = df['END'].values[:, None]
mask = (start <= dates) & (dates <= end)
result = pd.DataFrame({
'Date': dates,
'Count': mask.sum(axis=0)
})
Create IntervalIndex and use genex or list comprehension with contains to check each date again each interval (Note: I made a smaller sample to test on this solution)
Sample `df`
Out[56]:
ID START END
0 1 2016-12-31 2017-01-20
1 2 2017-01-20 2017-01-30
2 3 2016-12-28 2017-02-03
3 4 2017-01-20 2017-01-25
iix = pd.IntervalIndex.from_arrays(df.START, df.END, closed='both')
all_dates = pd.date_range(start=min(df['START']), end=max(df['END']))
df_final = pd.DataFrame({'dates': all_dates,
'date_counts': (iix.contains(dt).sum() for dt in all_dates)})
In [58]: df_final
Out[58]:
dates date_counts
0 2016-12-28 1
1 2016-12-29 1
2 2016-12-30 1
3 2016-12-31 2
4 2017-01-01 2
5 2017-01-02 2
6 2017-01-03 2
7 2017-01-04 2
8 2017-01-05 2
9 2017-01-06 2
10 2017-01-07 2
11 2017-01-08 2
12 2017-01-09 2
13 2017-01-10 2
14 2017-01-11 2
15 2017-01-12 2
16 2017-01-13 2
17 2017-01-14 2
18 2017-01-15 2
19 2017-01-16 2
20 2017-01-17 2
21 2017-01-18 2
22 2017-01-19 2
23 2017-01-20 4
24 2017-01-21 3
25 2017-01-22 3
26 2017-01-23 3
27 2017-01-24 3
28 2017-01-25 3
29 2017-01-26 2
30 2017-01-27 2
31 2017-01-28 2
32 2017-01-29 2
33 2017-01-30 2
34 2017-01-31 1
35 2017-02-01 1
36 2017-02-02 1
37 2017-02-03 1

python and dataframe: group by week and calculate the sum and difference

I have a dataframe with the following columns:
DATE ALFA BETA
2016-04-26 1 3
2016-04-27 3 0
2016-04-28 0 8
2016-04-29 4 2
2016-04-30 3 1
2016-05-01 -2 -5
2016-05-02 3 0
2016-05-03 3 3
2016-05-08 1 7
2016-05-11 3 1
2016-05-12 10 1
2016-05-13 4 2
I would like to group the data in a weekly range but treat the alpha and beta columns differently. I would like to calculate the sum of the numbers in the ALFA column for each week while for the BETA column I would like to calculate the difference between the beginning and the end of the week. I show you an example of the expected result.
DATE sum_ALFA diff_BETA
2016-04-26 12 3
2016-05-03 4 4
2016-05-11 17 1
I have tried this code but it calculates the sum for each column
df = df.resample('W', on='DATE').sum().reset_index().sort_values(by='DATE')
this is my dataset https://drive.google.com/uc?export=download&id=1fEqjINx9R5io7t_YxA9qShvNDxWRCUke
I'd guess I'm having a different locale here (hence my week is different), you can do:
df.resample("W", on="DATE",closed="left", label="left"
).agg({"ALFA":"sum", "BETA": lambda g: g.iloc[0] - g.iloc[-1]})
ALFA BETA
DATE
2016-04-24 11 2
2016-05-01 4 -8
2016-05-08 18 5
I think there is a solution for your data with my approach. Define
def get_series_first_minus_last(s):
try:
return s.iloc[0] - s.iloc[-1]
except IndexError:
return 0
and replace the lambda call just by the function call, i.e.
df.resample("W", on="DATE",closed="left", label="left"
).agg({"ALFA":"sum", "BETA": get_series_first_minus_last})
Note that in the newly defined function, you could also return nan if you'd prefer that.

Pandas Difference Between Dates in Months

i have a dataframe date column with below values
2015-01-01
2015-02-01
2015-03-01
2015-07-01
2015-08-01
2015-10-01
2015-11-01
2016-02-01
i want to find the difference of these values in months, as below
date_dt diff_mnts
2015-01-01 0
2015-02-01 1
2015-03-01 1
2015-07-01 4
2015-08-01 1
2015-10-01 2
2015-11-01 1
2016-02-01 3
i tried to use the diff() method to calculate the days and then convert to astype('timedelta64(M)'). but in those cases, when days are less than 30 - its showing month difference values as 0. please let me know, if there is any easy built in function, which i can try in this case.
Option 1
Change the period and call diff.
df
Date
0 2015-01-01
1 2015-02-01
2 2015-03-01
3 2015-07-01
4 2015-08-01
5 2015-10-01
6 2015-11-01
7 2016-02-01
df.Date.dtype
dtype('<M8[ns]')
df.Date.dt.to_period('M').diff().fillna(0)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Option 2
Alternatively, call diff on dt.month, but you'll need to account for gaps over a year (solution improved thanks to #galaxyan!) -
i = df.Date.dt.year.diff() * 12
j = df.Date.dt.month.diff()
(i + j).fillna(0).astype(int)
0 0
1 1
2 1
3 4
4 1
5 2
6 1
7 3
Name: Date, dtype: int64
Caveat (thanks to for spotting it) is that this wouldn't work for gaps over a year.
Try the following steps
Cast the column into datetime format.
Use the .month method to get the month number
Use the shift() method in pandas to calculate difference
example code will look something like this
df['diff_mnts'] = date_dt.month - date_dt.shift().month

Last-N days Pandas DataFrame TimeGrouper

I got a DataFrame which has date as index, and I would like to do operation "Get the sum of latest 2 days" on each day:
A
2015-11-01 1
2015-11-02 3
2015-11-03 2
2015-11-04 4
2015-11-05 1
2015-11-06 2
The aims is:
Lastest_2_days_A
2015-11-01 1
2015-11-02 4
2015-11-03 5
2015-11-04 6
2015-11-05 5
2015-11-06 3
I thought TimeGrouper might help. However when I use TimeGrouper and set freq to be "2D":
import numpy as np
import pandas as pd
rng = pd.date_range('2015-11-01', periods=6)
df = pd.DataFrame(np.random.randn(6,1), index=rng, columns=["A"]).applymap(lambda x:int(x))
df.groupby(pd.TimeGrouper(freq="2D", closed='right')).sum()
The result would be :
A
2015-10-30 1
2015-11-01 5
2015-11-03 5
2015-11-05 2
It is obvious that in TimeGrouper there is not any overlap between index in the result, while what I need is to perform the latest N-days sum operation for each day. Does it possible to do this operation? Any suggestions will be very appreciated!
For simple case like this, shift will suffice:
In [6]:
print df
A
2015-11-01 1
2015-11-02 3
2015-11-03 2
2015-11-04 4
2015-11-05 1
2015-11-06 2
In [7]:
print df + df.shift(1).fillna(0)
A
2015-11-01 1
2015-11-02 4
2015-11-03 5
2015-11-04 6
2015-11-05 5
2015-11-06 3
More generally, it is a case of rolling apply, min_periods control the minimal window that will be considered as valid. Skipping it in this case will result in having nan for the 1st cell:
In [8]:
print pd.rolling_sum(df,window=2,min_periods=1)
A
2015-11-01 1
2015-11-02 4
2015-11-03 5
2015-11-04 6
2015-11-05 5
2015-11-06 3

Categories

Resources