Weird behavior with pandas Grouper method with datetime objects - python

I am trying to make groups of x days within groups of another column. For some reason the grouping behavior is changed when I add another level of grouping.
See toy example below:
Create a random dataframe with 40 consecutive dates, an ID column and random values:
import numpy as np
import pandas as pd
df = pd.DataFrame(
{'dates':pd.date_range('2018-1-1',periods=40,freq='D'),
'id': np.concatenate((np.repeat(1,10),np.repeat(2,30))),
'amount':np.random.random(40)
}
)
I want to group by id first and then make groups of let's say 7 consecutive days within these groups. I do:
(df
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
And the output is:
mean count
id dates
1 2018-01-01 0.591755 7
2018-01-08 0.701657 3
2 2018-01-08 0.235837 4
2018-01-15 0.650085 7
2018-01-22 0.463854 7
2018-01-29 0.643556 7
2018-02-05 0.459864 5
There is something weird going on in the second group! I would expect to see 4 groups of 7 and then a last group of 2. When I run the same code on a dataframe with just the id=2 I do get what I actually expect:
df2=df[df.id==2]
(df2
.groupby(['id',pd.Grouper(key='dates',freq='7D')])
.amount
.agg(['mean','count'])
)
Output
mean count
id dates
2 2018-01-11 0.389343 7
2018-01-18 0.672550 7
2018-01-25 0.486620 7
2018-02-01 0.520816 7
2018-02-08 0.529915 2
What is going on here? Is it first creating a group of 4 in the id=2 group because the last group in id=1 group was only 3 rows? This is not what I want to do!

When you group with both IDs, you have a spillover from the first group into the second when you perform a weekly groupby (because there are not enough days in the last week to complete a full 7 days in group #1). This is obvious when you look at the first date per group:
"2018-01-08" in the first case v/s "2018-01-11".
The workaround is to perform a groupby on id and then apply a resampling operation:
df.groupby('id').apply(
lambda x: x.set_index('dates').amount.resample('7D').count()
)
id dates
1 2018-01-01 7
2018-01-08 3
2 2018-01-11 7
2018-01-18 7
2018-01-25 7
2018-02-01 7
2018-02-08 2
Name: amount, dtype: int64

Related

How to see if one value have 2 matches in 1 column in pandas

I have results from A/B test that I need to evaluate but in the checking of the data I noticed that there were users that were in both control groups and I need to drop them to not hurt the test. My data looks something like this:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
2 3698129301 2 2019-08-01 165.7 B
3 4214855558 2 2019-08-07 30.5 A
4 797272108 3 2019-08-23 100.4 A
What I need to do is remove every user that was in both A and B groups while leaving the rest intact. So from the example data I need this output:
transactionId visitorId date revenue group
0 906125958 0 2019-08-16 10.8 B
1 1832336629 1 2019-08-04 25.9 B
4 797272108 3 2019-08-23 100.4 A
I tried to do it in various ways and I can't seems to figure it out and I couldn't find an answer for it anywhere I would really appreciate some help here,
thanks in advance
You can get a list of users that are in just one group like this:
group_counts = df.groupby('visitorId').agg({'group': 'nunique'}) ##list of users with number of groups
to_include = group_counts[group_counts['group'] == 1] ##filter for just users in 1 group
And then filter your original data according to which visitors are in that list:
df = df[df['visitorId'].isin(to_include.index)]

sort_values function in pandas dataframe not working properly

I have a dataset of 1281695 rows and 4 columns in which I have 6 years of monthly data from 2013 to 2019. So, it's obvious to have repeated dates in the dataset. I want to arrange data as dates in ascending order like Jan 2013, Feb 2013,..Dec 2013, Jan 2014,......Dec 2019(6 years of data).I want to achieve ascending order for all of the dataset but it shows ascending order for some data and random order for the remaining data.
I tried sort_values of pandas library.
I tried something like this :
data = df.sort_values(['SKU', 'Region', 'FMonth'], axis=0, ascending=[False, True, True]).reset_index()
where SKU, Region, FMonth are my independent variables. FMonth is the date variable.
And the code arranges the starting of data but not the end of data. Like when I tried:
data.head()
result:
index SKU Region FMonth sh
0 8264 855019.133127 3975.495636 2013-01-01 67640.0
1 20022 855019.133127 3975.495636 2013-02-01 73320.0
2 31972 855019.133127 3975.495636 2013-03-01 86320.0
3 43897 855019.133127 3975.495636 2013-04-01 98040.0
4 55642 855019.133127 3975.495636 2013-05-01 73240.0
And,
data.tail()
result:
index SKU Region FMonth sh
1281690 766746 0.000087 7187.170501 2017-03-01 0.0
1281691 881816 0.000087 7187.170501 2017-09-01 0.0
1281692 980113 0.000087 7187.170501 2018-02-01 0.0
1281693 1020502 0.000087 7187.170501 2018-04-01 0.0
1281694 1249130 0.000087 7187.170501 2019-03-01 0.0
where 'sh' is my dependent variable.
Data is not really attractive but please focus on FMonth(date) column only.
As we can see the last rows are not arranged in ascending order but the starting rows are arranged in specified order. And if I change the ascending attribute of FMonth column in the above code, means in descending order the data shows the order in the starting rows but not in the last rows.
What am I doing wrong? What to do to achieve ascending order in all of the dataset? And what is happening and why?
Do you just need to prioritize Month?
z = pd.read_clipboard()
z.columns = [i.strip() for i in z.columns]
z.sort_values(['FMonth', 'Region', 'SKU'], axis=0, ascending=[True, True, True])
Out[23]:
index SKU Region FMonth sh
1 20022 8 52 1/1/2013 73320
0 8264 1 67 1/1/2013 67640
3 43897 5 34 3/1/2013 98040
2 31972 3 99 3/1/2013 86320
4 55642 4 98 5/1/2013 73240

Python Pandas: groupby date and count new records for each period

I'm trying to use Python Pandas to count daily returning visitors to my website over a time period.
Example data:
df1 = pd.DataFrame({'user_id':[1,2,3,1,3], 'date':['2012-09-29','2012-09-30','2012-09-30','2012-10-01','2012-10-01']})
print df1
date user_id
0 2012-09-29 1
1 2012-09-30 2
2 2012-09-30 3
3 2012-10-01 1
4 2012-10-01 3
What I'd like to have as final result:
df1_result = pd.DataFrame({'count_new':[1,2,0], 'date':['2012-09-29','2012-09-30','2012-10-01']})
print df1_result
count_new date
0 1 2012-09-29
1 2 2012-09-30
2 0 2012-10-01
In the first day there is 1 new user because user 1 appears for the first time.
In the second day there are 2 new users: user 2 and user 3 both appear for the first time.
Finally in the third day there are 0 new users: user 1 and user 3 have both already appeared in previous periods.
So far I have been looking into merging two copies of same dataframe and shifting one by a date, but without success:
pd.merge(df1, df1.user_id.shift(-date), on = 'date').groupby('date')['user_id_y'].nunique()
Any help would be much appreciated,
Thanks
>>> (df1
.groupby(['user_id'], as_index=False)['date'] # Group by `user_id` and get first date.
.first()
.groupby(['date']) # Group result on `date` and take counts.
.count()
.reindex(df1['date'].unique()) # Reindex on original dates.
.fillna(0)) # Fill null values with zero.
user_id
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
It is better to add a new column Isreturning (in case you need to analysis on Returning customer in the future)
df['Isreturning']=df.groupby('user_id').cumcount()
Only show new customer
df.loc[df.Isreturning==0,:].groupby('date')['user_id'].count()
Out[840]:
date
2012-09-29 1
2012-09-30 2
Name: user_id, dtype: int64
Or you can :
df.groupby('date')['Isreturning'].apply(lambda x : len(x[x==0]))
Out[843]:
date
2012-09-29 1
2012-09-30 2
2012-10-01 0
Name: Isreturning, dtype: int64

Python function to add values in a Pandas Dataframe using values from another Dataframe

I am a newbie in Python and I am struggling for coding things that seem simple in PHP/SQL and I hope you can help me.
I have 2 Pandas Dataframes that I have simplified for a better understanding.
In the first Dataframe df2015, I have the Sales for the 2015.
! Notice that unfortunately, we do not have ALL the values for each store !
>>> df2015
Store Date Sales
0 1 2015-01-15 6553
1 3 2015-01-15 7016
2 6 2015-01-15 8840
3 8 2015-01-15 10441
4 9 2015-01-15 7952
And another Dataframe named df2016 for the Sales Forecast in 2016, which lists ALL the stores.( As you guess, the column SalesForecast is the column to fill. )
>>> df2016
Store Date SalesForecast
0 1 2016-01-15
1 2 2016-01-15
2 3 2016-01-15
3 4 2016-01-15
4 5 2016-01-15
I want to create a function that for each row in df2016 will retrieve the Sales values from df2015, and for example, will increase by 5% these values and add these new values in SalesForecast column of df2016.
Let's say forecast is the function I have created that I want to apply :
def forecast(store_id,date):
sales2015 = df2015['Sales'].loc[(df2015['Store'].values == store_id) & (df2015['Date'].values == date )].values
forecast2016 = sales2015 * 1.05
return forecast2016
I have tested this function in a hardcoding way as below and it works:
>>> forecast(1,'2015-01-15')
array([ 6880.65])
But here we are where my problem is... How can I apply this function to the dataframes ?
It would be very easy to do it in PHP by creating a loop for each row in df2016 and retrieve the values (if they exist) from df2015 by a SELECT and WHERE Store = store_id and Date = date.. ...but the it seems the logic is not the same with Pandas Dataframes and Python.
I have tried the apply function as follows :
df2016['SalesForecast'] = df2016.apply(df2016['Store'],df2016['Date'])
but I am unable to put the arguments correctly or there is something I am doing wrong..
I think I do not have the good method or maybe my method is not suitable at all with Pandas and Python.. ?
I believe you are almost there! What's missing is the function, you've passed in the args.
The apply function takes in a function and its args. The documentation is here.
Without having tried this on my own system, I would suggest doing:
df2016['SalesForecast'] = df2016.apply(func=forecast, args=(df2016['Store'],df2016['Date']))
One of the nice things about Pandas is that it handles missing data well. The trick is to use a common index on both dataframes. For instance, if we set the index of both dataframes to be the 'Store' column:
df2015.set_index('Store', inplace=True)
df2016.set_index('Store', inplace=True)
Then doing what you'd like is as simple as:
df2016['SalesForecast'] = df2015['Sales'] * 1.05
resulting in:
Date SalesForecast
Store
1 2016-01-15 6880.65
2 2016-01-15 NaN
3 2016-01-15 7366.80
4 2016-01-15 NaN
5 2016-01-15 NaN
That the SalesForecast for store 2 is NaN reflects the fact that store 2 doesn't exist in the df2015 dataframe.
Notice that if you just need to multiply the Sales column from df2015 by 1.05, you can just do so, all in df2015:
In [18]: df2015['Forecast'] = df2015['Sales'] * 1.05
In [19]: df2015
Out[19]:
Store Date Sales Forecast
0 1 2015-01-15 6553 6880.65
1 3 2015-01-15 7016 7366.80
2 6 2015-01-15 8840 9282.00
3 8 2015-01-15 10441 10963.05
4 9 2015-01-15 7952 8349.60
At this point, you can join that result onto df2016 if you need this to appear in the df2016 data set:
In [20]: pandas.merge(df2016, # left side of join
df2015, # right side of join
on='Store', # similar to SQL 'on' for 'join'
how='outer', # same as SQL, outer join.
suffixes=('_2016', '_2015')) # rename same-named
# columns w/suffix
Out[20]:
Store Date_2016 Date_2015 Sales Forecast
0 1 2016-01-15 2015-01-15 6553 6880.65
1 2 2016-01-15 NaN NaN NaN
2 3 2016-01-15 2015-01-15 7016 7366.80
3 4 2016-01-15 NaN NaN NaN
4 5 2016-01-15 NaN NaN NaN
5 6 2016-01-15 2015-01-15 8840 9282.00
6 7 2016-01-15 NaN NaN NaN
7 8 2016-01-15 2015-01-15 10441 10963.05
8 9 2016-01-15 2015-01-15 7952 8349.60
If the two DataFrames happen to have compatible indexs already, you can simply write in the result column to df2016 directly, even if it's a computation on another DataFrame like df2015. In general though, you need to be careful about this, and it can be more general to perform the join explicitly (as I did above by using the merge function). Which way is best will depend on your application and your knowledge of index columns.
For more general function application to a column, a whole DataFrame, or groups of sub-frames, refer to the documentation for this type of operation in Pandas.
There are also links with some cookbook examples and comparisons with the way you might express similar operations in SQL.
Note that I created data to replicate your example data with these commands:
df2015 = pandas.DataFrame([[1, datetime.date(2015, 1, 15), 6553],
[3, datetime.date(2015, 1, 15), 7016],
[6, datetime.date(2015, 1, 15), 8840],
[8, datetime.date(2015, 1, 15), 10441],
[9, datetime.date(2015, 1, 15), 7952]],
columns=['Store', 'Date', 'Sales'])
from itertools import izip_longest
df2016 = pandas.DataFrame(
list(izip_longest(range(1,10),
[datetime.date(2016, 1, 15)],
fillvalue=datetime.date(2016, 1, 15))),
columns=['Store', 'Date']
)

Grouping records with close DateTimes in Python pandas DataFrame

I have been spinning my wheels with this problem and was wondering if anyone has any insight on how best to approach it. I have a pandas DataFrame with a number of columns, including one datetime64[ns]. I would like to find some way to 'group' records together which have datetimes which are very close to one another. For example, I might be interested in grouping the following transactions together if they occur within two seconds of each other by assigning a common ID called Grouped ID:
Transaction ID Time Grouped ID
1 08:10:02 1
2 08:10:03 1
3 08:10:50
4 08:10:55
5 08:11:00 2
6 08:11:01 2
7 08:11:02 2
8 08:11:03 3
9 08:11:04 3
10 08:15:00
Note that I am not looking to have the time window expand ad infinitum if transactions continue to occur at quick intervals - once a full 2 second window has passed, a new window would begin with the next transaction (as shown in transactions 5 - 9). Additionally, I will ultimately be performing this analysis at the millisecond level (i.e. combine transactions within 50 ms) but stuck with seconds for ease of presentation above.
Thanks very much for any insight you can offer!
The solution i suggest requires you to reindex your data with your Time data.
You can use a list of datetimes with the desired frequency, use searchsorted to find the nearest datetimes in your index, and then use it for slicing (as suggested in question python pandas dataframe slicing by date conditions and Python pandas, how to truncate DatetimeIndex and fill missing data only in certain interval).
I'm using pandas 0.14.1 and the DataOffset object (http://pandas.pydata.org/pandas-docs/dev/timeseries.html?highlight=dateoffset). I didn't check with datetime64, but i guess you might adapt the code. DataOffset goes down to the microsecond level.
Using the following code,
import pandas as pd
import pandas.tseries.offsets as pto
import numpy as np
# Create some ome test data
d_size = 15
df = pd.DataFrame({"value": np.arange(d_size)}, index=pd.date_range("2014/11/03", periods=d_size, freq=pto.Milli()))
# Define periods to define groups (ticks)
ticks = pd.date_range("2014/11/03", periods=d_size/3, freq=5*pto.Milli())
# find nearest indexes matching the ticks
index_ticks = np.unique(df.index.searchsorted(ticks))
# make a dataframe with the group ids
dgroups = pa.DataFrame(index=df.index, columns=['Group id',])
# sets the group ids
for i, (mini, maxi) in enumerate(zip(index_ticks[:-1], index_ticks[1:])):
dgroups.loc[mini:maxi] = i
# update original dataframe
df['Group id'] = dgroups['Group id']
I was able to obtain this kind of dataframe:
value Group id
2014-11-03 00:00:00 0 0
2014-11-03 00:00:00.001000 1 0
2014-11-03 00:00:00.002000 2 0
2014-11-03 00:00:00.003000 3 0
2014-11-03 00:00:00.004000 4 0
2014-11-03 00:00:00.005000 5 1
2014-11-03 00:00:00.006000 6 1
2014-11-03 00:00:00.007000 7 1
2014-11-03 00:00:00.008000 8 1
2014-11-03 00:00:00.009000 9 1
2014-11-03 00:00:00.010000 10 2
2014-11-03 00:00:00.011000 11 2
2014-11-03 00:00:00.012000 12 2
2014-11-03 00:00:00.013000 13 2
2014-11-03 00:00:00.014000 14 2

Categories

Resources