Pandas DataFrame Change Values Based on Values in Different Rows - python

I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!

Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data

Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0

Related

Count number of columns above a date

I have a pandas dataframe with several columns and I would like to know the number of columns above the date 2016-12-31 . Here is an example:
ID
Bill
Date 1
Date 2
Date 3
Date 4
Bill 2
4
6
2000-10-04
2000-11-05
1999-12-05
2001-05-04
8
6
8
2016-05-03
2017-08-09
2018-07-14
2015-09-12
17
12
14
2016-11-16
2017-05-04
2017-07-04
2018-07-04
35
And I would like to get this column
Count
0
2
3
Just create the mask and call sum on axis=1
date = pd.to_datetime('2016-12-31')
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1)
OUTPUT:
0 0
1 2
2 3
dtype: int64
If needed, call .to_frame('count') to create datarame with column as count
(df[['Date 1','Date 2','Date 3','Date 4']]>date).sum(1).to_frame('Count')
Count
0 0
1 2
2 3
Use df.filter to filter the Date* columns + .sum(axis=1)
(df.filter(like='Date') > '2016-12-31').sum(axis=1).to_frame(name='Count')
Result:
Count
0 0
1 2
2 3
You can do:
df['Count'] = (df.loc[:, [x for x in df.columns if 'Date' in x]] > '2016-12-31').sum(axis=1)
Output:
ID Bill Date 1 Date 2 Date 3 Date 4 Bill 2 Count
0 4 6 2000-10-04 2000-11-05 1999-12-05 2001-05-04 8 0
1 6 8 2016-05-03 2017-08-09 2018-07-14 2015-09-12 17 2
2 12 14 2016-11-16 2017-05-04 2017-07-04 2018-07-04 35 3
We select columns with 'Date' in the name. It's better when we have lots of columns like these and don't want to put them one by one. Then we compare it with lookup date and sum 'True' values.

Return duration for each id

I have a large list of events being tracked with a timestamp appended to each:
I currently have the following table:
ID Time_Stamp Event
1 2/20/2019 18:21 0
1 2/20/2019 19:46 0
1 2/21/2019 18:35 0
1 2/22/2019 11:39 1
1 2/22/2019 16:46 0
1 2/23/2019 7:40 0
2 6/5/2019 0:10 0
3 7/31/2019 10:18 0
3 8/23/2019 16:33 0
4 6/26/2019 20:49 0
What I want is the following [but not sure if it's possible]:
ID Time_Stamp Conversion Total_Duration_Days Conversion_Duration
1 2/20/2019 18:21 0 2.555 1.721
1 2/20/2019 19:46 0 2.555 1.721
1 2/21/2019 18:35 0 2.555 1.721
1 2/22/2019 11:39 1 2.555 1.721
1 2/22/2019 16:46 1 2.555 1.934
1 2/23/2019 7:40 0 2.555 1.934
2 6/5/2019 0:10 0 1.00 0.000
3 7/31/2019 10:18 0 23.260 0.000
3 8/23/2019 16:33 0 23.260 0.000
4 6/26/2019 20:49 0 1.00 0.000
For #1 Total Duration = Max Date - Min Date [2.555 Days]
For #2 Conversion Duration = Conversion Date - Min Date [1.721 Days] - following actions post the conversion can remain at the calculated duration
I have attempted the following:
df.reset_index(inplace=True)
df.groupby(['ID'])['Time_Stamp].diff().fillna(0)
This kind of does what I want, but it's showing the difference between each event, not the min time stamp to the max time stamp
conv_test = df.reset_index(inplace=True)
min_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('min').to_frame('MinTime')
max_df = conv_test.groupby(['ID'])['visitStartTime_aest'].agg('max').to_frame('MaxTime')
conv_test = conv_test.set_index('ID').merge(min_df, left_index=True, right_index=True)
conv_test = conv_test.merge(max_df, left_index=True, right_index=True)
conv_test['Durartion'] = conv_test['MaxTime'] - conv_test['MinTime']
This gives me Total_Duration_Days which is great [feel free to offer a more elegant solution
Any ideas on how I can get Conversion_Duration?
You can use GroupBy.transform with min and max for Series with same size like original, so possible subtract for Total_Duration_Days and then filter only 1 rows by Event, create Series by DataFrame.set_index and convert to dict, then Series.map for new Series, so possible subtract minimal values per groups:
df['Time_Stamp'] = pd.to_datetime(df['Time_Stamp'])
min1 = df.groupby('ID')['Time_Stamp'].transform('min')
max1 = df.groupby('ID')['Time_Stamp'].transform('max')
df['Total_Duration_Days'] = max1.sub(min1).dt.total_seconds() / (3600 * 24)
d = df.loc[df['Event'] == 1].set_index('ID')['Time_Stamp'].to_dict()
new1 = df['ID'].map(d)
Because possible multiple 1 per groups is added solution only for this groups - testing, if more 1 per groups in mask, get Series new2 and then use Series.combine_first with mapped Series new1.
Reason is improve performance, because a bit complicated processing multiple 1.
mask = df['Event'].eq(1).groupby(df['ID']).transform('sum').gt(1)
g = df[mask].groupby('ID')['Event'].cumsum().replace({0:np.nan})
new2 = (df[mask].groupby(['ID', g])['Time_Stamp']
.transform('first')
.groupby(df['ID'])
.bfill())
df['Conversion_Duration'] = (new2.combine_first(new1)
.sub(min1)
.dt.total_seconds().fillna(0) / (3600 * 24))
print (df)
ID Time_Stamp Event Total_Duration_Days Conversion_Duration
0 1 2019-02-20 18:21:00 0 2.554861 1.720833
1 1 2019-02-20 19:46:00 0 2.554861 1.720833
2 1 2019-02-21 18:35:00 0 2.554861 1.720833
3 1 2019-02-22 11:39:00 1 2.554861 1.720833
4 1 2019-02-22 16:46:00 1 2.554861 1.934028
5 1 2019-02-23 07:40:00 0 2.554861 1.934028
6 2 2019-06-05 00:10:00 0 0.000000 0.000000
7 3 2019-07-31 10:18:00 0 23.260417 0.000000
8 3 2019-08-23 16:33:00 0 23.260417 0.000000
9 4 2019-06-26 20:49:00 0 0.000000 0.000000

Calculate average of every 7 instances in a dataframe column

I have this pandas dataframe with daily asset prices:
Picture of head of Dataframe
I would like to create a pandas series (It could also be an additional column in the dataframe or some other datastructure) with the weakly average asset prices. This means I need to calculate the average on every 7 consecutive instances in the column and save it into a series.
Picture of how result should look like
As I am a complete newbie to python (and programming in general, for that matter), I really have no idea how to start.
I am very grateful for every tipp!
I believe need GroupBy.transform by modulo of numpy array create by numpy.arange for general solution also working with all indexes (e.g. with DatetimeIndex):
np.random.seed(2018)
rng = pd.date_range('2018-04-19', periods=20)
df = pd.DataFrame({'Date': rng[::-1],
'ClosingPrice': np.random.randint(4, size=20)})
#print (df)
df['weekly'] = df['ClosingPrice'].groupby(np.arange(len(df)) // 7).transform('mean')
print (df)
ClosingPrice Date weekly
0 2 2018-05-08 1.142857
1 2 2018-05-07 1.142857
2 2 2018-05-06 1.142857
3 1 2018-05-05 1.142857
4 1 2018-05-04 1.142857
5 0 2018-05-03 1.142857
6 0 2018-05-02 1.142857
7 2 2018-05-01 2.285714
8 1 2018-04-30 2.285714
9 1 2018-04-29 2.285714
10 3 2018-04-28 2.285714
11 3 2018-04-27 2.285714
12 3 2018-04-26 2.285714
13 3 2018-04-25 2.285714
14 1 2018-04-24 1.666667
15 0 2018-04-23 1.666667
16 3 2018-04-22 1.666667
17 2 2018-04-21 1.666667
18 2 2018-04-20 1.666667
19 2 2018-04-19 1.666667
Detail:
print (np.arange(len(df)) // 7)
[0 0 0 0 0 0 0 1 1 1 1 1 1 1 2 2 2 2 2 2]

Use Pandas dataframe to add lag feature from MultiIindex Series

I have a MultiIndex Series (3 indices) that looks like this:
Week ID_1 ID_2
3 26 1182 39.0
4767 42.0
31393 20.0
31690 42.0
32962 3.0
....................................
I also have a dataframe df which contains all the columns (and more) used for indices in the Series above, and I want to create a new column in my dataframe df that contains the value matching the ID_1 and ID_2 and the Week - 2 from the Series.
For example, for the row in dataframe that has ID_1 = 26, ID_2 = 1182 and Week = 3, I want to match the value in the Series indexed by ID_1 = 26, ID_2 = 1182 and Week = 1 (3-2) and put it on that row in a new column. Further, my Series might not necessarily have the value required by the dataframe, in which case I'd like to just have 0.
Right now, I am trying to do this by using:
[multiindex_series.get((x[1].get('week', 2) - 2, x[1].get('ID_1', 0), x[1].get('ID_2', 0))) for x in df.iterrows()]
This however is very slow and memory hungry and I was wondering what are some better ways to do this.
FWIW, the Series was created using
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
and I'm willing to do it a different way if better paths exist to create what I'm looking for.
Increase the Week by 2:
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
and then merge df with saved_groupby:
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
This will augment df with the target median from 2 weeks ago.
To make the median (target) saved_groupby column 0 when there is no match, use fillna to change NaNs to 0:
result['Median'] = result['Median'].fillna(0)
For example,
import numpy as np
import pandas as pd
np.random.seed(2016)
df = pd.DataFrame(np.random.randint(5, size=(20,5)),
columns=['Week', 'ID_1', 'ID_2', 'Target', 'Foo'])
saved_groupby = df.groupby(['Week', 'ID_1', 'ID_2'])['Target'].median()
saved_groupby = saved_groupby.reset_index()
saved_groupby['Week'] = saved_groupby['Week'] + 2
saved_groupby = saved_groupby.rename(columns={'Target':'Median'})
result = pd.merge(df, saved_groupby, on=['Week', 'ID_1', 'ID_2'], how='left')
result['Median'] = result['Median'].fillna(0)
print(result)
yields
Week ID_1 ID_2 Target Foo Median
0 3 2 3 4 2 0.0
1 3 3 0 3 4 0.0
2 4 3 0 1 2 0.0
3 3 4 1 1 1 0.0
4 2 4 2 0 3 2.0
5 1 0 1 4 4 0.0
6 2 3 4 0 0 0.0
7 4 0 0 2 3 0.0
8 3 4 3 2 2 0.0
9 2 2 4 0 1 0.0
10 2 0 4 4 2 0.0
11 1 1 3 0 0 0.0
12 0 1 0 2 0 0.0
13 4 0 4 0 3 4.0
14 1 2 1 3 1 0.0
15 3 0 1 3 4 2.0
16 0 4 2 2 4 0.0
17 1 1 4 4 2 0.0
18 4 1 0 3 0 0.0
19 1 0 1 0 0 0.0

Pandas groupby date

I have a DataFrame with events. One or more events can occur at a date (so the date can't be an index). The date range is several years. I want to groupby years and months and have a count of the Category values. Thnx
in [12]: df = pd.read_excel('Pandas_Test.xls', 'sheet1')
In [13]: df
Out[13]:
EventRefNr DateOccurence Type Category
0 86596 2010-01-02 00:00:00 3 Small
1 86779 2010-01-09 00:00:00 13 Medium
2 86780 2010-02-10 00:00:00 6 Small
3 86781 2010-02-09 00:00:00 17 Small
4 86898 2010-02-10 00:00:00 6 Small
5 86898 2010-02-11 00:00:00 6 Small
6 86902 2010-02-17 00:00:00 9 Small
7 86908 2010-02-19 00:00:00 3 Medium
8 86908 2010-03-05 00:00:00 3 Medium
9 86909 2010-03-06 00:00:00 8 Small
10 86930 2010-03-12 00:00:00 29 Small
11 86934 2010-03-16 00:00:00 9 Small
12 86940 2010-04-08 00:00:00 9 High
13 86941 2010-04-09 00:00:00 17 Small
14 86946 2010-04-14 00:00:00 10 Small
15 86950 2011-01-19 00:00:00 12 Small
16 86956 2011-01-24 00:00:00 13 Small
17 86959 2011-01-27 00:00:00 17 Small
I tried:
df.groupby(df['DateOccurence'])
For the month and year break out I often add additional columns to the data frame that break out the dates into each piece:
df['year'] = [t.year for t in df.DateOccurence]
df['month'] = [t.month for t in df.DateOccurence]
df['day'] = [t.day for t in df.DateOccurence]
It adds space complexity (adding columns to the df) but is less time complex (less processing on groupby) than a datetime index but it's really up to you. datetime index is the more pandas way to do things.
After breaking out by year, month, day you can do any groupby you need.
df.groupby['year','month'].Category.apply(pd.value_counts)
To get months across multiple years:
df.groupby['month'].Category.apply(pd.value_counts)
Or in Andy Hayden's datetime index
df.groupby[di.month].Category.apply(pd.value_counts)
You can simply pick which method fits your needs better.
You can apply value_counts to the SeriesGroupby (for the column):
In [11]: g = df.groupby('DateOccurence')
In [12]: g.Category.apply(pd.value_counts)
Out[12]:
DateOccurence
2010-01-02 Small 1
2010-01-09 Medium 1
2010-02-09 Small 1
2010-02-10 Small 2
2010-02-11 Small 1
2010-02-17 Small 1
2010-02-19 Medium 1
2010-03-05 Medium 1
2010-03-06 Small 1
2010-03-12 Small 1
2010-03-16 Small 1
2010-04-08 High 1
2010-04-09 Small 1
2010-04-14 Small 1
2011-01-19 Small 1
2011-01-24 Small 1
2011-01-27 Small 1
dtype: int64
I actually hoped this to return the following DataFrame, but you need to unstack it:
In [13]: g.Category.apply(pd.value_counts).unstack(-1).fillna(0)
Out[13]:
High Medium Small
DateOccurence
2010-01-02 0 0 1
2010-01-09 0 1 0
2010-02-09 0 0 1
2010-02-10 0 0 2
2010-02-11 0 0 1
2010-02-17 0 0 1
2010-02-19 0 1 0
2010-03-05 0 1 0
2010-03-06 0 0 1
2010-03-12 0 0 1
2010-03-16 0 0 1
2010-04-08 1 0 0
2010-04-09 0 0 1
2010-04-14 0 0 1
2011-01-19 0 0 1
2011-01-24 0 0 1
2011-01-27 0 0 1
If there were multiple different Categories with the same Date they would be on the same row...

Categories

Resources