I have a data set (sample) like below
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 0
2019-05-04 0
2019-05-05 0
2019-05-06 0
2019-05-07 0
2019-05-08 1
2019-05-09 0
I want to transform it such that, if I encounter Value=1, then I take the 3 values from 2 days before and fill it as 1. Also set the current value to be 0.
In other words, the transformed data set should look like this
Date Value
2019-05-01 0
2019-05-02 0
2019-05-03 1
2019-05-04 1
2019-05-05 1
2019-05-06 0
2019-05-07 0
2019-05-08 0
2019-05-09 0
Do notice, that in the example above, 2019-05-08 was set to 0 after transformation, and 2019-05-03 to 2019-05-05 was set to 1 (last value set to 1 is 2 days before 2019-05-08 and 3 days preceding 2019-05-05 is also set to 1).
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I think I can do this via for loops, but was looking to see if any inbuilt functions can help me with this.
Thanks!
There could be more precise ways of solving this problem. However, I could only think of solving this using the index values(say i) where Value==1 and then grab the index values at preceding locations(2 dates before means i-3 and then two more values above it means i-4, i-5) and assign the Value to 1. Finally, set the Value back to 0 for the index location(s) that were originally found for Value==1.
In [53]: df = pd.DataFrame({'Date':['2019-05-01','2019-05-02', '2019-05-03','2019-05-04','2019-05-05', '2019-05-06','20
...: 19-05-07','2019-05-08','2019-05-09'], 'Value':[0,0,0,0,0,0,0,1,0]})
...:
...:
In [54]: val_1_index = df.loc[df.Value == 1].index.tolist()
In [55]: val_1_index_decr = [(i-3, i-4, i-5) for i in val_1_index]
In [56]: df.loc[df['Value'].index.isin([i for i in val_1_index_decr[0]]), 'Value'] = 1
In [57]: df.loc[df['Value'].index.isin(val_1_index), 'Value'] = 0
In [58]: df
Out[58]:
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 1
3 2019-05-04 1
4 2019-05-05 1
5 2019-05-06 0
6 2019-05-07 0
7 2019-05-08 0
8 2019-05-09 0
A one line solution, assuming that df is your original dataframe:
df['Value'] = pd.Series([1 if 1 in df.iloc[i+3:i+6].values else 0 for i in df.index])
Here I work on index rather than dates, so I assume that you have one day per row and days are consecutive as shown in your example.
To fit also for this request:
If two consecutive values show up as 1, we start the date calculation from the last value that shows up as 1.
I can propose a two line solution:
validones = [True if df.iloc[i]['Value'] == 1 and df.iloc[i+1]['Value'] == 0 else False for i in df.index]
df['Value'] = pd.Series([1 if any(validones[i+3:i+6]) else 0 for i in range(len(validones))])
Basically first I build a list of boolean to check if the 1 in df['Value'] is not followed by another 1 and use this boolean list to perform the substitutions.
No sure about the efficiency of this solution because one needs to create three new columns but this also works:
df['shiftedValues'] = \
df['Value'].shift(-3, fill_value=0) + \
df['Value'].shift(-4, fill_value=0) + \
df['Value'].shift(-5, fill_value=0)
Note that the shift is done by row and not by day.
To shift by actual days I would first index by dates
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index('Date')
df['shiftedValues'] = \
df['Value'].shift(-3, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-4, freq='1D', fill_value=0).asof(df.index) + \
df['Value'].shift(-5, freq='1D', fill_value=0).asof(df.index)
# Out:
# Value shiftedValues
# Date
# 2019-05-01 0 0.0
# 2019-05-02 0 0.0
# 2019-05-03 0 1.0
# 2019-05-04 0 1.0
# 2019-05-05 0 1.0
# 2019-05-06 0 0.0
# 2019-05-07 0 0.0
# 2019-05-08 1 0.0
# 2019-05-09 0 0.0
Now this works correctly for dates, for instance if df is (note the missing and repeated days)
Date Value
0 2019-05-01 0
1 2019-05-02 0
2 2019-05-03 0
3 2019-05-04 0
4 2019-05-05 0
5 2019-05-05 0
6 2019-05-07 0
7 2019-05-08 1
8 2019-05-09 0
then you get
Value shiftedValues
Date
2019-05-01 0 0.0
2019-05-02 0 0.0
2019-05-03 0 1.0
2019-05-04 0 1.0
2019-05-05 0 1.0
2019-05-05 0 1.0
2019-05-07 0 0.0
2019-05-08 1 0.0
2019-05-09 0 0.0
Related
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have a df that looks like this:
And I'm trying to turn it into this:
the following code gets me a list of a list that I can convert to a df and includes the first 3 columns of expected output, but not sure how to get the number columns I need (note: I have way more than 3 number columns but using this as a simple illustration).
x=[['ID','Start','End','Number1','Number2','Number3']]
for i in range(len(df)):
if not(df.iloc[i-1]['DateSpellIndicator']):
ID= df.iloc[i]['ID']
start = df.iloc[i]['Date']
if not(df.iloc[i]['DateSpellIndicator']):
newrow = [ID, start,df.iloc[i]['Date'],...]
x.append(newrow)
Here's one way to do it by making use of pandas groupby.
Input Dataframe:
ID DATE NUM TORF
0 1 2020-01-01 40 True
1 1 2020-02-01 50 True
2 1 2020-03-01 60 False
3 1 2020-06-01 70 True
4 2 2020-07-01 20 True
5 2 2020-08-01 30 False
Output Dataframe:
END ID Number1 Number2 Number3 START
0 2020-08-01 2 20 30.0 NaN 2020-07-01
1 2020-06-01 1 70 NaN NaN 2020-06-01
2 2020-03-01 1 40 50.0 60.0 2020-01-01
Code:
new_df=pd.DataFrame()
#create groups based on ID
for index, row in df.groupby('ID'):
#Within each group split at the occurence of False
dfnew=np.split(row, np.where(row.TORF == False)[0] + 1)
for sub_df in dfnew:
#within each subgroup
if sub_df.empty==False:
dfmod=pd.DataFrame({'ID':sub_df['ID'].iloc[0],'START':sub_df['DATE'].iloc[0],'END':sub_df['DATE'].iloc[-1]},index=[0])
j=0
for nindex, srow in sub_df.iterrows():
dfmod['Number{}'.format(j+1)]=srow['NUM']
j=j+1
#concatenate the existing and modified dataframes
new_df=pd.concat([dfmod, new_df], axis=0)
new_df.reset_index(drop=True)
Some of the steps could be reduced to get the same output.
I used cumsum to get the fist and last date. Used list to get the columns the way you want. Please note the output has different column names than your example. I assume you can change them the way you want.
df ['new1'] = ~df['datespell']
df['new2'] = df['new1'].cumsum()-df['new1']
check = df.groupby(['id', 'new2']).agg({'date': {'start': 'first', 'end': 'last'}, 'number': {'cols': lambda x: list(x)}})
check.columns = check.columns.droplevel(0)
check.reset_index(inplace=True)
pd.concat([check,check['cols'].apply(pd.Series)], axis=1).drop(['cols'], axis=1)
id new2 start end 0 1 2
0 1 0 2020-01-01 2020-03-01 40.0 50.0 60.0
1 1 1 2020-06-01 2020-06-01 70.0 NaN NaN
2 2 1 2020-07-01 2020-08-01 20.0 30.0 NaN
Here is the dataframe i used.
id date number datespell new1 new2
0 1 2020-01-01 40 True False 0
1 1 2020-02-01 50 True False 0
2 1 2020-03-01 60 False True 0
3 1 2020-06-01 70 True False 1
4 2 2020-07-01 20 True False 1
5 2 2020-08-01 30 False True 1
I have a Dataframe of the form
date_time uids
2018-10-16 23:00:00 1000,1321,7654,1321
2018-10-16 23:10:00 7654
2018-10-16 23:20:00 NaN
2018-10-16 23:30:00 7654,1000,7654,1321,1000
2018-10-16 23:40:00 691,3974,3974,323
2018-10-16 23:50:00 NaN
2018-10-17 00:00:00 NaN
2018-10-17 00:10:00 NaN
2018-10-17 00:20:00 27,33,3974,3974,7665,27
This is a very big data frame containing the 5 mins time interval and the number of appearances of ids during those time intervals.
I want to iterate over these DataFrame 6 rows at a time (corresponding to 1 hour) and create DataFrame containing the ID and the number of times each id appear during this time.
Expected output is one dataframe per hour information. For example, in the above case dataframe for the hour 23 - 00 will have this form
uid 1 2 3 4 5 6
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
and so on
How can I do this efficiently?
I don't have an exact solution but you could create a pivot table: ids on the index and datetimes on the columns. Then you just have to select the columns you want.
import pandas as pd
import numpy as np
df = pd.DataFrame(
{
"date_time": [
"2018-10-16 23:00:00",
"2018-10-16 23:10:00",
"2018-10-16 23:20:00",
"2018-10-16 23:30:00",
"2018-10-16 23:40:00",
"2018-10-16 23:50:00",
"2018-10-17 00:00:00",
"2018-10-17 00:10:00",
"2018-10-17 00:20:00",
],
"uids": [
"1000,1321,7654,1321",
"7654",
np.nan,
"7654,1000,7654,1321,1000",
"691,3974,3974,323",
np.nan,
np.nan,
np.nan,
"27,33,3974,3974,7665,27",
],
}
)
df["date_time"] = pd.to_datetime(df["date_time"])
df = (
df.set_index("date_time") #do not use set_index if date_time is current index
.loc[:, "uids"]
.str.extractall(r"(?P<uids>\d+)")
.droplevel(level=1)
) # separate all the ids
df["number"] = df.index.minute.astype(float) / 10 + 1 # get the number 1 to 6 depending on the minutes
df_pivot = df.pivot_table(
values="number",
index="uids",
columns=["date_time"],
) #dataframe with all the uids on the index and all the datetimes in columns.
You can apply this to the whole dataframe or just a subset containing 6 rows. Then you rename your columns.
You can use the function crosstab:
df['uids'] = df['uids'].str.split(',')
df = df.explode('uids')
df['date_time'] = df['date_time'].dt.minute.floordiv(10).add(1)
pd.crosstab(df['uids'], df['date_time'], dropna=False)
Output:
date_time 1 2 3 4 5 6
uids
1000 1 0 0 2 0 0
1321 2 0 0 1 0 0
27 0 0 2 0 0 0
323 0 0 0 0 1 0
33 0 0 1 0 0 0
3974 0 0 2 0 2 0
691 0 0 0 0 1 0
7654 1 1 0 2 0 0
7665 0 0 1 0 0 0
We can achieve this with extracting the minutes from your datetime column. Then using pivot_table to get your wide format:
df['date_time'] = pd.to_datetime(df['date_time'])
df['minute'] = df['date_time'].dt.minute // 10
piv = (df.assign(uids=df['uids'].str.split(','))
.explode('uids')
.pivot_table(index='uids', columns='minute', values='minute', aggfunc='size')
)
minute 0 1 2 3 4
uids
1000 1.0 NaN NaN 2.0 NaN
1321 2.0 NaN NaN 1.0 NaN
27 NaN NaN 2.0 NaN NaN
323 NaN NaN NaN NaN 1.0
33 NaN NaN 1.0 NaN NaN
3974 NaN NaN 2.0 NaN 2.0
691 NaN NaN NaN NaN 1.0
7654 1.0 1.0 NaN 2.0 NaN
7665 NaN NaN 1.0 NaN NaN
I have a dataframe with a column that consists of datetime values, one that consists of speed values, and one that consists of timedelta values between the rows.
I would want to get the cumulative sum of timedeltas whenever the speed is below 2 knots. When the speed rises above 2 knots, I would like this cumulative sum to reset to 0, and then to start summing at the next instance of speed observations below 2 knots.
I have started by flagging all observations of speed values < 2. I only manage to get the cumulative sum for all of the observations with speed < 2, but not a cumulative sum separated for each instance.
The dataframe looks like this, and cum_sum is the desired output:
datetime speed timedelta cum_sum flag
1-1-2019 19:30:00 0.5 0 0 1
1-1-2019 19:32:00 0.7 2 2 1
1-1-2019 19:34:00 0.1 2 4 1
1-1-2019 19:36:00 5.0 2 0 0
1-1-2019 19:38:00 25.0 2 0 0
1-1-2019 19:42:00 0.1 4 4 1
1-1-2019 19:49:00 0.1 7 11 1
You can use the method from "How to groupby consecutive values in pandas DataFrame" to get the groups where flag is either 1 or 0, and then you will just need to apply the cumsum on the timedelta column, and set those values where flag == 0 to 0:
gb = df.groupby((df['flag'] != df['flag'].shift()).cumsum())
df['cum_sum'] = gb['timedelta'].cumsum()
df.loc[df['flag'] == 0, 'cum_sum'] = 0
print(df)
will give
datetime speed timedelta flag cum_sum
0 1-1-2019 19:30:00 0.5 0 1 0
1 1-1-2019 19:32:00 0.7 2 1 2
2 1-1-2019 19:34:00 0.1 2 1 4
3 1-1-2019 19:36:00 5.0 2 0 0
4 1-1-2019 19:38:00 25.0 2 0 0
5 1-1-2019 19:42:00 0.1 4 1 4
6 1-1-2019 19:49:00 0.1 7 1 11
Note: Uses global variable
c = 0
def fun(x):
global c
if x['speed'] > 2.0:
c = 0
else:
c = x['timedelta']+c
return c
df = pd.DataFrame( {'datetime': ['1-1-2019 19:30:00']*7,
'speed': [0.5,.7,0.1,5.0,25.0,0.1,0.1], 'timedelta': [0,2,2,2,2,4,7]})
df['cum_sum']=df.apply(fun, axis=1)
datetime speed timedelta cum_sum
0 1-1-2019 19:30:00 0.5 0 0
1 1-1-2019 19:30:00 0.7 2 2
2 1-1-2019 19:30:00 0.1 2 4
3 1-1-2019 19:30:00 5.0 2 0
4 1-1-2019 19:30:00 25.0 2 0
5 1-1-2019 19:30:00 0.1 4 4
6 1-1-2019 19:30:00 0.1 7 11
I am working with data like the following. The dataframe is sorted by the date:
category value Date
0 1 24/5/2019
1 NaN 24/5/2019
1 1 26/5/2019
2 2 1/6/2019
1 2 23/7/2019
2 NaN 18/8/2019
2 3 20/8/2019
7 3 1/9/2019
1 NaN 12/9/2019
2 NaN 13/9/2019
I would like to replace the "NaN" values with the previous mean for that specific category.
What is the best way to do this in pandas?
Some approaches I considered:
1) This litte riff:
df['mean' = df.groupby('category')['time'].apply(lambda x: x.shift().expanding().mean()))
source
This gets me the the correct means in but in another column, and it does not replace the NaNs.
2) This riff replaces the NaNs with the average of the columns:
df = df.groupby(df.columns, axis = 1).transform(lambda x: x.fillna(x.mean()))
Source 2
Both of these do not exactly give what I want. If someone could guide me on this it would be much appreciated!
You can replace value by new Series from shift + expanding + mean, first value of 1 group is not replaced, because no previous NaN values exits:
df['Date'] = pd.to_datetime(df['Date'])
s = df.groupby('category')['value'].apply(lambda x: x.shift().expanding().mean())
df['value'] = df['value'].fillna(s)
print (df)
category value Date
0 0 1.0 2019-05-24
1 1 NaN 2019-05-24
2 1 1.0 2019-05-26
3 2 2.0 2019-01-06
4 1 2.0 2019-07-23
5 2 2.0 2019-08-18
6 2 3.0 2019-08-20
7 7 3.0 2019-01-09
8 1 1.5 2019-12-09
9 2 2.5 2019-09-13
You can use pandas.Series.fillna to replace NaN values:
df['value']=df['value'].fillna(df.groupby('category')['value'].transform(lambda x: x.shift().expanding().mean()))
print(df)
category value Date
0 0 1.0 24/5/2019
1 1 NaN 24/5/2019
2 1 1.0 26/5/2019
3 2 2.0 1/6/2019
4 1 2.0 23/7/2019
5 2 2.0 18/8/2019
6 2 3.0 20/8/2019
7 7 3.0 1/9/2019
8 1 1.5 12/9/2019
9 2 2.5 13/9/2019