I have this:
x[l[0]] = pd.to_datetime(x[l[0]], format="%Y-%m-%d %H:%M:%S")
Where l=list(x)
How can I have the difference between this objects in seconds, If I do this
x[l[0]][1]-x[l[0]][2]
It returns me a timedelta object
print (x[:5])
LogDate Query_BoxID_ID Query_Function_ID SC_Win32_Status
0 2017-06-15 09:50:14 12 24 0
1 2017-06-15 09:50:14 12 26 0
2 2017-06-15 09:50:14 12 26 0
3 2017-06-15 09:50:14 12 30 0
4 2017-06-15 09:50:32 12 19 0
Use diff for timedeltas, which are converted by total_seconds:
#convert column to datetime
x['LogDate'] = pd.to_datetime(x['LogDate'], format="%Y-%m-%d %H:%M:%S")
#first value is NaN always, so replaced to 0 by fillna and cast to int
a = x['LogDate'].diff().dt.total_seconds().fillna(0).astype(int)
print (a)
0 0
1 0
2 0
3 0
4 18
Name: LogDate, dtype: int32
b = int((x.loc[0, 'LogDate'] - x.loc[0, 'LogDate']).total_seconds())
print (b)
0
I can just do
(x[l[0]][1]-x[l[0]][2]).total_seconds()
Related
I have a DataFrame of store sales for 1115 stores with dates over about 2.5 years. The StateHoliday column is a categorical variable indicating the type of holiday it is. See the piece of the df below. As can be seen, b is the code for Easter. There are other codes for other holidays.
Piece of DF
My objective is to analyze sales before and during a holiday. The way I seek to do this is to change the value of the StateHoliday column to something unique for the few days before a particular holiday. For example, b is the code for Easter, so I could change the value to b- indicating that the day is shortly before Easter. The only way I can think to do this is to go through and manually change these values for certain dates. There aren't THAT many holidays, so it wouldn't be that hard to do. But still very annoying!
Tom, see if this works for you, if not please provide additional information:
In the file I have the following data:
Store,Sales,Date,StateHoliday
1,6729,2013-03-25,0
1,6686,2013-03-26,0
1,6660,2013-03-27,0
1,7285,2013-03-28,0
1,6729,2013-03-29,b
1115,10712,2015-07-01,0
1115,11110,2015-07-02,0
1115,10500,2015-07-03,0
1115,12000,2015-07-04,c
import pandas as pd
fname = r"D:\workspace\projects\misc\data\holiday_sales.csv"
df = pd.read_csv(fname)
df["Date"] = pd.to_datetime(df["Date"])
holidays = df[df["StateHoliday"]!="0"].copy(deep=True) # taking only holidays
dictDate2Holiday = dict(zip(holidays["Date"].tolist(), holidays["StateHoliday"].tolist()))
look_back = 2 # how many days back you want to go
holiday_look_back = []
# building a list of pairs (prev days, holiday code)
for dt, h in dictDate2Holiday.items():
prev = dt
holiday_look_back.append((prev, h))
for i in range(1, look_back+1):
prev = prev - pd.Timedelta(days=1)
holiday_look_back.append((prev, h))
dfHolidayLookBack = pd.DataFrame(holiday_look_back, columns=["Date", "StateHolidayNew"])
df = df.merge(dfHolidayLookBack, how="left", on="Date")
df["StateHolidayNew"].fillna("0", inplace=True)
print(df)
columns StateHolidayNew should have the info you need to start analyzing your data
Assuming you have a dataframe like this:
Store Sales Date StateHoliday
0 2 4205 2016-11-15 0
1 1 684 2016-07-13 0
2 2 8946 2017-04-15 0
3 1 6929 2017-02-02 0
4 2 8296 2017-10-30 b
5 1 8261 2015-10-05 0
6 2 3904 2016-08-22 0
7 1 2613 2017-12-30 0
8 2 1324 2016-08-23 0
9 1 6961 2015-11-11 0
10 2 15 2016-12-06 a
11 1 9107 2016-07-05 0
12 2 1138 2015-03-29 0
13 1 7590 2015-06-24 0
14 2 5172 2017-04-29 0
15 1 660 2016-06-21 0
16 2 2539 2017-04-25 0
What you can do is group the values between the different alphabets which represent the holidays and then groupby to find out the sales according to each group. An improvement to this would be to backfill the numbers before the groups, exp., groups=0.0 would become b_0 which would make it easier to understand the groups and what holiday they represent, but I am not sure how to do that.
df['StateHolidayBool'] = df['StateHoliday'].str.isalpha().fillna(False).replace({False: 0, True: 1})
df = df.assign(group = (df[~df['StateHolidayBool'].between(1,1)].index.to_series().diff() > 1).cumsum())
df = df.assign(groups = np.where(df.group.notna(), df.group, df.StateHoliday)).drop(['StateHolidayBool', 'group'], axis=1)
df[~df['groups'].str.isalpha().fillna(False)].groupby('groups').sum()
Output:
Store Sales
groups
0.0 6 20764
1.0 7 23063
2.0 9 26206
Final DataFrame:
Store Sales Date StateHoliday groups
0 2 4205 2016-11-15 0 0.0
1 1 684 2016-07-13 0 0.0
2 2 8946 2017-04-15 0 0.0
3 1 6929 2017-02-02 0 0.0
4 2 8296 2017-10-30 b b
5 1 8261 2015-10-05 0 1.0
6 2 3904 2016-08-22 0 1.0
7 1 2613 2017-12-30 0 1.0
8 2 1324 2016-08-23 0 1.0
9 1 6961 2015-11-11 0 1.0
10 2 15 2016-12-06 a a
11 1 9107 2016-07-05 0 2.0
12 2 1138 2015-03-29 0 2.0
13 1 7590 2015-06-24 0 2.0
14 2 5172 2017-04-29 0 2.0
15 1 660 2016-06-21 0 2.0
16 2 2539 2017-04-25 0 2.0
I have a dataframe like this:
time
2018-06-25 20:42:00
2016-06-26 23:51:00
2017-05-34 12:29:00
2016-03-11 10:14:00
Now I created a column like this
df['isEIDRange'] = 0
Let's say, EID festivate is on 15 June 2018.
So I want to fill 1 value in isEIDRange column. If the date is between 10 June 2018 to 20 June 2018 (5 days before and 5 days after EID)
How can I do it?
Something like?
df.loc[ (df.time > 15 June - 5 days) & (df.time < 15 June + 5 days), 'isEIDRange' ] = 1
Use Series.between function for test values with cast mask to integers:
df['isEIDRange'] = df['time'].between('2018-06-10', '2018-06-20').astype(int)
If want dynamic solution:
df = pd.DataFrame({"time": pd.date_range("2018-06-08", "2018-06-22")})
#print (df)
date = '15 June 2018'
d = pd.to_datetime(date)
diff = pd.Timedelta(5, unit='d')
df['isEIDRange1'] = df['time'].between(d - diff, d + diff).astype(int)
df['isEIDRange2'] = df['time'].between(d - diff, d + diff, inclusive=False).astype(int)
print (df)
time isEIDRange1 isEIDRange2
0 2018-06-08 0 0
1 2018-06-09 0 0
2 2018-06-10 1 0
3 2018-06-11 1 1
4 2018-06-12 1 1
5 2018-06-13 1 1
6 2018-06-14 1 1
7 2018-06-15 1 1
8 2018-06-16 1 1
9 2018-06-17 1 1
10 2018-06-18 1 1
11 2018-06-19 1 1
12 2018-06-20 1 0
13 2018-06-21 0 0
14 2018-06-22 0 0
Or set values by numpy.where:
df['isEIDRange'] = np.where(df['time'].between(d - diff, d + diff), 1, 0)
You can use loc or np.where:
import numpy as np
df['isEIDRange'] = np.where((df['time'] > '2018-06-10') & (df['time'] < '2018-06-20'),1,df['isEIDRange']
This means that when the column time is between 2018-06-10 and 2018-06-20, the column isEIDRange will be equal to 1, otherwise it will retain it's original value (0).
You can use pandas date_range for this:
eid = pd.date_range("15/10/2019", "20/10/2019")
df = pd.DataFrame({"dates": pd.date_range("13/10/2019", "20/10/2019")})
df["eid"] = 0
df.loc[df["dates"].isin(eid), "eid"] = 1
and output:
dates eid
0 2019-10-13 0
1 2019-10-14 0
2 2019-10-15 1
3 2019-10-16 1
4 2019-10-17 1
5 2019-10-18 1
6 2019-10-19 1
7 2019-10-20 1
I have a df with the following structure:
my_df
date hour product
2019-06-06 17 laptopt
2019-06-06 15 printer
2019-06-07 14 laptopt
2019-06-07 17 desktop
How can I get a df like this:
hour laptop printer desktop
14 1 0 0
15 0 1 0
16 0 0 0
17 1 0 1
So far I've been trying doing my_df.groupby(["product","hour"]).count().unstack(level=0)
date
product desktop laptop printer
hour
14 NaN 1.0 NaN
15 NaN NaN 1.0
17 1.0 1.0 NaN
and I'm stucked there.
Thanks.
Call what you already have unstacked and do this:
index = pd.RangeIndex(df.hour.min(),df.hour.max() + 1)
unstacked.reindex(index).fillna(0).astype(int)
You can use pd.crosstab and reindex:
(pd.crosstab(df['hour'], df['product'])
.reindex(pd.RangeIndex(df['hour'].min(), df['hour'].max()+1), fill_value=0))
product desktop laptopt printer
14 0 1 0
15 0 0 1
16 0 0 0
17 1 1 0
IIUC
df.set_index('hour')['product'].str.get_dummies().sum(level=0).reindex(range(df.hour.min(),df.hour.max()+1),fill_value=0)
Out[15]:
desktop laptopt printer
hour
14 0 1 0
15 0 0 1
16 0 0 0
17 1 1 0
I have loaded a pandas dataframe from a .csv file that contains a column having datetime values.
df = pd.read_csv('data.csv')
The name of the column having the datetime values is pickup_datetime. Here's what I get if i do df['pickup_datetime'].head():
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Name: pickup_datetime, dtype: datetime64[ns, UTC]
How do I convert this column into a numpy array having only the day values of the datetime? For example: 15 from 0 2009-06-15 17:26:00+00:00, 05 from 1 2010-01-05 16:52:00+00:00, etc..
df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'], errors='coerce')
df['pickup_datetime'].dt.day.values
# array([15, 5, 18, 21, 9])
Just adding another Variant, although coldspeed already provide the briefed answer as a x-mas and New year bonus :-) :
>>> df
pickup_datetime
0 2009-06-15 17:26:00+00:00
1 2010-01-05 16:52:00+00:00
2 2011-08-18 00:35:00+00:00
3 2012-04-21 04:30:00+00:00
4 2010-03-09 07:51:00+00:00
Convert the strings to timestamps by inferring their format:
>>> df['pickup_datetime'] = pd.to_datetime(df['pickup_datetime'])
>>> df
pickup_datetime
0 2009-06-15 17:26:00
1 2010-01-05 16:52:00
2 2011-08-18 00:35:00
3 2012-04-21 04:30:00
4 2010-03-09 07:51:00
You can pic the day's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.day
0 15
1 5
2 18
3 21
4 9
Name: pickup_datetime, dtype: int64
You can pic the month's only from the pickup_datetime:
>>> df['pickup_datetime'].dt.month
0 6
1 1
2 8
3 4
4 3
You can pic the Year's only from the pickup_datetime
>>> df['pickup_datetime'].dt.year
0 2009
1 2010
2 2011
3 2012
4 2010
I feel like this should be done very easily, yet I can't figure out how. I have a pandas DataFrame with column date:
0 2012-08-21
1 2013-02-17
2 2013-02-18
3 2013-03-03
4 2013-03-04
Name: date, dtype: datetime64[ns]
I want to have a columns of durations, something like:
0 0
1 80 days
2 1 day
3 15 days
4 1 day
Name: date, dtype: datetime64[ns]
My attempt yields bunch of 0 days and NaT instead:
>>> df.date[1:] - df.date[:-1]
0 NaT
1 0 days
2 0 days
...
Any ideas?
Timedeltas are useful here: (see docs)
Starting in v0.15.0, we introduce a new scalar type Timedelta, which is a subclass of datetime.timedelta, and behaves in a similar manner, but allows compatibility with np.timedelta64 types as well as a host of custom representation, parsing, and attributes.
Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be both positive and negative.
df
0
0 2012-08-21
1 2013-02-17
2 2013-02-18
3 2013-03-03
4 2013-03-04
You could:
pd.to_timedelta(df)
TimedeltaIndex(['0 days'], dtype='timedelta64[ns]', freq=None)
0 0
1 180
2 1
3 13
4 1
Name: 0, dtype: int64
Alternatively, you can calculate the difference between points in time using .shift() (or .diff() as illustrated by #Andy Hayden):
res = df-df.shift()
to get:
res.fillna(0)
0
0 0 days
1 180 days
2 1 days
3 13 days
4 1 days
You can convert these from timedelta64 dtype to integer using:
res.fillna(0).squeeze().dt.days
0 0
1 180
2 1
3 13
4 1
You can use diff:
In [11]: s
Out[11]:
0 2012-08-21
1 2013-02-17
2 2013-02-18
3 2013-03-03
4 2013-03-04
Name: date, dtype: datetime64[ns]
In [12]: s.diff()
Out[12]:
0 NaT
1 180 days
2 1 days
3 13 days
4 1 days
Name: date, dtype: timedelta64[ns]
In [13]: s.diff().fillna(0)
Out[13]:
0 0 days
1 180 days
2 1 days
3 13 days
4 1 days
Name: date, dtype: timedelta64[ns]
df.date[1:] - df.date[:-1] doesn't do what you think it does. Each element is subtracted by series/dataframe index mapping, not by location in the series.
Calculating df.date[1:] - df.date[:-1] does:
+---- index of df.date[1:]
| +---- index of df.date[:-1]
| |
| v
v
- 0 2012-08-21 = NaT
1 2013-02-17 - 1 2013-02-17 = 0
2 2013-02-18 - 2 2013-02-18 = 0
3 2013-03-03 - 3 2013-03-03 = 0
4 2013-03-04 - = NaT