Get weekday/day-of-week for Datetime column of DataFrame - python

I have a DataFrame df like the following (excerpt, 'Timestamp' are the index):
Timestamp Value
2012-06-01 00:00:00 100
2012-06-01 00:15:00 150
2012-06-01 00:30:00 120
2012-06-01 01:00:00 220
2012-06-01 01:15:00 80
...and so on.
I need a new column df['weekday'] with the respective weekday/day-of-week of the timestamps.
How can I get this?

Use the new dt.dayofweek property:
In [2]:
df['weekday'] = df['Timestamp'].dt.dayofweek
df
Out[2]:
Timestamp Value weekday
0 2012-06-01 00:00:00 100 4
1 2012-06-01 00:15:00 150 4
2 2012-06-01 00:30:00 120 4
3 2012-06-01 01:00:00 220 4
4 2012-06-01 01:15:00 80 4
In the situation where the Timestamp is your index you need to reset the index and then call the dt.dayofweek property:
In [14]:
df = df.reset_index()
df['weekday'] = df['Timestamp'].dt.dayofweek
df
Out[14]:
Timestamp Value weekday
0 2012-06-01 00:00:00 100 4
1 2012-06-01 00:15:00 150 4
2 2012-06-01 00:30:00 120 4
3 2012-06-01 01:00:00 220 4
4 2012-06-01 01:15:00 80 4
Strangely if you try to create a series from the index in order to not reset the index you get NaN values as does using the result of reset_index to call the dt.dayofweek property without assigning the result of reset_index back to the original df:
In [16]:
df['weekday'] = pd.Series(df.index).dt.dayofweek
df
Out[16]:
Value weekday
Timestamp
2012-06-01 00:00:00 100 NaN
2012-06-01 00:15:00 150 NaN
2012-06-01 00:30:00 120 NaN
2012-06-01 01:00:00 220 NaN
2012-06-01 01:15:00 80 NaN
In [17]:
df['weekday'] = df.reset_index()['Timestamp'].dt.dayofweek
df
Out[17]:
Value weekday
Timestamp
2012-06-01 00:00:00 100 NaN
2012-06-01 00:15:00 150 NaN
2012-06-01 00:30:00 120 NaN
2012-06-01 01:00:00 220 NaN
2012-06-01 01:15:00 80 NaN
EDIT
As pointed out to me by user #joris you can just access the weekday attribute of the index so the following will work and is more compact:
df['Weekday'] = df.index.weekday

If the Timestamp column is a datetime value, then you can just use:
df['weekday'] = df['Timestamp'].apply(lambda x: x.weekday())
or
df['weekday'] = pd.to_datetime(df['Timestamp']).apply(lambda x: x.weekday())

You can get with this way:
import datetime
df['weekday'] = pd.Series(df.index).dt.day_name()

In case somebody else has the same issue with a multiindexed dataframe, here is what solved it for me, based on #joris solution:
df['Weekday'] = df.index.get_level_values(1).weekday
for me date was the get_level_values(1) instead of get_level_values(0), which would work for the outer index.

As of pandas 1.1.0 dt.dayofweek is deprecated, so instead of:
df['weekday'] = df['Timestamp'].dt.dayofweek
from #EdChum and #Artyom Krivolapov
you can now use:
df['weekday'] = df['Timestamp'].dt.isocalendar().day

Related

Pandas Set row value based on another column value but do nothing on else

this is my dataframe:
its got 455 rows with a secuence of a period of days in range of 4 hours each row.
i need to replace each 'demand' value with 0 if the timestamp hours are "23"
so i write this:
datadf['value']=datadf['timestamp'].apply(lambda x, y=datadf['value']: 0 if x.hour==23 else y)
i know the Y value is wrong, but i couldnt find the way to refer to the same row "demand" value inside the lambda.
how can i refer to that demand value? is any alternative that my else do nothing?
import pandas as pd
import numpy as np
#data preparation
df = pd.DataFrame()
df['date'] = pd.date_range(start='2022-06-01',periods=7,freq='4h') + pd.Timedelta('3H')
df['val'] = np.random.rand(7)
print(df)
>>
date val
0 2022-06-01 03:00:00 0.601889
1 2022-06-01 07:00:00 0.017787
2 2022-06-01 11:00:00 0.290662
3 2022-06-01 15:00:00 0.179150
4 2022-06-01 19:00:00 0.763534
5 2022-06-01 23:00:00 0.680892
6 2022-06-02 03:00:00 0.585380
#if your dates not datetime format, you must convert it
df['date'] = pd.to_datetime(df['date'])
df.loc[df['date'].dt.hour == 23, 'val'] = 0
#if you don't want to change data in "demand" column you can copy it
#df['val_2'] = df['val']
#df.loc[df['date'].dt.hour == 23, 'val_2'] = 0
print(df)
>>
date val
0 2022-06-01 03:00:00 0.601889
1 2022-06-01 07:00:00 0.017787
2 2022-06-01 11:00:00 0.290662
3 2022-06-01 15:00:00 0.179150
4 2022-06-01 19:00:00 0.763534
5 2022-06-01 23:00:00 0.000000
6 2022-06-02 03:00:00 0.585380

How to select an item by its ID and not by its index position [duplicate]

I have a pandas dataframe:
import pandas as pnd
d = pnd.Timestamp('2013-01-01 16:00')
dates = pnd.bdate_range(start=d, end = d+pnd.DateOffset(days=10), normalize = False)
df = pnd.DataFrame(index=dates, columns=['a'])
df['a'] = 6
print(df)
a
2013-01-01 16:00:00 6
2013-01-02 16:00:00 6
2013-01-03 16:00:00 6
2013-01-04 16:00:00 6
2013-01-07 16:00:00 6
2013-01-08 16:00:00 6
2013-01-09 16:00:00 6
2013-01-10 16:00:00 6
2013-01-11 16:00:00 6
I am interested in find the label location of one of the labels, say,
ds = pnd.Timestamp('2013-01-02 16:00')
Looking at the index values, I know that is integer location of this label 1. How can get pandas to tell what the integer value of this label is?
You're looking for the index method get_loc:
In [11]: df.index.get_loc(ds)
Out[11]: 1
Get dataframe integer index given a date key:
>>> import pandas as pd
>>> df = pd.DataFrame(
index=pd.date_range(pd.datetime(2008,1,1), pd.datetime(2008,1,5)),
columns=("foo", "bar"))
>>> df["foo"] = [10,20,40,15,10]
>>> df["bar"] = [100,200,40,-50,-38]
>>> df
foo bar
2008-01-01 10 100
2008-01-02 20 200
2008-01-03 40 40
2008-01-04 15 -50
2008-01-05 10 -38
>>> df.index.get_loc(df["bar"].argmax())
1
>>> df.index.get_loc(df["foo"].argmax())
2
In column bar, the index of the maximum value is 1
In column foo, the index of the maximum value is 2
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Index.get_loc.html
get_loc can be used for rows and columns according to:
import pandas as pnd
d = pnd.Timestamp('2013-01-01 16:00')
dates = pnd.bdate_range(start=d, end = d+pnd.DateOffset(days=10), normalize = False)
df = pnd.DataFrame(index=dates)
df['a'] = 5
df['b'] = 6
print(df.head())
a b
2013-01-01 16:00:00 5 6
2013-01-02 16:00:00 5 6
2013-01-03 16:00:00 5 6
2013-01-04 16:00:00 5 6
2013-01-07 16:00:00 5 6
#for rows
print(df.index.get_loc('2013-01-01 16:00:00'))
0
#for columns
print(df.columns.get_loc('b'))
1
Because get_loc returns a mask rather than a list of integer index locations when there are multiple instances of the key in the index, I was toying with an answer using reset_index():
# Add a duplicate!!!
dup = pd.Timestamp('2013-01-07 16:00')
df = df.append(pd.DataFrame([7],columns=['a'],index=[dup]))
df
a
2013-01-01 16:00:00 6
2013-01-02 16:00:00 6
2013-01-03 16:00:00 6
2013-01-04 16:00:00 6
2013-01-07 16:00:00 6
2013-01-08 16:00:00 6
2013-01-09 16:00:00 6
2013-01-10 16:00:00 6
2013-01-11 16:00:00 6
2013-01-07 16:00:00 7
2013-01-08 16:00:00 3
# Only use this method if the key has duplicates
if (df.loc[dup].index.has_duplicates):
df.reset_index().loc[df.index.get_loc(dup)].index.to_list()
array([4, 9])

Splitting a datetime, python, pandas

Sorry I am new to asking questions on stackoverflow so I don't understand how to format properly.
So I'm given a Pandas dataframe that contains column of datetime which contains the date and the time and an associated column that contains some sort of value. The given dates and times are incremented by the hour. I would like to manipulate the dataframe to have them increment every 15 minutes, but retain the same value. How would I do that? Thanks!
I have tried :
df = df.asfreq('15Min',method='ffill').
But I get a error:
"TypeError: Cannot compare type 'Timestamp' with type 'long'"
current dataframe:
datetime value
00:00:00 1
01:00:00 2
new dataframe:
datetime value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Update:
The approved answer below works, but so does the initial code I tried above
df = df.asfreq('15Min',method='ffill'). I was messing around with other Dataframes and I seemed to be having trouble with some null values so I took care of that with a fillna statements and everything worked.
You can use TimedeltaIndex, but is necessary manually add last value for correct reindex:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
tr = pd.timedelta_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
print (df)
value
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
Another solution with resample and same problem - need append new value for correct appending last values:
df['datetime'] = pd.to_timedelta(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
00:00:00 1
00:15:00 1
00:30:00 1
00:45:00 1
01:00:00 2
01:15:00 2
01:30:00 2
01:45:00 2
But if values are datetimes:
print (df)
datetime value
0 2018-01-01 00:00:00 1
1 2018-01-01 01:00:00 2
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
tr = pd.date_range(df.index.min(),
df.index.max() + pd.Timedelta(45*60, unit='s'), freq='15Min')
df = df.reindex(tr, method='ffill')
df['datetime'] = pd.to_datetime(df['datetime'])
df = df.set_index('datetime')
df.loc[df.index.max() + pd.Timedelta(1, unit='h')] = 1
df = df.resample('15Min').ffill().iloc[:-1]
print (df)
value
datetime
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 1
2018-01-01 00:45:00 1
2018-01-01 01:00:00 2
2018-01-01 01:15:00 2
2018-01-01 01:30:00 2
2018-01-01 01:45:00 2
You can use pandas.daterange
pd.date_range('00:00:00', '01:00:00', freq='15T')

Pandas: Set first 2 hours of every group to NaN

I am trying to clean my data by setting 'value' to NaN for the first 2 hours of every 'state' group.
My dataframe looks like this:
>>> import pandas as pd
>>> import numpy as np
>>>
>>> rng = pd.date_range('1/1/2016', periods=6, freq='H')
>>>
>>> data = {'value': np.random.rand(len(rng)),
... 'state': ['State 1']*3 + ['State 2']*3}
>>> df = pd.DataFrame(data, index=rng)
>>>
>>> df
state value
2016-01-01 00:00:00 State 1 0.800798
2016-01-01 01:00:00 State 1 0.130290
2016-01-01 02:00:00 State 1 0.464372
2016-01-01 03:00:00 State 2 0.925445
2016-01-01 04:00:00 State 2 0.732331
2016-01-01 05:00:00 State 2 0.811541
I've come up with three ways of doing this, and both don't work:
1) First attempt using .loc and/or .ix result in no change:
>>> df.loc[df.state=='State 2'].first('2H').value = np.nan
>>> df.ix[df.state=='State 2'].first('2H').value = np.nan
>>> df
state value
2016-01-01 00:00:00 State 1 0.800798
2016-01-01 01:00:00 State 1 0.130290
2016-01-01 02:00:00 State 1 0.464372
2016-01-01 03:00:00 State 2 0.925445
2016-01-01 04:00:00 State 2 0.732331
2016-01-01 05:00:00 State 2 0.811541
2) Second attempt results in an error:
>>> df.loc[df.state=='State 2', 'value'].first('2H') = np.nan
File "<stdin>", line 1
SyntaxError: can't assign to function call
3) This is a hackish attempt that worked, but is apparently discouraged:
>>> temp = df.loc[df.state=='State 2']
>>> temp.first('2H').value = np.nan
/home/user/anaconda3/lib/python3.5/site-packages/pandas/core/generic.py:2698: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self[name] = value
>>> df.loc[df.state=='State 2'] = temp
>>> df
state value
2016-01-01 00:00:00 State 1 0.800798
2016-01-01 01:00:00 State 1 0.130290
2016-01-01 02:00:00 State 1 0.464372
2016-01-01 03:00:00 State 2 NaN
2016-01-01 04:00:00 State 2 NaN
2016-01-01 05:00:00 State 2 0.811541
Ideally, I want to determine an easy way to loop over each group and clean the beginning and end of their respective data groups. I was under the impression that .first and .last would be great due to their simple time string formats.
Using .loc doesn't take into account these time string formats, but I'm probably missing something.
What's the true way of doing this in pandas?
Find all indexes by first 2H, then change index to Multiindex, swaplevel for matching ix and last reset_index:
idx = df.groupby('state')['value'].apply(lambda x: x.first('2H')).index
df.set_index('state', append=True, inplace=True)
df = df.swaplevel(0,1)
df.ix[idx,'value'] = np.nan
print (df.reset_index(level=0))
state value
2016-01-01 00:00:00 State 1 NaN
2016-01-01 01:00:00 State 1 NaN
2016-01-01 02:00:00 State 1 0.406512
2016-01-01 03:00:00 State 2 NaN
2016-01-01 04:00:00 State 2 NaN
2016-01-01 05:00:00 State 2 0.226350

Pandas - Event separation - .iloc iteritem()?

I have a sample_data.txt with structure.
Precision= Waterdrops
2009-11-17 14:00:00,4.9,
2009-11-17 14:30:00,6.1,
2009-11-17 15:00:00,5.3,
2009-11-17 15:30:00,3.3,
2009-11-17 16:00:00,4.9,
I need to separate my data with the values bigger than zero and identify change (event) with timespam bigger than 2 h. So far i have wrote:
file_path = 'sample_data.txt'
df = pd.read_csv(file_path, skiprows = [num for (num,line) in enumerate(open(file_path),2) if 'Precision=' in line][0],
parse_dates = True,index_col = 0,header= None, sep =',',
names = ['meteo', 'empty'])
df['date'] = df.index
df = df.drop(['empty'], axis=1)
df = df[df.meteo>20]
df['diff'] = df.date-df.date.shift(1)
df['sections'] = (diff > np.timedelta64(2, "h")).astype(int).cumsum()
From the above code i get:
meteo date diff sections
2009-12-15 12:00:00 23.8 2009-12-15 12:00:00 NaT 0
2009-12-15 13:00:00 23.0 2009-12-15 13:00:00 01:00:00 0
If i use:
df.date.iloc[[0, -1]].reset_index(drop=True)
I get:
0 2009-12-15 12:00:00
1 2012-12-05 16:00:00
Name: date, dtype: datetime64[ns]
Which is the start date and finish date of my example_data.txt.
How i can get .iloc[[0, -1]].reset_index(drop=True) for each df['section'] category ?
I tried with .apply:
def f(s):
return s.iloc[[0, -1]].reset_index(drop=True)
df.groupby(df['sections']).apply(f)
and i get: IndexError: positional indexers are out-of-bounds
I don't know why you use the drop_index() shenanigans. My somewhat more straightforward process would be, starting with
df
sections meteo date diff
0 0 2009-12-15 12:00:00 NaT
1 0 2009-12-15 13:00:00 01:00:00
0 1 2009-12-15 12:00:00 NaT
1 1 2009-12-15 13:00:00 01:00:00
to do (after you ensure with sort('sections', 'date') that iloc[0,-1] actually is start and end, otherwise just use min() and max() )
def f(s):
return s.iloc[[0, -1]]['date']
df.groupby('sections').apply(f)
date 0 1
sections
0 12:00:00 13:00:00
1 12:00:00 13:00:00
Or, as a more streamlined approach
df.groupby('sections')['date'].agg([np.max, np.min])
amax amin
sections
0 13:00:00 12:00:00
1 13:00:00 12:00:00

Categories

Resources