Pandas time series indexing -- re - python

I have a pandas dataframe indexed by time:
>>> dframe.head()
aw_FATFREEMASS raw aw_FATFREEMASS sym
TIMESTAMP
2011-12-08 23:13:23 139.3 H
2011-12-08 23:12:18 139.2 H
2011-12-08 22:31:53 139.2 H
2011-12-09 07:08:50 138.2 H
2011-12-10 21:36:20 137.6 H
[5 rows x 2 columns]
>>> type(dframe.index)
<class 'pandas.tseries.index.DatetimeIndex'>
I'm trying to do a simple time series query similar to this SQL:
SELECT * FROM dframe WHERE tstart <= TIMESTAMP <= tend
where tstart and tend are appropriately represented timestamps. With pandas I'm getting behavior I just don't understand.
This does what I expect:
>>> dframe['2011-11-01' : '2011-11-20']
Empty DataFrame
Columns: [aw_FATFREEMASS raw, aw_FATFREEMASS sym]
Index: []
[0 rows x 2 columns]
This does the same thing:
dframe['2011-11-01 00:00:00' : '2011-11-20 00:00:00']
However:
>>> from dateutil.parser import parse
>>> dframe[parse('2011-11-01 00:00:00') : '2011-11-20 00:00:00']
*** TypeError: 'datetime.datetime' object is not iterable
>>> dframe[parse('2011-11-01') : '2011-11-20 00:00:00']
*** TypeError: 'datetime.datetime' object is not iterable
>>> dframe[parse('2011-11-01') : parse('2011-11-01')]
*** KeyError: Timestamp('2011-11-01 00:00:00', tz=None)
When I provide a time represented as a pandas Timestamp I get slice behavior I don't understand. Can someone explain this behavior and/or tell me how I can achieve the SQL query above?

docs are here
This is called partial string indexing. In a nutshell, providing a string will get you results that 'match', e.g. they are included in the specified interval, while if you specify a Timestamp/datetime then its exact; it HAS to be in the index.
Can you show how you constructed the DatetimeIndex?
what version pandas?
In [4]: df = DataFrame(np.random.randn(20,2),index=date_range('20130101',periods=20,freq='H'))
In [5]: df
Out[5]:
0 1
2013-01-01 00:00:00 -0.339751 1.223660
2013-01-01 01:00:00 0.525203 -0.987815
2013-01-01 02:00:00 1.724239 0.213446
2013-01-01 03:00:00 -0.074797 -1.658876
2013-01-01 04:00:00 0.483425 -2.112314
2013-01-01 05:00:00 0.094140 0.327681
2013-01-01 06:00:00 -1.265337 -0.858521
2013-01-01 07:00:00 -1.470041 0.168871
2013-01-01 08:00:00 -0.609185 0.829035
2013-01-01 09:00:00 0.047774 0.221399
2013-01-01 10:00:00 0.814162 -1.415824
2013-01-01 11:00:00 1.070209 0.720150
2013-01-01 12:00:00 0.887571 -0.611207
2013-01-01 13:00:00 1.669451 -0.022434
2013-01-01 14:00:00 -1.796565 -1.186899
2013-01-01 15:00:00 0.417758 0.082021
2013-01-01 16:00:00 -1.064019 -0.377208
2013-01-01 17:00:00 0.939902 0.430784
2013-01-01 18:00:00 -0.645667 1.611992
2013-01-01 19:00:00 -0.172148 -1.725041
[20 rows x 2 columns]
In [6]: df['20130101 7:00:01':'20130101 10:00:00']
Out[6]:
0 1
2013-01-01 08:00:00 -0.609185 0.829035
2013-01-01 09:00:00 0.047774 0.221399
2013-01-01 10:00:00 0.814162 -1.415824
[3 rows x 2 columns]
In [7]: df.index
Out[7]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01 00:00:00, ..., 2013-01-01 19:00:00]
Length: 20, Freq: H, Timezone: None
If you already have Timestamps/datetimes, then just construct a boolean expression
df[(df.index > Timestamp('20130101 10:00:00')) & (df.index < Timestamp('201301010 17:00:00')])

Related

Pandas computer hourly average and set at middle of interval

I want to compute the hourly mean for a time series of wind speed and direction, but I want to set the time at the half hour. So, the average for values from 14:00 to 15:00 will be at 14:30. Right now, I can only seem to get it on left or right of the interval. Here is what I currently have:
ts_g=[item.replace(second=0, microsecond=0) for item in dates_g]
dg = {'ws': data_g.ws, 'wdir': data_g.wdir}
df_g = pandas.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
grouped_g = df_g.groupby(pandas.TimeGrouper('H'))
hourly_ws_g = grouped_g['ws'].mean()
hourly_wdir_g = grouped_g['wdir'].mean()
the output for this looks like:
2016-04-08 06:00:00+00:00 46.980000
2016-04-08 07:00:00+00:00 64.313333
2016-04-08 08:00:00+00:00 75.678333
2016-04-08 09:00:00+00:00 127.383333
2016-04-08 10:00:00+00:00 145.950000
2016-04-08 11:00:00+00:00 184.166667
....
but I would like it to be like:
2016-04-08 06:30:00+00:00 54.556
2016-04-08 07:30:00+00:00 78.001
....
Thanks for your help!
So the easiest way is to resample and then use linear interpolation:
In [21]: rng = pd.date_range('1/1/2011', periods=72, freq='H')
In [22]: ts = pd.Series(np.random.randn(len(rng)), index=rng)
...:
In [23]: ts.head()
Out[23]:
2011-01-01 00:00:00 0.796704
2011-01-01 01:00:00 -1.153179
2011-01-01 02:00:00 -1.919475
2011-01-01 03:00:00 0.082413
2011-01-01 04:00:00 -0.397434
Freq: H, dtype: float64
In [24]: ts2 = ts.resample('30T').interpolate()
In [25]: ts2.head()
Out[25]:
2011-01-01 00:00:00 0.796704
2011-01-01 00:30:00 -0.178237
2011-01-01 01:00:00 -1.153179
2011-01-01 01:30:00 -1.536327
2011-01-01 02:00:00 -1.919475
Freq: 30T, dtype: float64
In [26]:
I believe this is what you need.
Edit to add clarifying example
Perhaps it's easier to see what's going on without random Data:
In [29]: ts.head()
Out[29]:
2011-01-01 00:00:00 0
2011-01-01 01:00:00 1
2011-01-01 02:00:00 2
2011-01-01 03:00:00 3
2011-01-01 04:00:00 4
Freq: H, dtype: int64
In [30]: ts2 = ts.resample('30T').interpolate()
In [31]: ts2.head()
Out[31]:
2011-01-01 00:00:00 0.0
2011-01-01 00:30:00 0.5
2011-01-01 01:00:00 1.0
2011-01-01 01:30:00 1.5
2011-01-01 02:00:00 2.0
Freq: 30T, dtype: float64
This post is already several years old and uses the API that has long been deprecated. Modern Pandas already provides the resample method that is easier to use than pandas.TimeGrouper. Yet it allows only left and right labelled intervals but getting the intervals centered at the middle of the interval is not readily available.
Yet this is not hard to do.
First we fill in the data that we want to resample:
ts_g=[datetime.datetime.fromisoformat('2019-11-20') +
datetime.timedelta(minutes=10*x) for x in range(0,100)]
dg = {'ws': range(0,100), 'wdir': range(0,100)}
df_g = pd.DataFrame(data=dg, index=ts_g, columns=['ws','wdir'])
df_g.head()
The output would be:
ws wdir
2019-11-20 00:00:00 0 0
2019-11-20 00:10:00 1 1
2019-11-20 00:20:00 2 2
2019-11-20 00:30:00 3 3
2019-11-20 00:40:00 4 4
Now we first resample to 30 minute intervals
grouped_g = df_g.resample('30min')
halfhourly_ws_g = grouped_g['ws'].mean()
halfhourly_ws_g.head()
The output would be:
2019-11-20 00:00:00 1
2019-11-20 00:30:00 4
2019-11-20 01:00:00 7
2019-11-20 01:30:00 10
2019-11-20 02:00:00 13
Freq: 30T, Name: ws, dtype: int64
Finally the trick to get the centered intervals:
hourly_ws_g = halfhourly_ws_g.add(halfhourly_ws_g.shift(1)).div(2)\
.loc[halfhourly_ws_g.index.minute % 60 == 30]
hourly_ws_g.head()
This would produce the expected output:
2019-11-20 00:30:00 2.5
2019-11-20 01:30:00 8.5
2019-11-20 02:30:00 14.5
2019-11-20 03:30:00 20.5
2019-11-20 04:30:00 26.5
Freq: 60T, Name: ws, dtype: float64

Matching two similar dates with different times

I have two dates in pandas dataframes (df1.a_date & df2.another_date) read from CSV files. They match at the date level (YYYY-MM-DD) but not at the time (HH:MM:SS). Both are read in as dtype: object.
I need to merge the two dataframes on the dates, but since they aren't exact, i probably need to convert them first. Any ideas?
edit:
I've tried using diatomite.date to construct a new date from the pandas.datetime, but that doesn't seem to work.
datetime.date(df.a_date.year, df.a_date.month, df.a_date.day)
pandas datetime objects don't have year, month, day accessors, though.
You can normalize a date column/DatetimeIndex index:
Note: At the moment normalize isn't exported to the dt accessor so we need to wrap with DatetimeIndex.
In [11]: df = pd.DataFrame(pd.date_range('2015-01-01 05:00', periods=3), columns=['datetime'])
In [12]: df
Out[12]:
datetime
0 2015-01-01 05:00:00
1 2015-01-02 05:00:00
2 2015-01-03 05:00:00
In [13]: df["date"] = pd.DatetimeIndex(df["datetime"]).normalize()
In [14]: df
Out[14]:
datetime date
0 2015-01-01 05:00:00 2015-01-01
1 2015-01-02 05:00:00 2015-01-02
2 2015-01-03 05:00:00 2015-01-03
This works if it's a DatetimeIndex too, use df.index rather than df[col_name].
Format the datetime to only include YYYY-MM-DD:
assuming df is your dataframe:
'{:%Y-%m-%d}'.format(d)
Assume, dft is your dataframe and 'index' column contains datetime:
In [1804]: dft.head()
Out[1804]:
index A
0 2013-01-01 00:00:00 1.193366
1 2013-01-01 00:01:00 1.013425
2 2013-01-01 00:02:00 1.281902
3 2013-01-01 00:03:00 -0.043788
4 2013-01-01 00:04:00 -1.610164
You could convert the column to contain just the date and save it in a different column, if you want. And operate on that:
In [1805]: dft['index'].apply(lambda v:v.date()).head()
Out[1805]:
0 2013-01-01
1 2013-01-01
2 2013-01-01
3 2013-01-01
4 2013-01-01
Name: index, dtype: object

Get weekday/day-of-week for Datetime column of DataFrame

I have a DataFrame df like the following (excerpt, 'Timestamp' are the index):
Timestamp Value
2012-06-01 00:00:00 100
2012-06-01 00:15:00 150
2012-06-01 00:30:00 120
2012-06-01 01:00:00 220
2012-06-01 01:15:00 80
...and so on.
I need a new column df['weekday'] with the respective weekday/day-of-week of the timestamps.
How can I get this?
Use the new dt.dayofweek property:
In [2]:
df['weekday'] = df['Timestamp'].dt.dayofweek
df
Out[2]:
Timestamp Value weekday
0 2012-06-01 00:00:00 100 4
1 2012-06-01 00:15:00 150 4
2 2012-06-01 00:30:00 120 4
3 2012-06-01 01:00:00 220 4
4 2012-06-01 01:15:00 80 4
In the situation where the Timestamp is your index you need to reset the index and then call the dt.dayofweek property:
In [14]:
df = df.reset_index()
df['weekday'] = df['Timestamp'].dt.dayofweek
df
Out[14]:
Timestamp Value weekday
0 2012-06-01 00:00:00 100 4
1 2012-06-01 00:15:00 150 4
2 2012-06-01 00:30:00 120 4
3 2012-06-01 01:00:00 220 4
4 2012-06-01 01:15:00 80 4
Strangely if you try to create a series from the index in order to not reset the index you get NaN values as does using the result of reset_index to call the dt.dayofweek property without assigning the result of reset_index back to the original df:
In [16]:
df['weekday'] = pd.Series(df.index).dt.dayofweek
df
Out[16]:
Value weekday
Timestamp
2012-06-01 00:00:00 100 NaN
2012-06-01 00:15:00 150 NaN
2012-06-01 00:30:00 120 NaN
2012-06-01 01:00:00 220 NaN
2012-06-01 01:15:00 80 NaN
In [17]:
df['weekday'] = df.reset_index()['Timestamp'].dt.dayofweek
df
Out[17]:
Value weekday
Timestamp
2012-06-01 00:00:00 100 NaN
2012-06-01 00:15:00 150 NaN
2012-06-01 00:30:00 120 NaN
2012-06-01 01:00:00 220 NaN
2012-06-01 01:15:00 80 NaN
EDIT
As pointed out to me by user #joris you can just access the weekday attribute of the index so the following will work and is more compact:
df['Weekday'] = df.index.weekday
If the Timestamp column is a datetime value, then you can just use:
df['weekday'] = df['Timestamp'].apply(lambda x: x.weekday())
or
df['weekday'] = pd.to_datetime(df['Timestamp']).apply(lambda x: x.weekday())
You can get with this way:
import datetime
df['weekday'] = pd.Series(df.index).dt.day_name()
In case somebody else has the same issue with a multiindexed dataframe, here is what solved it for me, based on #joris solution:
df['Weekday'] = df.index.get_level_values(1).weekday
for me date was the get_level_values(1) instead of get_level_values(0), which would work for the outer index.
As of pandas 1.1.0 dt.dayofweek is deprecated, so instead of:
df['weekday'] = df['Timestamp'].dt.dayofweek
from #EdChum and #Artyom Krivolapov
you can now use:
df['weekday'] = df['Timestamp'].dt.isocalendar().day

Pandas fillna on datetime object

I'm trying to run fillna on a column of type datetime64[ns]. When I run something like:
df['date'].fillna(datetime("2000-01-01"))
I get:
TypeError: an integer is required
Any way around this?
This should work in 0.12 and 0.13 (just released).
#DSM points out that datetimes are constructed like: datetime.datetime(2012,1,1)
SO the error is from failing to construct the value the you are passing to fillna.
Note that using a Timestamp WILL parse the string.
In [3]: s = Series(date_range('20130101',periods=10))
In [4]: s.iloc[3] = pd.NaT
In [5]: s.iloc[7] = pd.NaT
In [6]: s
Out[6]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
3 NaT
4 2013-01-05 00:00:00
5 2013-01-06 00:00:00
6 2013-01-07 00:00:00
7 NaT
8 2013-01-09 00:00:00
9 2013-01-10 00:00:00
dtype: datetime64[ns]
datetime.datetime will work as well
In [7]: s.fillna(Timestamp('20120101'))
Out[7]:
0 2013-01-01 00:00:00
1 2013-01-02 00:00:00
2 2013-01-03 00:00:00
3 2012-01-01 00:00:00
4 2013-01-05 00:00:00
5 2013-01-06 00:00:00
6 2013-01-07 00:00:00
7 2012-01-01 00:00:00
8 2013-01-09 00:00:00
9 2013-01-10 00:00:00
dtype: datetime64[ns]
Right now, df['date'].fillna(pd.Timestamp("20210730")) works in pandas 1.3.1
This example is works with dynamic data if you want to replace NaT data in rows with data from another DateTime data.
df['column_with_NaT'].fillna(df['dt_column_with_thesame_index'], inplace=True)
It's works for me when I was updated some rows in DateTime column and not updated rows had NaT value, and I've been needed to inherit old series data. And this code above resolve my problem. Sry for the not perfect English )

python pandas date_range when importing with read_csv

I have a .csv file with some data in the following format:
1.69511909, 0.57561167, 0.31437427, 0.35458831, 0.15841189, 0.28239582, -0.18180907, 1.34761404, -1.5059083, 1.29246638
-1.66764664, 0.1488095, 1.03832221, -0.35229205, 1.35705861, -1.56747104, -0.36783851, -0.57636948, 0.9854391, 1.63031066
0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838, -0.08557089, 1.71855596, -0.61235066, -0.32520105, 1.54162629
Every line corresponds to a specific day, and every record in a line corresponds to a specific hour in that day.
Is there is a convenient way of importing the data with read_csv such that everything would be correctly indexed, i.e. the importing function would discriminate different days (lines), and hours within days (separate records in lines).
Something like this. Note that I couldn't copy your string for some reason, so my dataset is cutoff....
Read in the string (it reads as a dataframe because mine had newlines in it)....but need to coerce to a Series.
In [23]: s = pd.read_csv(StringIO(data)).values
In [24]: s
Out[24]:
array([[-1.66764664, 0.1488095 , 1.03832221, -0.35229205, 1.35705861,
-1.56747104, -0.36783851, -0.57636948, 0.9854391 , 1.63031066],
[ 0.87763775, 0.60757153, 0.64908314, -0.68357724, 0.33499838,
-0.08557089, 1.71855596, nan, nan, nan]])
In [25]: s = Series(pd.read_csv(StringIO(data)).values.ravel())
In [26]: s
Out[26]:
0 -1.667647
1 0.148810
2 1.038322
3 -0.352292
4 1.357059
5 -1.567471
6 -0.367839
7 -0.576369
8 0.985439
9 1.630311
10 0.877638
11 0.607572
12 0.649083
13 -0.683577
14 0.334998
15 -0.085571
16 1.718556
17 NaN
18 NaN
19 NaN
dtype: float64
Just set the index directly....Note that you are solely responsible for alignment, this is VERY
easy to be say off by 1
In [27]: s.index = pd.date_range('20130101',freq='H',periods=len(s))
In [28]: s
Out[28]:
2013-01-01 00:00:00 -1.667647
2013-01-01 01:00:00 0.148810
2013-01-01 02:00:00 1.038322
2013-01-01 03:00:00 -0.352292
2013-01-01 04:00:00 1.357059
2013-01-01 05:00:00 -1.567471
2013-01-01 06:00:00 -0.367839
2013-01-01 07:00:00 -0.576369
2013-01-01 08:00:00 0.985439
2013-01-01 09:00:00 1.630311
2013-01-01 10:00:00 0.877638
2013-01-01 11:00:00 0.607572
2013-01-01 12:00:00 0.649083
2013-01-01 13:00:00 -0.683577
2013-01-01 14:00:00 0.334998
2013-01-01 15:00:00 -0.085571
2013-01-01 16:00:00 1.718556
2013-01-01 17:00:00 NaN
2013-01-01 18:00:00 NaN
2013-01-01 19:00:00 NaN
Freq: H, dtype: float64
First just read in the DataFrame:
df = pd.read_csv(file_name, sep=',\s+', header=None)
Then set the index to be the dates and the columns to be the hours*
df.index = pd.date_range('2012-01-01', freq='D', periods=len(df))
from pandas.tseries.offsets import Hour
df.columns = [Hour(7+t) for t in df.columns]
In [5]: df
Out[5]:
<7 Hours> <8 Hours> <9 Hours> <10 Hours> <11 Hours> <12 Hours> <13 Hours> <14 Hours> <15 Hours> <16 Hours>
2012-01-01 1.695119 0.575612 0.314374 0.354588 0.158412 0.282396 -0.181809 1.347614 -1.505908 1.292466
2012-01-02 -1.667647 0.148810 1.038322 -0.352292 1.357059 -1.567471 -0.367839 -0.576369 0.985439 1.630311
2012-01-03 0.877638 0.607572 0.649083 -0.683577 0.334998 -0.085571 1.718556 -0.612351 -0.325201 1.541626
Then stack it and add the Date and the Hour levels of the MultiIndex:
s = df.stack()
s.index = [x[0]+x[1] for x in s.index]
In [8]: s
Out[8]:
2012-01-01 07:00:00 1.695119
2012-01-01 08:00:00 0.575612
2012-01-01 09:00:00 0.314374
2012-01-01 10:00:00 0.354588
2012-01-01 11:00:00 0.158412
2012-01-01 12:00:00 0.282396
2012-01-01 13:00:00 -0.181809
2012-01-01 14:00:00 1.347614
2012-01-01 15:00:00 -1.505908
2012-01-01 16:00:00 1.292466
2012-01-02 07:00:00 -1.667647
2012-01-02 08:00:00 0.148810
...
* You can use different offsets, see here, e.g. Minute, Second.

Categories

Resources