Map a pandas DataFrame index - python

Suppose I have a dataframe indexed by datetime:
> df.head()
value
2013-01-01 00:00:00 -0.014844
2013-01-01 01:00:00 0.243548
2013-01-01 02:00:00 0.463755
2013-01-01 03:00:00 0.695867
2013-01-01 04:00:00 0.845290
(...)
if I wanted to plot all values by date, I could do:
times = map(lambda x : x.date(), df.index)
values = df.value
plot(values, times)
Is there a more "pandas idiomatic" way to do it? I tried the .rename method, but I got a assertion error:
df.rename(lambda x : x.time())
What I really wanted was to do something like a boxplot:
df.boxplot(by = lambda x : x.time())
but without the standard deviation boxes (which will be substituted by estimated confidence bands). Is there a way to do this with a simple pandas command?
I don't know if I was clear about what was the problem. The problem is that I have a datetime field as index of the dataframe, and I need to extract only the time part and plot the values by time. This will give me lots of points with the same x-axis, which is fine, but the rename method seems to expect that each value in the resulting index is unique.

You can plot natively with the DataFrame plot method, for example:
df.plot()
df.plot(kind='bar')
...
This method gives you a lot of flexibility (with all the power of matplotlib).
The visualisation section of the docs goes into a lot of detail, and has plenty of examples.
In 0.12+ there's a time method/attribute on an DatetimeIndex (IIRC due to this question):
df.index.time # equivalent to df.index.map(lambda ts: ts.time())
To plot only the times, you could use:
plot(df.index.time, df.value)
However this seems only slightly better than your solution, if at all. Perhaps timeseries index ought to offer a time method, similar to how it does for hour (I vaguely recall a similar question...):
plot(df.index.hour, df.value))

Here is my solution:
crate the data:
import pandas as pd
from pandas import *
from numpy.random import randn
rng = date_range('1/1/2011', periods=72, freq='H')
ts = TimeSeries(randn(72), index=rng)
plot date-value:
ts.to_period("D").plot(style="o")
plot time-value:
TimeSeries(ts.values, index=DatetimeIndex(ts.index.values -
ts.index.to_period("D").to_timestamp().values)).plot(style="o")

If you want the time values, then this is fairly fast.
def dt_time(ind):
return np.array([time(*time_tuple) for time_tuple in zip(ind.hour, ind.minute, ind.second)])
Calling map will be magnitudes slower.
In [29]: %timeit dt_time(dt)
1000 loops, best of 3: 511 µs per loop
In [30]: %timeit dt_map(dt)
10 loops, best of 3: 96.3 ms per loop
for a 100 length DatetimeIndex.

Related

Pandas: easier way to sample interpolated time series data at given times (e.g. every full day)

Regularly I run into the problem that I have time series data that I want to interpolate and resample at given times. I have a solution, but it feels like "too labor intensive", e.g. I guess there should be a simpler way. Have a look for how I currently do it here: https://gist.github.com/cs224/012f393d5ced6931ae223e6ddc4fe6b2 (or the nicer version via nbviewer here: https://nbviewer.org/gist/cs224/012f393d5ced6931ae223e6ddc4fe6b2)
Perhaps a motivating example: I fill up my car about every two weeks. I have the cost data of every refill. Now I would like to know the cummulative sum on a daily basis, where the day values are at midnight and interpolated.
Currently I create a new empty data frame that contains the time points at which I want to have my resampled values:
df_sampling = pd.DataFrame(index=pd.date_range(start, end, freq=freq))
And then either use pd.merge:
ldf = pd.merge(df_in, df_sampling, left_index=True, right_index=True, how='outer')
or pd.concat:
ldf = pd.concat([df_in, df_sampling], axis=1)
to create a combined time series that has the additional time points in the index. Based on that I can then use pd.interpolate and then sub-select all index values given by df_sampling. See the gist for details.
All this feels too cumbersome and I guess there should be a better way how to do it.
Instead of using either merge or concat inside your function generate_interpolated_time_series, I would rely on df.reindex. Something like this:
def f(df_in, freq='T', start=None):
if start is None:
start = df_in.index[0].floor('T')
# refactored: df_in.index[0].replace(second=0,microsecond=0,nanosecond=0)
end = df_in.index[-1]
idx = pd.date_range(start=start, end=end, freq=freq)
ldf = df_in.reindex(df_in.index.union(idx)).interpolate().bfill()
ldf = ldf[~ldf.index.isin(df_in.index.difference(idx))]
return ldf
Test sample:
from pandas import Timestamp
d = {Timestamp('2022-10-07 11:06:09.957000'): 21.9,
Timestamp('2022-11-19 04:53:18.532000'): 47.5,
Timestamp('2022-11-19 16:30:04.564000'): 66.9,
Timestamp('2022-11-21 04:17:57.832000'): 96.9,
Timestamp('2022-12-05 22:26:48.354000'): 118.6}
df = pd.DataFrame.from_dict(d, orient='index', columns=['values'])
print(df)
values
2022-10-07 11:06:09.957 21.9
2022-11-19 04:53:18.532 47.5
2022-11-19 16:30:04.564 66.9
2022-11-21 04:17:57.832 96.9
2022-12-05 22:26:48.354 118.6
Check for equality:
merge = generate_interpolated_time_series(df, freq='D', method='merge')
concat = generate_interpolated_time_series(df, freq='D', method='concat')
reindex = f(df, freq='D')
print(all([merge.equals(concat),merge.equals(reindex)]))
# True
Added bonus would be some performance gain. Here you see the results of a comparison between the 3 methods (applying %timeit) for different frequencies (['D','H','T','S']). reindex in green is fastest for each.
Aside: in your function, raise Exception('Method unknown: ' + metnhod) contains a typo; should be method.

Pandas to_datetime () function performance issues

Have a df like that:
Dat
10/01/2016
11/01/2014
12/02/2013
The column 'Dat' has object type so I trying to switch it to datetime using to_datetime () pandas function that way:
to_datetime_rand = partial(pd.to_datetime, format='%m/%d/%Y')
df['DAT'] = df['DAT'].apply(to_datetime_rand)
Everything works well but I have performance issues when my df is higher than 2 billion rows. So in that case this method stucks and does not work well.
Does pandas to_datetime () function has an ability to do the convertation by chuncks or maybe iterationally by looping.
Thanks.
If performance is a concern I would advise to use the following function to convert those columns to date_time:
def lookup(s):
"""
This is an extremely fast approach to datetime parsing.
For large data, the same dates are often repeated. Rather than
re-parse these, we store all unique dates, parse them, and
use a lookup to convert all dates.
"""
dates = {date:pd.to_datetime(date) for date in s.unique()}
return s.apply(lambda v: dates[v])
to_datetime: 5799 ms
dateutil: 5162 ms
strptime: 1651 ms
manual: 242 ms
lookup: 32 ms
UPDATE: This enhancement has been incorporated into pandas 0.23.0
cache : boolean, default False
If True, use a cache of unique, converted dates to apply the datetime
conversion. May produce significant speed-up when parsing duplicate
date strings, especially ones with timezone offsets.
You could split into chunks your huge dataframe into smaller ones, for example this method can do it where you can decide what is the chunk size:
def splitDataFrameIntoSmaller(df, chunkSize = 10000):
listOfDf = list()
numberChunks = len(df) // chunkSize + 1
for i in range(numberChunks):
listOfDf.append(df[i*chunkSize:(i+1)*chunkSize])
return listOfDf
After you have chunks, you can apply the datetime function on each chunk separately.
I just came across this same issue myself. Thanks to SerialDev for the excellent answer. To build on that, I tried using datetime.strptime instead of pd.to_datetime:
from datetime import datetime as dt
dates = {date : dt.strptime(date, '%m/%d/%Y') for date in df['DAT'].unique()}
df['DAT'] = df['DAT'].apply(lambda v: dates[v])
The strptime method was 6.5x faster than the to_datetime method for me.
Inspired from the previous answers, in the case of having both performance problems and multiple date formats I suggest the following solution.
for date in df['DAT'].unique():
for ft in ['%Y/%m/%d', '%Y']:
try:
dates[date] = datetime.strptime(date, ft) if date else None
except ValueError:
continue
df['DAT'] = df['DAT'].apply(lambda v: dates[v])

pandas out of bounds nanosecond timestamp after offset rollforward plus adding a month offset

I am confused how pandas blew out of bounds for datetime objects with these lines:
import pandas as pd
BOMoffset = pd.tseries.offsets.MonthBegin()
# here some code sets the all_treatments dataframe and the newrowix, micolix, mocolix counters
all_treatments.iloc[newrowix,micolix] = BOMoffset.rollforward(all_treatments.iloc[i,micolix] + pd.tseries.offsets.DateOffset(months = x))
all_treatments.iloc[newrowix,mocolix] = BOMoffset.rollforward(all_treatments.iloc[newrowix,micolix]+ pd.tseries.offsets.DateOffset(months = 1))
Here all_treatments.iloc[i,micolix] is a datetime set by pd.to_datetime(all_treatments['INDATUMA'], errors='coerce',format='%Y%m%d'), and INDATUMA is date information in the format 20070125.
This logic seems to work on mock data (no errors, dates make sense), so at the moment I cannot reproduce while it fails in my entire data with the following error:
pandas.tslib.OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00
Since pandas represents timestamps in nanosecond resolution, the timespan that can be represented using a 64-bit integer is limited to approximately 584 years
In [54]: pd.Timestamp.min
Out[54]: Timestamp('1677-09-22 00:12:43.145225')
In [55]: pd.Timestamp.max
Out[55]: Timestamp('2262-04-11 23:47:16.854775807')
And your value is out of this range 2262-05-01 00:00:00 and hence the outofbounds error
Straight out of: https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#timestamp-limitations
Workaround:
This will force the dates which are outside the bounds to NaT
pd.to_datetime(date_col_to_force, errors = 'coerce')
Setting the errors parameter in pd.to_datetime to 'coerce' causes replacement of out of bounds values with NaT. Quoting the docs:
If ‘coerce’, then invalid parsing will be set as NaT
E.g.:
datetime_variable = pd.to_datetime(datetime_variable, errors = 'coerce')
This does not fix the data (obviously), but still allows processing the non-NaT data points.
The reason you are seeing this error message
"OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 3000-12-23 00:00:00" is because pandas timestamp data type stores date in nanosecond resolution(from the docs).
Which means the date values have to be in the range
pd.Timestamp.min(1677-09-21 00:12:43.145225) and
pd.Timestamp.max(2262-04-11 23:47:16.854775807)
Even if you only want the date with resolution of seconds or microseconds, pandas will still store it internally in nanoseconds. There is no option in pandas to store a timestamp outside of the above mentioned range.
This is surprising because databases like sql server and libraries like numpy allows to store date beyond this range. Also maximum of 64 bits are used in most of the cases to store the date.
But here is the difference.
SQL server stores date in nanosecond resolution but only up to a accuracy of 100 ns(as opposed to 1 ns in pandas). Since the space is limited(64 bits), its a matter of range vs accuracy. With pandas timestamp we have higher accuracy but lower date range.
In case of numpy (pandas is built on top of numpy) datetime64 data type,
if the date falls in the above mentioned range you can store
it in nanoseconds which is similar to pandas.
OR you can give up the nanosecond resolution and go with
microseconds which will give you a much larger range. This is something that is missing in pandas timestamp type.
However if you choose to store in nanoseconds and the date is outside the range then numpy will automatically wrap around this date and you might get unexpected results (referenced below in the 4th solution).
np.datetime64("3000-06-19T08:17:14.073456178", dtype="datetime64[ns]")
> numpy.datetime64('1831-05-11T09:08:06.654352946')
Now with pandas we have below options,
import pandas as pd
data = {'Name': ['John', 'Sam'], 'dob': ['3000-06-19T08:17:14', '2000-06-19T21:17:14']}
my_df = pd.DataFrame(data)
1)If you are ok with losing the data which is out of range then simply use below param to convert out of range date to NaT(not a time).
my_df['dob'] = pd.to_datetime(my_df['dob'], errors = 'coerce')
2)If you dont want to lose the data then you can convert the values into a python datetime type. Here the column "dob" is of type pandas object but the individual value will be of type python datetime. However doing this we will lose the benefit of vectorized functions.
import datetime as dt
my_df['dob'] = my_df['dob'].apply(lambda x: dt.datetime.strptime(x,'%Y-%m-%dT%H:%M:%S') if type(x)==str else pd.NaT)
print(type(my_df.iloc[0][1]))
> <class 'datetime.datetime'>
3)Another option is to use numpy instead of pandas series if possible. In case of pandas dataframe, you can convert a series(or column in a df) to numpy array. Process the data separately and then join it back to the dataframe.
4)we can also use pandas timespans as suggested in the docs. Do checkout the difference b/w timestamp and period before using this data type. Date range and frequency here works similar to numpy(mentioned above in the numpy section).
my_df['dob'] = my_df['dob'].apply(lambda x: pd.Period(x, freq='ms'))
You can try with strptime() in datetime library along with lambda expression to convert text to date values in a series object:
Example:
df['F'].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%Y %I:%M:%S') if type(x)==str else np.NaN)
None of above are so good, because it will delete your data. But, you can only mantain and edit your conversion:
# convertin from epoch to datatime mantainig the nanoseconds timestamp
xbarout= pd.to_datetime(xbarout.iloc[:,0],unit='ns')

Pandas number of business days between a DatetimeIndex and a Timestamp

This is quite similar to the question here but I'm wondering if there is a clean way in pandas to make a business day aware TimedeltaIndex? Ultimately I am trying to get the number of business days (no holiday calendar) between a DatetimeIndex and a Timestamp. As per the referenced question, something like this works
import pandas as pd
import numpy as np
drg = pd.date_range('2015-07-31', '2015-08-05', freq='B')
A = [d.date() for d in drg]
B = pd.Timestamp('2015-08-05', 'B').date()
np.busday_count(A, B)
which gives
array([3, 2, 1, 0], dtype=int64)
but this seems a bit kludgy. If I try something like
drg - pd.Timestamp('2015-08-05', 'B')
I get a TimedeltaIndex but the business day frequency is dropped
TimedeltaIndex(['-5 days', '-2 days', '-1 days', '0 days'], dtype='timedelta64[ns]', freq=None)
Just wondering if there is a more elegant way to go about this.
TimedeltaIndexes represent fixed spans of time. They can be added to Pandas Timestamps to increment them by fixed amounts. Their behavior is never dependent on whether or not the Timestamp is a business day.
The TimedeltaIndex itself is never business-day aware.
Since the ultimate goal is to count the number of days between a DatetimeIndex and a Timestamp, I would look in another direction than conversion to TimedeltaIndex.
Unfortunately, date calculations are rather complicated, and a number of data structures have sprung up to deal with them -- Python datetime.dates, datetime.datetimes, Pandas Timestamps, NumPy datetime64s.
They each have their strengths, but no one of them is good for all purposes. To
take advantage of their strengths, it is sometime necessary to convert between
these types.
To use np.busday_count you need to convert the DatetimeIndex and Timestamp to
some type np.busday_count understands. What you call kludginess is the code
required to convert types. There is no way around that assuming we want to use np.busday_count -- and I know of no better tool for this job than np.busday_count.
So, although I don't think there is a more succinct way to count business days
than than the method you propose, there is a far more performant way:
Convert to datetime64[D]'s instead of Python datetime.date objects:
import pandas as pd
import numpy as np
drg = pd.date_range('2000-07-31', '2015-08-05', freq='B')
timestamp = pd.Timestamp('2015-08-05', 'B')
def using_astype(drg, timestamp):
A = drg.values.astype('<M8[D]')
B = timestamp.asm8.astype('<M8[D]')
return np.busday_count(A, B)
def using_datetimes(drg, timestamp):
A = [d.date() for d in drg]
B = pd.Timestamp('2015-08-05', 'B').date()
return np.busday_count(A, B)
This is over 100x faster for the example above (where len(drg) is close to 4000):
In [88]: %timeit using_astype(drg, timestamp)
10000 loops, best of 3: 95.4 µs per loop
In [89]: %timeit using_datetimes(drg, timestamp)
100 loops, best of 3: 10.3 ms per loop
np.busday_count converts its input to datetime64[D]s anyway, so avoiding this extra conversion to and from datetime.dates is far more efficient.

Time Series plot of timestamps in monthly buckets using Python/Pandas [duplicate]

I'm trying to calculate daily sums of values using pandas. Here's the test file - http://pastebin.com/uSDfVkTS
This is the code I came up so far:
import numpy as np
import datetime as dt
import pandas as pd
f = np.genfromtxt('test', dtype=[('datetime', '|S16'), ('data', '<i4')], delimiter=',')
dates = [dt.datetime.strptime(i, '%Y-%m-%d %H:%M') for i in f['datetime']]
s = pd.Series(f['data'], index = dates)
d = s.resample('D', how='sum')
Using the given test file this produces:
2012-01-02 1128
Freq: D
First problem is that calculated sum corresponds to the next day. I've been able to solve that by using parameter loffset='-1d'.
Now the actual problem is that the data may start not from 00:30 of a day but at any time of a day. Also the data has gaps filled with 'nan' values.
That said, is it possible to set a lower threshold of number of values that are necessary to calculate daily sums? (e.g. if there're less than 40 values in a single day, then put NaN instead of a sum)
I believe that it is possible to define a custom function to do that and refer to it in 'how' parameter, but I have no clue how to code the function itself.
You can do it directly in Pandas:
s = pd.read_csv('test', header=None, index_col=0, parse_dates=True)
d = s.groupby(lambda x: x.date()).aggregate(lambda x: sum(x) if len(x) >= 40 else np.nan)
X.2
2012-01-01 1128
Much easier way is to use pd.Grouper:
d = s.groupby(pd.Grouper(freq='1D')).sum()

Categories

Resources