I have a pandas dataframe with a column that is of Timedelta type. I used groupby with a separate month column to create groups of these Timdelta by month, I then tried to use the agg function along with min, max, mean on the Timedelta column which triggered DataError: No numeric types to aggregate
As a solution for this I tried to use the total_seconds() function along with apply() to get a numeric representation of the column, however the behaviour seems strange to me as the NaT values in my Timedelta column were turned into -9.223372e+09 but they result in NaN when total_seconds() is used on a scalar without apply()
A minimal example:
test = pd.Series([np.datetime64('nat'),np.datetime64('nat')])
res = test.apply(pd.Timedelta.total_seconds)
print(res)
which produces:
0 -9.223372e+09
1 -9.223372e+09
dtype: float64
whereas:
res = test.iloc[0].total_seconds()
print(res)
yields:
nan
The behaviour of the second example is desired as I wish to perform aggregations etc and propagate missing/invalid values. Is this a bug ?
You should use .dt.total_seconds() method, instead of applying pd.Timedelta.total_seconds function onto datetime64[ns] dtype column:
In [232]: test
Out[232]:
0 NaT
1 NaT
dtype: datetime64[ns] # <----
In [233]: pd.to_timedelta(test)
Out[233]:
0 NaT
1 NaT
dtype: timedelta64[ns] # <----
In [234]: pd.to_timedelta(test).dt.total_seconds()
Out[234]:
0 NaN
1 NaN
dtype: float64
Another demo:
In [228]: s = pd.Series(pd.to_timedelta(['03:33:33','1 day','aaa'], errors='coerce'))
In [229]: s
Out[229]:
0 0 days 03:33:33
1 1 days 00:00:00
2 NaT
dtype: timedelta64[ns]
In [230]: s.dt.total_seconds()
Out[230]:
0 12813.0
1 86400.0
2 NaN
dtype: float64
Related
I'm having problems interpolating over time in Pandas so I've taken it back to a very basic example and I still see the same problem.
c is the dataframe, a is the index (a datetime64 array) and b is the data (a float array)
In [104]: c
Out[104]:
b
a
2009-04-01 386.928680
2009-06-01 386.502686
In [105]: a
Out[105]:
0 2009-04-01
1 2009-06-01
dtype: datetime64[ns]
In [106]: b
Out[106]:
0 386.928680
1 386.502686
dtype: float64
upsampled = c.resample('M')
interpolated = upsampled.interpolate(method='linear')
In [107]: interpolated
Out[107]:
b
a
2009-04-30 NaN
2009-05-31 NaN
2009-06-30 NaN
I've tried changing the interpolation method and setting the limit keyword but nothing seems to help and I just get all NaNs.
You need to change your resample to 'MS' month start to get the original values.
c.resample('MS').asfreq().interpolate(method='linear')
Output:
b
a
2009-04-01 386.928680
2009-05-01 386.715683
2009-06-01 386.502686
I'm trying to divide a Pandas DataFrame column by a lagged value, which is 1 in this example.
Create the dataframe. This example only has 1 column, even though my real data has dozens
dTest = pd.DataFrame(data={'Open': [0.99355, 0.99398, 0.99534, 0.99419]})
When I try this vector division (I'm a Python newbie coming from R):
dTest.ix[range(1,4),'Open'] / dTest.ix[range(0,3),'Open']
I get this output:
NaN 1 1 NaN
But I'm expecting:
1.0004327915052085
1.0013682367854484
0.9988446159101413
There's clearly something that I don't understand about the data structure. I'm expecting 3 values but it's outputting 4. What am I missing?
What you tried failed because the sliced ranges of the indices only overlap on the middle 2 rows. You should use shift to shift the rows to achieve what you want:
In [166]:
dTest['Open'] / dTest['Open'].shift()
Out[166]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
you can also use div:
In [159]:
dTest['Open'].div(dTest['Open'].shift(), axis=0)
Out[159]:
0 NaN
1 1.000433
2 1.001368
3 0.998845
Name: Open, dtype: float64
You can see that the indices are different when you slice so when using / only the common indices are affected:
In [164]:
dTest.ix[range(0,3),'Open']
Out[164]:
0 0.99355
1 0.99398
2 0.99534
Name: Open, dtype: float64
In [165]:
dTest.ix[range(1,4),'Open']
Out[165]:
1 0.99398
2 0.99534
3 0.99419
Name: Open, dtype: float64
here:
In [168]:
dTest.ix[range(0,3),'Open'].index.intersection(dTest.ix[range(1,4),'Open'].index
Out[168]:
Int64Index([1, 2], dtype='int64')
I have a pandas data frame with a 'date_of_birth' column. Values take the form 1977-10-24T00:00:00.000Z for example.
I want to grab the year, so I tried the following:
X['date_of_birth'] = X['date_of_birth'].apply(lambda x: int(str(x)[4:]))
This works if I am guaranteed that the first 4 letters are always integers, but it fails on my data set as some dates are messed up or garbage. Is there a way I can adjust my lambda without using regex? If not, how could I write this in regex?
I think it would be better to just use to_datetime to convert to datetime dtype, you can drop the invalid rows using dropna and also access just the year attribute using dt.year:
In [58]:
df = pd.DataFrame({'date':['1977-10-24T00:00:00.000Z', 'duff', '200', '2016-01-01']})
df['mod_dates'] = pd.to_datetime(df['date'], errors='coerce')
df
Out[58]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
1 duff NaT
2 200 NaT
3 2016-01-01 2016-01-01
In [59]:
df.dropna()
Out[59]:
date mod_dates
0 1977-10-24T00:00:00.000Z 1977-10-24
3 2016-01-01 2016-01-01
In [60]:
df['mod_dates'].dt.year
Out[60]:
0 1977.0
1 NaN
2 NaN
3 2016.0
Name: mod_dates, dtype: float64
I have data like follows:
import pandas as pd
from datetime import datetime
x = pd.Series([1, 2, 4], [datetime(2013,11,1), datetime(2013,11, 2), datetime(2013, 11, 4)])
The missing index at November 3rd corresponds to a zero value, and I want it to look like this:
y = pd.Series([1,2,0,4], pd.date_range('2013-11-01', periods = 4))
What's the best way to convert x to y? I've tried
y = pd.Series(x, pd.date_range('2013-11-1', periods = 4)).fillna(0)
This throws an index error sometimes which I can't interpret (Index length did not match values, even though index and data have the same length. Is there a better way to do this?
You can use pandas.Series.resample() for this:
>>> x.resample('D').fillna(0)
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
There's fill_method parameter in the resample() function, but I don't know if it's possible to use it to replace NaN during resampling. But looks like you can use how method to take care of it, like:
>>> x.resample('D', how=lambda x: x.mean() if len(x) > 0 else 0)
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
Don't know which method is preferred one. Please also take a look at #AndyHayden's answer - probably reindex() with fill_value=0 would be most efficien way to do this, but you have to make your own tests.
I think I would use a resample (note if there are dupes it takes the mean by default):
In [11]: x.resample('D') # you could use how='first'
Out[11]:
2013-11-01 1
2013-11-02 2
2013-11-03 NaN
2013-11-04 4
Freq: D, dtype: float64
In [12]: x.resample('D').fillna(0)
Out[12]:
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
Freq: D, dtype: float64
If you prefered dupes to raise, then use reindex:
In [13]: x.reindex(pd.date_range('2013-11-1', periods=4), fill_value=0)
Out[13]:
2013-11-01 1
2013-11-02 2
2013-11-03 0
2013-11-04 4
Freq: D, dtype: float64
I have a massive dataframe with four columns, two of which are 'date' (in datetime format) and 'page' (a location saved as a string). I have grouped the dataframe by 'page' and called it pagegroup, and want to know the range of time over which each page is accessed (e.g. the first access was on 1-1-13, the last on 1-5-13, so the max-min is 5 days).
I know in pandas I can use date_range to compare two datetimes, but trying something like:
pagegroup['date'].agg(np.date_range)
returns
AttributeError: 'module' object has no attribute 'date_range'
while trying the simple (non date-specific) numpy function ptp gives me an integer answer:
daterange = pagegroup['date'].agg([np.ptp])
daterange.head()
ptp
page
%2F 0
/ 13325984000000000
/-509606456 297697000000000
/-511484155 0
/-511616154 0
Can anyone think of a way to calculate the range of dates and have it return in a recognizable date format?
Thank you
Assuming you have indexed by datetime can use groupby apply:
In [11]: df = pd.DataFrame([[1, 2], [1, 3], [2, 4]],
columns=list('ab'),
index=pd.date_range('2013', freq='H', periods=3)
In [12]: df
Out[12]:
a b
2013-08-22 00:00:00 1 2
2013-08-22 01:00:00 1 3
2013-08-22 02:00:00 2 4
In [13]: g = df.groupby('a')
In [14]: g.apply(lambda x: x.iloc[-1].name - x.iloc[0].name)
Out[14]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Here iloc[-1] grabs the last row in the group and iloc[0] gets the first. The name attribute is the index of the row.
#Elyase points out that this only works if the original DatetimeIndex was in order, if not you can use max/min (which actually reads better, but may be less efficient):
In [15]: g.apply(lambda x: x.index.max() - x.index.min())
Out[15]:
a
1 01:00:00
2 00:00:00
dtype: timedelta64[ns]
Note: to get the timedelta between two Timestamps we have just subtracted (-).
If date is a column rather than an index, then use the column name:
g.apply(lambda x: x['date'].iloc[-1] - x['date'].iloc[0])
g.apply(lambda x: x['date'].max() - x['date'].min())