map behaves strangely on DatetimeIndex - python

I'm seeing a strange behaviour on the map function when applied to a DatetimeIndex, where the first element that is mapped is the whole index, and then each element is processed individually (as expected).
Here's a way to reproduce the issue
(Have tried it on pandas 0.22.0, 0.23.0 and 0.24.0):
df = pd.DataFrame(data = np.random.randn(3,1),
index = pd.DatetimeIndex(
start='2018-05-03',
periods = 3,
freq ='D'))
df.index.map(lambda x: print(x))
yields:
DatetimeIndex(['2018-05-03', '2018-05-04', '2018-05-05'], dtype='datetime64[ns]', freq='D')
2018-05-03 00:00:00
2018-05-04 00:00:00
2018-05-05 00:00:00
Index([None, None, None], dtype='object')
EDIT: The very first line that the print is producing is what I find odd. If I use a RangeIndex this doesn't happen.

Surprising print behaviour
This unusual behaviour only affects a DatetimeIndex and not a Series. So to fix the bug, wrap your index in pd.Series() before mapping the lambda function:
pd.Series(df.index).map(lambda x: print(x))
Alternatively you can use the .to_series() method:
df.index.to_series().map(lambda x: print(x))
Note the return values of the pd.Series() version will be numerically indexed, while the return values of the .to_series() version will be datetime indexed.
Is this a bug?
Index.map(), like Series.map(), returns a Series containing the return values of your lambda function.
In this case, print() just returns None, so you are correctly getting an Index Series of None values. The print behaviour is inconsistent with other types of pandas Indexes and Series, but this is an unusual application.
import pandas as pd
import numpy as np
df = pd.DataFrame(data = np.random.randn(3,1),
index = pd.DatetimeIndex(
start='2018-05-03',
periods = 3,
freq ='D'))
example = df.index.map(lambda x: print(x))
# DatetimeIndex(['2018-05-03', '2018-05-04', '2018-05-05'], dtype='datetime64[ns]', freq='D')
# 2018-05-03 00:00:00
# 2018-05-04 00:00:00
# 2018-05-05 00:00:00
print(example)
# Index([None, None, None], dtype='object')
As you can see, there's nothing wrong with the return value. Or for a clearer example, where we add one day to each item:
example2 = df.index.map(lambda x: x + 1)
print(example2)
# DatetimeIndex(['2018-05-04', '2018-05-05', '2018-05-06'], dtype='datetime64[ns]', freq='D')
So the print behaviour is inconsistent with similar classes in pandas, but the return values are correct.

Related

Panda Datetimes: get datetime ranges from lists of datetimes

Not sure if relevant, but the dates are in DatetimeIndex list(?) in Panda, Python 3.6
I'm trying to get all the date ranges of consecutive days, outputting the minimum and maximum of the said date ranges.
Output preferred to be in list, but it seems like Dataframe is essentially a list where I can use indexing, I think?
I would later output these date ranges to an Excel sheet.
Sample input:
'1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05', '2002-10-05', '2002-10-06'
Expected output:
1990-10-01, 1990-10-03
1990-10-05
2002-10-05, 2002-10-06
I know a naive method would be to do a for loop and check if the next/previous dates is off by one, checking the day, month, and year. But what's a better way to do this?
Thanks
Edited to clarify
Setup:
df = pd.DataFrame()
df['Date'] = pd.to_datetime(['1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05', '2002-10-05', '2002-10-06'])
Solution:
First calculate running diff, create a flag to indicate if the dates should be in the same group, then groupby and get the start and end date for that group. Set is used to remove end date if it's the same as start.
(
df.assign(DateDiff=(df.Date - df.Date.shift(1)).dt.days.fillna(0))
.assign(Flag= lambda x: np.where(x.DateDiff==1, np.nan, range(len(x))))
.assign(Flag=lambda x: x.Flag.ffill())
.groupby(by='Flag').Date
.apply(lambda x: set([x.iloc[0].date(), x.iloc[-1].date()]))
)
Flag
0.0 {1990-10-01, 1990-10-03}
3.0 {1990-10-05}
4.0 {2002-10-05, 2002-10-06}
Name: Date, dtype: object
Lets create the example:
Input:
l = ['1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05', '2002-10-05', '2002-10-06']
idx = pd.DatetimeIndex(l)
DatetimeIndex(['1990-10-01', '1990-10-02', '1990-10-03', '1990-10-05',
'2002-10-05', '2002-10-06'],
dtype='datetime64[ns]', freq=None)
Solution:
Create a helper series which will calculate the difference between the consecutive dates and create groups where difference is not 1 , then loop over the groups and get the first and last item in that group.
g = idx.to_series().diff().fillna(pd.Timedelta(days=1)).dt.days.ne(1).cumsum()
final = [pd.DatetimeIndex(map(grp.index.__getitem__, (0,-1)))
if len(grp.index)>1 else grp.index
for _,grp in g.groupby(g)]
Output:
[DatetimeIndex(['1990-10-01', '1990-10-03'], dtype='datetime64[ns]', freq=None),
DatetimeIndex(['1990-10-05'], dtype='datetime64[ns]', freq=None),
DatetimeIndex(['2002-10-05', '2002-10-06'], dtype='datetime64[ns]', freq=None)]
If you want a dataframe to do df.to_excel(..) , just create a dataframe based on the final list:
df = pd.DataFrame(final,columns = ['start','end'])
print(df)
start end
0 1990-10-01 1990-10-03
1 1990-10-05 NaT
2 2002-10-05 2002-10-06

Indexing datetime column in pandas

I imported a csv file in python. Then, I changed the first column to datetime format.
datetime Bid32 Ask32
2019-01-01 22:06:11.699 1.14587 1.14727
2019-01-01 22:06:12.634 1.14567 1.14707
2019-01-01 22:06:13.091 1.14507 1.14647
I saw three ways for indexing first column.
df.index = df.datetime
del datetime
or
df.set_index('datetime', inplace=True)
and
df.set_index(pd.DatetimeIndex('datetime'), inplace=True)
My question is about the second and third ways. Why in some sources they used pd.DatetimeIndex() with df.set_index() (like third code) while the second code was enough?
In case you are not changing the 'datetime' column with to_datetime():
df = pd.DataFrame(columns=['datetime', 'Bid32', 'Ask32'])
df.loc[0] = ['2019-01-01 22:06:11.699', '1.14587', '1.14727']
df.set_index('datetime', inplace=True) # option 2
print(type(df.index))
Result:
pandas.core.indexes.base.Index
vs.
df = pd.DataFrame(columns=['datetime', 'Bid32', 'Ask32'])
df.loc[0] = ['2019-01-01 22:06:11.699', '1.14587', '1.14727']
df.set_index(pd.DatetimeIndex(df['datetime']), inplace=True) # option 3
print(type(df.index))
Result:
pandas.core.indexes.datetimes.DatetimeIndex
So the third one with pd.DatetimeIndex() makes it an actual datetime index, which is what you want.
Documentation:
pandas.Index
pandas.DatetimeIndex

Pandas Groupby Agg Function Does Not Reduce

I am using an aggregation function that I have used in my work for a long time now. The idea is that if the Series passed to the function is of length 1 (i.e. the group only has one observation) then that observations is returned. If the length of the Series passed is greater than one, then the observations are returned in a list.
This may seem odd to some, but this is not an X,Y problem, I have good reason for wanting to do this that is not relevant to this question.
This is the function that I have been using:
def MakeList(x):
""" This function is used to aggregate data that needs to be kept distinc within multi day
observations for later use and transformation. It makes a list of the data and if the list is of length 1
then there is only one line/day observation in that group so the single element of the list is returned.
If the list is longer than one then there are multiple line/day observations and the list itself is
returned."""
L = x.tolist()
if len(L) > 1:
return L
else:
return L[0]
Now for some reason, with the current data set I am working on I get a ValueError stating that the function does not reduce. Here is some test data and the remaining steps I am using:
import pandas as pd
DF = pd.DataFrame({'date': ['2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02',
'2013-04-02'],
'line_code': ['401101',
'401101',
'401102',
'401103',
'401104',
'401105',
'401105',
'401106',
'401106',
'401107'],
's.m.v.': [ 7.760,
25.564,
25.564,
9.550,
4.870,
7.760,
25.564,
5.282,
25.564,
5.282]})
DFGrouped = DF.groupby(['date', 'line_code'], as_index = False)
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
In trying to debug this, I put a print statement to the effect of print L and print x.index and
the output was as follows:
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
[7.7599999999999998, 25.564]
Int64Index([0, 1], dtype='int64')
For some reason it appears that agg is passing the Series twice to the function. This as far as I know is not normal at all, and is presumably the reason why my function is not reducing.
For example if I write a function like this:
def test_func(x):
print x.index
return x.iloc[0]
This runs without problem and the print statements are:
DF_Agg = DFGrouped.agg({'s.m.v.' : test_func})
Int64Index([0, 1], dtype='int64')
Int64Index([2], dtype='int64')
Int64Index([3], dtype='int64')
Int64Index([4], dtype='int64')
Int64Index([5, 6], dtype='int64')
Int64Index([7, 8], dtype='int64')
Int64Index([9], dtype='int64')
Which indicates that each group is only being passed once as a Series to the function.
Can anyone help me understand why this is failing? I have used this function with success in many many data sets I work with....
Thanks
I can't really explain you why, but from my experience list in pandas.DataFrame don't work all that well.
I usually use tuple instead.
That will work:
def MakeList(x):
T = tuple(x)
if len(T) > 1:
return T
else:
return T[0]
DF_Agg = DFGrouped.agg({'s.m.v.' : MakeList})
date line_code s.m.v.
0 2013-04-02 401101 (7.76, 25.564)
1 2013-04-02 401102 25.564
2 2013-04-02 401103 9.55
3 2013-04-02 401104 4.87
4 2013-04-02 401105 (7.76, 25.564)
5 2013-04-02 401106 (5.282, 25.564)
6 2013-04-02 401107 5.282
This is a misfeature in DataFrame. If the aggregator returns a list for the first group, it will fail with the error you mention; if it returns a non-list (non-Series) for the first group, it will work fine. The broken code is in groupby.py:
def _aggregate_series_pure_python(self, obj, func):
group_index, _, ngroups = self.group_info
counts = np.zeros(ngroups, dtype=int)
result = None
splitter = get_splitter(obj, group_index, ngroups, axis=self.axis)
for label, group in splitter:
res = func(group)
if result is None:
if (isinstance(res, (Series, Index, np.ndarray)) or
isinstance(res, list)):
raise ValueError('Function does not reduce')
result = np.empty(ngroups, dtype='O')
counts[label] = group.shape[0]
result[label] = res
Notice that if result is None and isinstance(res, list.
Your options are:
Fake out groupby().agg(), so it doesn't see a list for the first group, or
Do the aggregation yourself, using code like that above but without the erroneous test.

Pass a function with parameters specified for resample() method on a pandas dataframe

I want to pass a function to resample() on a pandas dataframe with certain parameters specified when it is passed (as opposed to defining several separate functions).
This is the function
import itertools
def spell(X, kind='wet', how='mean', threshold=0.5):
if kind=='wet':
condition = X>threshold
else:
condition = X<=threshold
length = [sum(1 if x==True else nan for x in group) for key,group in itertools.groupby(condition)]
if not length:
res = 0
elif how=='mean':
res = np.mean(length)
else:
res = np.max(length)
return res
here is a dataframe
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
And heres sort of what I want to do with it
df.resample('M', how=spell(kind='dry',how='max',threshold=0.7))
But I get the error TypeError: spell() takes at least 1 argument (3 given). I want to be able to pass this function with these parameters specified except for the input array. Is there a way to do this?
EDIT:
X is the input array that is passed to the function when calling the resample method on a dataframe object like so df.resample('M', how=my_func) for a monthly resampling interval.
If I try df.resample('M', how=spell) I get:
0
1960-01-31 1.875000
1960-02-29 1.500000
1960-03-31 1.888889
1960-04-30 3.000000
which is exactly what I want for the default parameters but I want to be able to specify the input parameters to the function before passing it. This might include storing the definition in another variable but I'm not sure how to do this with the default parameters changed.
I think this may be what you're looking for, though it's a little hard to tell.. Let me know if this helps. First, the example dataframe:
idx = pd.DatetimeIndex(start='1960-01-01', periods=100, freq='d')
values = np.random.random(100)
df = pd.DataFrame(values, index=idx)
EDIT- had a greater than instead of less than or equal to originally...
Next, the function:
def spell(df, column='', kind='wet', rule='M', how='mean', threshold=0.5):
if kind=='wet':
df = df[df[column] > threshold]
else:
df = df[df[column] <= threshold]
df = df.resample(rule=rule, how=how)
return df
So, you would call it by:
spell(df, 0)
To get:
0
1960-01-31 0.721519
1960-02-29 0.754054
1960-03-31 0.746341
1960-04-30 0.654872
You can change around the parameters as well:
spell(df, 0, kind='something else', rule='W', how='max', threshold=0.7)
0
1960-01-03 0.570638
1960-01-10 0.529357
1960-01-17 0.565959
1960-01-24 0.682973
1960-01-31 0.676349
1960-02-07 0.379397
1960-02-14 0.680303
1960-02-21 0.654014
1960-02-28 0.546587
1960-03-06 0.699459
1960-03-13 0.626460
1960-03-20 0.611464
1960-03-27 0.685950
1960-04-03 0.688385
1960-04-10 0.697602

How to resample large dataframe with different functions, using a key?

I have a large time-series set of data with over 200 recorded values (columns). Some values need to be averaged and some need to be summed, and I have a list that determines which is which. I need help figuring out how to feed that list into the how= function of resample.
Example data:
"Timestamp","TZ","TAO (degF)","RHO (%)","WS (mph)","WD (deg)","RAIN (mm)","OAP (hPa)","INSOL (W/m2)","HAIL (hits/cm2)"......."
2014/04/01 01:01:01.005,n,45.3,88.2,0,0.6,0.339,1.0108,-0.270342,0,68.147808,40.91662,68.15884,40.672356,66.55452,......
2014/04/01 01:02:01.027,n,45.3,88,0,3.4,0.339,1.0108,-0.124948,0,68.216736,40.929836,68.15884,40.656932,66.560072,.......
2014/04/01 01:03:01.050,n,45.3,88,0,1.7,0.34,1.0108,-0.145394,0,68.156064,40.890184,68.103736,40.68332,66.557296,......
The best I can come up with is concatenating the list into a string to pass into the how=function, but the concatenation of strings makes the function SeriesGroupBy error out.
df = pandas.read_csv(parsedatafile, parse_dates = True, date_parser=lambda x: datetime.datetime.strptime(x, '%Y/%m/%d %H:%M:%S.%f') , index_col=0)
while i < len(recordname):
if recordhow[i]=="Y":
#parseavgsum[i]="sum"
recordhow[i]=str(recordname[i])+str(": sum")
else:
recordhow[i]=str(recordname[i])+str(": mean")
#parseavgsum[i]="mean"
i+=1
df2=df.resample('60Min', how = recordhow)
I would pass how a dictionary:
>>> df
WD (deg) RAIN (mm)
Timestamp
2014-04-01 01:01:01.005000 40.916620 68.158840
2014-04-01 01:02:01.027000 40.929836 68.158840
2014-04-01 01:03:01.050000 40.890184 68.103736
[3 rows x 2 columns]
>>> what_to_do = {"WD (deg)": "mean", "RAIN (mm)": "sum"}
>>> df.resample("60Min", how=what_to_do)
RAIN (mm) WD (deg)
Timestamp
2014-04-01 01:00:00 204.421416 40.912213
[1 rows x 2 columns]
I think using a recordhow list like you're doing is a little dangerous, because it's very easy for columns to get shuffled accidentally in which case your means and sums would be off. It's much safer to work with column names. But if you have recordhow, you could do something like:
>>> recordhow = ["N", "Y"]
>>> how_map = {"Y": "sum", "N": "mean"}
>>> what_to_do = dict(zip(df.columns, [how_map[x] for x in recordhow]))
>>> what_to_do
{'RAIN (mm)': 'sum', 'WD (deg)': 'mean'}
but again, I recommend moving away from a bare list that doesn't know what maps to what as quickly as possible.

Categories

Resources