Pandas DataFrame casting to timedelta fails with loc - python

I've got a little bit of a weird situation, and I don't understand why it works in one situation and not the other.
I'm trying to cast a column on a multiindex from timedelta64[ns] to timedelta64[s], and I also have a multiindex for rows.
If tuple is the column I want (level_0, level_1):
it works with df[tuple] = df[tuple].astype(timedelta64[s])
it doesn't work with df.loc[:, tuple].astype(timedelta64[s])
Here is some sample data (csv):
Level_0,,,Respondent,Respondent,Respondent,OtherCat,OtherCat
Level_1,,,Something,StartDate,EndDate,Yes/No,SomethingElse
Region,Site,RespondentID,,,,,
Region_1,Site_1,3987227376,A,5/25/2015 10:59,5/25/2015 11:22,Yes,
Region_1,Site_1,3980680971,A,5/21/2015 9:40,5/21/2015 9:52,Yes,Yes
Region_1,Site_2,3977723249,A,5/20/2015 8:27,5/20/2015 8:41,Yes,
Region_1,Site_2,3977723089,A,5/20/2015 8:33,5/20/2015 9:09,Yes,No
Load it with:
In [1]: df = pd.read_csv(header=[0,1], index_col=[0,1,2])
df
Out[1]:
I want to create a column "Duration" (and then one called "DurationMinutes" dividing Duration by 60).
I start by casting the dates to datetime:
In [2]:
df.loc[:,('Respondent','StartDate')] = pd.to_datetime(sample.loc[:,('Respondent','StartDate')])
df.loc[:,('Respondent','EndDate')] = pd.to_datetime(df.loc[:,('Respondent','EndDate')])
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','EndDate')] - df.loc[:,('Respondent','StartDate')]
This is where I don't understand anymore what's going on. I want to convert it to timedelta64[s] because I need that.
If I simply display the result of astype('timedelta64[s]'), it works like a charm:
In [3]: df.loc[:,('Respondent','Duration')].astype('timedelta64[s]')
Out[3]:
Region Site RespondentID
Region_1 Site_1 3987227376 1380
3980680971 720
Site_2 3977723249 840
3977723089 2160
Name: (Respondent, Duration), dtype: float64
But if I assign, then show the column, it fails:
In [4]: df.loc[:,('Respondent','Duration')] = df.loc[:,'Respondent','Duration')].astype('timedelta64[s]')
df.loc[:,('Respondent','Duration')]
Out[4]:
Region Site RespondentID
Region_1 Site_1 3987227376 00:00:00.000001
3980680971 00:00:00.000000
Site_2 3977723249 00:00:00.000000
3977723089 00:00:00.000002
Name: (Respondent, Duration), dtype: timedelta64[ns]
Weirdly enough, if I do this: it will work:
In [5]: df[('Respondent','Duration')] = df[('Respondent','Duration')].astype('timedelta64[s]')
df.loc[:,('Respondent','Duration')]
Out[5]:
Region Site RespondentID
Region_1 Site_1 3987227376 1380
3980680971 720
Site_2 3977723249 840
3977723089 2160
Name: (Respondent, Duration), dtype: float64
Another strange thing, if I filter for one site, and drop the Region so that I end up with a single-level index, it works...:
In [6]:
Survey = 'Site_1'
df = df.xs(Survey, level='Site').copy()
​
# Drop the 'Region' from index
df.index = df.index.droplevel(level='Region')
df.loc[:,('Respondent','StartDate')] = pd.to_datetime(df.loc[:,('Respondent','StartDate')])
df.loc[:,('Respondent','EndDate')] = pd.to_datetime(df.loc[:,('Respondent','EndDate')])
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','EndDate')] - df.loc[:,('Respondent','StartDate')]
​# This works fine
df.loc[:,('Respondent','Duration')] = df.loc[:,('Respondent','Duration')].astype('timedelta64[s]')
​
# Display
df.loc[:,('Respondent','Duration')]
Out[6]:
RespondentID
3987227376 1380
3980680971 720
Name: (Respondent, Duration), dtype: float64
Clearly I'm missing something as to why df.loc[:,tuple] is different than df[tuple].
Can someone shed some light please?
Python 2.7.9, pandas 0.16.2

This was a bug, I just fixed it here, will be in 0.17.0.
The gist is this. When you do something like df.loc[:,column] = value this is treated exactly the same as df[[column]] = value. This means that type coercion is independent of what the column WAS. Contrast this to df.loc[indexer,column], e.g. you are partially setting a column. Here the new value AND the existing dtype of the column matters.
The bug was that when the frame has a multi-index, even though the multi-index was a full index (e.g. it encompassed the full length of values in the frame) it wasn't taking the correct path.
So the bottom line is that these cases should (and will be) the same.

Related

Pandas: populating column values with date and time string based on conditions

I have a Pandas dataframe df that looks as follows:
created_time action_time
2021-03-05T07:18:12.281-0600 2021-03-05T08:32:19.153-0600
2021-03-04T15:34:23.373-0600 2021-03-04T15:37:32.360-0600
2021-03-01T04:57:47.848-0600 2021-03-01T08:37:39.083-0600
import pandas as pd
df = pd.DataFrame({'created_time':['2021-03-05T07:18:12.281-0600', '2021-03-04T15:34:23.373-0600', '2021-03-01T04:57:47.848-0600'],
'action_time':['2021-03-05T08:32:19.153-0600', '2021-03-04T15:37:32.360-0600', '2021-03-01T08:37:39.083-0600']})
I then create another column which represents the the difference in minutes between these two columns:
df['elapsed_time'] = (pd.to_datetime(df['action_time']) - pd.to_datetime(df['created_time'])).dt.total_seconds() / 60
df['elapsed_time']
elapsed_time
74.114533
3.149783
219.853917
We assume that "action" can only take place during business hours (which we assume to start 8:30am).
I would like to create another column named created_time_adjusted, which adjusts the created_time to 08:30am if the created_time is before 08:30am).
I can parse out the date and time string that I need, as follows:
df['elapsed_time'] = pd.to_datetime(df['created_time']).dt.date.astype(str) + 'T08:30:00.000-0600'
But, this doesn't deal with the conditional.
I'm aware of a few ways that I might be able to do this:
replace
clip
np.where
loc
What is the best (and least hacky) way to accomplish this?
Thanks!
First of all, I think your life would be easier if you convert the columns to datetime dtypes from the go. Then, its just a matter of running an apply op on the 'created_time' column.
df.created_time = pd.to_datetime(df.created_time)
df.action_time = pd.to_datetime(df.action_time)
df.elapsed_time = df.action_time-df.created_time
time_threshold = pd.to_datetime('08:30').time()
df['created_time_adjusted']=df.created_time.apply(lambda x:
x.replace(hour=8,minute=30,second=0)
if x.time()<time_threshold else x)
Output:
>>> df
created_time action_time created_time_adjusted
0 2021-03-05 07:18:12.281000-06:00 2021-03-05 08:32:19.153000-06:00 2021-03-05 08:30:00.281000-06:00
1 2021-03-04 15:34:23.373000-06:00 2021-03-04 15:37:32.360000-06:00 2021-03-04 15:34:23.373000-06:00
2 2021-03-01 04:57:47.848000-06:00 2021-03-01 08:37:39.083000-06:00 2021-03-01 08:30:00.848000-06:00
df['created_time']=pd.to_datetime(df['created_time'])#Coerce to datetime
df1=df.set_index(df['created_time']).between_time('00:00:00', '08:30:00', include_end=False)#Isolate earlier than 830 into df
df1['created_time']=df1['created_time'].dt.normalize()+ timedelta(hours=8,minutes=30, seconds=0)#Adjust time
df2=df1.append(df.set_index(df['created_time']).between_time('08:30:00','00:00:00', include_end=False)).reset_index(drop=True)#Knit before and after 830 together
df2

Pandas Timedelta mean returns error "No numeric types to aggregate". Why?

I am trying to perform the following operation:
pd.concat([A,B], axis = 1).groupby("status_reason")["closing_time"].mean()
Where
A is a Series named "status_reason" (Categorical values)
B is a Series named "closing_time" (TimeDelta values)
Example:
In : A.head(5)
Out:
0 -1 days +11:35:00
1 -10 days +07:13:00
2 NaT
3 NaT
4 NaT
Name: closing_time, dtype: timedelta64[ns]
In : B.head(5)
Out:
0 Won
1 Canceled
2 In Progress
3 In Progress
4 In Progress
Name: status_reason, dtype: object
The following error occurs:
DataError: No numeric types to aggregate
Please note: I tried to perform the mean even isolating every single category
Now, I saw a few question similar to mine online, so I tried this:
pd.to_timedelta(pd.concat([pd.to_numeric(A),B], axis = 1).groupby("status_reason")["closing_time"].mean())
Which is simply converting the Timedelta to an int64 and viceversa. But there result was quite strange (numbers too high)
In order to investigate the situation, I wrote the following code:
xxx = pd.concat([A,B], axis = 1)
xxx.closing_time.mean()
#xxx.groupby("status_reason")["closing_time"].mean()
The second row WORKS FINE, without converting the Timedelta to Int64. The third row DOES NOT work, and returns again the DataError.
I'm so confused here! What am I missig?
I would like to see the mean of the "closing times" for each "status reason"!
EDIT
If I try to do this: (Isolate the rows with a specific status without grouping)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy["closing_time"].mean()
The result is:
Timedelta('310 days 21:18:05.454545')
But if I do this: (Isolate the rows with a specific status grouping)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy.groupby("status_reason")["closing_time"].mean()
The result is again:
DataError: No numeric types to aggregate
Lastly, if I do this: (converting and converting back) (LET's CALL THIS: Special Example)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy.closing_time = pd.to_numeric (yyy.closing_time)
pd.to_timedelta(yyy.groupby("status_reason")["closing_time"].mean())
We go back to the first problem I noticed:
status_reason
In Progress -105558 days +10:08:05.605064
Name: closing_time, dtype: timedelta64[ns]
EDIT2
If I do this: (convert to seconds and convert back)
yyy = xxx[xxx["status_reason"] == "In Progress"]
yyy.closing_time = A.dt.seconds
pd.to_timedelta(yyy.groupby("status_reason")["closing_time"].mean(), unit="s" )
The result is
status_reason
In Progress 08:12:38.181818
Name: closing_time, dtype: timedelta64[ns]
The same result happens if I remove the NaNs, or if I fill them with 0:
yyy = xxx[xxx["status_reason"] == "In Progress"].dropna()
yyy.closing_time = A.dt.seconds
pd.to_timedelta(yyy.groupby("status_reason")["closing_time"].mean(), unit="s" )
BUT the numbers are very different from what we saw in the first edit! (Special Example)
-105558 days +10:08:05.605064
Also, let me run the same code (Special Example) with dropna():
310 days 21:18:05.454545
And again, let's run the same code (Special Example) with fillna(0):
3 days 11:14:22.819472
This is going nowhere. I should probably prepare an export of those data, and post them somewhere: Here we go
From reading the discussion of this issue on Github here, you can solve this issue by specifying numeric_only=False for mean calculation as follows
pd.concat([A,B], axis = 1).groupby("status_reason")["closing_time"] \
.mean(numeric_only=False)
The problem might be In Progress only have NaT time, which might not allowed in groupby().mean(). Here's the test:
df = pd.DataFrame({'closing_time':['11:35:00', '07:13:00', np.nan,np.nan, np.nan],
'status_reason':['Won','Canceled','In Progress', 'In Progress', 'In Progress']})
df.closing_time = pd.to_timedelta(df.closing_time)
df.groupby('status_reason').closing_time.mean()
gives the exact error. To overcome this, do:
def custom_mean(x):
try:
return x.mean()
except:
return pd.to_timedelta([np.nan])
df.groupby('status_reason').closing_time.apply(custom_mean)
which gives:
status_reason
Canceled 07:13:00
In Progress NaT
Won 11:35:00
Name: closing_time, dtype: timedelta64[ns]
I cannot say why groupby's mean() method does not work, but the following slight modification of your code should work: First, convert timedelta column to seconds with total_seconds() method, then groupby and mean, then convert seconds to timedelta again:
pd.to_timedelta(pd.concat([ A.dt.total_seconds(), B], axis = 1).groupby("status_reason")["closing_time"].mean(), unit="s")
For example dataframe below, the code -
df = pd.DataFrame({'closing_time':['2 days 11:35:00', '07:13:00', np.nan,np.nan, np.nan],'status_reason':['Won','Canceled','In Progress', 'In Progress', 'In Progress']})
df.loc[:,"closing_time"] = \
pd.to_timedelta(df.closing_time).dt.days*24*3600 \
+ pd.to_timedelta(df.closing_time).dt.seconds
# or alternatively use total_seconds() to get total seconds in timedelta as follows
# df.loc[:,"closing_time"] = pd.to_timedelta(df.closing_time).dt.total_seconds()
pd.to_timedelta(df.groupby("status_reason")["closing_time"].mean(), unit="s")
produces
status_reason
Canceled 0 days 07:13:00
In Progress NaT
Won 2 days 11:35:00
Name: closing_time, dtype: timedelta64[ns]
After a few investigation, here is what I found:
Most of the confusion comes from the fact that in one case I was calling SeriesGroupBy.mean() and in the other case Series.mean()
These functions are actually different and have different behaviours. I was not realizing that
The second important point is that converting to numeric, or to seconds, leads to a totally different behaviour when it comes to handling NaNs value.
To overcome this situation, the first thing you have to do is deciding how to handle NaN values. The best approach depends on what we want to achieve. In my case, it's fine to have even a simple categorical result, so I can do something like this:
import datetime
def define_time(row):
if pd.isnull(row["closing_time"]):
return "Null"
elif row["closing_time"] < datetime.timedelta(days=100):
return "<100"
elif row["closing_time"] > datetime.timedelta(days=100):
return ">100"
time_results = pd.concat([A,B], axis = 1).apply(lambda row:define_time(row), axis = 1)
In the end the result is like this:
In :
time_results.value_counts()
Out :
>100 1452
<100 1091
Null 1000
dtype: int64

Calculating a max for every X number of lines, how to take leap year into account?

I am trying to take yearly max rainfall data for multiple years of data within one array. I understand how you would need to use a for loop if I wanted to take the max of a single range, I saw there was similar question to the problem I'm having. However, I need to take leap year into account!
So for the first year I have 14616 data points from 1960-1965, not including 1965, which contains 2 leap years: 1960 and 1964. A leap year contains 2928 data points and every other year contains 2920 data points.
I first thought was to modify the solution from the similar question which involved using a for loop as follows (just a straight copy paste from their's):
for i,d in enumerate(data_you_want):
if (i % 600) == 0:
avg_for_day = np.mean(data_you_want[i - 600:i])
daily_averages.append(avg_for_day)
Their's involved taking the average of every 600 lines in their data. I thought there might be a way to just modify this, but I couldn't figure out a way for it to work. If modification of this won't work, is there another way to loop it with the leap years taken into account without completely cutting up the file manually.
Fake data:
import numpy as np
fake = np.random.randint(2, 30, size = 14616)
Use pandas to handle the leap year functionality.
Create timestamps for your data with pandas.date_range().
import pandas as pd
index = pd.date_range(start = '1960-1-1 00:00:00', end = '1964-12-31 23:59:59' , freq='3H')
Then create a DataFrame using the timestamps for the index.
df = pd.DataFrame(data = fake, index = index)
Aggregate by year - taking advantage of the DatetimeIndex flexibilty.
>>> df['1960'].max()
0 29
dtype: int32
>>> df['1960'].mean()
0 15.501366
dtype: float64
>>>
>>> len(df['1960'])
2928
>>> len(df['1961'])
2920
>>> len(df['1964'])
2928
>>>
I just cobbled this together from the Time Series / Date functionality section of the docs. Given pandas capability this looks a bit naive and probably can be improved upon.
Like resampling (using the same DataFrame)
>>> df.resample('A').mean()
0
1960-12-31 15.501366
1961-12-31 15.170890
1962-12-31 15.412329
1963-12-31 15.538699
1964-12-31 15.382514
>>> df.resample('A').max()
0
1960-12-31 29
1961-12-31 29
1962-12-31 29
1963-12-31 29
1964-12-31 29
>>>
>>> r = df.resample('A')
>>> r.agg([np.sum, np.mean, np.std])
0
sum mean std
1960-12-31 45388 15.501366 8.211835
1961-12-31 44299 15.170890 8.117072
1962-12-31 45004 15.412329 8.257992
1963-12-31 45373 15.538699 7.986877
1964-12-31 45040 15.382514 8.178057
>>>
Food for thought:
Time-aware Rolling vs. Resampling

Reformat a column containing dates in Pandas

Python newbie here who's switching from R to Python for statistical modeling and analysis.
I am working with a Pandas data structure and am trying to restructure a column that contains 'date' values. In the data below, you'll notice that some values take the 'Mar-10' format which others take a '12/1/13' format. How can I restructure a column in a Pandas data structure that contains 'dates' (technically not a date structure) so that they are uniform (contain the same structure). I'd prefer that they all follow the 'Mar-10' format. Can anyone help?
In [34]: dat["Date"].unique()
Out[34]:
array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object)
In [35]: isinstance(dat["Date"], basestring) # not a string?
Out[35]: False
In [36]: type(dat["Date"]).__name__
Out[36]: 'Series'
I think your dates are already strings, try:
import numpy as np
import pandas as pd
date = pd.Series(np.array(['Jan-10', 'Feb-10', 'Mar-10', 'Apr-10', 'May-10', 'Jun-10',
'Jul-10', 'Aug-10', 'Sep-10', 'Oct-10', 'Nov-10', 'Dec-10',
'Jan-11', 'Feb-11', 'Mar-11', 'Apr-11', 'May-11', 'Jun-11',
'Jul-11', 'Aug-11', 'Sep-11', 'Oct-11', 'Nov-11', 'Dec-11',
'Jan-12', 'Feb-12', 'Mar-12', 'Apr-12', 'May-12', 'Jun-12',
'Jul-12', 'Aug-12', 'Sep-12', 'Oct-12', 'Nov-12', 'Dec-12',
'Jan-13', 'Feb-13', 'Mar-13', 'Apr-13', 'May-13', '6/1/13',
'7/1/13', '8/1/13', '9/1/13', '10/1/13', '11/1/13', '12/1/13',
'1/1/14', '2/1/14', '3/1/14', '4/1/14', '5/1/14', '6/1/14',
'7/1/14', '8/1/14'], dtype=object))
date.map(type).value_counts()
# date contains 56 strings
# <type 'str'> 56
# dtype: int64
To see the types of each individual element, rather than seeing the type of the column they're contained in.
Your best bet for dealing sensibly with them is to convert them into pandas DateTime objects:
pd.to_datetime(date)
Out[18]:
0 2014-01-10
1 2014-02-10
2 2014-03-10
3 2014-04-10
4 2014-05-10
5 2014-06-10
6 2014-07-10
7 2014-08-10
8 2014-09-10
...
You may have to play around with the formats somewhat, e.g. creating two separate arrays
for each format and then merging them back together:
# Convert the Aug-10 style strings
pd.to_datetime(date, format='%b-%y', coerce=True)
# Convert the 9/1/13 style strings
pd.to_datetime(date, format='%m/%d/%y', coerce=True)
I can never remember these time formatting codes off the top of my head but there's a good rundown of them here.

Converting a column of strings to numbers in Pandas

How do I get the Units column to numeric?
I have a Google spreadsheet that I am reading in the date column gets converted fine.. but I'm not having much luck getting the Unit Sales column to convert to numeric I'm including all the code which uses requests to get the data:
from StringIO import StringIO
import requests
#act = requests.get('https://docs.google.com/spreadsheet/ccc?key=0Ak_wF7ZGeMmHdFZtQjI1a1hhUWR2UExCa2E4MFhiWWc&output=csv&gid=1')
dataact = act.content
actdf = pd.read_csv(StringIO(dataact),index_col=0,parse_dates=['date'])
actdf.rename(columns={'Unit Sales': 'Units'}, inplace=True) #incase the space in the name is messing me up
The different methods I have tried to get Units to get to numeric
actdf=actdf['Units'].convert_objects(convert_numeric=True)
#actdf=actdf['Units'].astype('float32')
Then I want to resample and I'm getting strange string concatenations since the numbers are still string
#actdfq=actdf.resample('Q',sum)
#actdfq.head()
actdf.head()
#actdf
so the df looks like this with just units and the date index
date
2013-09-01 3,533
2013-08-01 4,226
2013-07-01 4,281
Name: Units, Length: 161, dtype: object
You have to specify the thousands separator:
actdf = pd.read_csv(StringIO(dataact), index_col=0, parse_dates=['date'], thousands=',')
This will work
In [13]: s
Out[13]:
0 4,223
1 3,123
dtype: object
In [14]: s.str.replace(',','').convert_objects(convert_numeric=True)
Out[14]:
0 4223
1 3123
dtype: int64

Categories

Resources