I worked now for quite some time using python and pandas for analysing a set of hourly data and find it quite nice (Coming from Matlab.)
Now I am kind of stuck. I created my DataFrame like that:
SamplingRateMinutes=60
index = DateRange(initialTime,finalTime, offset=datetools.Minute(SamplingRateMinutes))
ts=DataFrame(data, index=index)
What I want to do now is to select the Data for all days at the hours 10 to 13 and 20-23 to use the data for further calculations.
So far I sliced the data using
selectedData=ts[begin:end]
And I am sure to get some kind of dirty looping to select the data needed. But there must be a more elegant way to index exacly what I want. I am sure this is a common problem and the solution in pseudocode should look somewhat like that:
myIndex=ts.index[10<=ts.index.hour<=13 or 20<=ts.index.hour<=23]
selectedData=ts[myIndex]
To mention I am an engineer and no programer :) ... yet
In upcoming pandas 0.8.0, you'll be able to write
hour = ts.index.hour
selector = ((10 <= hour) & (hour <= 13)) | ((20 <= hour) & (hour <= 23))
data = ts[selector]
Here's an example that does what you want:
In [32]: from datetime import datetime as dt
In [33]: dr = p.DateRange(dt(2009,1,1),dt(2010,12,31), offset=p.datetools.Hour())
In [34]: hr = dr.map(lambda x: x.hour)
In [35]: dt = p.DataFrame(rand(len(dr),2), dr)
In [36]: dt
Out[36]:
<class 'pandas.core.frame.DataFrame'>
DateRange: 17497 entries, 2009-01-01 00:00:00 to 2010-12-31 00:00:00
offset: <1 Hour>
Data columns:
0 17497 non-null values
1 17497 non-null values
dtypes: float64(2)
In [37]: dt[(hr >= 10) & (hr <=16)]
Out[37]:
<class 'pandas.core.frame.DataFrame'>
Index: 5103 entries, 2009-01-01 10:00:00 to 2010-12-30 16:00:00
Data columns:
0 5103 non-null values
1 5103 non-null values
dtypes: float64(2)
As it looks messy in my comment above, I decided to provide another answer which is a syntax update for pandas 0.10.0 on Marc's answer, combined with Wes' hint:
import pandas as pd
from datetime import datetime
dr = pd.date_range(datetime(2009,1,1),datetime(2010,12,31),freq='H')
dt = pd.DataFrame(rand(len(dr),2),dr)
hour = dt.index.hour
selector = ((10 <= hour) & (hour <= 13)) | ((20<=hour) & (hour<=23))
data = dt[selector]
Pandas DataFrame has a built-in function
pandas.DataFrame.between_time
df = pd.DataFrame(np.random.randn(1000, 2),
index=pd.date_range(start='2017-01-01', freq='10min', periods=1000))
Create 2 data frames for each period of time:
df1 = df.between_time(start_time='10:00', end_time='13:00')
df2 = df.between_time(start_time='20:00', end_time='23:00')
Data frame you want is merged and sorted df1 and df2:
pd.concat([df1, df2], axis=0).sort_index()
Related
I have to create a function the brings me the count of days between two dates and it must take out weekends and holidays that are inside of a dataframe.
my holidays df looks like this:
Data
0 2001-01-01
1 2001-02-26
2 2001-02-27
3 2001-04-13
4 2001-04-21
df.info()
class 'pandas.core.frame.DataFrame'
RangeIndex: 936 entries, 0 to 935
Data columns (total 1 columns):
Column Non-Null Count Dtype
0 Data 936 non-null datetime64[ns]
dtypes: datetime64[ns](1)
memory usage: 7.4 KB
So it should look like:
def delta_days (date_initial, date_end, holidays)
.....
What would be the best way?
Here you go:
import pandas as pd
import datetime
from datetime import datetime
def delta_days(date_initial, date_end, holidays):
date_initial = datetime.strptime(date_initial,'%Y-%m-%d')
date_end = datetime.strptime(date_end,'%Y-%m-%d')
work_days = pd.bdate_range(start=date_initial, end=date_end, holidays=holidays, freq='C')
return(len(work_days))
Testing the code:
holidays = ['2021-01-01','2021-04-04','2021-04-21','2021-05-01','2021-09-07','2021-10-12','2021-11-02','2021-11-15','2021-12-25']
delta_days('2021-01-01','2021-12-31',holidays=holidays)
Output:
255
Now, you can go one step further and automate the construction of the holidays list:
from workalendar.america import Brazil
cal = Brazil()
datetime_feriados = pd.to_datetime([d[0] for d in cal.holidays(2021)])
lista_feriados = [lista_feriados.strftime('%Y-%m-%d') for lista_feriados in datetime_feriados]
Output:
lista_feriados
['2021-01-01','2021-04-04','2021-04-21','2021-05-01','2021-09-07','2021-10-12','2021-11-02','2021-11-15','2021-12-25']
I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])
I'm using/learning Pandas to load a csv style dataset where I have a time column that can be used as index. The data is sampled roughly at 100Hz. Here is a simplified snippet of the data:
Time (sec) Col_A Col_B Col_C
0.0100 14.175 -29.97 -22.68
0.0200 13.905 -29.835 -22.68
0.0300 12.257 -29.32 -22.67
... ...
1259.98 -0.405 2.205 3.825
1259.99 -0.495 2.115 3.735
There are 20 min of data, resulting in about 120,000 rows at 100 Hz. My goal is to select those rows within a certain time range, say 100-200 sec.
Here is what I've figured out
import panda as pd
df = pd.DataFrame(my_data) # my_data is a numpy array
df.set_index(0, inplace=True)
df.columns = ['Col_A', 'Col_B', 'Col_C']
df.index = pd.to_datetime(df.index, unit='s', origin='1900-1-1') # the date in origin is just a space-holder
My dataset doesn't include the date. How to avoid setting a fake date like I did above? It feels wrong, and also is quite annoying when I plot the data against time.
I know there are ways to remove date from the datatime object like here.
But my goal is to select some rows that are in a certain time range, which means I need to use pd.date_range(). This function does not seem to work without date.
It's not the end of the world if I just use a fake date throughout my project. But I'd like to know if there are more elegant ways around it.
I don't see why you need to use datetime64 objects for this. Your time column is an number, so you can very easily select time intervals with inequalities. You can also plot the columns without issue.
Sample Data
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Time': np.arange(0,1200,0.01),
'Col_A': np.random.randint(1,100,120000),
'Col_B': np.random.randint(1,10,120000)})
Select Data between 100 and 200 seconds.
df[df.Time.between(100,200)]
Outputs:
Time Col_A Col_B
10000 100.00 75 9
10001 100.01 23 7
...
19999 199.99 39 7
20000 200.00 25 2
Plotting against time
#First 100 rows just for illustration
df[0:100].plot(x='Time')
Convert to timedelta64
If you really wanted to, you could convert the column to a timedelta64[ns]
df['Time'] = pd.to_datetime(df.Time, unit='s') - pd.to_datetime('1970-01-01')
print(df.head())
# Time Col_A Col_B
#0 00:00:00 67 6
#1 00:00:00.010000 93 1
#2 00:00:00.020000 99 3
#3 00:00:00.030000 18 2
#4 00:00:00.040000 84 3
df.dtypes
#Time timedelta64[ns]
#Col_A int32
#Col_B int32
#dtype: object
I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])
I have a pandas dataframe df which has one column constituted by datetime64, e.g.
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1471 entries, 0 to 2940
Data columns (total 2 columns):
date 1471 non-null values
id 1471 non-null values
dtypes: datetime64[ns](1), int64(1)
I would like to sub-sample df using as criterion the hour of the day (independently on the other information in date). E.g., in pseudo code
df_sub = df[ (HOUR(df.date) > 8) & (HOUR(df.date) < 20) ]
for some function HOUR.
I guess the problem can be solved via a preliminary conversion from datetime64 to datetime. Can this be handled more efficiently?
Found a simple solution.
df['hour'] = df.date.apply(lambda x : x.hour)
df_sub = df[(df.hour > 8) & (df.hour) <20]
EDIT:
There is a property dt specifically introduced to handle this problem. The query becomes:
df_sub = df[ (df.date.dt.hour > 8)
& (df.date.dt.hour < 20) ]