Extract Day, Month and Hour from Timestamp string in Python [duplicate] - python

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?

If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.

The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.

You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object

If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])

SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')

Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month

You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4

Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108

There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year

df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!

Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Related

Pandas: how to change only one column which is in a series contain same column name [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?
If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.
The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.
You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object
If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])
SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')
Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month
You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')
#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4
Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108
There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year
df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01
df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Pandas: Change column of integers to datetime and add a timestamp

I have a dataframe with an id column, and a date column made up of an integer.
d = {'id': [1, 2], 'date': [20161031, 20170930]}
df = pd.DataFrame(data=d)
id date
0 1 20161031
1 2 20170930
I can convert the date column to an actual date like so.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
id date
0 1 2016-10-31
1 2 2017-09-30
But I need to have this field as a timestamp with with hours, minutes, and seconds so that it is compatible with my database table. I don't care what the the values are, we can keep it easy by setting it to zeros.
2016-10-31 00:00:00
2017-09-30 00:00:00
What is the best way to change this field to a timestamp? I tried
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d%H%M%S'))
but pandas didn't like that.
I think I could append six 0's to the end of every value in that field and then use the above statement, but I was wondering if there is a better way.
With pandas it is simpler and faster to convert entire columns. First you convert to string and then to time stamp
pandas.to_datatime(df['date'].apply(str))
PS there are few other conversion methods of varying performance https://datatofish.com/fastest-way-to-convert-integers-to-strings-in-pandas-dataframe/
The problem seems to be that pd.to_datetime doesn't accept dates in this integer format:
pd.to_datetime(20161031) gives Timestamp('1970-01-01 00:00:00.020161031')
It assumes the integers are nanoseconds since 1970-01-01.
You have to convert to a string first:
df['date'] = pd.to_datetime(df["date"].astype(str))
Output:
id date
0 1 2016-10-31
1 2 2017-09-30
Note that these are datetimes so they include a time component (which are all zero in this case) even though they are not shown in the data frame representation above.
print(df.loc[0,'date'])
Out:
Timestamp('2016-10-31 00:00:00')
You can use
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y%m%d%H%M%S'))

How to get the number of business days between two dates in pandas

I have the following column in a dataframe, I would like to add a column to the end of this dataframe, where the column has the business days from today (6/24) to the previous day.
Bday() function does not seem to have this capability.
Date
2019-6-21
2019-6-20
2019-6-14
I am looking for a result that looks like following:
Date Business days
2019-6-21 1
2019-6-20 2
2019-6-14 6
Is there an easy way to do this, other than doing individual manipulations or using datetime library
Use np.busday_count:
# df['Date'] = pd.to_datetime(df['Date']) # if needed
np.busday_count(df['Date'].dt.date, np.datetime64('today'))
# array([1, 2, 6])
df['bdays'] = np.busday_count(df['Date'].dt.date, np.datetime64('today'))
df
Date bdays
0 2019-06-21 1
1 2019-06-20 2
2 2019-06-14 6

Sorting data by day and month (ignoring year) python pandas

I found many questions similar to mine, but none of them answer it exactly (this one comes closest, but it focusses on ruby).
I have a pandas DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2014-10-03', '2015-10-02', freq='1D'), 'Variable': np.random.randn(365)})
df.head()
Out[272]:
Date Variable
0 2014-10-03 0.637167
1 2014-10-04 0.562135
2 2014-10-05 -1.069769
3 2014-10-06 0.556997
4 2014-10-07 0.253468
I want to sort the data from January 1st to December 31st, ignoring the year component of the Date column. The background is that I want to track changes in Variable over the year, but my period starts and ends in October.
I thought of creating a seperate column for month and year and then sorting by those. But I am unsure how to do this in a "correct" and concise way.
Expected output:
Date Variable
0 01-01 0.637167 # (Placeholder-values)
1 01-02 0.562135
2 01-03 -1.069769
3 01-04 0.556997
4 01-05 0.253468
On way from argsort
yourdf=df.loc[df.Date.dt.strftime('%m%d').astype(int).argsort()]
You can create the day and month columns by simply doing the following
df = pd.DataFrame(data=pd.date_range('2014-10-03', '2015-10-02', freq='1D'), columns=['date'])
df['day'] = df['date'].apply(lambda x: x.day)
df['month'] = df['date'].apply(lambda x: x.month)
You could make it even more compact. But quick analysis, you can use the above.

Calculate day of year in dataframe from a datetime column to another column in Python

I have a dataframe with a datetime column in it, like 2014-01-01, 2016-06-05, etc. Now I want to add a column in the dataframe calculating the day of year (for that given year).
On this forum I did find some hints for sure, but I'm struggling with the types and dataframe stuff.
So this works fine
from datetime import datetime
day_to_calc = today
day_of_year = day_to_calc.timetuple().tm_yday
day_of_year
But my day_to_calc is not today, but df['Date']. However, if I try this
df['DOY'] = df['Date'].timetuple().tm_yday
I get
AttributeError: 'Series' object has no attribute 'timetuple'
Ok, so I guess I need a map function perhaps?
So I'm trying something like ..
df['DOY'] = map (datetime.timetuple().tm_yday,df['Date'])
And surely you guys see how stupid that is ;-) (but I'm still learning Python)
TypeError: descriptor 'timetuple' of 'datetime.datetime' object needs an argument
So that makes sense sort of because I need to pass the date as parameter, sooo .. trying
df['DOY'] = datetime.timetuple(df['Date']).tm_yday
TypeError: descriptor 'timetuple' requires a 'datetime.datetime' object but received a 'Series'
There must be a simple way, but I just can't figure out the syntax :-(
Use dayofyear function:
import pandas as pd
# first convert date string to datetime with a proper format string
df = pd.DataFrame({'Date':pd.to_datetime(['2014-01-01', '2016-06-05'], format='%Y-%m-%d')})
# calculate day of year
df['DOY'] = df['Date'].dt.dayofyear
print(df)
Output:
Date DOY
0 2014-01-01 1
1 2016-06-05 157
I noticed the above answer does not go into great detail, so I've provided a more explanatory answer below.
Try the following:
import pandas as pd
# Create a pandas datetime range for the year 2022
passed_2022 = pd.date_range('2022-01-01', '2022-12-31')
# Convert the datetime range to a list of strings in the format 'YYYY-MM-DD'
passed_2022_list = [i.strftime('%Y-%m-%d') for i in passed_2022]
# Create a DataFrame
data = pd.DataFrame({'datetime': passed_2022_list})
# Filter the data DataFrame to only include dates in the passed_2022 list
data = data[data['datetime'].isin(passed_2022_list)]
# Count the number of rows in the filtered DataFrame
num_days_passed = len(data)
# Create a new DataFrame with 'datetime' and 'DAYS_OF_YEAR' columns
result = pd.DataFrame({'datetime': passed_2022_list,
'DAYS OF YEAR': range(1, num_days_passed+1)})
# Print the result of the DataFrame
print(result)
Output:
datetime DAYS OF YEAR
0 2022-01-01 1
1 2022-01-02 2
2 2022-01-03 3
3 2022-01-04 4
4 2022-01-05 5
.. ... ...
360 2022-12-27 361
361 2022-12-28 362
362 2022-12-29 363
363 2022-12-30 364
364 2022-12-31 365
[365 rows x 2 columns]
Process finished with exit code 0

Categories

Resources