Grouping dates together by year in Pandas

Grouping dates together by year in Pandas - python

I have a dataset of property prices and they are currently listed by 'DATE_SOLD'. I'd like to be able to count them by year. The dataset looks like this -
SALE_DATE COUNTY SALE_PRICE
0 2010-01-01 Dublin 343000.0
1 2010-01-03 Laois 185000.0
2 2010-01-04 Dublin 438500.0
3 2010-01-04 Meath 400000.0
4 2010-01-04 Kilkenny 160000.0
This is the code I've tried -
by_year = property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
print(by_year)
I think I'm close but as a biblical noob it's quite frustrating!
Thank you for any help you can provide; this site has been awesome so far in finding little tips and tricks to make my life easier

You are close. As you did, you can use pd.to_datetime to convert your sale_date to a datetime column. Then groupby the year, using dt.year which gets the year of the datetime, and use size() on that which computes the size of each group, which in this case is the year.
property_prices['SALE_DATE'] = pd.to_datetime(property_prices['SALE_DATE'])
property_prices.groupby(property_prices.SALE_DATE.dt.year).size()
Which prints:
SALE_DATE
2010 5
dtype: int64

import pandas as pd
sample_dict = {'Date':['2010-01-11', '2020-01-22', '2010-03-12'], 'Price':[1000,2000,3500]}
df = pd.DataFrame(sample_dict)
# Creating 'year' column using the Date column
df['year'] = df.apply(lambda row: row.Date.split('-')[0], axis=1)
# Groupby function
df1 = df.groupby('Year')
# Print the first value in each group
df1.first()
Output:
Date x
year
2010 2010-01-11 1
2020 2020-01-22 2

Related

How to get calendar years as column names and month and day as index for one timeseries

I have looked for solutions but seem to find none that point me in the right direction, hopefully, someone on here can help. I have a stock price data set, with a frequency of Month Start. I am trying to get an output where the calendar years are the column names, and the day and month will be the index (there will only be 12 rows since it is monthly data). The rows will be filled with the stock prices corresponding to the year and month. I, unfortunately, have no code since I have looked at for loops, groupby, etc but can't seem to figure this one out.

You might want to split the date into month and year and to apply a pivot:
s = pd.to_datetime(df.index)
out = (df
.assign(year=s.year, month=s.month)
.pivot_table(index='month', columns='year', values='Close', fill_value=0)
)
output:
year 2003 2004
month
1 0 2
2 0 3
3 0 4
12 1 0
Used input:
df = pd.DataFrame({'Close': [1,2,3,4]},
index=['2003-12-01', '2004-01-01', '2004-02-01', '2004-03-01'])

You need multiple steps to do that.
First split your column into the right format.
Then convert this column into two separate columns.
Then pivot the table accordingly.
import pandas as pd
# Test Dataframe
df = pd.DataFrame({'Date': ['2003-12-01', '2004-01-01', '2004-02-01', '2004-12-01'],
'Close': [6.661, 7.053, 6.625, 8.999]})
# Split datestring into list of form [year, month-day]
df = df.assign(Date=df.Date.str.split(pat='-', n=1))
# Separate date-list column into two columns
df = pd.DataFrame(df.Date.to_list(), columns=['Year', 'Date'], index=df.index).join(df.Close)
# Pivot the table
df = df.pivot(columns='Year', index='Date')
df
Output:
Close
Year 2003 2004
Date
01-01 NaN 7.053
02-01 NaN 6.625
12-01 6.661 8.999

Pandas: how to change only one column which is in a series contain same column name [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?

If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.

The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.

You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object

If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])

SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')

Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month

You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4

Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108

There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year

df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!

Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Sorting data by day and month (ignoring year) python pandas

I found many questions similar to mine, but none of them answer it exactly (this one comes closest, but it focusses on ruby).
I have a pandas DataFrame like this:
import pandas as pd
import numpy as np
df = pd.DataFrame({'Date': pd.date_range('2014-10-03', '2015-10-02', freq='1D'), 'Variable': np.random.randn(365)})
df.head()
Out[272]:
Date Variable
0 2014-10-03 0.637167
1 2014-10-04 0.562135
2 2014-10-05 -1.069769
3 2014-10-06 0.556997
4 2014-10-07 0.253468
I want to sort the data from January 1st to December 31st, ignoring the year component of the Date column. The background is that I want to track changes in Variable over the year, but my period starts and ends in October.
I thought of creating a seperate column for month and year and then sorting by those. But I am unsure how to do this in a "correct" and concise way.
Expected output:
Date Variable
0 01-01 0.637167 # (Placeholder-values)
1 01-02 0.562135
2 01-03 -1.069769
3 01-04 0.556997
4 01-05 0.253468

On way from argsort
yourdf=df.loc[df.Date.dt.strftime('%m%d').astype(int).argsort()]

You can create the day and month columns by simply doing the following
df = pd.DataFrame(data=pd.date_range('2014-10-03', '2015-10-02', freq='1D'), columns=['date'])
df['day'] = df['date'].apply(lambda x: x.day)
df['month'] = df['date'].apply(lambda x: x.month)
You could make it even more compact. But quick analysis, you can use the above.

Pandas read_csv parse_dates from 2 columns

I have not found a post here that quite fits my situation. I have a csv file where the first column is year (2002), the second column is Month (January) and the third column is MonthCode (1 for January, etc). I would like to import into a Pandas dataframe to create a full date index. The following code gives an error, but should show you what I am trying to do.
The error is:
ValueError: time data '2002' does not match format '%Y%b%d'
Note: I do not have a Day of the month in the data so I have to use the first or last, unless there is a way to index on just Year and Month with no Day.
The data looks like this:
Year Month Month Code District Code District
2002 January 1 1 Albany
2002 January 1 2 Allegany
2002 January 1 3 Broome
2002 January 1 4 Cattaraugus
2002 January 1 5 Cayuga
The code that does not work:
file = 'C:/.../snap.csv'
parser = lambda date: pd.datetime.strptime(date, '%Y%b%d')
# create dataframe from csv file
snapdf = pd.read_csv(file, parse_dates = [0,1], date_parser = parser)
# NOTE: I also tried parse_dates = [0,2] but same error

I altered the data to make it more obvious how the dates gets parsed into the dataframe
Year,Month,Month Code,District Code,District
2002,January,1,1,Albany
2004,February,1,2,Allegany
2005,December,1,3,Broome
2007,August,1,4,Cattaraugus
2001,March,1,5,Cayuga
using parse_dates parameter with column 1-3:
>>>> with open('snap.csv') as f:
df = pd.read_csv(f, parse_dates={'Date': [0,1,2]}, index_col='Date')
>>>> df
District Code District
Date
2002-01-01 1 Albany
2004-02-01 2 Allegany
2005-12-01 3 Broome
2007-08-01 4 Cattaraugus
2001-03-01 5 Cayuga
>>>> df.District
Date
2002-01-01 Albany
2004-02-01 Allegany
2005-12-01 Broome
2007-08-01 Cattaraugus
2001-03-01 Cayuga
Name: District, dtype: object

I finally got this running and it was actually quite simple in the end.
snapdf["DateIndex"] = pd.to_datetime(snapdf['Year'].astype(str), format='%Y')
This takes the value from the Year column of a dataframe (stored as Int) and converts it to a datestring in a new column DateIndex. Because there are no month or day data, it automatically inserts 01/01 as the month and day.
So, 2017 in the Year column becomes 01/10/2017

Extract Day, Month and Hour from Timestamp string in Python [duplicate]

I have a Dataframe, df, with the following column:
df['ArrivalDate'] =
...
936 2012-12-31
938 2012-12-29
965 2012-12-31
966 2012-12-31
967 2012-12-31
968 2012-12-31
969 2012-12-31
970 2012-12-29
971 2012-12-31
972 2012-12-29
973 2012-12-29
...
The elements of the column are pandas.tslib.Timestamp.
I want to just include the year and month. I thought there would be simple way to do it, but I can't figure it out.
Here's what I've tried:
df['ArrivalDate'].resample('M', how = 'mean')
I got the following error:
Only valid with DatetimeIndex or PeriodIndex
Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I got the following error:
'Timestamp' object has no attribute '__getitem__'
Any suggestions?
Edit: I sort of figured it out.
df.index = df['ArrivalDate']
Then, I can resample another column using the index.
But I'd still like a method for reconfiguring the entire column. Any ideas?

If you want new columns showing year and month separately you can do this:
df['year'] = pd.DatetimeIndex(df['ArrivalDate']).year
df['month'] = pd.DatetimeIndex(df['ArrivalDate']).month
or...
df['year'] = df['ArrivalDate'].dt.year
df['month'] = df['ArrivalDate'].dt.month
Then you can combine them or work with them just as they are.

The df['date_column'] has to be in date time format.
df['month_year'] = df['date_column'].dt.to_period('M')
You could also use D for Day, 2M for 2 Months etc. for different sampling intervals, and in case one has time series data with time stamp, we can go for granular sampling intervals such as 45Min for 45 min, 15Min for 15 min sampling etc.

You can directly access the year and month attributes, or request a datetime.datetime:
In [15]: t = pandas.tslib.Timestamp.now()
In [16]: t
Out[16]: Timestamp('2014-08-05 14:49:39.643701', tz=None)
In [17]: t.to_pydatetime() #datetime method is deprecated
Out[17]: datetime.datetime(2014, 8, 5, 14, 49, 39, 643701)
In [18]: t.day
Out[18]: 5
In [19]: t.month
Out[19]: 8
In [20]: t.year
Out[20]: 2014
One way to combine year and month is to make an integer encoding them, such as: 201408 for August, 2014. Along a whole column, you could do this as:
df['YearMonth'] = df['ArrivalDate'].map(lambda x: 100*x.year + x.month)
or many variants thereof.
I'm not a big fan of doing this, though, since it makes date alignment and arithmetic painful later and especially painful for others who come upon your code or data without this same convention. A better way is to choose a day-of-month convention, such as final non-US-holiday weekday, or first day, etc., and leave the data in a date/time format with the chosen date convention.
The calendar module is useful for obtaining the number value of certain days such as the final weekday. Then you could do something like:
import calendar
import datetime
df['AdjustedDateToEndOfMonth'] = df['ArrivalDate'].map(
lambda x: datetime.datetime(
x.year,
x.month,
max(calendar.monthcalendar(x.year, x.month)[-1][:5])
)
)
If you happen to be looking for a way to solve the simpler problem of just formatting the datetime column into some stringified representation, for that you can just make use of the strftime function from the datetime.datetime class, like this:
In [5]: df
Out[5]:
date_time
0 2014-10-17 22:00:03
In [6]: df.date_time
Out[6]:
0 2014-10-17 22:00:03
Name: date_time, dtype: datetime64[ns]
In [7]: df.date_time.map(lambda x: x.strftime('%Y-%m-%d'))
Out[7]:
0 2014-10-17
Name: date_time, dtype: object

If you want the month year unique pair, using apply is pretty sleek.
df['mnth_yr'] = df['date_column'].apply(lambda x: x.strftime('%B-%Y'))
Outputs month-year in one column.
Don't forget to first change the format to date-time before, I generally forget.
df['date_column'] = pd.to_datetime(df['date_column'])

SINGLE LINE: Adding a column with 'year-month'-paires:
('pd.to_datetime' first changes the column dtype to date-time before the operation)
df['yyyy-mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y-%m')
Accordingly for an extra 'year' or 'month' column:
df['yyyy'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%Y')
df['mm'] = pd.to_datetime(df['ArrivalDate']).dt.strftime('%m')

Extracting the Year say from ['2018-03-04']
df['Year'] = pd.DatetimeIndex(df['date']).year
The df['Year'] creates a new column. While if you want to extract the month just use .month

You can first convert your date strings with pandas.to_datetime, which gives you access to all of the numpy datetime and timedelta facilities. For example:
df['ArrivalDate'] = pandas.to_datetime(df['ArrivalDate'])
df['Month'] = df['ArrivalDate'].values.astype('datetime64[M]')

#KieranPC's solution is the correct approach for Pandas, but is not easily extendible for arbitrary attributes. For this, you can use getattr within a generator comprehension and combine using pd.concat:
# input data
list_of_dates = ['2012-12-31', '2012-12-29', '2012-12-30']
df = pd.DataFrame({'ArrivalDate': pd.to_datetime(list_of_dates)})
# define list of attributes required
L = ['year', 'month', 'day', 'dayofweek', 'dayofyear', 'weekofyear', 'quarter']
# define generator expression of series, one for each attribute
date_gen = (getattr(df['ArrivalDate'].dt, i).rename(i) for i in L)
# concatenate results and join to original dataframe
df = df.join(pd.concat(date_gen, axis=1))
print(df)
ArrivalDate year month day dayofweek dayofyear weekofyear quarter
0 2012-12-31 2012 12 31 0 366 1 4
1 2012-12-29 2012 12 29 5 364 52 4
2 2012-12-30 2012 12 30 6 365 52 4

Thanks to jaknap32, I wanted to aggregate the results according to Year and Month, so this worked:
df_join['YearMonth'] = df_join['timestamp'].apply(lambda x:x.strftime('%Y%m'))
Output was neat:
0 201108
1 201108
2 201108

There is two steps to extract year for all the dataframe without using method apply.
Step1
convert the column to datetime :
df['ArrivalDate']=pd.to_datetime(df['ArrivalDate'], format='%Y-%m-%d')
Step2
extract the year or the month using DatetimeIndex() method
pd.DatetimeIndex(df['ArrivalDate']).year

df['Month_Year'] = df['Date'].dt.to_period('M')
Result :
Date Month_Year
0 2020-01-01 2020-01
1 2020-01-02 2020-01
2 2020-01-03 2020-01
3 2020-01-04 2020-01
4 2020-01-05 2020-01

df['year_month']=df.datetime_column.apply(lambda x: str(x)[:7])
This worked fine for me, didn't think pandas would interpret the resultant string date as date, but when i did the plot, it knew very well my agenda and the string year_month where ordered properly... gotta love pandas!

Then I tried:
df['ArrivalDate'].apply(lambda(x):x[:-2])
I think here the proper input should be string.
df['ArrivalDate'].astype(str).apply(lambda(x):x[:-2])

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Grouping dates together by year in Pandas - python

Related

How to get calendar years as column names and month and day as index for one timeseries

Pandas: how to change only one column which is in a series contain same column name [duplicate]

Sorting data by day and month (ignoring year) python pandas

Pandas read_csv parse_dates from 2 columns

Extract Day, Month and Hour from Timestamp string in Python [duplicate]

Categories

Resources