I'm trying to convert a column of Year values from int64 to datetime64 in pandas. The column currently looks like
Year
2003
2003
2003
2003
2003
...
2021
2021
2021
2021
2021
However the data type listed when I use dataset['Year'].dtypes is int64.
That's after I used pd.to_datetime(dataset.Year, format='%Y') to convert the column from int64 to datetime64. How do I get around this?
You have to assign pd.to_datetime(df['Year'], format="%Y") to df['date']. Once you have done that you should be able to see convert from integer.
df = pd.DataFrame({'Year': [2000,2000,2000,2000,2000,2000]})
df['date'] = pd.to_datetime(df['Year'], format="%Y")
df
The output should be:
Year date
0 2000 2000-01-01
1 2000 2000-01-01
2 2000 2000-01-01
3 2000 2000-01-01
4 2000 2000-01-01
5 2000 2000-01-01
So essentially all you are missing is df['date'] = pd.to_datetime(df['Year'], format="%Y") from your code and it should be working fine with respect to converting.
The pd.to_datetime() will not just return the Year (as far as I understood from your question you wanted the year), if you want more information on what .to_date_time() returns, you can see the documentation.
I hope this helps.
You should be able to convert from an integer:
df = pd.DataFrame({'Year': [2003, 2022]})
df['datetime'] = pd.to_datetime(df['Year'], format='%Y')
print(df)
Output:
Year datetime
0 2003 2003-01-01
1 2022 2022-01-01
Related
Hello and thanks in advance for any help. I have a simple dataframe with two columns. I did not set an index explicitly, but I believe a dataframe gets an integer index that I see along the left side of the output. Question below:
df = pandas.DataFrame(res)
df.columns = ['date', 'pb']
df['date'] = pandas.to_datetime(df['date'])
df.dtypes
date datetime64[ns]
pb float64
dtype: object
date pb
0 2016-04-01 24199.933333
1 2016-03-01 23860.870968
2 2016-02-01 23862.275862
3 2016-01-01 25049.193548
4 2015-12-01 24882.419355
5 2015-11-01 24577.000000
date datetime64[ns]
pb float64
dtype: object
I would like to pivot the dataframe so that I have years across the top (columns): 2016, 2015, etc
and a row for each month: 1 - 12.
Using the .dt accessor you can create columns for year and month and then pivot on those:
df['Year'] = df['date'].dt.year
df['Month'] = df['date'].dt.month
pd.pivot_table(df,index='Month',columns='Year',values='pb',aggfunc=np.sum)
Alternately if you don't want those other columns you can do:
pd.pivot_table(df,index=df['date'].dt.month,columns=df['date'].dt.year,
values='pb',aggfunc=np.sum)
With my dummy dataset that produces:
Year 2013 2014 2015 2016
date
1 92924.0 102072.0 134660.0 132464.0
2 79935.0 82145.0 118234.0 147523.0
3 86878.0 94959.0 130520.0 138325.0
4 80267.0 89394.0 120739.0 129002.0
5 79283.0 91205.0 118904.0 125878.0
6 77828.0 89884.0 112488.0 121953.0
7 78839.0 94407.0 113124.0 NaN
8 79885.0 97513.0 116771.0 NaN
9 79455.0 99555.0 114833.0 NaN
10 77616.0 98764.0 115872.0 NaN
11 75043.0 95756.0 107123.0 NaN
12 81996.0 102637.0 114952.0 NaN
Using stack instead of pivot
df = pd.DataFrame(
dict(date=pd.date_range('2013-01-01', periods=42, freq='M'),
pb=np.random.rand(42)))
df.set_index([df.date.dt.month, df.date.dt.year]).pb.unstack()
What would be the correct way to show what was the average sales volume in Carlisle city for each year between
2010-2020?
Here is an abbreviated form of the large data frame showing only the columns and rows relevant to the question:
import pandas as pd
df = pd.DataFrame({'Date': ['01/09/2009','01/10/2009','01/11/2009','01/12/2009','01/01/2010','01/02/2010','01/03/2010','01/04/2010','01/05/2010','01/06/2010','01/07/2010','01/08/2010','01/09/2010','01/10/2010','01/11/2010','01/12/2010','01/01/2011','01/02/2011'],
'RegionName': ['Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle','Carlisle'],
'SalesVolume': [118,137,122,132,83,81,105,114,110,106,137,130,129,121,129,100,84,62]})
This is what I've tried:
import pandas as pd
from matplotlib import pyplot as plt
df = pd.read_csv ('C:/Users/user/AppData/Local/Programs/Python/Python39/Scripts/uk_hpi_dataset_2021_01.csv')
df.Date = pd.to_datetime(df.Date)
df['Year'] = pd.to_datetime(df['Date']).apply(lambda x:
'{year}'.format(year=x.year).zfill(2))
carlisle_vol = df[df['RegionName'].str.contains('Carlisle')]
carlisle_vol.groupby('Year')['SalesVolume'].mean()
print(sales_vol)
When I try to run this code, it doesn't filter the 'Date' column to only calculate the average SalesVolume for the years beginning in '01/01/2010' and ending at '01/12/2020'. For some reason, it also prints out every other column is well. Can anyone please help me to answer this question correctly?
This is the result I've got
>>> df.loc[(df["Date"].dt.year.between(2010, 2020))
& (df["RegionName"] == "Carlisle")] \
.groupby([pd.Grouper(key="Date", freq="Y")])["SalesVolume"].mean()
Date
2010-01-01 112.083333
2011-01-01 73.000000
Freq: A-DEC, Name: SalesVolume, dtype: float64
For further:
The only difference between the answer of #nocibambi is the groupby parameter and particularly the freq argument of pd.Grouper. Imagine your accounting year starts the 1st september.
Sales each 3 months:
>>> df
Date Sales
0 2010-09-01 1 # 1st group: mean=2.5
1 2010-12-01 2
2 2011-03-01 3
3 2011-06-01 4
4 2011-09-01 5 # 2nd group: mean=6.5
5 2011-12-01 6
6 2012-03-01 7
7 2012-06-01 8
>>> df.groupby(pd.Grouper(key="Date", freq="AS-SEP")).mean()
Sales
Date
2010-09-01 2.5
2011-09-01 6.5
Check the documentation to know all values of freq aliases and anchoring suffix
You can access year with the datetime accessor:
df[
(df["RegionName"] == "Carlisle")
& (df["Date"].dt.year >= 2010)
& (df["Date"].dt.year <= 2020)
].groupby(df.Date.dt.year)["SalesVolume"].mean()
>>>
Date
2010 112.083333
2011 73.000000
Name: SalesVolume, dtype: float64
I caught up with this scenario and don't know how can I solve this.
I have the data frame where I am trying to add "week_of_year" and "year" column based in the "date" column of the pandas' data frame which is working fine.
import pandas as pd
df = pd.DataFrame({'date': ['2018-12-31', '2019-01-01', '2019-12-31', '2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].apply(lambda x: x.weekofyear)
df['year'] = df['date'].apply(lambda x: x.year)
print(df)
Current Output
date week_of_year year
0 2018-12-31 1 2018
1 2019-01-01 1 2019
2 2019-12-31 1 2019
3 2020-01-01 1 2020
Expected Output
So here what I am expecting is for 2018 and 2019 the last date was the first week of the new year which is 2019 and 2020 respectively so I want to add logic in the year, where the week is 1 but the date belongs for the previous year so the year column would track that as in the expected output.
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Try:
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].dt.weekofyear
df['year']=(df['date']+pd.to_timedelta(6-df['date'].dt.weekday, unit='d')).dt.year
Outputs:
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Few things - generally avoid .apply(..).
For datetime columns you can just interact with the date through df[col].dt variable.
Then to get the last day of the week just add to date 6-weekday where weekday is between 0 (Monday) and 6 to the date
TLDR CODE
To get the week number as a series
df['DATE'].dt.isocalendar().week
To set a new column to the week use same function and set series returned to a column:
df['WEEK'] = df['DATE'].dt.isocalendar().week
TLDR EXPLANATION
Use the pd.series.dt.isocalendar().week to get the the week for a given series object.
Note:
column "DATE" must be stored as a datetime column
I have dataframe with dates from year 1970 to year 2018, I want to plot frequency of occurrences from year 2016 to 2017.
In[95]: df['last_payout'].dtypes
Out[95]: dtype('<M8[ns]')
The data is stored in this format:
In[96]: df['last_payout'].head
Out[96]: <bound method NDFrame.head of 0 1970-01-01
1 1970-01-01
2 1970-01-01
3 1970-01-01
4 1970-01-01
I plot this by year using group by and count :
In[97]: df['last_payout'].groupby(df['last_payout'].dt.year).count().plot(kind="bar")
I want to get this plot between specific dates, I tried to put df['last_payout'].dt.year > 2016, but I got this:
How do I get the plot for specific date range?
I think need filter by between and boolean indexing first:
rng = pd.date_range('2015-04-03', periods=10, freq='7M')
df = pd.DataFrame({'last_payout': rng})
print (df)
last_payout
0 2015-04-30
1 2015-11-30
2 2016-06-30
3 2017-01-31
4 2017-08-31
5 2018-03-31
6 2018-10-31
7 2019-05-31
8 2019-12-31
9 2020-07-31
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.groupby(df['last_payout'].dt.year)
.count()
.plot(kind="bar")
)
Alternative solution:
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.year
.value_counts()
.sort_index()
.plot(kind="bar")
)
EDIT: For months with years convert datetimes to month period by to_period:
(df.loc[df['last_payout'].dt.year.between(2016, 2017), 'last_payout']
.dt.to_period('M')
.value_counts()
.sort_index()
.plot(kind="bar")
)
Note that
df['last_payout'].dt.year > 2016
just returns a boolean series, so plotting this will indeed show a bar chart of the number of dates for which this is or not.
Try first creating a relevant df:
relevant_df = df[(df['last_payout'].dt.year > 2016) & (df['last_payout'].dt.year <= 2017)]
(use strict or not inequalities depending on what you want, of course.)
then performing the plot on it:
relevant_df['last_payout'].groupby(relevant_df['last_payout'].dt.year).count().plot(kind="bar")
I am currently trying to reproduce this: convert numeric sas date to datetime in Pandas
, but get the following error:
"Python int too large to convert to C long"
Here and example of my dates:
0 1.416096e+09
1 1.427069e+09
2 1.433635e+09
3 1.428624e+09
4 1.433117e+09
Name: dates, dtype: float64
Any ideas?
Here is a little hacky solution. If the date column is called 'date', try
df['date'] = pd.to_datetime(df['date'] - 315619200, unit = 's')
Here 315619200 is the number of seconds between Jan 1 1960 and Jan 1 1970.
You get
0 2004-11-15 00:00:00
1 2005-03-22 00:03:20
2 2005-06-05 23:56:40
3 2005-04-09 00:00:00
4 2005-05-31 00:03:20