In Pandas, I am trying to format a date column from String to proper date so that I can export it to ElasticSearch. However, date and month are getting mixed up. I have given an example below.
df = pd.DataFrame({'Date':['12/03/2020 0:00', '11/02/2019 0:00', '10/01/2020 0:00'],
'Event':['Music', 'Poetry', 'Theatre'],
'Cost':[10000, 5000, 15000]})
Date is entered in dd/mm/YYYY format.
df['Date1'] = df['Date'].astype('datetime64[ns]')
df['Year'] = pd.DatetimeIndex(df['Date']).year
df['Month'] = pd.DatetimeIndex(df['Date1']).month
df['Day'] = pd.DatetimeIndex(df['Date1']).day
df
This results in the following data frame where the date and month are interchanged.Year is extracted correct.
Date Event Cost Date1 Year Month Day
0 12/03/2020 0:00 Music 10000 2020-12-03 2020 12 3
1 11/02/2019 0:00 Poetry 5000 2019-11-02 2019 11 2
2 10/01/2020 0:00 Theatre 15000 2020-10-01 2020 10 1
Can someone provide inputs on how to format the date column in an appropriate way? Thanks
You'll want to use pd.to_datetime() to convert the data to real datetimes first:
df['Date'] = pd.to_datetime(df['Date'])
Happily, the default parameters seem to work for parsing your example data:
>>> df
Date Event Cost
0 2020-12-03 Music 10000
1 2019-11-02 Poetry 5000
2 2020-10-01 Theatre 15000
If you need the separate d/m/y columns, you can access the series' dt property instead of converting via a DatetimeIndex:
>>> df['Year'] = df['Date'].dt.year
>>> # ... etc ...
>>> df
Date Event Cost Year
0 2020-12-03 Music 10000 2020
1 2019-11-02 Poetry 5000 2019
2 2020-10-01 Theatre 15000 2020
Related
I'm trying to convert a column of Year values from int64 to datetime64 in pandas. The column currently looks like
Year
2003
2003
2003
2003
2003
...
2021
2021
2021
2021
2021
However the data type listed when I use dataset['Year'].dtypes is int64.
That's after I used pd.to_datetime(dataset.Year, format='%Y') to convert the column from int64 to datetime64. How do I get around this?
You have to assign pd.to_datetime(df['Year'], format="%Y") to df['date']. Once you have done that you should be able to see convert from integer.
df = pd.DataFrame({'Year': [2000,2000,2000,2000,2000,2000]})
df['date'] = pd.to_datetime(df['Year'], format="%Y")
df
The output should be:
Year date
0 2000 2000-01-01
1 2000 2000-01-01
2 2000 2000-01-01
3 2000 2000-01-01
4 2000 2000-01-01
5 2000 2000-01-01
So essentially all you are missing is df['date'] = pd.to_datetime(df['Year'], format="%Y") from your code and it should be working fine with respect to converting.
The pd.to_datetime() will not just return the Year (as far as I understood from your question you wanted the year), if you want more information on what .to_date_time() returns, you can see the documentation.
I hope this helps.
You should be able to convert from an integer:
df = pd.DataFrame({'Year': [2003, 2022]})
df['datetime'] = pd.to_datetime(df['Year'], format='%Y')
print(df)
Output:
Year datetime
0 2003 2003-01-01
1 2022 2022-01-01
My dataset has dates in the European format, and I'm struggling to convert it into the correct format before I pass it through a pd.to_datetime, so for all day < 12, my month and day switch.
Is there an easy solution to this?
import pandas as pd
import datetime as dt
df = pd.read_csv(loc,dayfirst=True)
df['Date']=pd.to_datetime(df['Date'])
Is there a way to force datetime to acknowledge that the input is formatted at dd/mm/yy?
Thanks for the help!
Edit, a sample from my dates:
renewal["Date"].head()
Out[235]:
0 31/03/2018
2 30/04/2018
3 28/02/2018
4 30/04/2018
5 31/03/2018
Name: Earliest renewal date, dtype: object
After running the following:
renewal['Date']=pd.to_datetime(renewal['Date'],dayfirst=True)
I get:
Out[241]:
0 2018-03-31 #Correct
2 2018-04-01 #<-- this number is wrong and should be 01-04 instad
3 2018-02-28 #Correct
Add format.
df['Date'] = pd.to_datetime(df['Date'], format='%d/%m/%Y')
You can control the date construction directly if you define separate columns for 'year', 'month' and 'day', like this:
import pandas as pd
df = pd.DataFrame(
{'Date': ['01/03/2018', '06/08/2018', '31/03/2018', '30/04/2018']}
)
date_parts = df['Date'].apply(lambda d: pd.Series(int(n) for n in d.split('/')))
date_parts.columns = ['day', 'month', 'year']
df['Date'] = pd.to_datetime(date_parts)
date_parts
# day month year
# 0 1 3 2018
# 1 6 8 2018
# 2 31 3 2018
# 3 30 4 2018
df
# Date
# 0 2018-03-01
# 1 2018-08-06
# 2 2018-03-31
# 3 2018-04-30
I have a data set with a timestamp in format dd/mm/yyyy hh:mm:ss. I would like to extract the month and the year for the whole column. So I used the following code:
Extracting the year
`df['Year'] = pd.DatetimeIndex(df['timestamp']).year`
Extracting the month
`df['month_num'] = pd.DatetimeIndex(df['timestamp']).month`
Converting number of month in name of month
`df['Month'] = df['month_num'].apply(lambda x: calendar.month_abbr[x])`
`df.drop(['month_num'], axis=1, inplace=True)`
However, the above returns the wrong month as sometimes it takes the month from the second pair of details (as if date format were in dd/mm/yyyy, which in fact it is), and sometimes it takes the month from the first pair of details (as if date format were in mm/dd/yyyy, which is not). So as you can see below, it returns 'Feb' for what should be 'Jan', although 'Dec' is correct.
`02/01/2020 12:07:00 EURUSD EUR 138,476.70 2020 Feb`
`02/01/2020 12:02:12 GBPHKD GBP 13,545.93 2020 Feb`
`31/12/2019 16:35:48 GBPUSD USD 537.60 2019 Dec`
`31/12/2019 16:29:34 GBPHKD HKD 279.17 2019 Dec`
I also tried to change the original timestamp format to yyyy-mm-dd but when changing the format it keep taking the month with a different order.
Any idea for this? Cheers!
use strftime('%b') and assign
ensure your datecolumn is a proper date pd.to_datetime(df['date'])
df.assign(year = df[0].dt.year,
month = df[0].dt.strftime('%b'))
print(df)
0 1 2 3 4 year month
0 2020-02-01 12:07:00 EURUSD EUR 138,476.70 2020 Feb
1 2020-02-01 12:02:12 GBPHKD GBP 13,545.93 2020 Feb
2 2019-12-31 16:35:48 GBPUSD USD 537.60 2019 Dec
3 2019-12-31 16:29:34 GBPHKD HKD 279.17 2019 Dec
I caught up with this scenario and don't know how can I solve this.
I have the data frame where I am trying to add "week_of_year" and "year" column based in the "date" column of the pandas' data frame which is working fine.
import pandas as pd
df = pd.DataFrame({'date': ['2018-12-31', '2019-01-01', '2019-12-31', '2020-01-01']})
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].apply(lambda x: x.weekofyear)
df['year'] = df['date'].apply(lambda x: x.year)
print(df)
Current Output
date week_of_year year
0 2018-12-31 1 2018
1 2019-01-01 1 2019
2 2019-12-31 1 2019
3 2020-01-01 1 2020
Expected Output
So here what I am expecting is for 2018 and 2019 the last date was the first week of the new year which is 2019 and 2020 respectively so I want to add logic in the year, where the week is 1 but the date belongs for the previous year so the year column would track that as in the expected output.
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Try:
df['date'] = pd.to_datetime(df['date'])
df['week_of_year'] = df['date'].dt.weekofyear
df['year']=(df['date']+pd.to_timedelta(6-df['date'].dt.weekday, unit='d')).dt.year
Outputs:
date week_of_year year
0 2018-12-31 1 2019
1 2019-01-01 1 2019
2 2019-12-31 1 2020
3 2020-01-01 1 2020
Few things - generally avoid .apply(..).
For datetime columns you can just interact with the date through df[col].dt variable.
Then to get the last day of the week just add to date 6-weekday where weekday is between 0 (Monday) and 6 to the date
TLDR CODE
To get the week number as a series
df['DATE'].dt.isocalendar().week
To set a new column to the week use same function and set series returned to a column:
df['WEEK'] = df['DATE'].dt.isocalendar().week
TLDR EXPLANATION
Use the pd.series.dt.isocalendar().week to get the the week for a given series object.
Note:
column "DATE" must be stored as a datetime column
I have a dataframe of surface weather observations (fzraHrObs) organized by a station identifier code and date. fzraHrObs has several columns of weather data. The station code and date (datetime objects) look like:
usaf dat
716270 2014-11-23 12:00:00
2015-12-20 08:00:00
2015-12-20 09:00:00
2015-12-21 04:00:00
2015-12-28 03:00:00
716280 2015-12-19 08:00:00
2015-12-19 08:00:00
I would like to get a count of the number of unique dates (days) per year for each station - i.e. the number of days of obs per year at each station. In my example above this would give me:
usaf Year Count
716270 2014 1
2015 3
716280 2014 0
2015 1
I've tried using groupby and grouping by station, year, and date:
grouped = fzraHrObs['dat'].groupby(fzraHrObs['usaf'], fzraHrObs.dat.dt.year, fzraHrObs.dat.dt.date])
Count, size, nunique, etc. on this just gives me the number of obs on each date, not the number of dates themselves per year. Any suggestions on getting what I want here?
Could be something like this, group the date by usaf and year and then count the number of unique values:
import pandas as pd
df.dat.apply(lambda dt: dt.date()).groupby([df.usaf, df.dat.apply(lambda dt: dt.year)]).nunique()
# usaf dat
# 716270 2014 1
# 2015 3
# 716280 2015 1
# Name: dat, dtype: int64
The following should work:
df.groupby(['usaf', df.dat.dt.year])['dat'].apply(lambda s: s.dt.date.nunique())
What I did differently is group by two levels only, then use the nunique method of pandas series to count the number of unique dates in each group.