I have a pandas dataframe, where one column contains a string for the quarter and year in the following format: Q12019
My Question: How do I convert this into datetime format?
You can use Pandas PeriodIndex to accomplish this. Just reformat your quarters column to the expected format %Y-%q (with some help from regex, move the year to the front):
reformatted_quarters = df['QuarterYear'].str.replace(r'(Q\d)(\d+)', r'\2\1')
print(reformatted_quarters)
This prints:
0 2019Q1
1 2018Q2
2 2019Q4
Name: QuarterYear, dtype: object
Then, feed this result to PeriodIndex to get the datetime format. Use 'Q' to specify a quarterly frequency:
datetimes = pd.PeriodIndex(reformatted_quarters, freq='Q').to_timestamp()
print(datetimes)
This prints:
DatetimeIndex(['2019-01-01', '2018-04-01', '2019-10-01'], dtype='datetime64[ns]', name='Quarter', freq=None)
Note: Pandas PeriodIndex functionality experienced a regression in behavior (documented here), so for Pandas versions greater than 0.23.4, you'll need to use reformatted_quarters.values instead:
datetimes = pd.PeriodIndex(reformatted_quarters.values, freq='Q').to_timestamp()
(quarter) => new Date(quarter.slice(-4), 3 * (quarter.slice(1, 2) - 1), 1)
This will give you the start of every quarter (e.g. q42019 will give 2019-10-01).
You should probably include some validation since it will just keep rolling over months (e.g. q52019 = q12020 = 2020-01-01)
Related
When I read a date say '01/12/2020', which is in the format dd/mm/yyyy, with pd.to_datetime(), it detects the month as 01.
pd.to_datetime('01/12/2020').month
>> 1
But this behavior is not consistent.
When we create a dataframe with a column containing dates in this format, and convert using the same to_datetime function, it then detects 12 as the month.
tt.dt.month[0]
>> 12
What could be the reason ?
pandas automagically tries to detect the date format, which can be very nice, or annoying in your case.
Be explicit, use the dayfirst parameter:
pd.to_datetime('01/12/2020', dayfirst=False).month
# 1
pd.to_datetime('01/12/2020', dayfirst=True).month
# 12
Example of ambiguous use:
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
tt.dt.month
UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
0 5
1 1
dtype: int64
I have a large dataframe that, in its date column, has a mixture of date formats (only 2).
Most are in the correct format but there is some data that is in a different format.
i.e. most are 2013-11-07. Some are 20170510. Pandas throws an exception when i try to validate the code against a schema i have.
Is there a quick way to convert all dates to have the same format as the majority? Or do i have to do something more painful/manual?
i.e.
date \
0 2013-11-07 False
2 2013-11-07 False
... ... ... ... ... ...
3595037 20170510 NaN
3595038 20200701 NaN
Is there a quick way to convert all dates to have the same format as the majority?
Considering that you have only two formats, one represented by 2013-11-07 and another by 20170510 it is enough to remove - from first to get common format, i.e.
import pandas as pd
df = pd.DataFrame({'day':['2013-11-07','20170510']})
df['day'] = df['day'].str.replace('-','')
print(df)
output
day
0 20131107
1 20170510
pandas.to_datetime does understand it correctly
df['day'] = pd.to_datetime(df['day'])
print(df)
output
day
0 2013-11-07
1 2017-05-10
Disclaimer: I converted to format of minority not majority. It is possible to convert that to format of majority using regular expression, however if you are interested in datetime objects, this is unnecessary complication.
I have a DataFrame with one column storing the date.
However, some of these dates are properly formatted datetime objects like'2018-12-24 17:00:00'while others are not and are stored like '20181225'.
When I tried to plot these using plotly, the improperly formatted values got turned into EPOCH dates, which is a problem.
Is there any way I can get a copy of the DataFrame with only those rows with properly formatted dates?
I tried using
clean_dict= dailySum_df.where(dailySum_df[isinstance(dailySum_df['time'],datetime.datetime)])
methods and but it doesn't to work due to the 'Array conditional must be same shape as self' error.
dailySum_df = pd.DataFrame(list(cursors['dailySum']))
trace = go.Scatter(
x=dailySum_df['time'],
y=dailySum_df['countMessageIn']
)
data = [trace]
py.plot(data, filename='basic-line')
Apply dateutil.parser, see also my answer here:
import dateutil.parser as dparser
def myparser(x):
try:
return dparser.parse(x)
except:
return None
df = pd.DataFrame( {'time': ['2018-12-24 17:00:00', '20181225', 'no date at all'], 'countMessageIn': [1,2,3]})
df.time = df.time.apply(myparser)
df = df[df.time.notnull()]
Input:
time countMessageIn
0 2018-12-24 17:00:00 1
1 20181225 2
2 no date at all 3
Output:
time countMessageIn
0 2018-12-24 17:00:00 1
1 2018-12-25 00:00:00 2
Unlike Gustavo's solution this can handle rows with no recognizable date at all and it filters out such rows as required by your question.
If your original time column may contain other text besides the dates themselves, include the fuzzy=True parameter as shown here.
Try parsing the dates column of your dataframe using dateutil.parser.parse and Pandas apply function.
I am plotting the following pandas MultiIndex DataFrame:
print(log_returns_weekly.head())
AAPL MSFT TSLA FB GOOGL
Date Date
2016 1 -0.079078 0.005278 -0.155689 0.093245 0.002512
2 -0.001288 -0.072344 0.003811 -0.048291 -0.059711
3 0.119746 0.082036 0.179948 0.064994 0.061744
4 -0.150731 -0.102087 0.046722 0.030044 -0.074852
5 0.069314 0.067842 -0.075598 0.010407 0.056264
with the first sub-index representing the year, and the second one the week from that specific year.
This is simply achieved via the pandas plot() method; however, as seen below, the x-axis will not be in a (year, week) format i.e. (2016, 1), (2016, 2) etc. Instead, it simply shows 'Date,Date' - does anyone therefore know how I can overcome this issue?
log_returns_weekly.plot(figsize(8,8))
You need to convert your multiindex to single one and add a day, so it would be like this: 2016-01-01.
log1 = log_returns_weekly.set_index(log_returns_weekly.index.map(lambda x: pd.datetime(*x,1)))
log1.plot()
Currently, I have a series of Datetime Values that display as so
0 Datetime
1 20041001
2 20041002
3 20041003
4 20041004
they are within a series named
d['Datetime']
They were originally something like
20041001ABCDEF
But I split the end off just to leave them with the remaining numbers. How do I go about putting them into the following format?
2004-10-01
You can do the following,
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y%m%d'))