Pandas : 'to_datetime' function not consistent with dates

Pandas : 'to_datetime' function not consistent with dates - python

When I read a date say '01/12/2020', which is in the format dd/mm/yyyy, with pd.to_datetime(), it detects the month as 01.
pd.to_datetime('01/12/2020').month
>> 1
But this behavior is not consistent.
When we create a dataframe with a column containing dates in this format, and convert using the same to_datetime function, it then detects 12 as the month.
tt.dt.month[0]
>> 12
What could be the reason ?

pandas automagically tries to detect the date format, which can be very nice, or annoying in your case.
Be explicit, use the dayfirst parameter:
pd.to_datetime('01/12/2020', dayfirst=False).month
# 1
pd.to_datetime('01/12/2020', dayfirst=True).month
# 12
Example of ambiguous use:
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
tt.dt.month
UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
0 5
1 1
dtype: int64

Related

How to convert all data in column to datetime - pandas

I have a large dataframe that, in its date column, has a mixture of date formats (only 2).
Most are in the correct format but there is some data that is in a different format.
i.e. most are 2013-11-07. Some are 20170510. Pandas throws an exception when i try to validate the code against a schema i have.
Is there a quick way to convert all dates to have the same format as the majority? Or do i have to do something more painful/manual?
i.e.
date \
0 2013-11-07 False
2 2013-11-07 False
... ... ... ... ... ...
3595037 20170510 NaN
3595038 20200701 NaN

Is there a quick way to convert all dates to have the same format as the majority?
Considering that you have only two formats, one represented by 2013-11-07 and another by 20170510 it is enough to remove - from first to get common format, i.e.
import pandas as pd
df = pd.DataFrame({'day':['2013-11-07','20170510']})
df['day'] = df['day'].str.replace('-','')
print(df)
output
day
0 20131107
1 20170510
pandas.to_datetime does understand it correctly
df['day'] = pd.to_datetime(df['day'])
print(df)
output
day
0 2013-11-07
1 2017-05-10
Disclaimer: I converted to format of minority not majority. It is possible to convert that to format of majority using regular expression, however if you are interested in datetime objects, this is unnecessary complication.

Pandas reads date from CSV incorrectly

I am very new to Python, and finding it very frustrating.
I have a CSV that I am importing, but its reading the date column incorrectly.
In the Month column, I have the 1st of each month - so it should read (yyyy-mm-dd):
2020-01-01
2020-02-01
2020-03-01
etc
however, its reading it as (yyyy-dd-mm)
2020-01-01
2020-01-02
2020-01-03
etc
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
My import is as follows:
try:
collections_data = pd.read_csv('./monthly_collections.csv')
print("Collections Data imported successfully.")
except error as e:
print("Error importing Collections Data!")
I have tried the parse_dates parameter on the import, but it doesn't help.
If I then try this:
temp = pd.to_datetime(collections_data['Collections Month'], format='%m/%d/%Y')
temp
then I get
which you can see, it is reading the months as the days - in other words, it is showing individual days of the month, instead of the 1st day of each month.
I'd greatly appreciate some help to get these dates corrected, as I need to do some date calculations on them, and also join two tables based on this date - which is going to be my next problem.
Kind Regards

Inferring date format
Some dates are ambiguous, while others aren't. Consider these dates:
2020-27-01
2020-12-14
2020-01-02
11-10-12
In examples #1 & #2 we can easily infer that date format. In example #1, The first four digit have to be the year (there's no 2020th month or 2020th day of a month), the following two digits have to be the day of the month (there's no 27th month and we already have year information) and the last two digits are the month (we already have year and day of month information). We can use a similar approach for example #2.
For example #3 is that the first day of the second month, or is that the second day of the first month? It's impossible to tell without more information. If for instances we had the following sequence of dates: '2020-22-01', '2020-25-01', '2020-01-02', it would be reasonable to infer that '2020-01-02' refers to the first day of the second month, otherwise we would not be able to parse the previous two dates.
In example #4, it's impossible to infer the date format. Either pair of digits would make sense as a year, month or day. (Using pandas.read_csv() you can make use of the dayfirst and yearfirst kwargs, or explicitly declare your date formats and use pandas.to_datetime(some_df, format=).
Your problem
Your dates are ambiguous, from what you've included in your question is not possible to infer whether it's in a day first format (dd-mm) or a month first format (mm-dd). pandas defaults to dayfirst=False so a date like your date 2020-02-01 is expected to mean the second day of the first month unless you specific otherwise. See pandas.read_csv().
dayfirst : bool, default False
DD/MM format dates, international and European format.
Above means that in order to parse 01/02 (DD/MM), 2020/02/01 (iso/international format) or 01/02/2020 (European format) as the first day of the second month you will need to specify pandas.read_csv(somefile.csv, ... dayfirst=True).
I've tried several conversion functions from stackoverflow as well as other websites, but they either just don't work, or do nothing.
You haven't provided the code that you've used that didn't work, nor the code which you used which parsed your dates as month first. If you include an example of what you actually tried I can make a specific comment.
In your question you say that your date format is in (yyyy-mm-dd) but you passed format='%m/%d/%Y' and in your screenshots you have '/' and '-' as your separator in different places. So I'm not sure what your original dates look like.
What you passed to the format kwarg means the first two digits are zero-passed months (i.e 04) followed by a '/' then zero-padded days, followed by '/' and then year as yyyy. If what you wrote at the beginning of your question is correct you should have passed format='%Y-%m-%d' (see the strftime format codes).

Try https://towardsdatascience.com/4-tricks-you-should-know-to-parse-date-columns-with-pandas-read-csv-27355bb2ad0e
Essentially, try the dayfirst optional input for the read_csv function.
You would set it to True and have
collections_data = pd.read_csv('./monthly_collections.csv', dayfirst = True)

Convert "Q12019" object to datetime64

I have a pandas dataframe, where one column contains a string for the quarter and year in the following format: Q12019
My Question: How do I convert this into datetime format?

You can use Pandas PeriodIndex to accomplish this. Just reformat your quarters column to the expected format %Y-%q (with some help from regex, move the year to the front):
reformatted_quarters = df['QuarterYear'].str.replace(r'(Q\d)(\d+)', r'\2\1')
print(reformatted_quarters)
This prints:
0 2019Q1
1 2018Q2
2 2019Q4
Name: QuarterYear, dtype: object
Then, feed this result to PeriodIndex to get the datetime format. Use 'Q' to specify a quarterly frequency:
datetimes = pd.PeriodIndex(reformatted_quarters, freq='Q').to_timestamp()
print(datetimes)
This prints:
DatetimeIndex(['2019-01-01', '2018-04-01', '2019-10-01'], dtype='datetime64[ns]', name='Quarter', freq=None)
Note: Pandas PeriodIndex functionality experienced a regression in behavior (documented here), so for Pandas versions greater than 0.23.4, you'll need to use reformatted_quarters.values instead:
datetimes = pd.PeriodIndex(reformatted_quarters.values, freq='Q').to_timestamp()

(quarter) => new Date(quarter.slice(-4), 3 * (quarter.slice(1, 2) - 1), 1)
This will give you the start of every quarter (e.g. q42019 will give 2019-10-01).
You should probably include some validation since it will just keep rolling over months (e.g. q52019 = q12020 = 2020-01-01)

Filtering out improperly formatted datetime values in Python DataFrame

I have a DataFrame with one column storing the date.
However, some of these dates are properly formatted datetime objects like'2018-12-24 17:00:00'while others are not and are stored like '20181225'.
When I tried to plot these using plotly, the improperly formatted values got turned into EPOCH dates, which is a problem.
Is there any way I can get a copy of the DataFrame with only those rows with properly formatted dates?
I tried using
clean_dict= dailySum_df.where(dailySum_df[isinstance(dailySum_df['time'],datetime.datetime)])
methods and but it doesn't to work due to the 'Array conditional must be same shape as self' error.
dailySum_df = pd.DataFrame(list(cursors['dailySum']))
trace = go.Scatter(
x=dailySum_df['time'],
y=dailySum_df['countMessageIn']
)
data = [trace]
py.plot(data, filename='basic-line')

Apply dateutil.parser, see also my answer here:
import dateutil.parser as dparser
def myparser(x):
try:
return dparser.parse(x)
except:
return None
df = pd.DataFrame( {'time': ['2018-12-24 17:00:00', '20181225', 'no date at all'], 'countMessageIn': [1,2,3]})
df.time = df.time.apply(myparser)
df = df[df.time.notnull()]
Input:
time countMessageIn
0 2018-12-24 17:00:00 1
1 20181225 2
2 no date at all 3
Output:
time countMessageIn
0 2018-12-24 17:00:00 1
1 2018-12-25 00:00:00 2
Unlike Gustavo's solution this can handle rows with no recognizable date at all and it filters out such rows as required by your question.
If your original time column may contain other text besides the dates themselves, include the fuzzy=True parameter as shown here.

Try parsing the dates column of your dataframe using dateutil.parser.parse and Pandas apply function.

how to change dd-mm-yyyy date format to yyyy-dd-mm in pandas

How to change dd-mm-yyyy date format to yyyy-dd-mm in pandas. I have a datefield which is already in dd-mm-yyyy format but when I try
df[('date')] = pd.to_datetime(df[('date')]).dt.strftime('%Y-%m-%d')
it gives output a yyyy-dd-mm

I believe this is what you needed.
import pandas as pd
df = pd.read_csv("dates.csv")
df
id date
0 1 25/06/2018
1 2 14-11-2005
2 3 03/10/2010
3 4 13-08-2008
4 5 05-05-2005
Here no need to specify the format as you have tried.
df['date'] =pd.to_datetime(df['date'])
df
id date
0 1 2018-06-25
1 2 2005-11-14
2 3 2010-03-10
3 4 2008-08-13
4 5 2005-05-05

Pandas datetime series data do not have an inherent string format.
datetime values are stored internally as integers. For more details, see this answer. String representations are just that, representations. For example, when you use the print command, a specific string representation is used so that data is displayed in a human-readable way.
For most purposes, you should not worry about the representation. If you need a format different to the default representation, i.e. "YYYY-MM-DD", you can use pd.Series.dt.strftime and specify a string format. For this Python's strftime directives is a useful resource.

Use this:
import pandas as pd
df['date'] = pd.to_datetime(df['date'],format='%d-%m-%Y').dt.strftime('%Y-%m-%d')#specify input format '%d-%m-%Y' and output format '%Y-%m-%d' or change output as desired i.e. %d/%m/%Y to give dd/mm/yyyy

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.