How to convert all data in column to datetime - pandas - python

I have a large dataframe that, in its date column, has a mixture of date formats (only 2).
Most are in the correct format but there is some data that is in a different format.
i.e. most are 2013-11-07. Some are 20170510. Pandas throws an exception when i try to validate the code against a schema i have.
Is there a quick way to convert all dates to have the same format as the majority? Or do i have to do something more painful/manual?
i.e.
date \
0 2013-11-07 False
2 2013-11-07 False
... ... ... ... ... ...
3595037 20170510 NaN
3595038 20200701 NaN

Is there a quick way to convert all dates to have the same format as the majority?
Considering that you have only two formats, one represented by 2013-11-07 and another by 20170510 it is enough to remove - from first to get common format, i.e.
import pandas as pd
df = pd.DataFrame({'day':['2013-11-07','20170510']})
df['day'] = df['day'].str.replace('-','')
print(df)
output
day
0 20131107
1 20170510
pandas.to_datetime does understand it correctly
df['day'] = pd.to_datetime(df['day'])
print(df)
output
day
0 2013-11-07
1 2017-05-10
Disclaimer: I converted to format of minority not majority. It is possible to convert that to format of majority using regular expression, however if you are interested in datetime objects, this is unnecessary complication.

Related

Pandas : 'to_datetime' function not consistent with dates

When I read a date say '01/12/2020', which is in the format dd/mm/yyyy, with pd.to_datetime(), it detects the month as 01.
pd.to_datetime('01/12/2020').month
>> 1
But this behavior is not consistent.
When we create a dataframe with a column containing dates in this format, and convert using the same to_datetime function, it then detects 12 as the month.
tt.dt.month[0]
>> 12
What could be the reason ?
pandas automagically tries to detect the date format, which can be very nice, or annoying in your case.
Be explicit, use the dayfirst parameter:
pd.to_datetime('01/12/2020', dayfirst=False).month
# 1
pd.to_datetime('01/12/2020', dayfirst=True).month
# 12
Example of ambiguous use:
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
tt.dt.month
UserWarning: Parsing dates in DD/MM/YYYY format when dayfirst=False (the default) was specified. This may lead to inconsistently parsed dates! Specify a format to ensure consistent parsing.
tt = pd.to_datetime(pd.Series(['30/05/2020', '01/12/2020']))
0 5
1 1
dtype: int64

how to convert date in dd/mm/yyyy to days in python csv

i will like to covert dates to days
Example 01/01/2001 should be Day 1, 02/01/2001 should be Day 2
I have tried
prices_df['01/01/2001'] = prices_df['days'].dt.days
Without using any external function or a complicated solution, you can simply take the first 2 chars of the string.
int('01/01/2001'[0:2])
Output:
1
If you have to do this in a pandas column:
import pandas as pd
pd.to_numeric(df['days'].str[0:2])
N.B. This works if all date are in the form day/month/year

Filtering out improperly formatted datetime values in Python DataFrame

I have a DataFrame with one column storing the date.
However, some of these dates are properly formatted datetime objects like'2018-12-24 17:00:00'while others are not and are stored like '20181225'.
When I tried to plot these using plotly, the improperly formatted values got turned into EPOCH dates, which is a problem.
Is there any way I can get a copy of the DataFrame with only those rows with properly formatted dates?
I tried using
clean_dict= dailySum_df.where(dailySum_df[isinstance(dailySum_df['time'],datetime.datetime)])
methods and but it doesn't to work due to the 'Array conditional must be same shape as self' error.
dailySum_df = pd.DataFrame(list(cursors['dailySum']))
trace = go.Scatter(
x=dailySum_df['time'],
y=dailySum_df['countMessageIn']
)
data = [trace]
py.plot(data, filename='basic-line')
Apply dateutil.parser, see also my answer here:
import dateutil.parser as dparser
def myparser(x):
try:
return dparser.parse(x)
except:
return None
df = pd.DataFrame( {'time': ['2018-12-24 17:00:00', '20181225', 'no date at all'], 'countMessageIn': [1,2,3]})
df.time = df.time.apply(myparser)
df = df[df.time.notnull()]
Input:
time countMessageIn
0 2018-12-24 17:00:00 1
1 20181225 2
2 no date at all 3
Output:
time countMessageIn
0 2018-12-24 17:00:00 1
1 2018-12-25 00:00:00 2
Unlike Gustavo's solution this can handle rows with no recognizable date at all and it filters out such rows as required by your question.
If your original time column may contain other text besides the dates themselves, include the fuzzy=True parameter as shown here.
Try parsing the dates column of your dataframe using dateutil.parser.parse and Pandas apply function.

Converting a series of DateTime values to the proper format

Currently, I have a series of Datetime Values that display as so
0 Datetime
1 20041001
2 20041002
3 20041003
4 20041004
they are within a series named
d['Datetime']
They were originally something like
20041001ABCDEF
But I split the end off just to leave them with the remaining numbers. How do I go about putting them into the following format?
2004-10-01
You can do the following,
df['Datetime'] = pd.to_datetime(df['Datetime'], format='%Y%m%d'))

how to change dd-mm-yyyy date format to yyyy-dd-mm in pandas

How to change dd-mm-yyyy date format to yyyy-dd-mm in pandas. I have a datefield which is already in dd-mm-yyyy format but when I try
df[('date')] = pd.to_datetime(df[('date')]).dt.strftime('%Y-%m-%d')
it gives output a yyyy-dd-mm
I believe this is what you needed.
import pandas as pd
df = pd.read_csv("dates.csv")
df
id date
0 1 25/06/2018
1 2 14-11-2005
2 3 03/10/2010
3 4 13-08-2008
4 5 05-05-2005
Here no need to specify the format as you have tried.
df['date'] =pd.to_datetime(df['date'])
df
id date
0 1 2018-06-25
1 2 2005-11-14
2 3 2010-03-10
3 4 2008-08-13
4 5 2005-05-05
Pandas datetime series data do not have an inherent string format.
datetime values are stored internally as integers. For more details, see this answer. String representations are just that, representations. For example, when you use the print command, a specific string representation is used so that data is displayed in a human-readable way.
For most purposes, you should not worry about the representation. If you need a format different to the default representation, i.e. "YYYY-MM-DD", you can use pd.Series.dt.strftime and specify a string format. For this Python's strftime directives is a useful resource.
Use this:
import pandas as pd
df['date'] = pd.to_datetime(df['date'],format='%d-%m-%Y').dt.strftime('%Y-%m-%d')#specify input format '%d-%m-%Y' and output format '%Y-%m-%d' or change output as desired i.e. %d/%m/%Y to give dd/mm/yyyy

Categories

Resources