Convert SQL timestamp column to date format column of Python dataframe - python

I have data upload in MS Excel format.
enter image description here
This file has a column with dates in "dd.mm.yyyy 00:00:00" format.
Reading file with code:
df = pd.read_excel('data_from_db.xlsx')
I recieve a frame, where dates column has "object" type. Further I convert this column to date format by command:
df['Date_Column'] = pd.to_datetime(df['Date_Column'])
That gives me "datetime64[ns]" type.
But this command does not work correctly each time. I meet rows with muddled data:
somewhere rows have format "yyyy.mm.dd",
somwhere "yyyy.dd.mm".
How should I correctly convert excel column with "dd.mm.yyyy 00:00:00" format to column in pandas dataframe with date type and "dd.mm.yyyy" fromat?
P.S. Also, I noticed this oddity: some values in raw date column have str type, another - float. But I can't wrap my head around it, because raw table is an upload from database.

Without specifying a format, pd.to_datetime has to guess from the data how a date string is to be interpreted. With default parameters this fails for the second and third row of your data:
In [5]: date_of_hire = pd.Series(['18.01.2018 0:00:00',
'01.02.2018 0:00:00',
'06.11.2018 0:00:00'])
In [6]: pd.to_datetime(date_of_hire)
Out[6]:
0 2018-01-18
1 2018-01-02
2 2018-06-11
dtype: datetime64[ns]
The quickest solution would be to pass dayfirst=True:
In [7]: pd.to_datetime(date_of_hire, dayfirst=True)
Out[7]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
If you know the complete format of your data, can specify it directly. This only works if the format is exactly like given, if a row should e.g. lack the time the conversion will fail.
In [8]: pd.to_datetime(date_of_hire, format='%d.%m.%Y %H:%M:%S')
Out[8]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
In case you should have little information about the date format, except for it being consistent, pandas has the ability to infer the format from the data beforehand:
In [9]: pd.to_datetime(date_of_hire, infer_datetime_format=True)
Out[9]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]

Related

how to enter manually a Python dataframe with daily dates in a correct format

I would like to (manually) create in Python a dataframe with daily dates (in column 'date') as per below code.
But the code does not provide the correct format for the daily dates, neglects dates (the desired format representation is below).
Could you please advise how I can correct the code so that the 'date' column is entered in a desired format?
Thanks in advance!
------------------------------------------------------
desired format for date column
2021-03-22 3
2021-04-07 3
2021-04-18 3
2021-05-12 0
------------------------------------------------------
df1 = pd.DataFrame({"date": [2021-3-22, 2021-4-7, 2021-4-18, 2021-5-12],
"x": [3, 3, 3, 0 ]})
df1
date x
0 1996 3
1 2010 3
2 1999 3
3 2004 0
Python wants to interpret the numbers in the sequence 2021-3-22 as a series of mathematical operations 2021 minus 3 minus 22.
If you want that item to be stored as a string that resembles a date you will need to mark them as string literal datatype (str), as shown below by encapsulating them with quotes.
import pandas as pd
df1 = pd.DataFrame({"date": ['2021-3-22', '2021-4-7', '2021-4-18', '2021-5-12'],
"x": [3, 3, 3, 0 ]})
The results for the date column, as shown here indicate that the date column contains elements of the object datatype which encompasses str in pandas. Notice that the strings were created exactly as shown (2021-3-22 instead of 2021-03-22).
0 2021-3-22
1 2021-4-7
2 2021-4-18
3 2021-5-12
Name: date, dtype: object
IF however, you actually want them stored as datetime objects so that you can do datetime manipulations on them (i.e. determine the number of days between two dates OR filter by a specific month OR year) then you need to convert the values to datetime objects.
This technique will do that:
df1['date'] = pd.to_datetime(df1['date'])
The results of this conversion are Pandas datetime objects which enable nanosecond precision (I differentiate this from Python datetime objects which are limited to microsecond precision).
0 2021-03-22
1 2021-04-07
2 2021-04-18
3 2021-05-12
Name: date, dtype: datetime64[ns]
Notice the displayed results are now formatted just as you would expect of datetimes (2021-03-22 instead of 2021-3-22).
You would want to create the series as a datetime and use the following codes when doing so as strings, more info here pandas.to_datetime:
df1 = pd.DataFrame({"date": pd.to_datetime(["2021-3-22", "2021-4-7", "2021-4-18", "2021-5-12"]),
"x": [3, 3, 3, 0 ]})
FWIW, I often use pd.read_csv(io.StringIO(text)) to copy/paste tabular-looking data into a DataFrame (for example, from SO questions).
Example:
import io
import re
import pandas as pd
def df_read(txt, **kwargs):
txt = '\n'.join([s.strip() for s in txt.splitlines()])
return pd.read_csv(io.StringIO(re.sub(r' +', '\t', txt)), sep='\t', **kwargs)
txt = """
date value
2021-03-22 3
2021-04-07 3
2021-04-18 3
2021-05-12 0
"""
df = df_read(txt, parse_dates=['date'])
>>> df
date value
0 2021-03-22 3
1 2021-04-07 3
2 2021-04-18 3
3 2021-05-12 0
>>> df.dtypes
date datetime64[ns]
value int64
dtype: object

Converting pandas column to date, with many type of dates

How would you convert the below date column into a single formatted date column?
df = pd.DataFrame(data={'datecol': ["-",
"44198",
"2021/01/01",
"14.04.20",
"2021-13-03"]})
print(df.dropna()) should return the result below:
datecol
0 2021-01-02
1 2021-01-01
2 2020-04-14
3 2021-03-13
Convert all valid datetime formats using pd.to_datetime, specifying formats for unrecognised formats
Convert all integer (Excel) dates.
Combine both with fillna
parsed = pd.to_datetime(df["datecol"], errors="coerce").fillna(pd.to_datetime(df["datecol"],format="%Y-%d-%m",errors="coerce"))
ordinal = pd.to_numeric(df["datecol"], errors="coerce").apply(lambda x: pd.Timestamp("1899-12-30")+pd.Timedelta(x, unit="D"))
df["datecol"] = parsed.fillna(ordinal)
>>> df
datecol
0 NaT
1 2021-01-02
2 2021-01-01
3 2020-04-14
4 2021-03-13
If a column contains multiple formats, you're going to need to parse the column multiple times with the different formats and use combine_first to combine the resulting information. Because we specify errors='coerce' the date format should only match one of the formats.
The other small complication is that some of your formats require you to just specify the format argument, but others would require the origin and unit parameters. We can take care of this passing a dict of kwargs to the pd.to_datetime function.
Note any numeric values will work with origin and unit so you can't use this method if your date column had values that represented different units with different offsets in the same column. You would need to provide other logic to indicate which units and offsets are pertinent to which rows in that case.
import pandas as pd
from functools import reduce
kwl = [{'format': '%Y/%m/%d'},
{'format': '%d.%m.%y'},
{'format': '%Y-%d-%m'},
{'format': '%Y/%m/%d'},
{'unit': 'd', 'origin': '1899-12-30'}]]
l = []
for kwargs in kwl:
if 'unit' in kwargs.keys():
s = pd.to_numeric(df['datecol'], errors='coerce')
else:
s = df['datecol']
l.append(pd.to_datetime(s, errors='coerce', **kwargs))
result = reduce(lambda l,r: l.combine_first(r), l)
print(result)
#0 NaT
#1 2021-01-02
#2 2021-01-01
#3 2020-04-14
#4 2021-03-13
Name: datecol, dtype: datetime64[ns]

Pandas: Change column of integers to datetime and add a timestamp

I have a dataframe with an id column, and a date column made up of an integer.
d = {'id': [1, 2], 'date': [20161031, 20170930]}
df = pd.DataFrame(data=d)
id date
0 1 20161031
1 2 20170930
I can convert the date column to an actual date like so.
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d'))
id date
0 1 2016-10-31
1 2 2017-09-30
But I need to have this field as a timestamp with with hours, minutes, and seconds so that it is compatible with my database table. I don't care what the the values are, we can keep it easy by setting it to zeros.
2016-10-31 00:00:00
2017-09-30 00:00:00
What is the best way to change this field to a timestamp? I tried
df['date'] = df['date'].apply(lambda x: pd.to_datetime(str(x), format='%Y%m%d%H%M%S'))
but pandas didn't like that.
I think I could append six 0's to the end of every value in that field and then use the above statement, but I was wondering if there is a better way.
With pandas it is simpler and faster to convert entire columns. First you convert to string and then to time stamp
pandas.to_datatime(df['date'].apply(str))
PS there are few other conversion methods of varying performance https://datatofish.com/fastest-way-to-convert-integers-to-strings-in-pandas-dataframe/
The problem seems to be that pd.to_datetime doesn't accept dates in this integer format:
pd.to_datetime(20161031) gives Timestamp('1970-01-01 00:00:00.020161031')
It assumes the integers are nanoseconds since 1970-01-01.
You have to convert to a string first:
df['date'] = pd.to_datetime(df["date"].astype(str))
Output:
id date
0 1 2016-10-31
1 2 2017-09-30
Note that these are datetimes so they include a time component (which are all zero in this case) even though they are not shown in the data frame representation above.
print(df.loc[0,'date'])
Out:
Timestamp('2016-10-31 00:00:00')
You can use
df['date'] = pd.to_datetime(df["date"].dt.strftime('%Y%m%d%H%M%S'))

pandas date-time format VS google sheets date format

I have a column in my dataset that looks like this:
date
41245.0
41701.0
36361.0
I need to convert it into a date-format. When I try it in Python using this:
df = pd.to_datetime(df['date'])
My results are like this:
1 1970-01-01 00:00:00.000041701
4 1970-01-01 00:00:00.000042226
5 1970-01-01 00:00:00.000039031
These years seem quite odd. However, when I open the my dataset(as an excel sheet) on Google Drive/Sheets, select the column, and format it using the "date" or "date-time" format, the results are quite different.
12/2/2012
3/3/2014
7/20/1999
My results should be something like this. However, currently I am getting weird values. Similarly, the results on Microsoft Excel were also slightly different. Why are the dates different? What am I doing wrong?
That is days but the origin is 1900-01-01 not the default 1970-01-01
pd.to_datetime(df.date,unit='d',origin='1900-01-01')
Out[205]:
0 2012-12-04
1 2014-03-05
2 1999-07-22
Name: date, dtype: datetime64[ns]

how to convert sting of different date format in to single date format in python?

i have some of my date as 26-07-10 and others as 4/8/2010 as string type in a csv. i want them to be in single format like 4/8/2010 so that i can parse them and group them each year. Is there a function in python or pandas help me?
You can parse these date forms using parse_dates param of read_csv note however for ambiguous forms it may fail for instance if you gave month first forms mixed with day first:
In [7]:
t="""date
26-07-10
4/8/2010"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[7]:
date
0 2010-07-26
1 2010-04-08
You can alter the displayed format by changing the string format using dt.strftime:
In [10]:
df['date'].dt.strftime('%d/%m/%Y')
Out[10]:
0 26/07/2010
1 08/04/2010
Name: date, dtype: object
Really though it's better to keep the column as a datetime you can then groupby on year:
In [11]:
t="""date,val
26-07-10,23
4/8/2010,5567"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df
Out[11]:
date val
0 2010-07-26 23
1 2010-04-08 5567
In [12]:
df.groupby(df['date'].dt.year).mean()
Out[12]:
val
date
2010 2795
You can try using parse-date parameter of pd.read_csv() as mentioned by #EdChum
Alternatively, you could typecast them to a standard format, like that of datetime.date as follows:
import io
import datetime
t=u"""date
26-07-10
4/8/2010"""
df = pd.read_csv(io.StringIO(t), parse_dates=[0])
df.date.astype(datetime.date)
df
out:
date
0 2010-07-26
1 2010-04-08

Categories

Resources