pandas date-time format VS google sheets date format

pandas date-time format VS google sheets date format - python

I have a column in my dataset that looks like this:
date
41245.0
41701.0
36361.0
I need to convert it into a date-format. When I try it in Python using this:
df = pd.to_datetime(df['date'])
My results are like this:
1 1970-01-01 00:00:00.000041701
4 1970-01-01 00:00:00.000042226
5 1970-01-01 00:00:00.000039031
These years seem quite odd. However, when I open the my dataset(as an excel sheet) on Google Drive/Sheets, select the column, and format it using the "date" or "date-time" format, the results are quite different.
12/2/2012
3/3/2014
7/20/1999
My results should be something like this. However, currently I am getting weird values. Similarly, the results on Microsoft Excel were also slightly different. Why are the dates different? What am I doing wrong?

That is days but the origin is 1900-01-01 not the default 1970-01-01
pd.to_datetime(df.date,unit='d',origin='1900-01-01')
Out[205]:
0 2012-12-04
1 2014-03-05
2 1999-07-22
Name: date, dtype: datetime64[ns]

Related

pd.to_datetime gives Timestamp objects and to_pydatetime does not work on it

I am reading a dataframe:
df = pd.read_csv("file_path")
df['time'] = pd.to_datetime(df['time'], format="%Y-%m-%d")
When I want to access df['time'].dt in order to get year, month, etc. I get the error that:
'Timestamp' object has no attribute 'dt'.
There are many posts on StackOverflow related to my question. I tried the suggested solutions which was using to_pydatetime() but it did not work. I appreciate any help regarding this.

In order to get/extract a parameter like hour or year of a Timestamp data type in pandas, you can use the methods with the name of the attribute.
Example:
df['time].year
df['time].month
df['time'].day
df['time].hour
ref: https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.hour.html
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.year.html

OK, then, I'm confused about what code you actually have. The error message you cite says that you have a Timestamp object, not a Series. IF you have a Series, then you need dt. If you have an individual element, then you do not need dt. Here is a snippet based on your code that works perfectly. If this is not what you have, then you need to modify the question.
import pandas as pd
data = [
'2019-06-06',
'2020-04-01',
'2021-01-01'
]
df = pd.DataFrame(data, columns=['time'] )
print(df)
df['time'] = pd.to_datetime(df['time'], format="%Y-%m-%d")
print(df)
print(df['time'].dt.year)
Output:
time
0 2019-06-06
1 2020-04-01
2 2021-01-01
time
0 2019-06-06
1 2020-04-01
2 2021-01-01
0 2019
1 2020
2 2021
Name: time, dtype: int64

How to work around the date range limit in Pandas for plotting?

sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks

the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]

Any way to flag bad lines in pandas when reading an excel file?

pandas.read_csv has (warn, error) bad lines methods. I can't see any for pandas.read_excel. Is there a reason? For example, if I wanted to read an excel file where a column is supposed to be a datetime and the pandas.read_execl function encounters an int or str in one/few of the rows. Do i need to handle this myself?

In short, no I do not believe there is a way to do automatically do this with a parameter you pass to read_excel(). This is how to solve your problem though:
Let's say that when you read in your dataframe it looks like this:
df = pd.read_excel('Desktop/Book1.xlsx')
df
Date
0 2020-09-13 00:00:00
1 2
2 abc
3 2020-09-14 00:00:00
You can you pass errors='coerce' to pd.to_datetime():
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Date
0 2020-09-13
1 NaT
2 NaT
3 2020-09-14
Finally, you can drop those rows with:
df = df[df['Date'].notnull()]
df
Date
0 2020-09-13
3 2020-09-14

How do I extract dates from an Excel sheet?

I am trying to extract dates from an Excel sheet using the pandas library.
data = pd.read_excel (import_file_path)
df = pd.DataFrame(data,columns = ['birthday'])
This works but I don't really know how to work with DataFrames and I just need a list/array of the ages, so I tried to convert it into a numpy array:
array = df.to_numpy()
This works fine as well but elements of the array look like:
[datetime.datetime(1983, 6, 4, 0, 0)]
But I can't use the methods provided by datetime to convert the dates.
What would be the best approach to get a list/array of ages eventually?
Update:
Birthday
1 2002-03-15 00:00:00
2 1999-04-17 00:00:00
3 1993-06-04 00:00:00
4 1997-07-04 00:00:00
5 1983-08-09 00:00:00
6 2000-01-10 00:00:00
7 1996-08-20 00:00:00
8 2003-11-06 00:00:00

assuming your dates column is called birthday then something like the following :
df = pd.DataFrame({'Birthday' : pd.date_range(start='01/01/88',end='02/02/95',freq='M')})
df['Today'] = pd.datetime(2019,6,13) # probably better to use the datetime module.
df['Years'] = (df['Today'] - df['Birthday']) / np.timedelta64(1, 'Y')
print(df.head(5))
Birthday Today Years
0 1988-01-31 2019-06-13 31.365463
1 1988-02-29 2019-06-13 31.286063
2 1988-03-31 2019-06-13 31.201188
3 1988-04-30 2019-06-13 31.119051
4 1988-05-31 2019-06-13 31.034176
Then simply cast the col to a np.array
a = np.array(df['Years'])
print(a)
array([31.36546267, 31.28606337, 31.20118825, 31.11905104, 31.03417592,
30.95203871, 30.8671636 , 30.78228848, 30.70015127, 30.61527615,
30.53313894, 30.44826382, 30.36338871, 30.28672731, 30.20185219,
30.11971498, 30.03483987, 29.95270266, 29.86782754, 29.78295242]

Ok there was a row with irregular data in it, which messed up the conversion.
Handling of the types works fine now, thanks!

Convert SQL timestamp column to date format column of Python dataframe

I have data upload in MS Excel format.
enter image description here
This file has a column with dates in "dd.mm.yyyy 00:00:00" format.
Reading file with code:
df = pd.read_excel('data_from_db.xlsx')
I recieve a frame, where dates column has "object" type. Further I convert this column to date format by command:
df['Date_Column'] = pd.to_datetime(df['Date_Column'])
That gives me "datetime64[ns]" type.
But this command does not work correctly each time. I meet rows with muddled data:
somewhere rows have format "yyyy.mm.dd",
somwhere "yyyy.dd.mm".
How should I correctly convert excel column with "dd.mm.yyyy 00:00:00" format to column in pandas dataframe with date type and "dd.mm.yyyy" fromat?
P.S. Also, I noticed this oddity: some values in raw date column have str type, another - float. But I can't wrap my head around it, because raw table is an upload from database.

Without specifying a format, pd.to_datetime has to guess from the data how a date string is to be interpreted. With default parameters this fails for the second and third row of your data:
In [5]: date_of_hire = pd.Series(['18.01.2018 0:00:00',
'01.02.2018 0:00:00',
'06.11.2018 0:00:00'])
In [6]: pd.to_datetime(date_of_hire)
Out[6]:
0 2018-01-18
1 2018-01-02
2 2018-06-11
dtype: datetime64[ns]
The quickest solution would be to pass dayfirst=True:
In [7]: pd.to_datetime(date_of_hire, dayfirst=True)
Out[7]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
If you know the complete format of your data, can specify it directly. This only works if the format is exactly like given, if a row should e.g. lack the time the conversion will fail.
In [8]: pd.to_datetime(date_of_hire, format='%d.%m.%Y %H:%M:%S')
Out[8]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
In case you should have little information about the date format, except for it being consistent, pandas has the ability to infer the format from the data beforehand:
In [9]: pd.to_datetime(date_of_hire, infer_datetime_format=True)
Out[9]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

pandas date-time format VS google sheets date format - python

That is days but the origin is 1900-01-01 not the default 1970-01-01 pd.to_datetime(df.date,unit='d',origin='1900-01-01') Out[205]: 0 2012-12-04 1 2014-03-05 2 1999-07-22 Name: date, dtype: datetime64[ns]

Related

pd.to_datetime gives Timestamp objects and to_pydatetime does not work on it

How to work around the date range limit in Pandas for plotting?

Any way to flag bad lines in pandas when reading an excel file?

How do I extract dates from an Excel sheet?

Convert SQL timestamp column to date format column of Python dataframe

Categories

Resources