I have a column in my dataset that looks like this:
date
41245.0
41701.0
36361.0
I need to convert it into a date-format. When I try it in Python using this:
df = pd.to_datetime(df['date'])
My results are like this:
1 1970-01-01 00:00:00.000041701
4 1970-01-01 00:00:00.000042226
5 1970-01-01 00:00:00.000039031
These years seem quite odd. However, when I open the my dataset(as an excel sheet) on Google Drive/Sheets, select the column, and format it using the "date" or "date-time" format, the results are quite different.
12/2/2012
3/3/2014
7/20/1999
My results should be something like this. However, currently I am getting weird values. Similarly, the results on Microsoft Excel were also slightly different. Why are the dates different? What am I doing wrong?
That is days but the origin is 1900-01-01 not the default 1970-01-01
pd.to_datetime(df.date,unit='d',origin='1900-01-01')
Out[205]:
0 2012-12-04
1 2014-03-05
2 1999-07-22
Name: date, dtype: datetime64[ns]
Related
I am reading a dataframe:
df = pd.read_csv("file_path")
df['time'] = pd.to_datetime(df['time'], format="%Y-%m-%d")
When I want to access df['time'].dt in order to get year, month, etc. I get the error that:
'Timestamp' object has no attribute 'dt'.
There are many posts on StackOverflow related to my question. I tried the suggested solutions which was using to_pydatetime() but it did not work. I appreciate any help regarding this.
In order to get/extract a parameter like hour or year of a Timestamp data type in pandas, you can use the methods with the name of the attribute.
Example:
df['time].year
df['time].month
df['time'].day
df['time].hour
ref: https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.hour.html
https://pandas.pydata.org/docs/reference/api/pandas.Timestamp.year.html
OK, then, I'm confused about what code you actually have. The error message you cite says that you have a Timestamp object, not a Series. IF you have a Series, then you need dt. If you have an individual element, then you do not need dt. Here is a snippet based on your code that works perfectly. If this is not what you have, then you need to modify the question.
import pandas as pd
data = [
'2019-06-06',
'2020-04-01',
'2021-01-01'
]
df = pd.DataFrame(data, columns=['time'] )
print(df)
df['time'] = pd.to_datetime(df['time'], format="%Y-%m-%d")
print(df)
print(df['time'].dt.year)
Output:
time
0 2019-06-06
1 2020-04-01
2 2021-01-01
time
0 2019-06-06
1 2020-04-01
2 2021-01-01
0 2019
1 2020
2 2021
Name: time, dtype: int64
sorry if this question has been asked before but I can't seem to find one that describes my current issue.
Basically, I have a large climate dataset that is not bound to "real" dates. The dataset starts at "year one" and goes to "year 9999". These dates are stored as strings such as Jan-01, Feb-01, Mar-01 etc, where the number indicates the year. When trying to convert this column to date time objects, I get an out of range error. (My reading into this suggests this is due to a 64bit limit on the possible datetime timestamps that can exist)
What is a good way to work around this problem/process the date information so I can effectively plot the associated data vs these dates, over this ~10,000 year period?
Thanks
the cftime library was created specifically for this purpose, and xarray has a convenient xr.cftime_range function that makes creating such a range easy:
In [3]: import xarray as xr, pandas as pd
In [4]: date_range = xr.cftime_range('0001-01-01', '9999-01-01', freq='D')
In [5]: type(date_range)
Out[5]: xarray.coding.cftimeindex.CFTimeIndex
This creates a CFTimeIndex object which plays nicely with pandas:
In [8]: df = pd.DataFrame({"date": date_range, "vals": range(len(date_range))})
In [9]: df
Out[9]:
date vals
0 0001-01-01 00:00:00 0
1 0001-01-02 00:00:00 1
2 0001-01-03 00:00:00 2
3 0001-01-04 00:00:00 3
4 0001-01-05 00:00:00 4
... ... ...
3651692 9998-12-28 00:00:00 3651692
3651693 9998-12-29 00:00:00 3651693
3651694 9998-12-30 00:00:00 3651694
3651695 9998-12-31 00:00:00 3651695
3651696 9999-01-01 00:00:00 3651696
[3651697 rows x 2 columns]
pandas.read_csv has (warn, error) bad lines methods. I can't see any for pandas.read_excel. Is there a reason? For example, if I wanted to read an excel file where a column is supposed to be a datetime and the pandas.read_execl function encounters an int or str in one/few of the rows. Do i need to handle this myself?
In short, no I do not believe there is a way to do automatically do this with a parameter you pass to read_excel(). This is how to solve your problem though:
Let's say that when you read in your dataframe it looks like this:
df = pd.read_excel('Desktop/Book1.xlsx')
df
Date
0 2020-09-13 00:00:00
1 2
2 abc
3 2020-09-14 00:00:00
You can you pass errors='coerce' to pd.to_datetime():
df['Date'] = pd.to_datetime(df['Date'], errors='coerce')
df
Date
0 2020-09-13
1 NaT
2 NaT
3 2020-09-14
Finally, you can drop those rows with:
df = df[df['Date'].notnull()]
df
Date
0 2020-09-13
3 2020-09-14
I am trying to extract dates from an Excel sheet using the pandas library.
data = pd.read_excel (import_file_path)
df = pd.DataFrame(data,columns = ['birthday'])
This works but I don't really know how to work with DataFrames and I just need a list/array of the ages, so I tried to convert it into a numpy array:
array = df.to_numpy()
This works fine as well but elements of the array look like:
[datetime.datetime(1983, 6, 4, 0, 0)]
But I can't use the methods provided by datetime to convert the dates.
What would be the best approach to get a list/array of ages eventually?
Update:
Birthday
1 2002-03-15 00:00:00
2 1999-04-17 00:00:00
3 1993-06-04 00:00:00
4 1997-07-04 00:00:00
5 1983-08-09 00:00:00
6 2000-01-10 00:00:00
7 1996-08-20 00:00:00
8 2003-11-06 00:00:00
assuming your dates column is called birthday then something like the following :
df = pd.DataFrame({'Birthday' : pd.date_range(start='01/01/88',end='02/02/95',freq='M')})
df['Today'] = pd.datetime(2019,6,13) # probably better to use the datetime module.
df['Years'] = (df['Today'] - df['Birthday']) / np.timedelta64(1, 'Y')
print(df.head(5))
Birthday Today Years
0 1988-01-31 2019-06-13 31.365463
1 1988-02-29 2019-06-13 31.286063
2 1988-03-31 2019-06-13 31.201188
3 1988-04-30 2019-06-13 31.119051
4 1988-05-31 2019-06-13 31.034176
Then simply cast the col to a np.array
a = np.array(df['Years'])
print(a)
array([31.36546267, 31.28606337, 31.20118825, 31.11905104, 31.03417592,
30.95203871, 30.8671636 , 30.78228848, 30.70015127, 30.61527615,
30.53313894, 30.44826382, 30.36338871, 30.28672731, 30.20185219,
30.11971498, 30.03483987, 29.95270266, 29.86782754, 29.78295242]
Ok there was a row with irregular data in it, which messed up the conversion.
Handling of the types works fine now, thanks!
I have data upload in MS Excel format.
enter image description here
This file has a column with dates in "dd.mm.yyyy 00:00:00" format.
Reading file with code:
df = pd.read_excel('data_from_db.xlsx')
I recieve a frame, where dates column has "object" type. Further I convert this column to date format by command:
df['Date_Column'] = pd.to_datetime(df['Date_Column'])
That gives me "datetime64[ns]" type.
But this command does not work correctly each time. I meet rows with muddled data:
somewhere rows have format "yyyy.mm.dd",
somwhere "yyyy.dd.mm".
How should I correctly convert excel column with "dd.mm.yyyy 00:00:00" format to column in pandas dataframe with date type and "dd.mm.yyyy" fromat?
P.S. Also, I noticed this oddity: some values in raw date column have str type, another - float. But I can't wrap my head around it, because raw table is an upload from database.
Without specifying a format, pd.to_datetime has to guess from the data how a date string is to be interpreted. With default parameters this fails for the second and third row of your data:
In [5]: date_of_hire = pd.Series(['18.01.2018 0:00:00',
'01.02.2018 0:00:00',
'06.11.2018 0:00:00'])
In [6]: pd.to_datetime(date_of_hire)
Out[6]:
0 2018-01-18
1 2018-01-02
2 2018-06-11
dtype: datetime64[ns]
The quickest solution would be to pass dayfirst=True:
In [7]: pd.to_datetime(date_of_hire, dayfirst=True)
Out[7]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
If you know the complete format of your data, can specify it directly. This only works if the format is exactly like given, if a row should e.g. lack the time the conversion will fail.
In [8]: pd.to_datetime(date_of_hire, format='%d.%m.%Y %H:%M:%S')
Out[8]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]
In case you should have little information about the date format, except for it being consistent, pandas has the ability to infer the format from the data beforehand:
In [9]: pd.to_datetime(date_of_hire, infer_datetime_format=True)
Out[9]:
0 2018-01-18
1 2018-02-01
2 2018-11-06
dtype: datetime64[ns]