python pandas incorrectly reading excel dates

python pandas incorrectly reading excel dates - python

I have an excel file with dates formatted as such:
22.10.07 16:00
22.10.07 17:00
22.10.07 18:00
22.10.07 19:00
After using the parse method of pandas to read the data, the dates are read almost correctly:
In [55]: nts.data['Tid'][10000:10005]
Out[55]:
10000 2007-10-22 15:59:59.997905
10001 2007-10-22 16:59:59.997904
10002 2007-10-22 17:59:59.997904
10003 2007-10-22 18:59:59.997904
What do I need to do to either a) get it to work correctly, or b) is there a trick to fix this easily? (e.g. some kind of 'round' function for datetime)

I encountered the same issue and got around it by not parsing the dates using Pandas, but rather applying my own function (shown below) to the relevant column(s) of the dataframe:
def ExcelDateToDateTime(xlDate):
epoch = dt.datetime(1899, 12, 30)
delta = dt.timedelta(hours = round(xlDate*24))
return epoch + delta
df = pd.DataFrame.from_csv('path')
df['Date'] = df['Date'].apply(ExcelDateToDateTime)
Note: This will ignore any time granularity below the hour level, but that's all I need, and it looks from your example that this could be the case for you too.

Excel serializes datetimes with a ddddd.tttttt format, where the d part is an integer number representing the offset from a reference day (like Dec 31st, 1899), and the t part is a fraction between 0.0 and 1.0 which stands for the part of the day at the given time (for example at 12:00 it's 0.5, at 18:00 it's 0.75 and so on).
I asked you to upload a file with sample data. .xlsx files are really ZIP archives which contains your XML-serialized worksheets. This are the dates I extracted from the relevant column. Excerpt:
38961.666666666628
38961.708333333292
38961.749999999956
When you try to manually deserialize you get the same datetimes as Panda. Unfortunately, the way Excel stores times makes it impossible to represent some values exactly, so you have to round them for displaying purposes. I'm not sure if rounded data is needed for analysis, though.
This is the script I used to test that deserialized datetimes are really the same ones as Panda:
from datetime import date, datetime, time, timedelta
from urllib2 import urlopen
def deserialize(text):
tokens = text.split(".")
date_tok = tokens[0]
time_tok = tokens[1] if len(tokens) == 2 else "0"
d = date(1899, 12, 31) + timedelta(int(date_tok))
t = time(*helper(float("0." + time_tok), (24, 60, 60, 1000000)))
return datetime.combine(d, t)
def helper(factor, units):
result = list()
for unit in units:
value, factor = divmod(factor * unit, 1)
result.append(int(value))
return result
url = "https://gist.github.com/RaffaeleSgarro/877d7449bd19722b44cb/raw/" \
"45d5f0b339d4abf3359fe673fcd2976374ed61b8/dates.txt"
for line in urlopen(url):
print deserialize(line)

Related

Convert 18-digit LDAP/FILETIME timestamps to human readable date

I have exported a list of AD Users out of AD and need to validate their login times.
The output from the powershell script give lastlogin as LDAP/FILE time
EXAMPLE 130305048577611542
I am having trouble converting this to readable time in pandas
Im using the following code:
df['date of login'] = pd.to_datetime(df['FileTime'], unit='ns')
The column FileTime contains time formatted like the EXAMPLE above.
Im getting the following output in my new column date of login
EXAMPLE 1974-02-17 03:50:48.577611542
I know this is being parsed incorrectly as when i input this date time on a online converter i get this output
EXAMPLE:
Epoch/Unix time: 1386031258
GMT: Tuesday, December 3, 2013 12:40:58 AM
Your time zone: Monday, December 2, 2013 4:40:58 PM GMT-08:00
Anyone have an idea of what occuring here why are all my dates in the 1970'

I know this answer is very late to the party, but for anyone else looking in the future.
The 18-digit Active Directory timestamps (LDAP), also named 'Windows NT time format','Win32 FILETIME or SYSTEMTIME' or NTFS file time. These are used in Microsoft Active Directory for pwdLastSet, accountExpires, LastLogon, LastLogonTimestamp and LastPwdSet. The timestamp is the number of 100-nanoseconds intervals (1 nanosecond = one billionth of a second) since Jan 1, 1601 UTC.
Therefore, 130305048577611542 does indeed relate to December 3, 2013.
When putting this value through the date time function in Python, it is truncating the value to nine digits. Therefore the timestamp becomes 130305048 and goes from 1.1.1970 which does result in a 1974 date!
In order to get the correct Unix timestamp you need to do:
(130305048577611542 / 10000000) - 11644473600

Here's a solution I did in Python that worked well for me:
import datetime
def ad_timestamp(timestamp):
if timestamp != 0:
return datetime.datetime(1601, 1, 1) + datetime.timedelta(seconds=timestamp/10000000)
return np.nan
So then if you need to convert a Pandas column:
df.lastLogonTimestamp = df.lastLogonTimestamp.fillna(0).apply(ad_timestamp)
Note: I needed to use fillna before using apply. Also, since I filled with 0's, I checked for that in the conversion function about, if timestamp != 0. Hope that makes sense. It's extra stuff but you may need it to convert the column in question.

I've been stuck on this for couple of days. But now i am ready to share really working solution in more easy to use form:
import datetime
timestamp = 132375402928051110
value = datetime.datetime (1601, 1, 1) +
datetime.timedelta(seconds=timestamp/10000000) ### combine str 3 and 4
print(value.strftime('%Y-%m-%d %H:%M:%S'))

formatting timedelta64 when using pandas.to_excel

I am writing to an excel file using an ExcelWriter:
writer = pd.ExcelWriter(fn,datetime_format=' d hh:mm:ss')
df.to_excel(writer,sheet_name='FOO')
The writing operation is successful and opening the corresponding excel file I see datetimes nicely formatted as required. However, another column of the dataframe with dtype timedelta64[ns] is automatically converted to a numerical value, so in Python I see
0 days 00:23:33.499998
while in excel:
0.016359954
which is likely the same duration converted in number of days.
Is there any way to control the timedelta formatting using pd.ExcelWriter?

Excel has no data type for a timedelta or equivalent, so you have a couple imperfect choices.
To keep their "datetime-ness" in Excel, you could convert to a datetime, then display them in Excel with a format showing only the time part.
df = pd.DataFrame({'td': [pd.Timedelta(1, 'h'), pd.Timedelta(1.5, 'h')]})
df['td_datetime']
df['td_datetime'] = df['td'] + pd.Timestamp(0)
writer = pd.ExcelWriter('tmp.xlsx', datetime_format='hh:mm:ss')
df.to_excel(writer)
# tmp.xlsx
# td td_datetime
# 0.041667 01:00:00
# 0.0625 01:30:00
Alternatively, you could format as string before serializing:
df['td_str'] = df['td'].astype(str)
df
Out[24]:
td td_str
0 01:00:00 0 days 01:00:00.000000000
1 01:30:00 0 days 01:30:00.000000000

Some addition to the above.
Excel zero date is 1-1-1900, while pandas.TimeStamp(0) gives me 1-1-1970.
So, I changed code to
df['td_datetime'] = df['td'] + pd.Timestamp('1900-01-01')
and now it works correctly (and you can correctly add cells to add timedeltas)
Also you might like to display hours only (not 1 day 1 hour, but 25 hours) and for this you can use the following format:
writer = pd.ExcelWriter('tmp.xlsx', datetime_format='[h]:mm:ss')

Converting Matlab's datenum format to Python

I just started moving from Matlab to Python 2.7 and I have some trouble reading my .mat-files. Time information is stored in Matlab's datenum format. For those who are not familiar with it:
A serial date number represents a calendar date as the number of days that has passed since a fixed base date. In MATLAB, serial date number 1 is January 1, 0000.
MATLAB also uses serial time to represent fractions of days beginning at midnight; for example, 6 p.m. equals 0.75 serial days. So the string '31-Oct-2003, 6:00 PM' in MATLAB is date number 731885.75.
(taken from the Matlab documentation)
I would like to convert this to Pythons time format and I found this tutorial. In short, the author states that
If you parse this using python's datetime.fromordinal(731965.04835648148) then the result might look reasonable [...]
(before any further conversions), which doesn't work for me, since datetime.fromordinal expects an integer:
>>> datetime.fromordinal(731965.04835648148)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: integer argument expected, got float
While I could just round them down for daily data, I actually need to import minutely time series. Does anyone have a solution for this problem? I would like to avoid reformatting my .mat files since there's a lot of them and my colleagues need to work with them as well.
If it helps, someone else asked for the other way round. Sadly, I'm too new to Python to really understand what is happening there.
/edit (2012-11-01): This has been fixed in the tutorial posted above.

You link to the solution, it has a small issue. It is this:
python_datetime = datetime.fromordinal(int(matlab_datenum)) + timedelta(days=matlab_datenum%1) - timedelta(days = 366)
a longer explanation can be found here

Using pandas, you can convert a whole array of datenum values with fractional parts:
import numpy as np
import pandas as pd
datenums = np.array([737125, 737124.8, 737124.6, 737124.4, 737124.2, 737124])
timestamps = pd.to_datetime(datenums-719529, unit='D')
The value 719529 is the datenum value of the Unix epoch start (1970-01-01), which is the default origin for pd.to_datetime().
I used the following Matlab code to set this up:
datenum('1970-01-01') % gives 719529
datenums = datenum('06-Mar-2018') - linspace(0,1,6) % test data
datestr(datenums) % human readable format

Just in case it's useful to others, here is a full example of loading time series data from a Matlab mat file, converting a vector of Matlab datenums to a list of datetime objects using carlosdc's answer (defined as a function), and then plotting as time series with Pandas:
from scipy.io import loadmat
import pandas as pd
import datetime as dt
import urllib
# In Matlab, I created this sample 20-day time series:
# t = datenum(2013,8,15,17,11,31) + [0:0.1:20];
# x = sin(t)
# y = cos(t)
# plot(t,x)
# datetick
# save sine.mat
urllib.urlretrieve('http://geoport.whoi.edu/data/sine.mat','sine.mat');
# If you don't use squeeze_me = True, then Pandas doesn't like
# the arrays in the dictionary, because they look like an arrays
# of 1-element arrays. squeeze_me=True fixes that.
mat_dict = loadmat('sine.mat',squeeze_me=True)
# make a new dictionary with just dependent variables we want
# (we handle the time variable separately, below)
my_dict = { k: mat_dict[k] for k in ['x','y']}
def matlab2datetime(matlab_datenum):
day = dt.datetime.fromordinal(int(matlab_datenum))
dayfrac = dt.timedelta(days=matlab_datenum%1) - dt.timedelta(days = 366)
return day + dayfrac
# convert Matlab variable "t" into list of python datetime objects
my_dict['date_time'] = [matlab2datetime(tval) for tval in mat_dict['t']]
# print df
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 201 entries, 2013-08-15 17:11:30.999997 to 2013-09-04 17:11:30.999997
Data columns (total 2 columns):
x 201 non-null values
y 201 non-null values
dtypes: float64(2)
# plot with Pandas
df = pd.DataFrame(my_dict)
df = df.set_index('date_time')
df.plot()

Here's a way to convert these using numpy.datetime64, rather than datetime.
origin = np.datetime64('0000-01-01', 'D') - np.timedelta64(1, 'D')
date = serdate * np.timedelta64(1, 'D') + origin
This works for serdate either a single integer or an integer array.

Just building on and adding to previous comments. The key is in the day counting as carried out by the method toordinal and constructor fromordinal in the class datetime and related subclasses. For example, from the Python Library Reference for 2.7, one reads that fromordinal
Return the date corresponding to the proleptic Gregorian ordinal, where January 1 of year 1 has ordinal 1. ValueError is raised unless 1 <= ordinal <= date.max.toordinal().
However, year 0 AD is still one (leap) year to count in, so there are still 366 days that need to be taken into account. (Leap year it was, like 2016 that is exactly 504 four-year cycles ago.)
These are two functions that I have been using for similar purposes:
import datetime
def datetime_pytom(d,t):
'''
Input
d Date as an instance of type datetime.date
t Time as an instance of type datetime.time
Output
The fractional day count since 0-Jan-0000 (proleptic ISO calendar)
This is the 'datenum' datatype in matlab
Notes on day counting
matlab: day one is 1 Jan 0000
python: day one is 1 Jan 0001
hence an increase of 366 days, for year 0 AD was a leap year
'''
dd = d.toordinal() + 366
tt = datetime.timedelta(hours=t.hour,minutes=t.minute,
seconds=t.second)
tt = datetime.timedelta.total_seconds(tt) / 86400
return dd + tt
def datetime_mtopy(datenum):
'''
Input
The fractional day count according to datenum datatype in matlab
Output
The date and time as a instance of type datetime in python
Notes on day counting
matlab: day one is 1 Jan 0000
python: day one is 1 Jan 0001
hence a reduction of 366 days, for year 0 AD was a leap year
'''
ii = datetime.datetime.fromordinal(int(datenum) - 366)
ff = datetime.timedelta(days=datenum%1)
return ii + ff
Hope this helps and happy to be corrected.

How to convert a python datetime.datetime to excel serial date number

I need to convert dates into Excel serial numbers for a data munging script I am writing. By playing with dates in my OpenOffice Calc workbook, I was able to deduce that '1-Jan 1899 00:00:00' maps to the number zero.
I wrote the following function to convert from a python datetime object into an Excel serial number:
def excel_date(date1):
temp=dt.datetime.strptime('18990101', '%Y%m%d')
delta=date1-temp
total_seconds = delta.days * 86400 + delta.seconds
return total_seconds
However, when I try some sample dates, the numbers are different from those I get when I format the date as a number in Excel (well OpenOffice Calc). For example, testing '2009-03-20' gives 3478032000 in Python, whilst excel renders the serial number as 39892.
What is wrong with the formula above?
*Note: I am using Python 2.6.3, so do not have access to datetime.total_seconds()

It appears that the Excel "serial date" format is actually the number of days since 1900-01-00, with a fractional component that's a fraction of a day, based on http://www.cpearson.com/excel/datetime.htm. (I guess that date should actually be considered 1899-12-31, since there's no such thing as a 0th day of a month)
So, it seems like it should be:
def excel_date(date1):
temp = dt.datetime(1899, 12, 30) # Note, not 31st Dec but 30th!
delta = date1 - temp
return float(delta.days) + (float(delta.seconds) / 86400)

While this is not exactly relevant to the excel serial date format, this was the top hit for exporting python date time to Excel. What I have found particularly useful and simple is to just export using strftime.
import datetime
current_datetime = datetime.datetime.now()
current_datetime.strftime('%x %X')
This will output in the following format '06/25/14 09:59:29' which is accepted by Excel as a valid date/time and allows for sorting in Excel.

if the problem is that we want DATEVALUE() excel serial number for dates, the toordinal() function can be used. Python serial numbers start from Jan1 of year 1 whereas excel starts from 1 Jan 1900 so apply an offset. Also see excel 1900 leap year bug (https://support.microsoft.com/en-us/help/214326/excel-incorrectly-assumes-that-the-year-1900-is-a-leap-year)
def convert_date_to_excel_ordinal(day, month, year) :
offset = 693594
current = date(year,month,day)
n = current.toordinal()
return (n - offset)

With the 3rd party xlrd.xldate module, you can supply a tuple structured as (year, month, day, hour, minute, second) and, if necessary, calculate a day fraction from any microseconds component:
from datetime import datetime
from xlrd import xldate
from operator import attrgetter
def excel_date(input_date):
components = ('year', 'month', 'day', 'hour', 'minute', 'second')
frac = input_date.microsecond / (86400 * 10**6) # divide by microseconds in one day
return xldate.xldate_from_datetime_tuple(attrgetter(*components)(input_date), 0) + frac
res = excel_date(datetime(1900, 3, 1, 12, 0, 0, 5*10**5))
# 61.50000578703704

According to #akgood's answer, when the datetime is before 1/0/1900, the return value is wrong, the corrected return expression may be:
def excel_date(date1):
temp = dt.datetime(1899, 12, 30) # Note, not 31st Dec but 30th!
delta = date1 - temp
return float(delta.days) + (-1.0 if delta.days < 0 else 1.0)*(delta.seconds)) / 86400

This worked when I tested using the csv package to create a spreadsheet:
from datetime import datetime
def excel_date(date1):
return date1.strftime('%x %-I:%M:%S %p')
now = datetime.now()
current_datetime=now.strftime('%x %-I:%M:%S %p')
time_data.append(excel_date(datetime.now()))
...

How to use ``xlrd.xldate_as_tuple()``

I am not quite sure how to use the following function:
xlrd.xldate_as_tuple
for the following data
xldate:39274.0
xldate:39839.0
Could someone please give me an example on usage of the function for the data?

Quoth the documentation:
Dates in Excel spreadsheets
In reality, there are no such things.
What you have are floating point
numbers and pious hope. There are
several problems with Excel dates:
(1) Dates are not stored as a separate
data type; they are stored as floating
point numbers and you have to rely on
(a) the "number format" applied to
them in Excel and/or (b) knowing which
cells are supposed to have dates in
them. This module helps with (a) by
inspecting the format that has been
applied to each number cell; if it
appears to be a date format, the cell
is classified as a date rather than a
number. Feedback on this feature,
especially from non-English-speaking
locales, would be appreciated.
(2) Excel for Windows stores dates by
default as the number of days (or
fraction thereof) since
1899-12-31T00:00:00. Excel for
Macintosh uses a default start date of
1904-01-01T00:00:00. The date system
can be changed in Excel on a
per-workbook basis (for example: Tools
-> Options -> Calculation, tick the "1904 date system" box). This is of
course a bad idea if there are already
dates in the workbook. There is no
good reason to change it even if there
are no dates in the workbook. Which
date system is in use is recorded in
the workbook. A workbook transported
from Windows to Macintosh (or vice
versa) will work correctly with the
host Excel. When using this module's
xldate_as_tuple function to convert
numbers from a workbook, you must use
the datemode attribute of the Book
object. If you guess, or make a
judgement depending on where you
believe the workbook was created, you
run the risk of being 1462 days out of
kilter.
Reference:
http://support.microsoft.com/default.aspx?scid=KB;EN-US;q180162
(3) The Excel implementation of the
Windows-default 1900-based date system
works on the incorrect premise that
1900 was a leap year. It interprets
the number 60 as meaning 1900-02-29,
which is not a valid date.
Consequently any number less than 61
is ambiguous. Example: is 59 the
result of 1900-02-28 entered directly,
or is it 1900-03-01 minus 2 days? The
OpenOffice.org Calc program "corrects"
the Microsoft problem; entering
1900-02-27 causes the number 59 to be
stored. Save as an XLS file, then open
the file with Excel -- you'll see
1900-02-28 displayed.
Reference:
http://support.microsoft.com/default.aspx?scid=kb;en-us;214326
which I quote here because the answer to your question is likely to be wrong unless you take that into account.
So to put this into code would be something like:
import datetime
import xlrd
book = xlrd.open_workbook("myfile.xls")
sheet = book.sheet_by_index(0)
cell = sheet.cell(5, 19) # type, <class 'xlrd.sheet.Cell'>
if sheet.cell(5, 19).ctype == 3: # 3 means 'xldate' , 1 means 'text'
ms_date_number = sheet.cell_value(5, 19) # Correct option 1
ms_date_number = sheet.cell(5, 19).value # Correct option 2
year, month, day, hour, minute, second = xlrd.xldate_as_tuple(ms_date_number,
book.datemode)
py_date = datetime.datetime(year, month, day, hour, minute, nearest_second)
which gives you a Python datetime in py_date that you can do useful operations upon using the standard datetime module.
I've never used xlrd, and my example is completely made up, but if there is a myfile.xls and it really has a date number in cell F20, and you aren't too fussy about precision as noted above, this code should work.

The documentation of the function (minus the list of possible exceptions):
xldate_as_tuple(xldate, datemode) [#]
Convert an Excel number (presumed to represent a date, a datetime or a
time) into a tuple suitable for feeding to datetime or mx.DateTime
constructors.
xldate
The Excel number
datemode
0: 1900-based, 1: 1904-based.
WARNING: when using this function to interpret the contents of
a workbook, you should pass in the Book.datemode attribute of that
workbook. Whether the workbook has ever been anywhere near a Macintosh is
irrelevant.
Returns:
Gregorian (year, month, day, hour, minute, nearest_second).
As the author of xlrd, I'm interested in knowing how the documentation can be made better. Could you please answer these:
Did you read the general section on dates (quoted by #msw)?
Did you read the above specific documentation of the function?
Can you suggest any improvement in the documentation?
Did you actually try running the function, like this:
>>> import xlrd
>>> xlrd.xldate_as_tuple(39274.0, 0)
(2007, 7, 11, 0, 0, 0)
>>> xlrd.xldate_as_tuple(39274.0 - 1.0/60/60/24, 0)
(2007, 7, 10, 23, 59, 59)
>>>

Use it as such:
number = 39274.0
book_datemode = my_book.datemode
year, month, day, hour, minute, second = xldate_as_tuple(number, book_datemode)

import datetime as dt
import xlrd
log_dir = 'C:\\Users\\'
infile = 'myfile.xls'
book = xlrd.open_workbook(log_dir+infile)
sheet1 = book.sheet_by_index(0)
date_column_idx = 1
## iterate through the sheet to locate the date columns
for rownum in range(sheet1.nrows):
rows = sheet1.row_values(rownum)
## check if the cell is a date; continue otherwise
if sheet1.cell(rownum, date_column_idx).ctype != 3 :
continue
install_dt_tuple = xlrd.xldate_as_tuple((rows[date_column_idx ]), book.datemode)
## the "*date_tuple" will automatically unpack the tuple. Thanks mfitzp :-)
date = dt.datetime(*date_tuple)

Here's what I use to automatically convert dates:
cell = sheet.cell(row, col)
value = cell.value
if cell.ctype == 3: # xldate
value = datetime.datetime(*xlrd.xldate_as_tuple(value, workbook.datemode))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.