formatting timedelta64 when using pandas.to_excel - python

I am writing to an excel file using an ExcelWriter:
writer = pd.ExcelWriter(fn,datetime_format=' d hh:mm:ss')
df.to_excel(writer,sheet_name='FOO')
The writing operation is successful and opening the corresponding excel file I see datetimes nicely formatted as required. However, another column of the dataframe with dtype timedelta64[ns] is automatically converted to a numerical value, so in Python I see
0 days 00:23:33.499998
while in excel:
0.016359954
which is likely the same duration converted in number of days.
Is there any way to control the timedelta formatting using pd.ExcelWriter?

Excel has no data type for a timedelta or equivalent, so you have a couple imperfect choices.
To keep their "datetime-ness" in Excel, you could convert to a datetime, then display them in Excel with a format showing only the time part.
df = pd.DataFrame({'td': [pd.Timedelta(1, 'h'), pd.Timedelta(1.5, 'h')]})
df['td_datetime']
df['td_datetime'] = df['td'] + pd.Timestamp(0)
writer = pd.ExcelWriter('tmp.xlsx', datetime_format='hh:mm:ss')
df.to_excel(writer)
# tmp.xlsx
# td td_datetime
# 0.041667 01:00:00
# 0.0625 01:30:00
Alternatively, you could format as string before serializing:
df['td_str'] = df['td'].astype(str)
df
Out[24]:
td td_str
0 01:00:00 0 days 01:00:00.000000000
1 01:30:00 0 days 01:30:00.000000000

Some addition to the above.
Excel zero date is 1-1-1900, while pandas.TimeStamp(0) gives me 1-1-1970.
So, I changed code to
df['td_datetime'] = df['td'] + pd.Timestamp('1900-01-01')
and now it works correctly (and you can correctly add cells to add timedeltas)
Also you might like to display hours only (not 1 day 1 hour, but 25 hours) and for this you can use the following format:
writer = pd.ExcelWriter('tmp.xlsx', datetime_format='[h]:mm:ss')

Related

How do I parse time values that are over 23:59 in Python?

In a csv file, there is a column with a date (month/day/year) and TWO columns with time (hour:minute). One time column is the start time and the other column is the end time. All columns are objects that are not converted into datetime.
In the time columns, there are some time values that are over 23:59 and if they are over, the format is hour:minute:second (what I've seen so far). Ex: 24:50:00, 25:35:00, etc. How would I parse the time columns? I'm getting an error message and I think it's because the time is over the usual limit. Also, for the date column I'm told that if the start time column exceeds 23:59, the date would increase based on how much the time is over the limit. Ex: date of 1/1/2000 with start time column of 24:50:00 (hour:hour:minute) is 1/2/2000 with time 0:50 (hour:minute). Do I create a new column and merge the two, and if so, how? And what should I do for the end time column?
When reading the csv file, I tried to parse the time series with parse_dates:
time_parser = lambda x: pd.datetime.strptime(x, '%H:%M')
df = pd.read_csv('data.csv', parse_dates = ['StartTime'], date_parser = time_parser)
But I get a error message that tells me something like:
"25:39 does not match format %H:%M".
I'm not sure if the parser just ignores the extra :00 (second) as mentioned above, but I think the problem is that the time exceeds 23:59.
How should I go about approaching this?
Parse the date to datetime, parse the time to timedelta and add the two together. Note that to_timedelta expects a certain input format (HH:MM:SS), which in your case could be enforced by prepending :00. Ex:
import pandas as pd
df = pd.DataFrame({"date": ["1/1/2000", "1/1/2000", "1/1/2000"],
"time": ["23:59", "24:50", "25:30"]})
df["datetime"] = (
pd.to_datetime(df["date"], format="%m/%d/%Y") +
pd.to_timedelta(df["time"] + ":00")
)
df
date time datetime
0 1/1/2000 23:59 2000-01-01 23:59:00
1 1/1/2000 24:50 2000-01-02 00:50:00
2 1/1/2000 25:30 2000-01-02 01:30:00

Data parsing in pandas, python

I have an excel file with many columns, one of them, 'Column3' is date with some text in it, basically it looks like that:
26/05/20
XXX
YYY
12/05/2020
The data is written in DD/MM/YY format but pandas, just like excel, thinks that 12/05/2020 it's 05 Dec 2020 while it is 12 May 2020. (My windows is set to american date format)
Important note: when I open stock excel file, cells with 12/05/2020 already are Date type, trying to convert it to text it gives me 44170 which will give me wrong date if I just reformat it into DD/MM/YY
I added this line of code:
iport pandas as pd
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
df = pd.read_excel("my_file.xlsx", parse_dates=['Column3'], date_parser=dateparse)
But the text in the column generates an error.
ValueError: time data 'XXX' does not match format '%d/%m/%y'
I went a step further and manually removed all text (obviously I can't do it all the time) to see whether it works or nor, but then I got following error
dateparse = lambda x: pd.datetime.strptime(x,'%d/%m/%y')
TypeError: strptime() argument 1 must be str, not datetime.datetime
I also tried this:
df['Column3'] = pd.to_datetime(df.Column3, format ='%d/%m/%y', errors="coerce")
# if I make errors="ignore" it doesn't change anything.
in that case my 26/05/20 was correctly converted to 26 May 2020 but I lost all my text data(it's ok) and other dates which didn't match with my format argument. Because previously they were recognized as American type date.
My objective is to convert the data in Column3 to the same format so I could apply filters with pandas.
I think it's couple solutions:
tell Pandas to not convert text to date at all (but it is already saved as Date type in stock file, will it work?)
somehow ignore text values and use date_parser= method co convert add dates to DD/MM/YY
with help of pd.to_datetime convert 26/05/20 to 26 May 2020 and than convert 2020-09-06 00:00:00 to 9 June 2020 (seems to be the simplest one but ignore argument doesn't work.)
Here's link to small sample file https://easyupload.io/ca5p6w
You can pass a date_parser to read_excel:
dateparser = lambda x: pd.to_datetime(x, dayfirst=True)
pd.read_excel('test.xlsx', date_parser = dateparser)
Posting this as an answer, since it's too long for a comment
The problem originates in Excel. If I open it in Excel, I see 2 strings that look like dates 26/05/20, 05/12/2020 and 06/02/2020. Note the difference between the 20 and 2020 On lines 24 and 48 I see dates in Column4. This seems to indicate the Excel is put together. Is this Excel assembled by copy-paste, or programmatically?
loading it with just pd.read_excel gives these results for the dates:
26/05/20
2020-12-05 00:00:00
2020-02-06 00:00:00
If I do df["Column3"].apply(type)
gives me
str
<class 'datetime.datetime'>
<class 'datetime.datetime'>
So in the Excel file these are marked as datetime.
Loading them with df = pd.read_excel(DATA_DIR / "sample.xlsx", dtype={"Column3": str}) changes the type of all to str, but does not change the output.
If you open the extract the file, and go look at the xml file xl\worksheets\sheet1.xml directly and look for cell C26, you see it as 44170, while C5 is 6, which is a reference to 26/05/20 in xl/sharedStrings.xml
How do you 'make' this Excel file? This can best be solved in how this file is put together.
Workaround
As a workaround, you can convert the dates piecemeal. The different format allows this:
format1 = "%d/%m/%y"
format2 = "%Y-%d-%m %H:%M:%S"
Then you can do pd.to_datetime(dates, format=format1, errors="coerce") to only get the first dates, and NaT for the ones not according to the format. Then you use combine_first to fill the voids.
dates = df["Column3"] # of the one imported with dtype={"Column3": str}
dates_parsed = (
pd.to_datetime(dates, format=format1, errors="coerce")
.combine_first(pd.to_datetime(dates, format=format2, errors="coerce"))
.astype(object)
.combine_first(dates)
)
The astype(object) is needed to fill in the empty places with the string values.
I think, first you should import the file without date parsing then convert it to date format using following:
df['column3']= pd.to_datetime(df['column3'], errors='coerce')
Hope this will work

Convert Pandas column into same date time format

I have a file where the date and time are in mixed formats as per below:
Ref_ID Date_Time
5.645217e 2020-12-02 16:23:15
5.587422e 2019-02-25 18:33:24
What I'm trying to do is convert the dates into a standard format so that I can further analyse my dataset.
Expected Outcome:
Ref_ID Date_Time
5.645217e 2020-02-12 16:23:15
5.587422e 2019-02-25 18:33:24
So far I've tried a few things like Pandas to_datetime conversion and converting the date using strptime but none has worked so far.
# Did not work
data["Date_Time"] = pd.to_datetime(data["Date_Time"], errors="coerce")
# Also Did not work
data["Date_Time"] = data["Date_Time"].apply(lambda x: datetime.datetime.strptime(x, '%m/%d/%y'))
I've also searched this site for a solution but haven't found one yet.
you could try uisng str.split to extract the day and month and use some boolean testing:
this may be a bit confusing with all the variables but all we are doing is creating new series and dataframes to manipulate the variables, those being the day and month of your original date-time column
# create new dataframe with time split by space so date and time are split
s = df['Date_Time'].str.split('\s',expand=True)
# split date into its own series
m = s[0].str.split('-',expand=True).astype(int)
#use conditional logic to figure out column is the month or day.
m['possible_month'] = np.where(m[1].ge(12),m[2],m[1])
m['possible_day'] = np.where(m[1].ge(12),m[1],m[2])
#concat this back into your first split to re-create a proper datetime.
s[0] = m[0].astype(str).str.cat([m['possible_month'].astype(str),
m['possible_day'].astype(str)],'-')
df['fixed_date'] = pd.to_datetime(s[0].str.cat(s[1].astype(str),' ')
,format='%Y-%m-%d %H:%M:%S')
print(df)
Ref_ID Date_Time fixed_date
0 5.645217e 2020-12-02 16:23:15 2020-02-12 16:23:15
1 5.587422e 2019-02-25 18:33:24 2019-02-25 18:33:24
print(df.dtypes)
Ref_ID object
Date_Time object
fixed_date datetime64[ns]
dtype: object

How to import date column from csv in python in format d/m/y

I have a data sheet in which issue_d is a date column having values stored in a format - 11-Dec. On clicking any cell of the column, date is coming as 12/11/2018.
But while reading the csv file, issue_d is getting imported as 11-Dec. Year is not getting imported.
How do I get the issue_d column in format- d/m/y?
Code i tried -
import pandas
data=pandas.read_csv('Project_data.csv')
print(data)
checking issue_d column: data['issue_d']
result :
0 11-Dec
1 11-Dec
2 11-Dec
expected:
0 11-Dec-2018
1 11-Dec-2018
2 11-Dec-201
You can use to_datetime with add year to column:
df['issue_d'] = pd.to_datetime(df['issue_d'] + '-2018')
print (df)
issue_d
0 2018-12-11
1 2018-12-11
2 2018-12-11
A more 'controllable' way of getting the data is to first get the datetime from the data frame as normal, and then convert it:
dt = dt.strftime('%Y-%m-%d')
In this case, you'd put %d in front. strftime is a great technique because it allows the most customization when converting a datetime variable, and I used it in my tutorial book - if you're a beginner to python algorithms, you should definitely check it out!
After you do this, you can splice out each individual month, day, and year, and then use
strftime("%B")
to get the string-name of the month (e.g. "February").
Good Luck!

python pandas incorrectly reading excel dates

I have an excel file with dates formatted as such:
22.10.07 16:00
22.10.07 17:00
22.10.07 18:00
22.10.07 19:00
After using the parse method of pandas to read the data, the dates are read almost correctly:
In [55]: nts.data['Tid'][10000:10005]
Out[55]:
10000 2007-10-22 15:59:59.997905
10001 2007-10-22 16:59:59.997904
10002 2007-10-22 17:59:59.997904
10003 2007-10-22 18:59:59.997904
What do I need to do to either a) get it to work correctly, or b) is there a trick to fix this easily? (e.g. some kind of 'round' function for datetime)
I encountered the same issue and got around it by not parsing the dates using Pandas, but rather applying my own function (shown below) to the relevant column(s) of the dataframe:
def ExcelDateToDateTime(xlDate):
epoch = dt.datetime(1899, 12, 30)
delta = dt.timedelta(hours = round(xlDate*24))
return epoch + delta
df = pd.DataFrame.from_csv('path')
df['Date'] = df['Date'].apply(ExcelDateToDateTime)
Note: This will ignore any time granularity below the hour level, but that's all I need, and it looks from your example that this could be the case for you too.
Excel serializes datetimes with a ddddd.tttttt format, where the d part is an integer number representing the offset from a reference day (like Dec 31st, 1899), and the t part is a fraction between 0.0 and 1.0 which stands for the part of the day at the given time (for example at 12:00 it's 0.5, at 18:00 it's 0.75 and so on).
I asked you to upload a file with sample data. .xlsx files are really ZIP archives which contains your XML-serialized worksheets. This are the dates I extracted from the relevant column. Excerpt:
38961.666666666628
38961.708333333292
38961.749999999956
When you try to manually deserialize you get the same datetimes as Panda. Unfortunately, the way Excel stores times makes it impossible to represent some values exactly, so you have to round them for displaying purposes. I'm not sure if rounded data is needed for analysis, though.
This is the script I used to test that deserialized datetimes are really the same ones as Panda:
from datetime import date, datetime, time, timedelta
from urllib2 import urlopen
def deserialize(text):
tokens = text.split(".")
date_tok = tokens[0]
time_tok = tokens[1] if len(tokens) == 2 else "0"
d = date(1899, 12, 31) + timedelta(int(date_tok))
t = time(*helper(float("0." + time_tok), (24, 60, 60, 1000000)))
return datetime.combine(d, t)
def helper(factor, units):
result = list()
for unit in units:
value, factor = divmod(factor * unit, 1)
result.append(int(value))
return result
url = "https://gist.github.com/RaffaeleSgarro/877d7449bd19722b44cb/raw/" \
"45d5f0b339d4abf3359fe673fcd2976374ed61b8/dates.txt"
for line in urlopen(url):
print deserialize(line)

Categories

Resources