I have a list of times in h:m format in an Excel spreadsheet, and I'm trying to do some manipulation with DataNitro but it doesn't seem to like the way Excel formats times.
For example, in Excel the time 8:32 is actually just the decimal number .355556 formatted to appear as 8:32. When I access that time with DataNitro it sees it as the decimal, not the string 8:32. If I change the format in Excel from Time to General or Number, it converts it to the decimal (which I don't want). The only thing I've found that works is manually going through each cell and placing ' in front of each one, then going through and changing the format type to General.
Is there any way to convert these times in Excel into strings so I can extract the info with DataNitro (which is only viewing it as a decimal)?
If .355556 (represented as 8:32) is in A1 then =HOUR(A1)&":"&MINUTE(A1) and Copy/Paste Special Values should get you to a string.
Ideally it seems that you probably don't want to change the way that excel keeps the data (Obviously depending on your use case).
If that is the case there is excellent post How do I read a date in Excel format in Python? that explains how to convert the float to a python date time object. Specifically the script by #John Machin works great.
import datetime
def minimalist_xldate_as_datetime(xldate, datemode):
# datemode: 0 for 1900-based, 1 for 1904-based
return (
datetime.datetime(1899, 12, 30)
+ datetime.timedelta(days=xldate + 1462 * datemode)
)
Note his disclaimer "Here's the bare-knuckle no-seat-belts use-at-own-risk version:" I have used it with no problems.
Related
(This is a mix between code and 'user' issue, but since i suspect the issue is code, i opted to post in StackOverflow instead of SuperUser Exchange).
I generated a .csv file with pandas.DataFrame.to_csv() method. This file consists in 2 columns: one is a label (text) and another is a numeric value called accuracy (float). The delimiter used to separate columns is comma (,) and all float values are stored with dot ponctuation like this: 0.9438245862
Even saving this column as float, Excel and Google Sheets infer its type as text. And when i try to format this column as number, they ignore "0." and return a very high value instead of decimals like:
(text) 0.9438245862 => (number) 9438245862,00
I double-checked my .csv file reimporting it again with pandas.read_csv() and printing dataframe.dtypes and the column is imported as float succesfully.
I'd thank for some guidance on what am i missing.
Thanks,
By itself, the csv file should be correct. Both you and Pandas know what delimiter and floating point format are. But Excel might not agree with you, depending on your locale. A simple way to make sure is to write a tiny Excel sheet containing on first row one text value and one floating point one. You then export the file as csv and control what delimiter and floating point formats are.
AFAIK, it is much more easy to change your Python code to follow what your Excel expects that trying to explain Excel that the format of CSV files can vary...
I know that you can change the delimiter and floating point format in the current locale in a Windows system. Simply it is a global setting...
A short example of data would be most useful here. Otherwise we have no idea what you're actually writing/reading. But I'll hazard a guess based on the information you've provided.
The pandas dataframe will have column names. These column names will be text. Unless you tell Excel/Sheets to use the first row as the column name, it will have to treat the column as text. If this isn't the case, could you perhaps save the head of the dataframe to a csv, check it in a text editor, and see how Excel/Sheets imports it. Then include those five rows and two columns in your follow up.
The coding is not necessarily the issue here, but a combination of various factors. I am assuming that your computer is not using the dot character as a decimal separator, due to your language settings (for example, French, Dutch, etc). Instead your computer (and thus also Excel) is likely using a comma as a decimal separator.
If you want to open the data of your analysis / work later with Excel with little to no changes, you can either opt to change how Excel works or how you store the data to a CSV file.
Choosing the later, you can specify the decimal character for the df.to_csv method. It has the "decimal" keyword. You should then also remember that you have to change the decimal character during the importing of your data (if you want to read again the data).
Continuing with the approach of adopting your Python code, you can use the following code snippets to change how you write the dataframe to a csv
import pandas as pd
... some transformations here ...
df.to_csv('myfile.csv', decimal=',')
If you, then, want to read that output file back in with Python (using Pandas), you can use the following:
import pandas as pd
df = pd.read_csv('myfile.csv', decimal=',')
I'm using to csv to save a datframe which looks like this:
PredictionIdx CustomerInterest
0 fe789a06f3 0.654059
1 6238f6b829 0.654269
2 b0e1883ce5 0.666289
3 85e07cdd04 0.664172
in which I've a value '0e15826235' in first column.I'm writing this dataframe to csv using pandas to_csv() . But when I open this csv in google excel or libreoffice it shows 0E in excel and 0 in libreoffice. It is giving me problem during submission in kaggle. But one point to note here is that when I'm reading the same csv using pandas read_csv it shows the above value correctly in dataframe.
As noted in the first comment, the error is resulting from your choice of editor. Many editors will use some version of scientific notation that reads an e (in specific places like the second character) as an indicator of an exponent. Excel, for instance, will read it as a "base X raised to the power Y" where X are the numbers before the e and Y are the numbers after the e. This is a brief description of Excel's scientific notation.
This does not happen in the other cell entries because there appear to be other string-like characters. Excel, Libre, and possibly Google attempt to interpret what the entry is, rather than taking it literally.
In your question you write '0e15826235' with single quotes, indicating that it might be a string, but this might be something to make sure of when writing out the values to a file -- Excel and the rest might not know this is meant to be a string literal.
In general, check for the format of the value and consider what your eventual editor might "think" it is when it opens. For Excel specifically, a single quote character at the start of the string will force Excel to read it as a string. See this answer.
For me code below works correctly with google spreadsheets:
import pandas as pd
df = pd.DataFrame({'PredictionIdx': ['fe789a06f3',
'6238f6b829',
'b0e1883ce5',
'85e07cdd04'],
'CustomerInterest': [0.654059,
0.654269,
0.666289,
0.664172]})
df.to_csv('./test.csv', index = None)
Also csv is very simple text format, it doesn't hold any information about data types.
So you could use df.to_excel() as Nihal suggested, or adjust column type settings in your favourite spreadsheets viewer.
I have a Python 3 script that is loading some data into an Excel file on a Windows machine. I need the cell not just the number to be formatted as Currency.
I can use the following format to set the Number format for a cell:
sheet['D48'].number_format = '#,##0'
However, when I try a similar approach using the number format for Currency:
sheet['M48'].number_format = '($#,##0.00_);[Red]($#,##0.00)'
I get this for the custom format. Notice the extra backslashes, they are being added to the format so it does not match with the pre-defined Currency style.
(\$#,##0.00_);[Red](\$#,##0.00)
I have seen this question and used it to get this far. However the answer does not solve the extra backslash issue I am seeing.
Set openpyxl cell format to currency
I just formatted before placing into the cell.
"${:10,.2f}".format(7622086.82)
'$7,622,086.82'
I formatted the cell in Excel, and then copied the format.
This worked for me
.number_format = '[$$-409]#,##0.00;[RED]-[$$-409]#,##0.00'
I am using DataNitro in my spreadsheet. When I write the values to a cell. It automatically guesses if format looks like a date. This is obviously not always helpfull!
dt_str = "08/20/13"
Cell("A1").value = dt_str
# puts date type in that cell
I am not sure whether this behaviour is from Excel 2010 or from DataNitro side. As I am writing this i am getting more convinced that this is an Excel issue. Anybody with experience on this?
Done some more research and I almost conviced it is Excel Issue. Solutions when Entering data directly is starting the cell with a ' This is obviously? not possible if I come in from python.
This is an Excel issue, and putting a single quote at the beginning is correct. You can do that as long as you use double quotes to delimit the string:
Cell("A1").value = "'10/1/2013"
I'm working with some code that reads data from xlsx files by parsing the xml. It is all pretty straightforward, with the exception of date cell.
Dates are stored as integers and have an "s" attribute that is an index into the stylesheet, which can be used to get a date formatting string. Here are some examples from a previous stackoverflow question that is linked below:
19 = 'h:mm:ss AM/PM';
20 = 'h:mm';
21 = 'h:mm:ss';
22 = 'm/d/yy h:mm';
These are the built in date formatting strings from the ooxml standard, however it seems like excel tends to use custom formatted strings instead of the builtins. Here is an example format from an Excel 2007 spreadsheet. numFmtId greater than 164 is a custom format.
<numFmt formatCode="MM/DD/YY" numFmtId="165"/>
Determining if a cell should be formatted as a date is difficult because the only indicator I can find is the formatCode. This one is obviously a date, but cells could be formatted any number of ways. My initial attempt is to look for Ms, Ds, and Ys in the formatCode, but that seems problematic.
Has anybody had any luck with this problem? It seems like the standard excel reading libraries are lacking in xlsx support at this time. I've read through the standards and have dug through a lot of xlsx files without much luck.
The best information seems to come from this stackoverflow question:
what indicates an office open xml cell contains a date time value
Thanks!
Dates are stored as integers
In the Excel data model, there is really no such thing as an integer. Everything is a float. Dates and datetimes are floats, representing days and a fraction since a variable epoch. Times are fractions of a day.
It seems like the standard excel
reading libraries are lacking in xlsx
support at this time.
google("xlsxrd"). To keep up to date, join the python-excel group.
Edit I see that you have already asked a question there. If you had asked a question there as specific as this one, or responded to my request for clarification, you would have this info over two weeks ago.
Have a look at the xlrd documentation. Up the front there is a discussion on Excel dates. All of it applies to Excel 2007 as well as earlier versions. In particular: it is necessary to parse custom formats. It is necessary to have a table of "standard" format indexes which are for date formats. "Standard" formats listed in some places don't include the formats used in CJK locales.
Options for you:
(1) Borrow from the xlrd source code, including the xldate_as_tuple function.
(2) Option (1) + Get the xlsxrd bolt-on kit and borrow from its source code.
(3) [Recommended] Get the xlsxrd bolt-on kit and use it ... you get a set of APIs that operate across Excel versions 2.0 to 2007 and Python versions 2.1 to 2.7.
It isn't enough simply to look for Ms, Ds, and Ys in the number format code
[Red]#,##0 ;[Yellow](#,##0)
is a perfectly valid number format, which contains both Y and D, but isn't a date format. I specifically test for any of the standard date/time formatting characters ('y', 'm', 'd', 'H', 'i', 's') that are outside of square braces ('[' ']').
Even then, I was finding that a few false positives were slipping through, mainly associated with accounting and currency formats. Because these typically begin with either an underscore ('_') or a space followed by a zero (' 0') (neither of which I've ever encountered in a date format, I explicitly filter these values out.
A part of my (PHP) code for determining if a format mask is a date or not:
private static $possibleDateFormatCharacters = 'ymdHis';
// Typically number, currency or accounting (or occasionally fraction) formats
if ((substr($pFormatCode,0,1) == '_') || (substr($pFormatCode,0,2) == '0 ')) {
return false;
}
// Try checking for any of the date formatting characters that don't appear within square braces
if (preg_match('/(^|\])[^\[]*['.self::$possibleDateFormatCharacters.']/i',$pFormatCode)) {
return true;
}
// No date...
return false;
I'm sure that there may still be exceptions that I'm missing, but (if so) they are probably extreme cases