Pandas DataFrame Float64 - Point Precision - python

Precisely, I have an Excel file that among other columns, I have one with float numbers.
The problem is: when trying to convert it´s values to string, Pandas looses the precision point number for some of them.
The flow is described below:
Action - Reading Data:
df = pd.read_excel(io=_file, sheet_name="History", header=4, skiprows=[5, 6, 7]).fillna("")
Action - Converting:
test = df.assign(STRING_VALUES=df.SOURCE_VALUES.map(lambda x: str(x)))
Action - Checking Pandas Decimal Pattern:
pd.get_option("styler.format.decimal")
Result: ',' (that´s correct).
Action - The Reality
Highlights:
My job is: check if SOURCE_VALUES has more than two decimals values! And with this behavior, I am getting FALSE/POSITIVE reports.
I am in Brazil, so we use the same format for (currency) float numbers as European does, thousands it is "." and for decimal ",". Apparently, the default format presented by Pandas is correct.
Conclusion:
As described in my image, not all values loose the precision point number. What s the workaround for this behavior?
Conclusion 2:
I got the real source file now in XML! And when I received the spreadsheet, it was no longer in that format, but in xlsx. In this way, Excel applied a standard format by creating a mask as usual for floating numbers.
For this reason I was intrigued, what would be the reason for some values to have the correct formatting and others not in the decimals after converting them to string.
But now, I am sure that it´s not a fault of Pandas. And the real values are "correct".
Final Thoughts
The most important lesson here is to be aware of the origin of the data, so that you don't mistrust your results!

It looks like values in excel are formatted to display 0.00 format, but real data is different. For example, if you change format of value 4556.38 to display more digits, you will see 456.379999... . With that, pandas are reading the real value, not formatted value.
To control float to str conversion format, maybe use format method (https://www.w3schools.com/python/ref_string_format.asp)

Related

Convert csv to csv, removing scientific notation from one column

I'm starting with a CSV exported from a system with 3 columns, the first column is displaying a number in scientific notation. I need to transform only that column to a number and save to another CSV. Note there are thousands of lines, converting using Excel is not an option.
I have found many articles close to this, using "float", using "round", but I haven't found anything that can handle a large file.
Example, file1.csv:
ID, Phone, Email
1.23E+15, 123-456-7890, johnsmith#test.com
Need the output to file2.csv:
ID, Phone, Email
1234680000000000, 123-456-7890, johnsmith#test.com
I know I'm way off, but this may give you an idea of what I'm trying to accomplish...
import pandas
import numpy as np
pandas.read_csv('file1.csv', dtype=np.float64)
df = df.apply(pd.to_numeric, errors='coerce')
df.round(0)
df.to_csv(float_format='file2.csv')
Here is the error I receive:
error
The text in your CSV file, "1.23E+15", means "one-point-two-three, raised to the 15th power"... that's all Python, Pandas, anything (but you) can know about that number.
I say "but you", because you seem to know that before "1.23E+15", there was the value 1234680000000000.
But, then some other program/process chopped off the "46800..." part and all it left was "1.23E+15"—something decreased the precision of the original value.
That's why #TimRoberts asked "How was this generated?" To get back 1234680000000000, you need to go to the program/process that last had that higher-precision value and try to change that program/process to not decrease the precision of the number.

CSV cannot be interpreted by numeric values

(This is a mix between code and 'user' issue, but since i suspect the issue is code, i opted to post in StackOverflow instead of SuperUser Exchange).
I generated a .csv file with pandas.DataFrame.to_csv() method. This file consists in 2 columns: one is a label (text) and another is a numeric value called accuracy (float). The delimiter used to separate columns is comma (,) and all float values are stored with dot ponctuation like this: 0.9438245862
Even saving this column as float, Excel and Google Sheets infer its type as text. And when i try to format this column as number, they ignore "0." and return a very high value instead of decimals like:
(text) 0.9438245862 => (number) 9438245862,00
I double-checked my .csv file reimporting it again with pandas.read_csv() and printing dataframe.dtypes and the column is imported as float succesfully.
I'd thank for some guidance on what am i missing.
Thanks,
By itself, the csv file should be correct. Both you and Pandas know what delimiter and floating point format are. But Excel might not agree with you, depending on your locale. A simple way to make sure is to write a tiny Excel sheet containing on first row one text value and one floating point one. You then export the file as csv and control what delimiter and floating point formats are.
AFAIK, it is much more easy to change your Python code to follow what your Excel expects that trying to explain Excel that the format of CSV files can vary...
I know that you can change the delimiter and floating point format in the current locale in a Windows system. Simply it is a global setting...
A short example of data would be most useful here. Otherwise we have no idea what you're actually writing/reading. But I'll hazard a guess based on the information you've provided.
The pandas dataframe will have column names. These column names will be text. Unless you tell Excel/Sheets to use the first row as the column name, it will have to treat the column as text. If this isn't the case, could you perhaps save the head of the dataframe to a csv, check it in a text editor, and see how Excel/Sheets imports it. Then include those five rows and two columns in your follow up.
The coding is not necessarily the issue here, but a combination of various factors. I am assuming that your computer is not using the dot character as a decimal separator, due to your language settings (for example, French, Dutch, etc). Instead your computer (and thus also Excel) is likely using a comma as a decimal separator.
If you want to open the data of your analysis / work later with Excel with little to no changes, you can either opt to change how Excel works or how you store the data to a CSV file.
Choosing the later, you can specify the decimal character for the df.to_csv method. It has the "decimal" keyword. You should then also remember that you have to change the decimal character during the importing of your data (if you want to read again the data).
Continuing with the approach of adopting your Python code, you can use the following code snippets to change how you write the dataframe to a csv
import pandas as pd
... some transformations here ...
df.to_csv('myfile.csv', decimal=',')
If you, then, want to read that output file back in with Python (using Pandas), you can use the following:
import pandas as pd
df = pd.read_csv('myfile.csv', decimal=',')

Converting schemas via pandas vs pyarrow

I have a dataframe in pandas that I want to use pyarrow to write it out as a parquet.
I also need to be able to specify column types. If I change the type via pandas, I get no error; but when I change the it via pyarrow, I get an error. See examples:
Given
import pandas as pd
import pyarrow as pa
data = {"col": [86002575]}
df = pd.DataFrame(data)
Via Pandas
df = df.astype({"col": "float32"})
table = pa.Table.from_pandas(df)
No errors
Via PyArrow
schema = pa.Schema.from_pandas(df)
i = schema.get_field_index("col")
schema = schema.set(i, pa.field("col", pa.float32()))
table = pa.Table.from_pandas(df, schema=schema)
get error:
pyarrow.lib.ArrowInvalid: ('Integer value 86002575 not in range: -16777216 to 16777216', 'Conversion failed for column col with type int64')
I don't even recognize that range either. Is it trying to do some intermediary conversion when converting between the two?
When converting from one type to another, arrow is much stricter than pandas.
In your case you are converting from int64 to float32. Because they are limits to the exact representation of whole number in floating point, arrow limits the range you can convert to 16777216. Past that limit, the float precision gets bad and if you were to convert the float value back to an int, you are not guaranteed to have the same value.
You can easily ignore these checks though:
schema_float32 = pa.schema([pa.field("col", pa.float32())])
table = pa.Table.from_pandas(df, schema=schema_float32, safe=False)
EDIT:
It's not documented explicitely in arrow. It's common software engineering knowledge.
In wikipedia:
Any integer with absolute value less than 2^24 can be exactly
represented in the single precision format, and any integer with
absolute value less than 2^53 can be exactly represented in the double
precision format. Furthermore, a wide range of powers of 2 times such
a number can be represented. These properties are sometimes used for
purely integer data, to get 53-bit integers on platforms that have
double precision floats but only 32-bit integers.
2^24 = 16777216
It's not very well documented in arrow. You can look at the code

astype('float') changes data, not just data type

I download a bunch of csv-files from an aws s3-bucket and put them in a dataframe. Before uploading the dataframe to sql server I would like to change the columns of the dataframe to have the right datatypes.
When I run astype('float64') on a column I want to change it not only changes the datatype but also the data.
Code:
df['testcol'] = df['lineId'].astype('float64')
pycharm image of the result
I attached a picture to visualize the error. As you can see the data in the third column (testcol) is different to the data in the second column (lineId) even though only the datatype should be changed.
A pl_id can have multiple lineId's, that's why I added and sorted by pl_id in the picture.
Am I using astype() wrong or is this a pandas bug?
Basically it seems that the float64 is not sufficient to carry that long integer:
np.float64('211052094743748628')
Out[135]: 2.1105209474374864e+17
"The max precision a float 64 can reach is close to 10-16 (unit in the last place (ULP), see en.wikipedia.org/wiki/Floating-point_arithmetic) so the idea of an exact decimal value with significantly more than 16 digits for a floating point is misleading."
Numpy float64 vs Python float
Consider maybe using int64, which can be more suitable for the size of Id in your dataset:
np.int64('211052094743748628')
Out[150]: 211052094743748628

not able to read currency symbol from the cell using pandas python

I am using pandas.read_excel(file) to read the file, but instead of getting number with currency symbol its giving numbers only not with currency symbol.
help will be appreciated.
thanks]1
When the Excel file is read by Pandas it reads the underlying value of the cell which fundamentally is either a string or a number. Things like currency symbols are just applied as formatting by Excel - they affect what you see on screen but don't actually 'exist' in the cell value. For example the number 1.1256 might appear as 1.1 if you select on decimal place, or £1.13 if you select currency, it could even appear as a date, but fundamentally it is just 1.1256 and this is the value Pandas reads. Generally this is useful because with numbers you can perform arithmetic whereas if pandas imported '£1.13' you would be unable to do any arithmetic (for example add this to another amount of money) until you had removed the £ symbol and converted to a decimal and in so doing you would have lost some precision (as the number is now 1.13 not 1.1256).
If you need to view the data with the currency symbol, I'd suggest just adding it on at the last moment, for example if you print the data to screen you could use print(f'£{your_number}')

Categories

Resources