Converting schemas via pandas vs pyarrow

Converting schemas via pandas vs pyarrow - python

I have a dataframe in pandas that I want to use pyarrow to write it out as a parquet.
I also need to be able to specify column types. If I change the type via pandas, I get no error; but when I change the it via pyarrow, I get an error. See examples:
Given
import pandas as pd
import pyarrow as pa
data = {"col": [86002575]}
df = pd.DataFrame(data)
Via Pandas
df = df.astype({"col": "float32"})
table = pa.Table.from_pandas(df)
No errors
Via PyArrow
schema = pa.Schema.from_pandas(df)
i = schema.get_field_index("col")
schema = schema.set(i, pa.field("col", pa.float32()))
table = pa.Table.from_pandas(df, schema=schema)
get error:
pyarrow.lib.ArrowInvalid: ('Integer value 86002575 not in range: -16777216 to 16777216', 'Conversion failed for column col with type int64')
I don't even recognize that range either. Is it trying to do some intermediary conversion when converting between the two?

When converting from one type to another, arrow is much stricter than pandas.
In your case you are converting from int64 to float32. Because they are limits to the exact representation of whole number in floating point, arrow limits the range you can convert to 16777216. Past that limit, the float precision gets bad and if you were to convert the float value back to an int, you are not guaranteed to have the same value.
You can easily ignore these checks though:
schema_float32 = pa.schema([pa.field("col", pa.float32())])
table = pa.Table.from_pandas(df, schema=schema_float32, safe=False)
EDIT:
It's not documented explicitely in arrow. It's common software engineering knowledge.
In wikipedia:
Any integer with absolute value less than 2^24 can be exactly
represented in the single precision format, and any integer with
absolute value less than 2^53 can be exactly represented in the double
precision format. Furthermore, a wide range of powers of 2 times such
a number can be represented. These properties are sometimes used for
purely integer data, to get 53-bit integers on platforms that have
double precision floats but only 32-bit integers.
2^24 = 16777216
It's not very well documented in arrow. You can look at the code

Related

Pandas DataFrame Float64 - Point Precision

Precisely, I have an Excel file that among other columns, I have one with float numbers.
The problem is: when trying to convert it´s values to string, Pandas looses the precision point number for some of them.
The flow is described below:
Action - Reading Data:
df = pd.read_excel(io=_file, sheet_name="History", header=4, skiprows=[5, 6, 7]).fillna("")
Action - Converting:
test = df.assign(STRING_VALUES=df.SOURCE_VALUES.map(lambda x: str(x)))
Action - Checking Pandas Decimal Pattern:
pd.get_option("styler.format.decimal")
Result: ',' (that´s correct).
Action - The Reality
Highlights:
My job is: check if SOURCE_VALUES has more than two decimals values! And with this behavior, I am getting FALSE/POSITIVE reports.
I am in Brazil, so we use the same format for (currency) float numbers as European does, thousands it is "." and for decimal ",". Apparently, the default format presented by Pandas is correct.
Conclusion:
As described in my image, not all values loose the precision point number. What s the workaround for this behavior?
Conclusion 2:
I got the real source file now in XML! And when I received the spreadsheet, it was no longer in that format, but in xlsx. In this way, Excel applied a standard format by creating a mask as usual for floating numbers.
For this reason I was intrigued, what would be the reason for some values to have the correct formatting and others not in the decimals after converting them to string.
But now, I am sure that it´s not a fault of Pandas. And the real values are "correct".
Final Thoughts
The most important lesson here is to be aware of the origin of the data, so that you don't mistrust your results!

It looks like values in excel are formatted to display 0.00 format, but real data is different. For example, if you change format of value 4556.38 to display more digits, you will see 456.379999... . With that, pandas are reading the real value, not formatted value.
To control float to str conversion format, maybe use format method (https://www.w3schools.com/python/ref_string_format.asp)

Pd.read_sql add zero to int number from table sql server

I'm wondering why pandas.read_sql add .zero at the end of my data please ?
In SQL Server, the data are number indicated like int.
Someone knows why please?
And when I look type of data, it's indicated like float64.

According to this solution, this happens because your data might contains NaN values, so in order to handle this, int data type is automatically converted to float.
When introducing NAs into an existing Series or DataFrame via reindex or some other means, boolean and integer types will be promoted to a different dtype in order to store the NAs. These are summarized by this table:
Typeclass Promotion dtype for storing NAs
floating no change
object no change
integer cast to float64
boolean cast to object

astype('float') changes data, not just data type

I download a bunch of csv-files from an aws s3-bucket and put them in a dataframe. Before uploading the dataframe to sql server I would like to change the columns of the dataframe to have the right datatypes.
When I run astype('float64') on a column I want to change it not only changes the datatype but also the data.
Code:
df['testcol'] = df['lineId'].astype('float64')
pycharm image of the result
I attached a picture to visualize the error. As you can see the data in the third column (testcol) is different to the data in the second column (lineId) even though only the datatype should be changed.
A pl_id can have multiple lineId's, that's why I added and sorted by pl_id in the picture.
Am I using astype() wrong or is this a pandas bug?

Basically it seems that the float64 is not sufficient to carry that long integer:
np.float64('211052094743748628')
Out[135]: 2.1105209474374864e+17
"The max precision a float 64 can reach is close to 10-16 (unit in the last place (ULP), see en.wikipedia.org/wiki/Floating-point_arithmetic) so the idea of an exact decimal value with significantly more than 16 digits for a floating point is misleading."
Numpy float64 vs Python float
Consider maybe using int64, which can be more suitable for the size of Id in your dataset:
np.int64('211052094743748628')
Out[150]: 211052094743748628

overflow error in Python using dataframes

I get the below error while trying to fetch rows from Excel using as a data frame. Some of the columns have very big values like 1405668170987000000, while others are time stamp columns having values like 11:46:00.180630.
I did convert the format of the above columns to text. However, I'm still getting the below error for a simple select statement (select * from df limit 5):
Overflow Error: Python int too large to convert to SQLite INTEGER

SQLite INTEGERS are 64-bit, meaning the maximum value is 9,223,372,036,854,775,807.
It looks like some of your values are larger than that so they will not fit into the SQLite INTEGER type. You could try converting them to text in order to extract them.

SQL integer values have a upper bound of 2**63 - 1. And the value provided in your case 1405668170987000000 is simply too large for SQL.
Try converting them into string and then perform the required operation

ArrowNotImplementedError: halffloat error on applying pandas.to_feather on a dataframe

I have a dataframe with columns of different datatypes including dates. No after doing some modifications, i want to save it a feather file so as to access it later. But i am getting the error on the following step
historical_transactions.to_feather('tmp/historical-raw')
ArrowNotImplementedError: halffloat

I guess, in your dataframe, there is columns of dtype as float16 which is not supported in feather format. you can convert those columns to float32 and try.

You could try this:
historical_transactions.astype('float32').to_feather('tmp/historical-raw')
Note that above line could fail if you also have fields that are not convertable into float32. In order to ignore those columns and leave them as they are, try:
historical_transactions.astype('float32', errors='ignore').to_feather('tmp/historical-raw')
Feather format depends on Pyarrow which in turn depends on the Apache Parquet format. Regarding float formats, it only supports float (32) and double (64). Not sure how big of a deal this is for you but there is also an open request to automatically "Coerce Arrow half-precision float to float32" in GitHub.
See here and here for details.

Improving on Kocas' answer, converting exclusively the half-float columns
half_floats = historical_transactions.select_dtypes(include="float16")
historical_transactions[half_floats.columns] = half_floats.astype("float32")
historical_transactions.to_feather('tmp/historical-raw')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting schemas via pandas vs pyarrow - python

Related

Pandas DataFrame Float64 - Point Precision

Pd.read_sql add zero to int number from table sql server

astype('float') changes data, not just data type

overflow error in Python using dataframes

ArrowNotImplementedError: halffloat error on applying pandas.to_feather on a dataframe

Categories

Resources