Reading bigint (int8) column data from Redshift without Scientific Notation using Pandas

Reading bigint (int8) column data from Redshift without Scientific Notation using Pandas - python

I am reading data from Redshift using Pandas. I have one bigint (int8) column which is coming as exponential.
I tried following ways, but getting data truncation in those cases.
Sample Value of data in that column is : 635284328055690862. It is reading as 6.352843e+17.
I tried to convert that into int64 in Python.
import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)
Output in this case is : 635284328055690880. Here I am loosing my data, it is scaling it to 0 at the end.
Expected Output: 635284328055690862
Even, I am getting same result If I am doing this.
pd.set_option('display.float_format', lambda x: '%.0f' % x)
Output: 635284328055690880
Expected Output: 635284328055690862
It seems like this is normal Pandas behavior. I even tried creating a Dataframe using list and still getting the same.
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.0f' % x)
sample_data = [[635284328055690862, 758364950923147626], [np.NaN, np.NaN], [1, 3]]
df = pd.DataFrame(sample_data)
Output:
0 635284328055690880 758364950923147648
1 nan nan
2 1 3
What I have noticed is, whenever we have nan in dataframe, we are having this issue.
I am using below code to fetch data from Redshift.
from sqlalchemy import create_engine
import pandas as pd
connstr = 'redshift+psycopg2://<username>:<password>#<cluster_name>/<db_name>'
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df = pd.read_sql('''select * from schema.table_name''', conn)
print(df)
Please help me in fixing this. Thanks in Advance.

This happens because standard integer datatypes do not provide a way to represent missing data. Since floating point datatypes do provide nan, the old way of handling this was to convert numerical columns with missing data to float.
To correct this, pandas has introduced a Nullable integer data type. If you were doing something as simple as reading a csv, you could explicitly specify this type in your call to read_csv like so:
>>> pandas.read_csv('sample.csv', dtype="Int64")
column_a column_b
0 635284328055690880 45564
1 <NA> 45
2 1 <NA>
3 1 5
However, the problem persists! It seems that even though 635284328055690862 can be represented as a 64-bit integer, at some point, pandas still passes the value through a floating-point conversion step, changing the value. This is pretty odd, and might even be worth raising as an issue with the pandas developers.
The best workaround I see in this scenario is to use the "object" datatype, like so:
>>> pandas.read_csv('sample.csv', dtype="object")
column_a column_b
0 635284328055690862 45564
1 NaN 45
2 1 NaN
3 1 5
This preserves the exact value of the large integer and also allows for NaN values. However, because these are now arrays of python objects, there will be a significant performance hit for compute-intensive tasks. Furthermore, on closer examination, it appears that these are Python str objects, so we still need another conversion step. To my surprise, there was no straightforward approach. This was the best I could do:
def col_to_intNA(col):
return {ix: pandas.NA if pandas.isnull(v) else int(v)
for ix, v in col.to_dict().items()}
sample = {col: col_to_intNA(sample[col])
for col in sample.columns}
sample = pandas.DataFrame(sample, dtype="Int64")
This gives the desired result:
>>> sample
column_a column_b
0 635284328055690862 45564
1 <NA> 45
2 1 <NA>
3 1 5
>>> sample.dtypes
column_a Int64
column_b Int64
dtype: object
So that solves one problem. But a second problem arises, because to read from a Redshift database, you would normally use read_sql, which doesn't provide any way to specify data types.
So we'll roll our own! This is based on the code you posted, as well as some code from the pandas_redshift library. It uses psycopg2 directly, rather than using sqlalchemy, because I am not sure sqlalchemy provides a cursor_factory parameter that accepts a RealDictCursor. Caveat: I have not tested this at all because I am too lazy to set up a postgres database just to test a StackOverflow answer! I think it should work but I am not certain. Please let me know whether it works and/or what needs to be corrected.
import psycopg2
from psycopg2.extras import RealDictCursor # Turn rows into proper dicts.
import pandas
def row_null_to_NA(row):
return {col: pandas.NA if pandas.isnull(val) else val
for col, val in row.items()}
connstr = 'redshift+psycopg2://<username>:<password>#<cluster_name>/<db_name>'
try: # `with conn:` only closes the transaction, not the connection
conn = psycopg2.connect(connstr, cursor_factory=RealDictCursor)
cursor = conn.cursor()
cursor.execute('''select * from schema.table_name''')
# The DataFrame constructor accepts generators of dictionary rows.
df = pandas.DataFrame(
(row_null_to_NA(row) for row in cursor.fetchall()),
dtype="Int64"
)
finally:
conn.close()
print(df)
Note that this assumes that all your columns are integer columns. You might need to load the data column-by-column if not.

One of the fix can be instead of doing select * from schema.table_name. You can pass all columns separately and then cast the particular column.
Let's say, you have 5 columns in table and col2 is bigint(int8) column. So, you can read like below:
from sqlalchemy import create_engine
import pandas as pd
connstr = 'redshift+psycopg2://<username>:<password>#<cluster_name>/<db_name>'
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df = pd.read_sql('''select col1, cast(col2 as int), col3, col4, col5... from schema.table_name''', conn)
print(df)
P.S.: I am not sure this is the smartest solution but logically, if python is not able to cast to int64 properly then we can read casted value from SQL itself.
Further, I would like to try to cast int columns dynamically if it's length is more than 17.

Related

Pandas unexpectedly casts type in heterogeneous dataframe

if I define a dataframe like this
import numpy as np
import pandas as pd
str1= ["woz", "baz", "fop", "jok"]
arr=np.array([2,3,4])
data = {"Goz":str1, "Jok": np.hstack((np.array([5,"fava"]), np.array ([1,2]) ) ) }
df = pd.DataFrame(data)
Goz Jok
0 woz 5
1 baz fava
2 fop 1
3 jok 2
I am puzzled by the fact those nice floats in the second column, coming from a numpy array, are cast to string
type(df.loc[3]["Jok"])
out:: str
Further, if I try to manually cast back to float64 for a subset of the dataframe, say
df.iloc[2:,1].astype("float64", copy=False)
the type of the does not change. Besides I understand thecopy=False option, comes with extreme danger.
If on the other hand I define the dataframe using this
data = {"Goz":str1, "Jok": np.array ([1,2,3,4]) }
having hence the whole column to contain float64 uniformly, all seems to work.
Can somebody please clarify what is going on here? How to handle type-heterogenous columns? Thanks a lot

python float64 type conversion issue with pandas

I have a need to convert an 18 digit float64 pandas column to an integer or string to be readable avoiding the exponential notation.
But I am not successful so far.
df=pd.DataFrame(data={'col1':[915235514180670190,915235514180670208]},dtype='float64')
print(df)
col1
0 9.152355e+17
1 9.152355e+17
Then I tried converting it to int64. But last 3 digits going wrong.
df.col1.astype('int64')
0 915235514180670208
1 915235514180670208
Name: col1, dtype: int64
But you see .. the value is goin wrong. Not sure why.
I read from documentation as int64 should be able to hold an 18 digit number.
int64 Integer (-9223372036854775808 to 9223372036854775807)
Any idea what I am doing wrong ?
How can I achieve my requirement ?
Giving further info based on Eric Postpischil comment.
If float64 can't hold 18 digits, I might be in trouble.
Thing is that I get this data through a pandas read_sql function call from DB. And it automatically type casted to float64.
I don't see an option to mention datatype in pandas read_sql()
Any thoughts from any one on what I can do to overcome this problem ?

The problem is that a float64 a mantisse of 53 bits which can represent 15 or 16 decimal digits (ref).
That means that a 18 digit float64 pandas column is an illusion. No need to go into Pandas not even into numpy types:
>>> n = 915235514180670190
>>> d = float(n)
>>> print(n, d, int(d))
915235514180670190 9.152355141806702e+17 915235514180670208

read_sql in Pandas has a coerce_float parameter that might help. It's on by default, and is documented as:
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
Setting this to False helps, e.g. with the following schema/data:
import psycopg2
con = psycopg2.connect()
with con, con.cursor() as cur:
cur.execute("CREATE TABLE foo ( id SERIAL PRIMARY KEY, num DECIMAL(30,0) )")
cur.execute("INSERT INTO foo (num) VALUES (123456789012345678901234567890)")
I can run:
print(pd.read_sql("SELECT * FROM foo", con))
print(pd.read_sql("SELECT * FROM foo", con, coerce_float=False))
which gives me the following output:
id num
0 1 1.234568e+29
id num
0 1 123456789012345678901234567890
preserving the precision of the value I inserted.
You've not given many details of the database you're using, but hopefully the above is helpful to somebody!

I did a work around to deal that problem.. Thought of sharing it as it may help some one else.
#Preapring SQL to extract all rows.
sql='SELECT * , CAST(col1 AS CHAR(18)) as DUMMY_COL FROM table1;'
#Get data from postgres
df=pd.read_sql(sql, self.conn)
# converting dummy col to integer
df['DUMMY_COL']=df['DUMMY_COL'].astype('int64')
# removing the original col1 column with replacing the int64 converted one.
df['col1'] = df['DUMMY_COL']
df.drop('DUMMY_COL', axis=1, inplace=True)

Efficiently running operations on Pandas DataFrame columns that are not unique

I have a DataFrame similar to this:
import numpy as np
raw_data = {'Identifier':['10','10','10','11',11,'12','13']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Identifier'])
print df
As you can see the 'Identifier' column is not unique and the dataframe itself has many rows.
Everytime I try to do a calculation on the Identifier column using:
df['CalculatedColumn'] = df['Identifer'] + apply calculation here
As Identifer is not unique, is there a better way of doing this? Maybe store the calculations for each unique identifier and then pass in the results? The calculation is quite complex and added with the number of rows, this takes a long time. But i would want to reduce it as the identifiers are not unique.
Any thoughts?

I'm pretty sure there is a more pythonic way, but this works for me:
import numpy as np
import pandas as pd
raw_data = {'Identifier':['10','10','10','11','11','12','13']}
df = pd.DataFrame(raw_data,columns=['Identifier'])
df['CalculatedColumn']=0
dfuni=df.drop_duplicates(['Identifier'])
dfuni['CalculatedColumn']=dfuni['Identifier']*2 # perform calculation
for j in range(len(dfuni)):
df['CalculatedColumn'][df['Identifier']==dfuni['Identifier'].iloc[j]]=dfuni['CalculatedColumn'].iloc[j]
print df
print dfuni
As an explanation: I'm creating a new dataframe dfuni which contains all the unique fields of your original dataframe. Then, you perform your calculation on this (I just multiplied the value of the Identifier by two, and because it is a string, the result is 1010, 1111 etc.). Up to here, I like the code, but then, I'm using a loop through all the values of dfuni to copy them back into your original df. For this point, there might be a more elegant solution.
As a result, I get:
Identifier CalculatedColumn
0 10 1010
1 10 1010
2 10 1010
3 11 1111
4 11 1111
5 12 1212
6 13 1313
PS: This code was tested with Python 3. The only thing I adapted was the print-statements. I may have missed something.

Pandas to_dict unwantedly modifying float numbers

My code below takes in CSV data and uses pandas to_dict() function as one step in converting the data to JSON. The problem is it is modifying the float numbers (e.g. 1.6 becomes 1.6000000000000001). I am not concerned about the loss of accuracy, but because users will see the change in the numbers, it looks amateurish.
I am aware:
this is something that has come up before here, but it was two years ago, was not really answered in a great way,
also I have an additional complication: the data frames I am looking to convert to dictionaries could be any combination of datatypes
As such the issue with the previous solutions are:
Converting all the numbers to objects only works if you don't need to (numerically) use the numbers. I want the option to calculate sums and averages which reintroduces the addition decimal issue.
Force rounding of numbers to x decimals will either reduce accuracy or add additional unnecessary 0s depending on the data the user provides
My question:
Is there a better way to ensure the numbers are not being modified, but are kept in a numeric datatype? Is it a question of changing how I import the CSV data in the first place? Surely there is a simple solution I am overlooking?
Here is a simple script that will reproduce this bug:
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
CSV_Data = "Index,Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8\nindex_1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8\nindex_2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8\nindex_3,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8\nindex_4,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8"
input_data = StringIO(CSV_Data)
df = pd.DataFrame.from_csv(path = input_data, header = 0, sep=',', index_col=0, encoding='utf-8')
print(df.to_dict(orient = 'records'))

You could use pd.io.json.dumps to handle nested dicts with pandas objects.
Let's create a summary dict with dataframe records and custom metric.
In [137]: summary = {'df': df.to_dict(orient = 'records'), 'df_metric': df.sum() / df.min()}
In [138]: summary['df_metric']
Out[138]:
Column_1 9.454545
Column_2 9.000000
Column_3 8.615385
Column_4 8.285714
Column_5 8.000000
Column_6 7.750000
Column_7 7.529412
Column_8 7.333333
dtype: float64
In [139]: pd.io.json.dumps(summary)
Out[139]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.4545454545,"Column_2":9.0,"Column_3":8.6153846154,"Column_4":8.2857142857,"Column_5":8.0,"Column_6":7.75,"Column_7":7.5294117647,"Column_8":7.3333333333}}'
Use, double_precision to alter the maximum digit precision of doubles.
Notice. df_metric values.
In [140]: pd.io.json.dumps(summary, double_precision=2)
Out[140]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.45,"Column_2":9.0,"Column_3":8.62,"Column_4":8.29,"Column_5":8.0,"Column_6":7.75,"Column_7":7.53,"Column_8":7.33}}'
You could use orient='records/index/..' to handle dataframe -> to_json construction.
In [144]: pd.io.json.dumps(summary, orient='records')
Out[144]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":[9.4545454545,9.0,8.6153846154,8.2857142857,8.0,7.75,7.5294117647,7.3333333333]}'
In essence, pd.io.json.dumps - converts arbitrary object recursively into JSON, which internally uses ultrajson

I need to make df.to_dict('list') with right float numbers. But df.to_json() doesn't support orient='list' yet. So I do following:
list_oriented_dict = {
column: list(data.values())
for column, data in json.loads(df.to_json()).items()
}
Not the best way, but it works for me. Maybe some one has a more elegant solution?

Pandas Groupby - Sparse Matrix Error

This question is related to the question I asked previously about using pandas get_dummies() function (link below).
Pandas Get_dummies for nested tables
However in the course of utilizing the solution provide in the answer I noticed odd behavior when looking at the groupby function. The issue is that repeated (non-unique) index values for a dataframe appear to cause an error when the matrix is represented in sparse format, while working as expected for dense matrix.
I have extremely high dimensional data thus sparse matrix will be required for memory reasons. An example of the error is below. If anyone has a work around it would be greatly appreciated
Working:
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name')
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Failing
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Note you will need version 16.1 or greater of pandas.
Thank you in advance

You can perform your groupby in a different way as a workaround. Don't set Instance as the index and use the column for your groupby and drop the Instance column (last column in this case since it was just added). Groupby will will make an Instance index.
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
#WORKAROUND:
result=result.groupby('Instance').apply(max)[result.columns[:-1]]
result
Out[58]:
Name_Doe Name_Jane Name_John Name_Smith
Instance
1 0 0 1 1
2 0 1 0 0
3 1 0 0 0
Note: The sparse dataframe stores your Instance int's as floats within a BlockIndex in the dataframe column. In order to have the index the exact same as the first example you'd need to change to int from float.
result.index=result.index.map(int)
result.index.name='Instance'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Reading bigint (int8) column data from Redshift without Scientific Notation using Pandas - python

Related

Pandas unexpectedly casts type in heterogeneous dataframe

python float64 type conversion issue with pandas

Efficiently running operations on Pandas DataFrame columns that are not unique

Pandas to_dict unwantedly modifying float numbers

Pandas Groupby - Sparse Matrix Error

Categories

Resources