if I define a dataframe like this
import numpy as np
import pandas as pd
str1= ["woz", "baz", "fop", "jok"]
arr=np.array([2,3,4])
data = {"Goz":str1, "Jok": np.hstack((np.array([5,"fava"]), np.array ([1,2]) ) ) }
df = pd.DataFrame(data)
Goz Jok
0 woz 5
1 baz fava
2 fop 1
3 jok 2
I am puzzled by the fact those nice floats in the second column, coming from a numpy array, are cast to string
type(df.loc[3]["Jok"])
out:: str
Further, if I try to manually cast back to float64 for a subset of the dataframe, say
df.iloc[2:,1].astype("float64", copy=False)
the type of the does not change. Besides I understand thecopy=False option, comes with extreme danger.
If on the other hand I define the dataframe using this
data = {"Goz":str1, "Jok": np.array ([1,2,3,4]) }
having hence the whole column to contain float64 uniformly, all seems to work.
Can somebody please clarify what is going on here? How to handle type-heterogenous columns? Thanks a lot
I have a need to convert an 18 digit float64 pandas column to an integer or string to be readable avoiding the exponential notation.
But I am not successful so far.
df=pd.DataFrame(data={'col1':[915235514180670190,915235514180670208]},dtype='float64')
print(df)
col1
0 9.152355e+17
1 9.152355e+17
Then I tried converting it to int64. But last 3 digits going wrong.
df.col1.astype('int64')
0 915235514180670208
1 915235514180670208
Name: col1, dtype: int64
But you see .. the value is goin wrong. Not sure why.
I read from documentation as int64 should be able to hold an 18 digit number.
int64 Integer (-9223372036854775808 to 9223372036854775807)
Any idea what I am doing wrong ?
How can I achieve my requirement ?
Giving further info based on Eric Postpischil comment.
If float64 can't hold 18 digits, I might be in trouble.
Thing is that I get this data through a pandas read_sql function call from DB. And it automatically type casted to float64.
I don't see an option to mention datatype in pandas read_sql()
Any thoughts from any one on what I can do to overcome this problem ?
The problem is that a float64 a mantisse of 53 bits which can represent 15 or 16 decimal digits (ref).
That means that a 18 digit float64 pandas column is an illusion. No need to go into Pandas not even into numpy types:
>>> n = 915235514180670190
>>> d = float(n)
>>> print(n, d, int(d))
915235514180670190 9.152355141806702e+17 915235514180670208
read_sql in Pandas has a coerce_float parameter that might help. It's on by default, and is documented as:
Attempts to convert values of non-string, non-numeric objects (like decimal.Decimal) to floating point, useful for SQL result sets.
Setting this to False helps, e.g. with the following schema/data:
import psycopg2
con = psycopg2.connect()
with con, con.cursor() as cur:
cur.execute("CREATE TABLE foo ( id SERIAL PRIMARY KEY, num DECIMAL(30,0) )")
cur.execute("INSERT INTO foo (num) VALUES (123456789012345678901234567890)")
I can run:
print(pd.read_sql("SELECT * FROM foo", con))
print(pd.read_sql("SELECT * FROM foo", con, coerce_float=False))
which gives me the following output:
id num
0 1 1.234568e+29
id num
0 1 123456789012345678901234567890
preserving the precision of the value I inserted.
You've not given many details of the database you're using, but hopefully the above is helpful to somebody!
I did a work around to deal that problem.. Thought of sharing it as it may help some one else.
#Preapring SQL to extract all rows.
sql='SELECT * , CAST(col1 AS CHAR(18)) as DUMMY_COL FROM table1;'
#Get data from postgres
df=pd.read_sql(sql, self.conn)
# converting dummy col to integer
df['DUMMY_COL']=df['DUMMY_COL'].astype('int64')
# removing the original col1 column with replacing the int64 converted one.
df['col1'] = df['DUMMY_COL']
df.drop('DUMMY_COL', axis=1, inplace=True)
I have a DataFrame similar to this:
import numpy as np
raw_data = {'Identifier':['10','10','10','11',11,'12','13']}
import pandas as pd
df = pd.DataFrame(raw_data,columns=['Identifier'])
print df
As you can see the 'Identifier' column is not unique and the dataframe itself has many rows.
Everytime I try to do a calculation on the Identifier column using:
df['CalculatedColumn'] = df['Identifer'] + apply calculation here
As Identifer is not unique, is there a better way of doing this? Maybe store the calculations for each unique identifier and then pass in the results? The calculation is quite complex and added with the number of rows, this takes a long time. But i would want to reduce it as the identifiers are not unique.
Any thoughts?
I'm pretty sure there is a more pythonic way, but this works for me:
import numpy as np
import pandas as pd
raw_data = {'Identifier':['10','10','10','11','11','12','13']}
df = pd.DataFrame(raw_data,columns=['Identifier'])
df['CalculatedColumn']=0
dfuni=df.drop_duplicates(['Identifier'])
dfuni['CalculatedColumn']=dfuni['Identifier']*2 # perform calculation
for j in range(len(dfuni)):
df['CalculatedColumn'][df['Identifier']==dfuni['Identifier'].iloc[j]]=dfuni['CalculatedColumn'].iloc[j]
print df
print dfuni
As an explanation: I'm creating a new dataframe dfuni which contains all the unique fields of your original dataframe. Then, you perform your calculation on this (I just multiplied the value of the Identifier by two, and because it is a string, the result is 1010, 1111 etc.). Up to here, I like the code, but then, I'm using a loop through all the values of dfuni to copy them back into your original df. For this point, there might be a more elegant solution.
As a result, I get:
Identifier CalculatedColumn
0 10 1010
1 10 1010
2 10 1010
3 11 1111
4 11 1111
5 12 1212
6 13 1313
PS: This code was tested with Python 3. The only thing I adapted was the print-statements. I may have missed something.
My code below takes in CSV data and uses pandas to_dict() function as one step in converting the data to JSON. The problem is it is modifying the float numbers (e.g. 1.6 becomes 1.6000000000000001). I am not concerned about the loss of accuracy, but because users will see the change in the numbers, it looks amateurish.
I am aware:
this is something that has come up before here, but it was two years ago, was not really answered in a great way,
also I have an additional complication: the data frames I am looking to convert to dictionaries could be any combination of datatypes
As such the issue with the previous solutions are:
Converting all the numbers to objects only works if you don't need to (numerically) use the numbers. I want the option to calculate sums and averages which reintroduces the addition decimal issue.
Force rounding of numbers to x decimals will either reduce accuracy or add additional unnecessary 0s depending on the data the user provides
My question:
Is there a better way to ensure the numbers are not being modified, but are kept in a numeric datatype? Is it a question of changing how I import the CSV data in the first place? Surely there is a simple solution I am overlooking?
Here is a simple script that will reproduce this bug:
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
CSV_Data = "Index,Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8\nindex_1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8\nindex_2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8\nindex_3,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8\nindex_4,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8"
input_data = StringIO(CSV_Data)
df = pd.DataFrame.from_csv(path = input_data, header = 0, sep=',', index_col=0, encoding='utf-8')
print(df.to_dict(orient = 'records'))
You could use pd.io.json.dumps to handle nested dicts with pandas objects.
Let's create a summary dict with dataframe records and custom metric.
In [137]: summary = {'df': df.to_dict(orient = 'records'), 'df_metric': df.sum() / df.min()}
In [138]: summary['df_metric']
Out[138]:
Column_1 9.454545
Column_2 9.000000
Column_3 8.615385
Column_4 8.285714
Column_5 8.000000
Column_6 7.750000
Column_7 7.529412
Column_8 7.333333
dtype: float64
In [139]: pd.io.json.dumps(summary)
Out[139]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.4545454545,"Column_2":9.0,"Column_3":8.6153846154,"Column_4":8.2857142857,"Column_5":8.0,"Column_6":7.75,"Column_7":7.5294117647,"Column_8":7.3333333333}}'
Use, double_precision to alter the maximum digit precision of doubles.
Notice. df_metric values.
In [140]: pd.io.json.dumps(summary, double_precision=2)
Out[140]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.45,"Column_2":9.0,"Column_3":8.62,"Column_4":8.29,"Column_5":8.0,"Column_6":7.75,"Column_7":7.53,"Column_8":7.33}}'
You could use orient='records/index/..' to handle dataframe -> to_json construction.
In [144]: pd.io.json.dumps(summary, orient='records')
Out[144]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":[9.4545454545,9.0,8.6153846154,8.2857142857,8.0,7.75,7.5294117647,7.3333333333]}'
In essence, pd.io.json.dumps - converts arbitrary object recursively into JSON, which internally uses ultrajson
I need to make df.to_dict('list') with right float numbers. But df.to_json() doesn't support orient='list' yet. So I do following:
list_oriented_dict = {
column: list(data.values())
for column, data in json.loads(df.to_json()).items()
}
Not the best way, but it works for me. Maybe some one has a more elegant solution?
This question is related to the question I asked previously about using pandas get_dummies() function (link below).
Pandas Get_dummies for nested tables
However in the course of utilizing the solution provide in the answer I noticed odd behavior when looking at the groupby function. The issue is that repeated (non-unique) index values for a dataframe appear to cause an error when the matrix is represented in sparse format, while working as expected for dense matrix.
I have extremely high dimensional data thus sparse matrix will be required for memory reasons. An example of the error is below. If anyone has a work around it would be greatly appreciated
Working:
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name')
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Failing
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
result = result.set_index('Instance')
result = result.groupby(level=0).apply(max)
Note you will need version 16.1 or greater of pandas.
Thank you in advance
You can perform your groupby in a different way as a workaround. Don't set Instance as the index and use the column for your groupby and drop the Instance column (last column in this case since it was just added). Groupby will will make an Instance index.
import pandas as pd
df = pd.DataFrame({'Instance':[1,1,2,3],'Cat_col':
['John','Smith','Jane','Doe']})
result= pd.get_dummies(df.Cat_col, prefix='Name',sparse=True)
result['Instance'] = df.Instance
#WORKAROUND:
result=result.groupby('Instance').apply(max)[result.columns[:-1]]
result
Out[58]:
Name_Doe Name_Jane Name_John Name_Smith
Instance
1 0 0 1 1
2 0 1 0 0
3 1 0 0 0
Note: The sparse dataframe stores your Instance int's as floats within a BlockIndex in the dataframe column. In order to have the index the exact same as the first example you'd need to change to int from float.
result.index=result.index.map(int)
result.index.name='Instance'