Pandas to_dict unwantedly modifying float numbers - python

My code below takes in CSV data and uses pandas to_dict() function as one step in converting the data to JSON. The problem is it is modifying the float numbers (e.g. 1.6 becomes 1.6000000000000001). I am not concerned about the loss of accuracy, but because users will see the change in the numbers, it looks amateurish.
I am aware:
this is something that has come up before here, but it was two years ago, was not really answered in a great way,
also I have an additional complication: the data frames I am looking to convert to dictionaries could be any combination of datatypes
As such the issue with the previous solutions are:
Converting all the numbers to objects only works if you don't need to (numerically) use the numbers. I want the option to calculate sums and averages which reintroduces the addition decimal issue.
Force rounding of numbers to x decimals will either reduce accuracy or add additional unnecessary 0s depending on the data the user provides
My question:
Is there a better way to ensure the numbers are not being modified, but are kept in a numeric datatype? Is it a question of changing how I import the CSV data in the first place? Surely there is a simple solution I am overlooking?
Here is a simple script that will reproduce this bug:
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
CSV_Data = "Index,Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8\nindex_1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8\nindex_2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8\nindex_3,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8\nindex_4,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8"
input_data = StringIO(CSV_Data)
df = pd.DataFrame.from_csv(path = input_data, header = 0, sep=',', index_col=0, encoding='utf-8')
print(df.to_dict(orient = 'records'))

You could use pd.io.json.dumps to handle nested dicts with pandas objects.
Let's create a summary dict with dataframe records and custom metric.
In [137]: summary = {'df': df.to_dict(orient = 'records'), 'df_metric': df.sum() / df.min()}
In [138]: summary['df_metric']
Out[138]:
Column_1 9.454545
Column_2 9.000000
Column_3 8.615385
Column_4 8.285714
Column_5 8.000000
Column_6 7.750000
Column_7 7.529412
Column_8 7.333333
dtype: float64
In [139]: pd.io.json.dumps(summary)
Out[139]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.4545454545,"Column_2":9.0,"Column_3":8.6153846154,"Column_4":8.2857142857,"Column_5":8.0,"Column_6":7.75,"Column_7":7.5294117647,"Column_8":7.3333333333}}'
Use, double_precision to alter the maximum digit precision of doubles.
Notice. df_metric values.
In [140]: pd.io.json.dumps(summary, double_precision=2)
Out[140]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.45,"Column_2":9.0,"Column_3":8.62,"Column_4":8.29,"Column_5":8.0,"Column_6":7.75,"Column_7":7.53,"Column_8":7.33}}'
You could use orient='records/index/..' to handle dataframe -> to_json construction.
In [144]: pd.io.json.dumps(summary, orient='records')
Out[144]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":[9.4545454545,9.0,8.6153846154,8.2857142857,8.0,7.75,7.5294117647,7.3333333333]}'
In essence, pd.io.json.dumps - converts arbitrary object recursively into JSON, which internally uses ultrajson

I need to make df.to_dict('list') with right float numbers. But df.to_json() doesn't support orient='list' yet. So I do following:
list_oriented_dict = {
column: list(data.values())
for column, data in json.loads(df.to_json()).items()
}
Not the best way, but it works for me. Maybe some one has a more elegant solution?

Related

Quickly search through a numpy array and sum the corresponding values

I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!
You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up
A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.

Not getting stats analysis of binary column pandas

I have a dataframe, 11 columns 18k rows. The last column is either a 1 or 0, but when I use .describe() all I get is
count 19020
unique 2
top 1
freq 12332
Name: Class, dtype: int64
as opposed to an actual statistical analysis with mean, std, etc.
Is there a way to do this?
If your numeric (0, 1) column is not being picked up automatically by .describe(), it might be because it's not actually encoded as an int dtype. You can see this in the documentation of the .describe() method, which tells you that the default include parameter is only for numeric types:
None (default) : The result will include all numeric columns.
My suggestion would be the following:
df.dtypes # check datatypes
df['num'] = df['num'].astype(int) # if it's not integer, cast it as such
df.describe(include=['object', 'int64']) # explicitly state the data types you'd like to describe
That is, first check the datatypes (I'm assuming the column is called num and the dataframe df, but feel free to substitute with the right ones). If this indicator/(0,1) column is indeed not encoded as int/integer type, then cast it as such by using .astype(int). Then, you can freely use df.describe() and perhaps even specify columns of which data types you want to include in the description output, for more fine-grained control.
You could use
# percentile list
perc =[.20, .40, .60, .80]
# list of dtypes to include
include =['object', 'float', 'int']
data.describe(percentiles = perc, include = include)
where data is your dataframe (important point).
Since you are new to stack, I might suggest that you include some actual code (i.e. something showing how and on what you are using your methods). You'll get better answers

Reading bigint (int8) column data from Redshift without Scientific Notation using Pandas

I am reading data from Redshift using Pandas. I have one bigint (int8) column which is coming as exponential.
I tried following ways, but getting data truncation in those cases.
Sample Value of data in that column is : 635284328055690862. It is reading as 6.352843e+17.
I tried to convert that into int64 in Python.
import numpy as np
df["column_name"] = df["column_name"].astype(np.int64)
Output in this case is : 635284328055690880. Here I am loosing my data, it is scaling it to 0 at the end.
Expected Output: 635284328055690862
Even, I am getting same result If I am doing this.
pd.set_option('display.float_format', lambda x: '%.0f' % x)
Output: 635284328055690880
Expected Output: 635284328055690862
It seems like this is normal Pandas behavior. I even tried creating a Dataframe using list and still getting the same.
import pandas as pd
import numpy as np
pd.set_option('display.float_format', lambda x: '%.0f' % x)
sample_data = [[635284328055690862, 758364950923147626], [np.NaN, np.NaN], [1, 3]]
df = pd.DataFrame(sample_data)
Output:
0 635284328055690880 758364950923147648
1 nan nan
2 1 3
What I have noticed is, whenever we have nan in dataframe, we are having this issue.
I am using below code to fetch data from Redshift.
from sqlalchemy import create_engine
import pandas as pd
connstr = 'redshift+psycopg2://<username>:<password>#<cluster_name>/<db_name>'
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df = pd.read_sql('''select * from schema.table_name''', conn)
print(df)
Please help me in fixing this. Thanks in Advance.
This happens because standard integer datatypes do not provide a way to represent missing data. Since floating point datatypes do provide nan, the old way of handling this was to convert numerical columns with missing data to float.
To correct this, pandas has introduced a Nullable integer data type. If you were doing something as simple as reading a csv, you could explicitly specify this type in your call to read_csv like so:
>>> pandas.read_csv('sample.csv', dtype="Int64")
column_a column_b
0 635284328055690880 45564
1 <NA> 45
2 1 <NA>
3 1 5
However, the problem persists! It seems that even though 635284328055690862 can be represented as a 64-bit integer, at some point, pandas still passes the value through a floating-point conversion step, changing the value. This is pretty odd, and might even be worth raising as an issue with the pandas developers.
The best workaround I see in this scenario is to use the "object" datatype, like so:
>>> pandas.read_csv('sample.csv', dtype="object")
column_a column_b
0 635284328055690862 45564
1 NaN 45
2 1 NaN
3 1 5
This preserves the exact value of the large integer and also allows for NaN values. However, because these are now arrays of python objects, there will be a significant performance hit for compute-intensive tasks. Furthermore, on closer examination, it appears that these are Python str objects, so we still need another conversion step. To my surprise, there was no straightforward approach. This was the best I could do:
def col_to_intNA(col):
return {ix: pandas.NA if pandas.isnull(v) else int(v)
for ix, v in col.to_dict().items()}
sample = {col: col_to_intNA(sample[col])
for col in sample.columns}
sample = pandas.DataFrame(sample, dtype="Int64")
This gives the desired result:
>>> sample
column_a column_b
0 635284328055690862 45564
1 <NA> 45
2 1 <NA>
3 1 5
>>> sample.dtypes
column_a Int64
column_b Int64
dtype: object
So that solves one problem. But a second problem arises, because to read from a Redshift database, you would normally use read_sql, which doesn't provide any way to specify data types.
So we'll roll our own! This is based on the code you posted, as well as some code from the pandas_redshift library. It uses psycopg2 directly, rather than using sqlalchemy, because I am not sure sqlalchemy provides a cursor_factory parameter that accepts a RealDictCursor. Caveat: I have not tested this at all because I am too lazy to set up a postgres database just to test a StackOverflow answer! I think it should work but I am not certain. Please let me know whether it works and/or what needs to be corrected.
import psycopg2
from psycopg2.extras import RealDictCursor # Turn rows into proper dicts.
import pandas
def row_null_to_NA(row):
return {col: pandas.NA if pandas.isnull(val) else val
for col, val in row.items()}
connstr = 'redshift+psycopg2://<username>:<password>#<cluster_name>/<db_name>'
try: # `with conn:` only closes the transaction, not the connection
conn = psycopg2.connect(connstr, cursor_factory=RealDictCursor)
cursor = conn.cursor()
cursor.execute('''select * from schema.table_name''')
# The DataFrame constructor accepts generators of dictionary rows.
df = pandas.DataFrame(
(row_null_to_NA(row) for row in cursor.fetchall()),
dtype="Int64"
)
finally:
conn.close()
print(df)
Note that this assumes that all your columns are integer columns. You might need to load the data column-by-column if not.
One of the fix can be instead of doing select * from schema.table_name. You can pass all columns separately and then cast the particular column.
Let's say, you have 5 columns in table and col2 is bigint(int8) column. So, you can read like below:
from sqlalchemy import create_engine
import pandas as pd
connstr = 'redshift+psycopg2://<username>:<password>#<cluster_name>/<db_name>'
engine = create_engine(connstr)
with engine.connect() as conn, conn.begin():
df = pd.read_sql('''select col1, cast(col2 as int), col3, col4, col5... from schema.table_name''', conn)
print(df)
P.S.: I am not sure this is the smartest solution but logically, if python is not able to cast to int64 properly then we can read casted value from SQL itself.
Further, I would like to try to cast int columns dynamically if it's length is more than 17.

Python how does == work for float/double?

I know using == for float is generally not safe. But does it work for the below scenario?
Read from csv file A.csv, save first half of the data to csv file B.csv without doing anything.
Read from both A.csv and B.csv. Use == to check if data match everywhere in the first half.
These are all done with Pandas. The columns in A.csv have types datetime, string, and float. Obviously == works for datetime and string, so if == works for float as well in this case, it saves a lot of work.
It seems to be working for all my tests, but can I assume it will work all the time?
The same string representation will become the same float representation when put through the same parse routine. The float inaccuracy issue occurs either when mathematical operations are performed on the values or when high-precision representations are used, but equality on low-precision values is no reason to worry.
No, you cannot assume that this will work all the time.
For this to work, you need to know that the text value written out by Pandas when it's writing to a CSV file recovers the exact same value when read back in (again using Pandas). But by default, the Pandas read_csv function sacrifices accuracy for speed, and so the parsing operation does not automatically recover the same float.
To demonstrate this, try the following: we'll create some random values, write them out to a CSV file, and read them back in, all using Pandas. First the necessary imports:
>>> import pandas as pd
>>> import numpy as np
Now create some random values, and put them into a Pandas Series object:
>>> test_values = np.random.rand(10000)
>>> s = pd.Series(test_values, name='test_values')
Now we use the to_csv method to write these values out to a file, and then read the contents of that file back into a DataFrame:
>>> s.to_csv('test.csv', header=True)
>>> df = pd.read_csv('test.csv')
Finally, let's extract the values from the relevant column of df and compare. We'll sum the result of the == operation to find out how many of the 10000 input values were recovered exactly.
>>> sum(test_values == df['test_values'])
7808
So approximately 78% of the values were recovered correctly; the others were not.
This behaviour is considered a feature of Pandas, rather than a bug. However, there's a workaround: Pandas 0.15 added a new float_precision argument to the CSV reader. By supplying float_precision='round_trip' to the read_csv operation, Pandas uses a slower but more accurate parser. Trying that on the example above, we get the values recovered perfectly:
>>> df = pd.read_csv('test.csv', float_precision='round_trip')
>>> sum(test_values == df['test_values'])
10000
Here's a second example, going in the other direction. The previous example showed that writing and then reading doesn't give back the same data. This example shows that reading and then writing doesn't preserve the data, either. The setup closely matches the one you describe in the question. First we'll create A.csv, this time using regularly-spaced values instead of random ones:
>>> import pandas as pd, numpy as np
>>> s = pd.Series(np.arange(10**4) / 1e3, name='test_values')
>>> s.to_csv('A.csv', header=True)
Now we read A.csv, and write the first half of the data back out again to B.csv, as in your Step 1.
>>> recovered_s = pd.read_csv('A.csv').test_values
>>> recovered_s[:5000].to_csv('B.csv', header=True)
Then we read in both A.csv and B.csv, and compare the first half of A with B, as in your Step 2.
>>> a = pd.read_csv('A.csv').test_values
>>> b = pd.read_csv('B.csv').test_values
>>> (a[:5000] == b).all()
False
>>> (a[:5000] == b).sum()
4251
So again, several of the values don't compare correctly. Opening up the files, A.csv looks pretty much as I expect. Here are the first 15 entries in A.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.007
8,0.008
9,0.009
10,0.01
11,0.011
12,0.012
13,0.013
14,0.014
15,0.015
And here are the corresponding entries in B.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.006999999999999999
8,0.008
9,0.009000000000000001
10,0.01
11,0.011000000000000001
12,0.012
13,0.013000000000000001
14,0.013999999999999999
15,0.015
See this bug report for more information on the introduction of the float_precision keyword to read_csv.

mean() of column in pandas DataFrame returning inf: how can I solve this?

I'm trying to implement some machine learning algorithms, but I'm having some difficulties putting the data together.
In the example below, I load a example data-set from UCI, remove lines with missing data (thanks to the help from a previous question), and now I would like to try to normalize the data.
For many datasets, I just used:
valores = (valores - valores.mean()) / (valores.std())
But for this particular dataset the approach above doesn't work. The problem is that the mean function is returning inf, perhaps due to a precision issue. See the example below:
bcw = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data', header=None)
for col in bcw.columns:
if bcw[col].dtype != 'int64':
print "Removendo possivel '?' na coluna %s..." % col
bcw = bcw[bcw[col] != '?']
valores = bcw.iloc[:,1:10]
#mean return inf
print valores.iloc[:,5].mean()
My question is how to deal with this. It seems that I need to change the type of this column, but I don't know how to do it.
not so familiar with pandas but if you convert to a numpy array it works, try
np.asarray(valores.iloc[:,5], dtype=np.float).mean()
NaN values should not matter when computing the mean of a pandas.Series. Precision is also irrelevant. The only explanation I can think of is that one of the values in valores is equal to infinity.
You could exclude any values that are infinite when computing the mean like this:
import numpy as np
is_inf = valores.iloc[:, 5] == np.inf
valores.ix[~is_inf, 5].mean()
If the elements of the pandas series are strings you get inf and the mean result. In this specific case you can simply convert the pandas series elements to float and then calculate the mean. No need to use numpy.
Example:
valores.iloc[:,5].astype(float).mean()
I had the same problem with a column that was of dtype 'o', and whose max value was 9999. Have you tried using the convert_objects method with the convert_numeric=True parameter? This fixed the problem for me.
For me, the reason was an overflow: my original data was in float16 and calling .mean() on that would return inf. After converting my data to float32 (e.g. via .astype("float32")), .mean worked as expected.

Categories

Resources