Python how does == work for float/double?

Python how does == work for float/double? - python

I know using == for float is generally not safe. But does it work for the below scenario?
Read from csv file A.csv, save first half of the data to csv file B.csv without doing anything.
Read from both A.csv and B.csv. Use == to check if data match everywhere in the first half.
These are all done with Pandas. The columns in A.csv have types datetime, string, and float. Obviously == works for datetime and string, so if == works for float as well in this case, it saves a lot of work.
It seems to be working for all my tests, but can I assume it will work all the time?

The same string representation will become the same float representation when put through the same parse routine. The float inaccuracy issue occurs either when mathematical operations are performed on the values or when high-precision representations are used, but equality on low-precision values is no reason to worry.

No, you cannot assume that this will work all the time.
For this to work, you need to know that the text value written out by Pandas when it's writing to a CSV file recovers the exact same value when read back in (again using Pandas). But by default, the Pandas read_csv function sacrifices accuracy for speed, and so the parsing operation does not automatically recover the same float.
To demonstrate this, try the following: we'll create some random values, write them out to a CSV file, and read them back in, all using Pandas. First the necessary imports:
>>> import pandas as pd
>>> import numpy as np
Now create some random values, and put them into a Pandas Series object:
>>> test_values = np.random.rand(10000)
>>> s = pd.Series(test_values, name='test_values')
Now we use the to_csv method to write these values out to a file, and then read the contents of that file back into a DataFrame:
>>> s.to_csv('test.csv', header=True)
>>> df = pd.read_csv('test.csv')
Finally, let's extract the values from the relevant column of df and compare. We'll sum the result of the == operation to find out how many of the 10000 input values were recovered exactly.
>>> sum(test_values == df['test_values'])
7808
So approximately 78% of the values were recovered correctly; the others were not.
This behaviour is considered a feature of Pandas, rather than a bug. However, there's a workaround: Pandas 0.15 added a new float_precision argument to the CSV reader. By supplying float_precision='round_trip' to the read_csv operation, Pandas uses a slower but more accurate parser. Trying that on the example above, we get the values recovered perfectly:
>>> df = pd.read_csv('test.csv', float_precision='round_trip')
>>> sum(test_values == df['test_values'])
10000
Here's a second example, going in the other direction. The previous example showed that writing and then reading doesn't give back the same data. This example shows that reading and then writing doesn't preserve the data, either. The setup closely matches the one you describe in the question. First we'll create A.csv, this time using regularly-spaced values instead of random ones:
>>> import pandas as pd, numpy as np
>>> s = pd.Series(np.arange(10**4) / 1e3, name='test_values')
>>> s.to_csv('A.csv', header=True)
Now we read A.csv, and write the first half of the data back out again to B.csv, as in your Step 1.
>>> recovered_s = pd.read_csv('A.csv').test_values
>>> recovered_s[:5000].to_csv('B.csv', header=True)
Then we read in both A.csv and B.csv, and compare the first half of A with B, as in your Step 2.
>>> a = pd.read_csv('A.csv').test_values
>>> b = pd.read_csv('B.csv').test_values
>>> (a[:5000] == b).all()
False
>>> (a[:5000] == b).sum()
4251
So again, several of the values don't compare correctly. Opening up the files, A.csv looks pretty much as I expect. Here are the first 15 entries in A.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.007
8,0.008
9,0.009
10,0.01
11,0.011
12,0.012
13,0.013
14,0.014
15,0.015
And here are the corresponding entries in B.csv:
,test_values
0,0.0
1,0.001
2,0.002
3,0.003
4,0.004
5,0.005
6,0.006
7,0.006999999999999999
8,0.008
9,0.009000000000000001
10,0.01
11,0.011000000000000001
12,0.012
13,0.013000000000000001
14,0.013999999999999999
15,0.015
See this bug report for more information on the introduction of the float_precision keyword to read_csv.

Related

Which file format uses less memory in python?

I wrote the code for points generation which will generate a dataframe for every one second and it keeps on generating. Each dataframe has 1000 rows and 7 columns.. It was implemented using while loop and thus for every iteration one dataframe is generated and it must be appended on a file. While file format should I use to manage the memory efficiency? Which file format takes less memory.? Can anyone give me a suggestion.. Is it okay to use csv? If so what datatype should I prefer to use.. Currently my dataframe has int16 values.. Should I append the same or should I convert it into binary format or byte format?

numpy arrays can be stored in binary format. Since you you have a single int16 data type, you can create a numpy array and write that. You would have 2 bytes per int16 value which is fairly good for size. The trick is that you need to know the dimensions of the stored data when you read it later. In this example its hard coded. This is a bit fragile - if you change your mind and start using different dimensions later, old data would have to be converted.
Assuming you want to read a bunch of 1000x7 dataframes later, you could do something like the example below. The writer keeps appending 1000x7 int16s and the reader chunks them back into dataframes. If you don't use anything specific to pandas itself, you would be better off just sticking with numpy for all of your operations and skip the demonstrated conversions.
import pandas as pd
import numpy as np
def write_df(filename, df):
with open(filename, "ab") as fp:
np.array(df, dtype="int16").tofile(fp)
def read_dfs(filename, dim=(1000,7)):
"""Sequentially reads dataframes from a file formatted as raw int16
with dimension 1000x7"""
size = dim[0] * dim[1]
with open(filename, "rb") as fp:
while True:
arr = np.fromfile(fp, dtype="int16", count=size)
if not len(arr):
break
yield pd.DataFrame(arr.reshape(*dim))
import os
# ready for test
test_filename = "test123"
if os.path.exists(test_filename):
os.remove(test_filename)
df = pd.DataFrame({"a":[1,2,3], "b":[4,5,6]})
# write test file
for _ in range(5):
write_df(test_filename, df)
# read and verify test file
return_data = [df for df in read_dfs(test_filename, dim=(3,2))]
assert len(return_data) == 5

Quickly search through a numpy array and sum the corresponding values

I have an array with around 160k entries which I get from a CSV-file and it looks like this:
data_arr = np.array(['ID0524', 1.0]
['ID0965', 2.5]
.
.
['ID0524', 6.7]
['ID0324', 3.0])
I now get around 3k unique ID's from some database and what I have to do is look up each of these IDs in the array and sum the corresponding numbers.
So if I would need to look up "ID0524", the sum would be 7.7.
My current working code looks something like this (I'm sorry that it's pretty ugly, I'm very new to numpy):
def sumValues(self, id)
sub_arr = data_arr[data_arr[0:data_arr.size, 0] == id]
sum_arr = sub_arr[0:sub_arr.size, 1]
return sum_arr.sum()
And it takes around ~18s to do this for all 3k IDs.
I wondered if there is probably any faster way to this as the current runtime seems a bit too long for me. I would appreciate any guidance and hints on this. Thank you!

You could try the using builtin numpy methods.
numpy.intersect1d to find the unique IDs
numpy.sum to sum them up

A convenient tool to do your task is Pandas, with its grouping mechanism.
Start from the necessary import:
import pandas as pd
Then convert data_arr to a pandasonic DataFrame:
df = pd.DataFrame({'Id': data_arr[:, 0], 'Amount': data_arr[:, 1].astype(float)})
The reason for some complication in the above code is that:
elements of your input array are of a single type (in this case
object),
so there is necessary to convert the second column to float.
Then you can get the expected result in a single instruction:
result = df.groupby('Id').sum()
The result, for your data sample, is:
Amount
Id
ID0324 3.0
ID0524 7.7
ID0965 2.5
Another approach is that you could read your CSV file directly
into a DataFrame (see read_csv method), so there is no need to use
any Numpy array.
The advantage is that read_csv is clever enough to recognize the data
type of each column separately, at least it is able to tell apart numbers
from strings.

Mixed data types in data fields

I am authoring a code designed to detect index and report errors found within very large data sets. I am reading in the data set (csv) using pandas, creating a dataframe with dozens of columns. Numerical errors are easy by converting the column of interest to an np array and using basic logical expression and the np.where function. Bam!
One of the errors I am looking for is an
invalid data type
For example, if the column was supposed to be an array of floats but a string was inadvertently entered smack dab in the middle of all of the floats. When converting to a np array it THEN converts all values into strings and causes the logic expressions to fail (as would be expected).
Ideally all non-numeric entries for that data column would be indexed as
invalid data type
with the values logged. It would then replace the value with NaN, convert the array of strings to the originally intended float values, and then continue with the assessment of numerical error checks.
This could be simply solved through for loops with a few try/catch statements. But being new to python. I am hoping for a more elegant solution.
Any suggestions?

Have a look at great expectatoins which aims to solve a similar problem. Note that until they implement their expect_column_values_to_be_parseable_as_type, you can force you column to be a string and use regex for the checks instead. For example, say you had a column called 'AGE' and wanted to validate it as an integer between 18 and 120
import great_expectations as ge
gf = ge.read_csv("my_datacsv",
dtype={
'AGE':str,
})
result = gf.expect_column_values_to_match_regex('AGE',
r'1[8-9]|[2-9][0-9]',
result_format={'result_format': 'COMPLETE'})
Alternatively, using numpy maybe something like this:
import numpy as np
#np.vectorize
def is_num(num):
try:
float(num)
return True
except:
return False
A = np.array([1,2,34,'e',5])
is_num(A)
which returns
array([ True, True, True, False, True])

pandas: check whether an element is in dataframe or given column leads to strange results

I am doing some data handling based on a DataFrame with the shape of (135150, 12) so double checking my results manually is not applicable anymore.
I encountered some 'strange' behavior when I tried to check if an element is part of the dataframe or a given column.
This behavior is reproducible with even smaller dataframes as follows:
import numpy as np
import pandas as pd
start = 1e-3
end = 2e-3
step = 0.01e-3
arr = np.arange(start, end+step, step)
val = 0.0019
df = pd.DataFrame(arr, columns=['example_value'])
print(val in df) # prints `False`
print(val in df['example_value']) # prints `True`
print(val in df.values) # prints `False`
print(val in df['example_value'].values) # prints `False`
print(df['example_value'].isin([val]).any()) # prints `False`
Since I am a very beginner in data analysis I am not able to explain this behavior.
I know that I am using different approaches involving different datatypes (like pd.Series, np.ndarray or np.array) in order to check if the given value exists in the dataframe. Additionally when using np.array or np.ndarray the machine accuracy comes in play which I am aware of in mind.
However, at the end, I need to implement several functions to filter the dataframe and count the occurrences of some values, which I have done several times before based on boolean columns in combination with performed operations like > and < successfully.
But in this case I need to filter by the exact value and count its occurrences which after all lead me to the issue described above.
So could anyone explain, what's going on here?

The underlying issue, as Divakar suggested, is floating point precision. Because DataFrames/Series are built on top of numpy, there isn't really a penalty for using numpy methods though, so you can just do something like:
df['example_value'].apply(lambda x: np.isclose(x, val)).any()
or
np.isclose(df['example_value'], val).any()
both of which correctly return True.

Pandas to_dict unwantedly modifying float numbers

My code below takes in CSV data and uses pandas to_dict() function as one step in converting the data to JSON. The problem is it is modifying the float numbers (e.g. 1.6 becomes 1.6000000000000001). I am not concerned about the loss of accuracy, but because users will see the change in the numbers, it looks amateurish.
I am aware:
this is something that has come up before here, but it was two years ago, was not really answered in a great way,
also I have an additional complication: the data frames I am looking to convert to dictionaries could be any combination of datatypes
As such the issue with the previous solutions are:
Converting all the numbers to objects only works if you don't need to (numerically) use the numbers. I want the option to calculate sums and averages which reintroduces the addition decimal issue.
Force rounding of numbers to x decimals will either reduce accuracy or add additional unnecessary 0s depending on the data the user provides
My question:
Is there a better way to ensure the numbers are not being modified, but are kept in a numeric datatype? Is it a question of changing how I import the CSV data in the first place? Surely there is a simple solution I am overlooking?
Here is a simple script that will reproduce this bug:
import pandas as pd
import sys
if sys.version_info[0] < 3:
from StringIO import StringIO
else:
from io import StringIO
CSV_Data = "Index,Column_1,Column_2,Column_3,Column_4,Column_5,Column_6,Column_7,Column_8\nindex_1,1.1,1.2,1.3,1.4,1.5,1.6,1.7,1.8\nindex_2,2.1,2.2,2.3,2.4,2.5,2.6,2.7,2.8\nindex_3,3.1,3.2,3.3,3.4,3.5,3.6,3.7,3.8\nindex_4,4.1,4.2,4.3,4.4,4.5,4.6,4.7,4.8"
input_data = StringIO(CSV_Data)
df = pd.DataFrame.from_csv(path = input_data, header = 0, sep=',', index_col=0, encoding='utf-8')
print(df.to_dict(orient = 'records'))

You could use pd.io.json.dumps to handle nested dicts with pandas objects.
Let's create a summary dict with dataframe records and custom metric.
In [137]: summary = {'df': df.to_dict(orient = 'records'), 'df_metric': df.sum() / df.min()}
In [138]: summary['df_metric']
Out[138]:
Column_1 9.454545
Column_2 9.000000
Column_3 8.615385
Column_4 8.285714
Column_5 8.000000
Column_6 7.750000
Column_7 7.529412
Column_8 7.333333
dtype: float64
In [139]: pd.io.json.dumps(summary)
Out[139]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.4545454545,"Column_2":9.0,"Column_3":8.6153846154,"Column_4":8.2857142857,"Column_5":8.0,"Column_6":7.75,"Column_7":7.5294117647,"Column_8":7.3333333333}}'
Use, double_precision to alter the maximum digit precision of doubles.
Notice. df_metric values.
In [140]: pd.io.json.dumps(summary, double_precision=2)
Out[140]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":{"Column_1":9.45,"Column_2":9.0,"Column_3":8.62,"Column_4":8.29,"Column_5":8.0,"Column_6":7.75,"Column_7":7.53,"Column_8":7.33}}'
You could use orient='records/index/..' to handle dataframe -> to_json construction.
In [144]: pd.io.json.dumps(summary, orient='records')
Out[144]: '{"df":[{"Column_7":1.7,"Column_6":1.6,"Column_5":1.5,"Column_4":1.4,"Column_3":1.3,"Column_2":1.2,"Column_1":1.1,"Column_8":1.8},{"Column_7":2.7,"Column_6":2.6,"Column_5":2.5,"Column_4":2.4,"Column_3":2.3,"Column_2":2.2,"Column_1":2.1,"Column_8":2.8},{"Column_7":3.7,"Column_6":3.6,"Column_5":3.5,"Column_4":3.4,"Column_3":3.3,"Column_2":3.2,"Column_1":3.1,"Column_8":3.8},{"Column_7":4.7,"Column_6":4.6,"Column_5":4.5,"Column_4":4.4,"Column_3":4.3,"Column_2":4.2,"Column_1":4.1,"Column_8":4.8}],"df_metric":[9.4545454545,9.0,8.6153846154,8.2857142857,8.0,7.75,7.5294117647,7.3333333333]}'
In essence, pd.io.json.dumps - converts arbitrary object recursively into JSON, which internally uses ultrajson

I need to make df.to_dict('list') with right float numbers. But df.to_json() doesn't support orient='list' yet. So I do following:
list_oriented_dict = {
column: list(data.values())
for column, data in json.loads(df.to_json()).items()
}
Not the best way, but it works for me. Maybe some one has a more elegant solution?

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python how does == work for float/double? - python

Related

Which file format uses less memory in python?

Quickly search through a numpy array and sum the corresponding values

Mixed data types in data fields

pandas: check whether an element is in dataframe or given column leads to strange results

Pandas to_dict unwantedly modifying float numbers

Categories

Resources