Get the same hash value for a Pandas DataFrame each time

Get the same hash value for a Pandas DataFrame each time - python

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.

As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.

Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)

I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()

This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()

Related

pandas.read_csv changes values on import

I have a csv file that looks as so:
"3040",0.24948,-0.89496
"3041",0.25344,-0.89496
"3042",0.2574,-0.891
"3043",0.2574,-0.89496
"3044",0.26136,-0.89892
"3045",0.2574,-0.891
"3046",0.26532,-0.9108
"3047",0.27324,-0.9306
"3048",0.23424,-0.8910
This data is "reference" data intended to validate calculations run on other data. Reading the data in gives me this:
In [2]: test = pd.read_csv('test.csv', header=0, names=['lx', 'ly'])
In [3]: test
Out[3]:
lx ly
3041 0.25344 -0.89496
3042 0.25740 -0.89100
3043 0.25740 -0.89496
3044 0.26136 -0.89892
3045 0.25740 -0.89100
3046 0.26532 -0.91080
3047 0.27324 -0.93060
3048 0.23424 -0.89100
Which looks as you might expect. Problem is, these values are not quite as they appear and comparisons with them don't work:
In [4]: test.loc[3042,'ly']
Out[4]: -0.8909999999999999
Why is it doing that? It seems to be specific to values in the csv that only have 3 places to the right of the decimal, at least so far:
In [5]: test.loc[3048,'ly']
Out[5]: -0.891
In [5]: test.loc[3048,'ly']
Out[5]: -0.891
In [6]: test.loc[3047,'ly']
Out[6]: -0.9306
In [7]: test.loc[3046,'ly']
Out[7]: -0.9108
I just want the exact values from the csv, not an interpretation. Ideas?
Update:
I set float_precision='round_trip' in the read_csv parameters and that seemed to fix it. Document here. What I don't understand is why by default the data is being changed as read in. This doesn't seem good for comparing data sets. Is there a better way to read in data for testing against other dataframes?
Update with answer:
Changing float_precision is what i went with, although I still don't understand how pandas can misrepresent the data in this way. I get a conversion happens on the import, but 0.891 should be 0.891.
For my comparison case rather than testing equivalence I went with something different:
# rather than
df1 == df2
# I tested as
(df1 / df2) - 1 > 1e-14
This works fine for my purposes.

For comparison purposes with other df's, you can use pd.option_context, (note I took off header=0 because it isn't displaying your first row in your df):
import pandas as pd
test = pd.read_csv('./Desktop/dummy.csv', names=['lx', 'ly'])
test.dtypes
with pd.option_context('display.precision', 5):
print(test.loc[3042,'ly'])
output:
-0.891
This isn't the nicest fix, but adding
float_precision='round_trip'
won't always fix your problem either:
import pandas as pd
test = pd.read_csv('./Desktop/dummy.csv', names=['lx', 'ly'], float_precision='round_trip')
test.dtypes
test.loc[3042,'ly']
output:
-0.89100000000000001
With display.precision, you will execute all blocks of code under this with statement at the precision you set, so you can guarantee that df's compared under this will be the value you expect.

it seems it is linked to data type you are loading, which is in your case float64. Using float 32 you get what you expect. So you can change the dtype while loading
test = pd.read_csv('test.csv', header=0, names=['lx', 'ly'],
dtype={'ly': np.float32, 'ly': np.float32})
or afterward
print(type(test.loc[3042,'ly'])) # <class 'numpy.float64'>
test[['lx', 'ly']] = test[['lx', 'ly']].astype('float32')
print(test.loc[3042,'ly']) # -0.891

Hash each value in a pandas data frame

In python, I am trying to find the quickest to hash each value in a pandas data frame.
I know any string can be hashed using:
hash('a string')
But how do I apply this function on each element of a pandas data frame?
This may be a very simple thing to do, but I have just started using python.

Pass the hash function to apply on the str column:
In [37]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df
Out[37]:
a
0 asds
1 asdds
2 asdsadsdas
In [39]:
df['hash'] = df['a'].apply(hash)
df
Out[39]:
a hash
0 asds 4065519673257264805
1 asdds -2144933431774646974
2 asdsadsdas -3091042543719078458
If you want to do this to every element then call applymap:
In [42]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas'],'b':['asewer','werwer','tyutyuty']})
df
Out[42]:
a b
0 asds asewer
1 asdds werwer
2 asdsadsdas tyutyuty
In [43]:
df.applymap(hash)

Out[43]:
a b
0 4065519673257264805 7631381377676870653
1 -2144933431774646974 -6124472830212927118
2 -3091042543719078458 -1784823178011532358

Pandas also has a function to apply a hash function on an array or column:
import pandas as pd
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df["hash"] = pd.util.hash_array(df["a"].to_numpy())

In addition to #EdChum a heads-up: hash() does not return the same values for a string for each run on every machine. Depending on your use-case, you better use
import hashlib
def md5hash(s: str):
return hashlib.md5(s.encode('utf-8')).hexdigest() # or SHA, ...
df['a'].apply(md5hash)
# or
df.applymap(md5hash)

Sympify() and Eval() Equations Not Executing Within a Function

I am writing a script which reads values from a CSV through pandas into a DataFrame. The values, 'A' and 'B', are inputs into an equation. This equation is obtained from an XML output file from an external program. The equation provides a result for 'A' and 'B' row-by-row of the DataFrame and places those results back into the original DataFrame.
If I make a function definition, explicitly write the equation in the definition, and return that that equation, things work fine. e.g.,
import pandas as pd
dataFrame = pd.read_csv() # Reads CSV to "dataFrame"
A = dataFrame['A'] # Defines A as row A in "dataFrame"
B = dataFrame['B'] # Defines B as row B in "dataFrame"
def Func(a,b):
P = 2*a+3*b
return P
outPut['P'] = Func(A, B) # Assigns a value to each row in "outPut" for each 'A' and 'B' per row of "dataFrame"
However, what I really want to do is "build" that same equation from an XML file rather than entering it in explicitly. So, I basically pull 'terms' and 'coefficients' from the xml file and result in a string form of the equation. I then convert the string to an executable function using sympy.sympify(). e.g.,
import pandas as pd
import sympy as sy
import xml.etree.ElementTree as etree
dataFrame = pd.read_csv() # Reads CSV to "dataFrame"
A = dataFrame['A'] # Defines A as row A in "dataFrame"
B = dataFrame['B'] # Defines B as row B in "dataFrame"
tree = etree.parse('C:\...')
.
..(some XML stuff with etree)
.
equationString = "some code that grabs terms and coefficients from XML file" # Builds equation from XML 'terms' and 'coefficients'
P = sy.sympify(equationString)
def Func(A, B):
global P
return P
outPut['P'] = Func(A, B) # Assigns a value to each row in "outPut" for each 'A' and 'B' per row of "dataFrame"
The result is that when I call to execute this equation over the dataFrame the literal equation is copied into the "outPut" DF rather than the row by row result for each 'A' and 'B'. I don't understand why Python sees these code examples differently, nor how to achieve the result I want from the first example. For some reason the sympify() result is not executable. The same seems to occur when I use eval().

Elaborating on my comment, here's how to solve the problem with lambdify
In [1]: import sympy as sp
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: df = pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
In [5]: equationString = "2*A+3*B"
In [7]: expr = sp.S(equationString)
In [8]: expr
Out[8]: 2*A + 3*B
In [10]: f = sp.lambdify(sp.symbols("A B"), expr, modules="numpy")
In [11]: f(df['A'],df['B'])
Out[11]:
0 -2.779739
1 -1.176580
2 3.911066
3 1.888639
4 0.745293
dtype: float64
In [12]: 2*df["A"]+3*df["B"] - f(df["A"],df["B"])
Out[12]:
0 0
1 0
2 0
3 0
4 0
dtype: float64
Depending on the expressions encountered in your xml file, sympy may be overkill. Here's how to use eval
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.ran
np.random np.rank
In [3]: df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
In [4]: equationString = "2*A+3*B"
In [5]: f = eval("lambda A, B: "+equationString)
In [6]: f(df['A'],df['B'])
Out[6]:
0 1.094797
1 -1.942295
2 -5.181502
3 1.888990
4 3.069017
dtype: float64
In [7]: 2*df["A"]+3*df["B"] - f(df["A"],df["B"])
Out[7]:
0 0
1 0
2 0
3 0
4 0
dtype: float64

How to add a column to Pandas based off of other columns

I'm using Pandas and I have a very basic dataframe:
session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z
I wish to add a third column based on the values of the current columns:
df['key'] = urlsafe_b64encode(md5('l' + df['session_id'] + df['datetime']))
But I receive:
TypeError: must be convertible to a buffer, not Series

You need to use pandas.DataFrame.apply. The code below will apply the lambda function to each row of df. You could, of course, define a separate function (if you need to do more something more complicated).
import pandas as pd
from io import StringIO
from base64 import urlsafe_b64encode
from hashlib import md5
s = ''' session_id datetime
5 t0ubmqqpbt01rhce201cujjtm7 2014-11-28T04:30:09Z
6 k87akpjpl004nbmhf4loiafi72 2014-11-28T04:30:11Z
7 g0t7hrqo8hgc5vlb7240d1n9l5 2014-11-28T04:30:12Z
8 ugh3fkskmedq3br99d20t78gb2 2014-11-28T04:30:15Z
9 fckkf16ahoe1uf9998eou1plc2 2014-11-28T04:30:18Z'''
df = pd.read_csv(StringIO(s), sep='\s+')
df['key'] = df.apply(lambda x: urlsafe_b64encode(md5('l' + x['session_id'] + x['datetime'])), axis=1)
Note: I couldn't get the hashing bit working on my machine unfortunately, some unicode error (might be because I'm using Python 3) and I don't have time to debug the inner workings of it, but the pandas part I'm pretty sure about :P

How to use pandas dataframes and numpy arrays in Rpy2?

I'd like to use pandas for all my analysis along with numpy but use Rpy2 for plotting my data. I want to do all analyses using pandas dataframes and then use full plotting of R via rpy2 to plot these. py2, and am using ipython to plot. What's the correct way to do this?
Nearly all commands I try fail. For example:
I'm trying to plot a scatter between two columns of a pandas DataFrame df. I'd like the labels of df to be used in x/y axis just like would be used if it were an R dataframe. Is there a way to do this? When I try to do it with r.plot, I get this gibberish plot:
In: r.plot(df.a, df.b) # df is pandas DataFrame
yields:
Out: rpy2.rinterface.NULL
resulting in the plot:
As you can see, the axes labels are messed up and it's not reading the axes labels from the DataFrame like it should (the X axis is column a of df and the Y axis is column b).
If I try to make a histogram with r.hist, it doesn't work at all, yielding the error:
In: r.hist(df.a)
Out:
...
vectors.pyc in <genexpr>((x,))
293 if l < 7:
294 s = '[' + \
--> 295 ', '.join((p_str(x, max_width = math.floor(52 / l)) for x in self[ : 8])) +\
296 ']'
297 else:
vectors.pyc in p_str(x, max_width)
287 res = x
288 else:
--> 289 res = "%s..." % (str(x[ : (max_width - 3)]))
290 return res
291
TypeError: slice indices must be integers or None or have an __index__ method
And resulting in this plot:
Any idea what the error means? And again here, the axes are all messed up and littered with gibberish data.
EDIT: This error occurs only when using ipython. When I run the command from a script, it still produces the problematic plot, but at least runs with no errors. It must be something wrong with calling these commands from ipython.
I also tried to convert the pandas DataFrame df to an R DataFrame as recommended by the poster below, but that fails too with this error:
com.convert_to_r_dataframe(mydf) # mydf is a pandas DataFrame
----> 1 com.convert_to_r_dataframe(mydf)
in convert_to_r_dataframe(df, strings_as_factors)
275 # FIXME: This doesn't handle MultiIndex
276
--> 277 for column in df:
278 value = df[column]
279 value_type = value.dtype.type
TypeError: iteration over non-sequence
How can I get these basic plotting features to work on Pandas DataFrame (with labels of plots read from the labels of the Pandas DataFrame), and also get the conversion between a Pandas DF to an R DF to work?
EDIT2: Here is a complete example of a csv file "test.txt" (http://pastebin.ca/2311928) and my code to answer #dale's comment:
import rpy2
from rpy2.robjects import r
import rpy2.robjects.numpy2ri
import pandas.rpy.common as com
from rpy2.robjects.packages import importr
from rpy2.robjects.lib import grid
from rpy2.robjects.lib import ggplot2
rpy2.robjects.numpy2ri.activate()
from numpy import *
import scipy
# load up pandas df
import pandas
data = pandas.read_table("./test.txt")
# plotting a column fails
print "data.c2: ", data.c2
r.plot(data.c2)
# Conversion and then plotting also fails
r_df = com.convert_to_r_dataframe(data)
r.plot(r_df)
The call to plot the column of "data.c2" fails, even though data.c2 is a column of a pandas df and therefore for all intents and purposes should be a numpy array. I use the activate() call so I thought it would handle this column as a numpy array and plot it.
The second call to plot the dataframe data after conversion to an R dataframe also fails. Why is that? If I load up test.txt from R as a dataframe, I'm able to plot() it and since my dataframe was converted from pandas to R, it seems like it should work here too.
When I do try rmagic in ipython, it does not fire up a plot window for some reason, though it does not error. I.e. if I do:
In [12]: X = np.array([0,1,2,3,4])
In [13]: Y = np.array([3,5,4,6,7])
In [14]: import rpy2
In [15]: from rpy2.robjects import r
In [16]: import rpy2.robjects.numpy2ri
In [17]: import pandas.rpy.common as com
In [18]: from rpy2.robjects.packages import importr
In [19]: from rpy2.robjects.lib import grid
In [20]: from rpy2.robjects.lib import ggplot2
In [21]: rpy2.robjects.numpy2ri.activate()
In [22]: from numpy import *
In [23]: import scipy
In [24]: r.assign("x", X)
Out[24]:
<Array - Python:0x592ad88 / R:0x6110850>
[ 0, 1, 2, 3, 4]
In [25]: r.assign("y", Y)
<Array - Python:0x592f5f0 / R:0x61109b8>
[ 3, 5, 4, 6, 7]
In [27]: %R plot(x,y)
There's no error, but no plot window either. In any case, I'd like to stick to rpy2 and not rely on rmagic if possible.
Thanks.

[note: Your code in "edit 2" is working here (Python 2.7, rpy2-2.3.2, R-1.15.2).]
As #dale mentions it whenever R objects are anonymous (that is no R symbol exists for the object) the R deparse(substitute()) will end up returning the structure() of the R object, and a possible fix is to specify the "xlab" and "ylab" parameters; for some plots you'll have to also specify main (the title).
An other way to work around that is to use R's formulas and feed the data frame (more below, after we work out the conversion part).
Forget about what is in pandas.rpy. It is both broken and seem to ignore features available in rpy2.
An earlier quick fix to conversion with ipython can be turned into a proper conversion rather easily. I am considering adding one to the rpy2 codebase (with more bells and whistles), but in the meantime just add the following snippet after all your imports in your code examples. It will transparently convert pandas' DataFrame objects into rpy2's DataFrame whenever an R call is made.
from collections import OrderedDict
py2ri_orig = rpy2.robjects.conversion.py2ri
def conversion_pydataframe(obj):
if isinstance(obj, pandas.core.frame.DataFrame):
od = OrderedDict()
for name, values in obj.iteritems():
if values.dtype.kind == 'O':
od[name] = rpy2.robjects.vectors.StrVector(values)
else:
od[name] = rpy2.robjects.conversion.py2ri(values)
return rpy2.robjects.vectors.DataFrame(od)
elif isinstance(obj, pandas.core.series.Series):
# converted as a numpy array
res = py2ri_orig(obj)
# "index" is equivalent to "names" in R
if obj.ndim == 1:
res.names = ListVector({'x': ro.conversion.py2ri(obj.index)})
else:
res.dimnames = ListVector(ro.conversion.py2ri(obj.index))
return res
else:
return py2ri_orig(obj)
rpy2.robjects.conversion.py2ri = conversion_pydataframe
Now the following code will "just work":
r.plot(rpy2.robjects.Formula('c3~c2'), data)
# `data` was converted to an rpy2 data.frame on the fly
# and the a scatter plot c3 vs c2 (with "c2" and "c3" the labels on
# the "x" axis and "y" axis).
I also note that you are importing ggplot2, without using it. Currently the conversion
will have to be explicitly requested. For example:
p = ggplot2.ggplot(rpy2.robjects.conversion.py2ri(data)) +\
ggplot2.geom_histogram(ggplot2.aes_string(x = 'c3'))
p.plot()

You need to pass in the labels explicitly when calling the r.plot function.
r.plot([1,2,3],[1,2,3], xlab="X", ylab="Y")
When you plot in R, it grabs the labels via deparse(substitute(x)) which essentially grabs the variable name from the plot(testX, testY). When you're passing in python objects via rpy2, it's an anonymous R object and akin to the following in R:
> deparse(substitute(c(1,2,3)))
[1] "c(1, 2, 3)"
which is why you're getting the crazy labels.
A lot of times it's saner to use rpy2 to only push data back and forth.
r.assign('testX', df.A)
r.assign('testY', df.B)
%R plot(testX, testY)
rdf = com.convert_to_r_dataframe(df)
r.assign('bob', rdf)
%R plot(bob$$A, bob$$B)
http://nbviewer.ipython.org/4734581/

use rpy. the conversion is part of pandas so you don't need to do it yoursef
http://pandas.pydata.org/pandas-docs/dev/r_interface.html
In [1217]: from pandas import DataFrame
In [1218]: df = DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6], 'C':[7,8,9]},
......: index=["one", "two", "three"])
......:
In [1219]: r_dataframe = com.convert_to_r_dataframe(df)
In [1220]: print type(r_dataframe)
<class 'rpy2.robjects.vectors.DataFrame'>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get the same hash value for a Pandas DataFrame each time - python

Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes). import joblib joblib.hash(df)

This function seems to work fine: from hashlib import sha256 def hash_df(df): s = str(df.columns) + str(df.index) + str(df.values) return sha256(s.encode()).hexdigest()

Related

pandas.read_csv changes values on import

Hash each value in a pandas data frame

Sympify() and Eval() Equations Not Executing Within a Function

How to add a column to Pandas based off of other columns

How to use pandas dataframes and numpy arrays in Rpy2?

Categories

Resources