Hash each value in a pandas data frame - python

In python, I am trying to find the quickest to hash each value in a pandas data frame.
I know any string can be hashed using:
hash('a string')
But how do I apply this function on each element of a pandas data frame?
This may be a very simple thing to do, but I have just started using python.

Pass the hash function to apply on the str column:
In [37]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df
Out[37]:
a
0 asds
1 asdds
2 asdsadsdas
In [39]:
df['hash'] = df['a'].apply(hash)
df
Out[39]:
a hash
0 asds 4065519673257264805
1 asdds -2144933431774646974
2 asdsadsdas -3091042543719078458
If you want to do this to every element then call applymap:
In [42]:
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas'],'b':['asewer','werwer','tyutyuty']})
df
Out[42]:
a b
0 asds asewer
1 asdds werwer
2 asdsadsdas tyutyuty
In [43]:
df.applymap(hash)
​
Out[43]:
a b
0 4065519673257264805 7631381377676870653
1 -2144933431774646974 -6124472830212927118
2 -3091042543719078458 -1784823178011532358

Pandas also has a function to apply a hash function on an array or column:
import pandas as pd
df = pd.DataFrame({'a':['asds','asdds','asdsadsdas']})
df["hash"] = pd.util.hash_array(df["a"].to_numpy())

In addition to #EdChum a heads-up: hash() does not return the same values for a string for each run on every machine. Depending on your use-case, you better use
import hashlib
def md5hash(s: str):
return hashlib.md5(s.encode('utf-8')).hexdigest() # or SHA, ...
df['a'].apply(md5hash)
# or
df.applymap(md5hash)

Related

How to implement python custom function on dictionary of dataframes

I have a dictionary that contains 3 dataframes.
How do I implement a custom function to each dataframes in the dictionary.
In simpler terms, I want to apply the function find_outliers as seen below
# User defined function : find_outliers
#(I)
from scipy import stats
outlier_threshold = 1.5
ddof = 0
def find_outliers(s: pd.Series):
outlier_mask = np.abs(stats.zscore(s, ddof=ddof)) > outlier_threshold
# replace boolean values with corresponding strings
return ['background-color:blue' if val else '' for val in outlier_mask]
To the dictionary of dataframes dict_of_dfs below
# the dataset
import numpy as np
import pandas as pd
df = {
'col_A':['A_1001', 'A_1001', 'A_1001', 'A_1001', 'B_1002','B_1002','B_1002','B_1002','D_1003','D_1003','D_1003','D_1003'],
'col_X':[110.21, 191.12, 190.21, 12.00, 245.09,4321.8,122.99,122.88,134.28,148.14,161.17,132.17],
'col_Y':[100.22,199.10, 191.13,199.99, 255.19,131.22,144.27,192.21,7005.15,12.02,185.42,198.00],
'col_Z':[140.29, 291.07, 390.22, 245.09, 4122.62,4004.52,395.17,149.19,288.91,123.93,913.17,1434.85]
}
df = pd.DataFrame(df)
df
#dictionary_of_dataframes
#(II)
dict_of_dfs=dict(tuple(df.groupby('col_A')))
and lastly, flag outliers in each df of the dict_of_dfs
# end goal is to have find/flag outliers in each `df` of the `dict_of_dfs`
#(III)
desired_cols = ['col_X','col_Y','col_Z']
dict_of_dfs.style.apply(find_outliers, subset=desired_cols)
summarily, I want to apply I to II and finally flag outliers in III
Thanks for your attempt. :)
Desired output should look like this, but for the three dataframes
This may not be what you want, but this is how I'd approach it, but you'll have to work out the details of the function because you have it written to receive a series rather a dataframe. Groupby apply() will send the subsets of rows and then you can perform the actions on that subset and return the result.
For consideration:
inside the function you may be able to handle all columns like so:
def find_outliers(x):
for col in ['col_X','col_Y','col_Z']:
outlier_mask = np.abs(stats.zscore(x[col], ddof=ddof)) > outlier_threshold
x[col] = ['outlier' if val else '' for val in outlier_mask]
return x
newdf = df.groupby('col_A').apply(find_outliers)
col_A col_X col_Y col_Z
0 A_1001 outlier
1 A_1001
2 A_1001
3 A_1001 outlier
4 B_1002 outlier
5 B_1002 outlier
6 B_1002
7 B_1002
8 D_1003 outlier
9 D_1003
10 D_1003

manipulate a column of dataframe with conditions

enter code hereIn order to change strings' suffix to be prefix in a column of dataframe, which is made with the following code for example.
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
a b
1 100000.ss 10
2 200000.zz 18
I tried one line code below, but the result shows the if else statement doesn't work. Why?
df['a'] = df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") else 'zz.'+x[:6])
a b
1 ss.100000 10
2 ss.200000 18
Each x of your lambda function is a string. x.find returns -1 if not found. -1 is considered as boolean True. Therefore, your lambda always returns ss + .... Try to change your lambda to this
df['a'].apply(lambda x: 'ss.'+x[:6] if x.find("ss") != -1 else 'zz.'+x[:6])
Out[4]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Anyway, you don't need apply for this issue. Just use pandas str accessor
df['a'].str[-2:] + '.' + df['a'].str[:-3]
Out[10]:
1 ss.100000
2 zz.200000
Name: a, dtype: object
Why do the hardwork when there is a library that does it for you....
import pandas as pd
from pathlib import Path
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df.assign(
a=lambda x: x["a"].apply(lambda s: f"{Path(s).suffix[1:]}.{Path(s).stem}")
)
output
a b
ss.100000 10
zz.200000 18
There might be options to this in a lower number of lines. I have a solution
import pandas as pd
df = pd.DataFrame({'a':['100000.ss','200000.zz'],'b':[10,18]},index=[1,2])
df[['First','Last']] = df.a.str.split(".",expand=True)
df['a']=df['Last']+'.'+df['First']
df.drop(['First','Last'],axis=1)

Compare columns in Dataframe with inconsistent type values

import pandas as pd
df = pd.DataFrame({'RMDS': ['10.686000','NYSE_XNAS','0.472590','qrtr'], 'Mstar': ['10.690000', 'NYSE_XNAS', '0.473590','mnthly']})
Dataframe df will look like this:
Mstar RMDS
0 10.690000 10.686000
1 NYSE_XNAS NYSE_XNAS
2 0.473590 0.472590
3 mnthly qrtr
I want to compare value of 'RMDS' with 'Mstar' and type of dataframe is 'object',this is huge dataframe and I need to compare rounded values
mask = np.around(pd.to_numeric(df.Mstar), 2) != np.around(pd.to_numeric(df.RMDS), 2)
df_Difference = df[mask]
since values in columns are not consistent so whenever string values are coming like 'qrtr' , above logic is failing as I am using pd.to_numeric but still i wanted to compare 'qrtr' from 'RMDS' to 'mnthly' in 'Mstar'
Is there any way I could handle this type of situation.
Use pd.to_numeric to convert what you can, then .fillna to get everything back that wasn't converted.
import pandas as pd
import numpy as np
df = np.round(df.apply(pd.to_numeric, errors='coerce'),2).fillna(df)
# RMDS Mstar
#0 10.69 10.69
#1 NYSE_XNAS NYSE_XNAS
#2 0.47 0.47
#3 qrtr mnthly
df.RMDS == df.Mstar
#0 True
#1 True
#2 True
#3 False
#dtype: bool
Alternatively, define your own function and use .applymap
def my_round(x):
try:
return np.round(float(x),2)
except ValueError:
return x
df = df.applymap(my_round)

Get the same hash value for a Pandas DataFrame each time

My goal is to get unique hash value for a DataFrame. I obtain it out of .csv file.
Whole point is to get the same hash each time I call hash() on it.
My idea was that I create the function
def _get_array_hash(arr):
arr_hashable = arr.values
arr_hashable.flags.writeable = False
hash_ = hash(arr_hashable.data)
return hash_
that is calling underlying numpy array, set it to immutable state and get hash of the buffer.
INLINE UPD.
As of 08.11.2016, this version of the function doesn't work anymore. Instead, you should use
hash(df.values.tobytes())
See comments for the Most efficient property to hash for numpy array.
END OF INLINE UPD.
It works for regular pandas array:
In [12]: data = pd.DataFrame({'A': [0], 'B': [1]})
In [13]: _get_array_hash(data)
Out[13]: -5522125492475424165
In [14]: _get_array_hash(data)
Out[14]: -5522125492475424165
But then I try to apply it to DataFrame obtained from a .csv file:
In [15]: fpath = 'foo/bar.csv'
In [16]: data_from_file = pd.read_csv(fpath)
In [17]: _get_array_hash(data_from_file)
Out[17]: 6997017925422497085
In [18]: _get_array_hash(data_from_file)
Out[18]: -7524466731745902730
Can somebody explain me, how's that possible?
I can create new DataFrame out of it, like
new_data = pd.DataFrame(data=data_from_file.values,
columns=data_from_file.columns,
index=data_from_file.index)
and it works again
In [25]: _get_array_hash(new_data)
Out[25]: -3546154109803008241
In [26]: _get_array_hash(new_data)
Out[26]: -3546154109803008241
But my goal is to preserve the same hash value for a dataframe across application launches in order to retrieve some value from cache.
As of Pandas 0.20.1, you can use the little known (and poorly documented) hash_pandas_object (source code) which was recently made public in pandas.util. It returns one hash value for reach row of the dataframe (and works on series etc. too)
import pandas as pd
import numpy as np
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,4))
df = pd.DataFrame(arr)
print(df)
# 0 1 2 3
# 0 42 foo 42 42
# 1 foo foo 42 bar
# 2 42 42 42 42
from pandas.util import hash_pandas_object
h = hash_pandas_object(df)
print(h)
# 0 5559921529589760079
# 1 16825627446701693880
# 2 7171023939017372657
# dtype: uint64
You can always do hash_pandas_object(df).sum() if you want an overall hash of all rows.
Joblib provides a hashing function optimized for objects containing numpy arrays (e.g. pandas dataframes).
import joblib
joblib.hash(df)
I had a similar problem: check if a dataframe is changed and I solved it by hashing the msgpack serialization string. This seems stable among different reloading the same data.
import pandas as pd
import hashlib
DATA_FILE = 'data.json'
data1 = pd.read_json(DATA_FILE)
data2 = pd.read_json(DATA_FILE)
assert hashlib.md5(data1.to_msgpack()).hexdigest() == hashlib.md5(data2.to_msgpack()).hexdigest()
assert hashlib.md5(data1.values.tobytes()).hexdigest() != hashlib.md5(data2.values.tobytes()).hexdigest()
This function seems to work fine:
from hashlib import sha256
def hash_df(df):
s = str(df.columns) + str(df.index) + str(df.values)
return sha256(s.encode()).hexdigest()

Sympify() and Eval() Equations Not Executing Within a Function

I am writing a script which reads values from a CSV through pandas into a DataFrame. The values, 'A' and 'B', are inputs into an equation. This equation is obtained from an XML output file from an external program. The equation provides a result for 'A' and 'B' row-by-row of the DataFrame and places those results back into the original DataFrame.
If I make a function definition, explicitly write the equation in the definition, and return that that equation, things work fine. e.g.,
import pandas as pd
dataFrame = pd.read_csv() # Reads CSV to "dataFrame"
A = dataFrame['A'] # Defines A as row A in "dataFrame"
B = dataFrame['B'] # Defines B as row B in "dataFrame"
def Func(a,b):
P = 2*a+3*b
return P
outPut['P'] = Func(A, B) # Assigns a value to each row in "outPut" for each 'A' and 'B' per row of "dataFrame"
However, what I really want to do is "build" that same equation from an XML file rather than entering it in explicitly. So, I basically pull 'terms' and 'coefficients' from the xml file and result in a string form of the equation. I then convert the string to an executable function using sympy.sympify(). e.g.,
import pandas as pd
import sympy as sy
import xml.etree.ElementTree as etree
dataFrame = pd.read_csv() # Reads CSV to "dataFrame"
A = dataFrame['A'] # Defines A as row A in "dataFrame"
B = dataFrame['B'] # Defines B as row B in "dataFrame"
tree = etree.parse('C:\...')
.
..(some XML stuff with etree)
.
equationString = "some code that grabs terms and coefficients from XML file" # Builds equation from XML 'terms' and 'coefficients'
P = sy.sympify(equationString)
def Func(A, B):
global P
return P
outPut['P'] = Func(A, B) # Assigns a value to each row in "outPut" for each 'A' and 'B' per row of "dataFrame"
The result is that when I call to execute this equation over the dataFrame the literal equation is copied into the "outPut" DF rather than the row by row result for each 'A' and 'B'. I don't understand why Python sees these code examples differently, nor how to achieve the result I want from the first example. For some reason the sympify() result is not executable. The same seems to occur when I use eval().
Elaborating on my comment, here's how to solve the problem with lambdify
In [1]: import sympy as sp
In [2]: import pandas as pd
In [3]: import numpy as np
In [4]: df = pd.DataFrame(np.random.randn(5,2), columns=['A', 'B'])
In [5]: equationString = "2*A+3*B"
In [7]: expr = sp.S(equationString)
In [8]: expr
Out[8]: 2*A + 3*B
In [10]: f = sp.lambdify(sp.symbols("A B"), expr, modules="numpy")
In [11]: f(df['A'],df['B'])
Out[11]:
0 -2.779739
1 -1.176580
2 3.911066
3 1.888639
4 0.745293
dtype: float64
In [12]: 2*df["A"]+3*df["B"] - f(df["A"],df["B"])
Out[12]:
0 0
1 0
2 0
3 0
4 0
dtype: float64
Depending on the expressions encountered in your xml file, sympy may be overkill. Here's how to use eval
In [1]: import pandas as pd
In [2]: import numpy as np
In [3]: df = pd.DataFrame(np.ran
np.random np.rank
In [3]: df = pd.DataFrame(np.random.randn(5, 2), columns=['A', 'B'])
In [4]: equationString = "2*A+3*B"
In [5]: f = eval("lambda A, B: "+equationString)
In [6]: f(df['A'],df['B'])
Out[6]:
0 1.094797
1 -1.942295
2 -5.181502
3 1.888990
4 3.069017
dtype: float64
In [7]: 2*df["A"]+3*df["B"] - f(df["A"],df["B"])
Out[7]:
0 0
1 0
2 0
3 0
4 0
dtype: float64

Categories

Resources