Hashing Pandas dataframe breaks

Hashing Pandas dataframe breaks - python

First import:
import pandas as pd
import numpy as np
import hashlib
Next, consider the following:
np.random.seed(42)
arr = np.random.choice([41, 43, 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Multiple executions of this snippet yield the same hash twice all the time: ddfee4572d380bef86d3ebe3cb7bfa7c68b7744f55f67f4e1ca5f6872c2c9ba1.
However, if we consider the following:
np.random.seed(42)
arr = np.random.choice(['foo', 'bar', 42], size=(3,3))
df = pd.DataFrame(arr)
print(arr)
print(df)
print(hashlib.sha256(arr.tobytes()).hexdigest())
print(hashlib.sha256(df.values.tobytes()).hexdigest())
Note that there are strings in the data now. The hash of the arr is fixed (52db9328682317c44370b8186a5c6bae75f2a94c9d0d5b24d61f602857acd3de) for different evaluations, but the one of the pandas.DataFrame changes each time.
Any pythonic way around it? No Pythonic?
Edit: Related links:
Hashable DataFrames

A pandas DataFrame or Series can be hashed using the pandas.util.hash_pandas_object function, starting in version 0.20.1.

According to me when you are using string as values for your cells. Data frame type is object
df.dtypes
shows that.
That is why you get different hash each time.

Naive workaround is to get a string representation of the whole dataframe and hash it. In particular either of the following can work:
print(hashlib.sha256(df.to_json().encode()).hexdigest())
print(hashlib.sha256(df.to_csv().encode()).hexdigest())
Naturally, this is going to be very length for big dataframes.
Still, the it remains that pd.DataFrame(arr).values != arr, and this is counter-intuitive.
See a summary: https://gist.github.com/drorata/bfc5d956c4fb928dcc77510a33009691

I wrote a package with hashable subclasses of Series and DataFrame for my needs. Hope this helps.

Related

Convert Redis Streams output to Pandas Dataframe

What would be the fastest way to convert a Redis Stream output (aioredis client/ hiredis parser) to a Pandas Dataframe where Redis Stream ID‘s timestamp and sequence number as well as values are proper type converted Pandas index columns?
Example Redis output:
[[b'1554900384437-0', [b'key', b'1']],
[b'1554900414434-0', [b'key', b'1']]]

There seem to be two main bottlenecks here:
Pandas DataFrames store their data in column-major format, meaning each column maps to one numpy array, whereas the Redis stream data is row-by-row.
Pandas MultiIndex is made for categorical data, and converting raw arrays to the required levels/code structure seems to be non-optimized
Due to number 1. it is inevitable to loop over all Redis stream entries. Assuming we know the length beforehand, we can pre-allocate numpy arrays that we fill as we go along, and with some tricks reuse these arrays as the DataFrame columns. If the overhead of looping in Python is still too much, rewriting in Cython should be straightforward.
Since you didn't specify datatypes, the answer keeps everything in bytes using numpy.object arrays, it should be reasonably obvious how to adapt to a custom setting. The only reason to put all of the columns in the same array is to move an inner loop over the columns/fields from Python to C. It can be split up into e.g. one array per data type or one array per column.
from functools import partial, reduce
import numpy as np
import pandas as pd
data = [[b'1554900384437-0', [b'foo', b'1', b'bar', b'2', b'bla', b'abc']],
[b'1554900414434-0', [b'foo', b'3', b'bar', b'4', b'bla', b'xyz']]]
colnames = data[0][1][0::2]
ncols = len(colnames)
nrows = len(data)
ts_seq = np.empty((2, nrows), dtype=np.int64)
cols = np.empty((ncols, nrows), dtype=np.object)
for i,(id,fields) in enumerate(data):
ts, seq = id.split(b"-", 2)
ts_seq[:, i] = (int(ts), int(seq))
cols[:, i] = fields[1::2]
colframes = [pd.DataFrame(cols[i:i+1, :].T) for i in range(ncols)]
merge = partial(pd.merge, left_index=True, right_index=True, copy=False)
df = reduce(merge, colframes[1:], colframes[0])
df.columns = colnames
For number 2. we can use numpy.unique to create the levels/codes structure needed by Pandas MultiIndex. From the documentation it seems that numpy.unique also sorts the data. Since our data is presumably already sorted, a possible future optimisation would be to try to skip the sorting step.
ts = ts_seq[0, :]
seq = ts_seq[1, :]
maxseq = np.max(seq)
ts_levels, ts_codes = np.unique(ts, return_inverse=True)
seq_levels = np.arange(maxseq+1)
seq_codes = seq
df.index = pd.MultiIndex(levels=[ts_levels, seq_levels], codes=[ts_codes, seq_codes], names=["Timestamp", "Seq"])
Finally, we can verify that there was no copying involved by doing
cols[0, 0] = b'79'
and checking that the entries in df do indeed change.

The quickest way is to process data using batches
IO in batches of N msgs (i.e. 100 messages per batch)
Convert this batch into 1 Dataframe (using pd.DataFrame([]))
Apply lambda or convertation function to timestamp column converted to numpy (.values). a-la:
df['time'] = [datetime.fromtimestamp(t.split('-')[0]) for t in df['time'].values]

you can use this:
pd.read_msgpack(redisConn.get("key"))

Why select_dtypes does not work in this case in pandas

import pandas as pd
import numpy as np
test_df = pd.DataFrame([[1,2]]*4, columns=['x','y'])
test_df.iloc[0,0] = '1'
test_df.iloc[0,0] = 1
test_df.select_dtypes(include=['number'])
I want to know that why column x does not included in this case

I can reproduce on Pandas v0.19.2. The issue is when, if at all, Pandas chooses to check and recast series. You first define the series as dtype object with this assignment:
test_df.iloc[0, 0] = '1'
Pandas stores any series with strings as object dtype. You then overwrite a value in the next line without explicitly changing the dtype of the series:
test_df.iloc[0, 0] = 1
But you should not assume this automatically triggers conversion to a numeric dtype for the entire series. As far as I am aware, this is not a documented behaviour. While it may work in more recent versions, it is not a behaviour you should assume for a production workflow.

Constructing pandas dataframe from a list of objects

Let me start off by saying that I'm fairly new to numpy and pandas. I'm trying to construct a pandas dataframe but I'm not sure that I'm doing things in an appropriate way.
My setting is that I have a large list of .Net objects (that I have very little control over) and I want to build a time series from this using pandas dataframe. I have an example where I have replaced the .Net class with a simplified placeholder class just for demonstration. The listOfthings in the code is basically what I get from .Net and I want to convert that into a pandas dataframe.
My questions are:
I construct the dataframe by first constructing a numpy array. Is this necessary? Also, this array doesn't have the size 1000x2 as I expect. Is there a better way to use numpy here?
This code doesn't work because I doesn't seem to be able to cast the string to a datetime64. This confuses me since the string is in ISO format and it works when I try to parse it like this: np.datetime64(str(np.datetime64('now','us'))).
Code sample:
import numpy as np
import pandas as pd
class PlaceholderClass:
def time(self):
return str(np.datetime64('now', 'us'))
def value(self):
return 100*np.random.random_sample()
listOfThings = [PlaceholderClass() for i in range(1000)]
arr = np.array([(x.time(), x.value()) for x in listOfThings], dtype=[('time', np.datetime64), ('value', np.float)])
dataframe = pd.DataFrame(data=arr['value'], index=arr['time'])
Thanks in advance

Q1:
I think it is not necessary to first make an np.array and then create the dataframe. This works perfectly fine, for example:
rd = lambda: datetime.date(randint(2005,2025), randint(1,12),randint(1,28))
df = pd.DataFrame([(rd(), rd()) for x in range(100)])
Added later:
df = pd.DataFrame((x.value() for x in listOfThings), index=(pd.to_datetime(x.time()) for x in listOfThings))
Q2:
I noticed that pd.to_datetime('some date') almost always gets it right. Even without specifying the format. Perhaps this helps.
In [115]: pd.to_datetime('2008-09-22T13:57:31.2311892-04:00')
Out[115]: Timestamp('2008-09-22 17:57:31.231189200')

How to use xlwings or pandas to get all the non-null cell?

Recently I need to write a python script to find out how many times the specific string occurs in the excel sheet.
I noted that we can use *xlwings.Range('A1').table.formula* to achieve this task only if the cells are continuous. If the cells are not continuous how can I accomplish that?

It's a little hacky, but why not.
By the way, I'm assuming you are using python 3.x.
First well create a new boolean dataframe that matches the value you are looking for.
import pandas as pd
import numpy as np
df = pd.read_excel('path_to_your_excel..')
b = df.applymap(lambda x: x == 'value_you_want_to_find' if isinstance(x, str) else False)
and then simply sum all occurences.
print(np.count_nonzero(b.values))

As clarified in the comments, if you already have a dataframe, you can simply use count (Note: there must be a better way of doing it):
df = pd.DataFrame({'col_a': ['a'], 'col_b': ['ab'], 'col_c': ['c']})
string_to_search = '^a$' # should actually be a regex, in this example searching for 'a'
print(sum(df[col].str.count(string_to_search).sum() for col in df.columns))
>> 1

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

I am trying to pass values to stats.friedmanchisquare from a dataframe df, that has shape (11,17).
This is what works for me (only for three rows in this example):
df = df.as_matrix()
print stats.friedmanchisquare(df[1, :], df[2, :], df[3, :])
which yields
(16.714285714285694, 0.00023471398805908193)
However, the line of code is too long when I want to use all 11 rows of df.
First, I tried to pass the values in the following manner:
df = df.as_matrix()
print stats.friedmanchisquare([df[x, :] for x in np.arange(df.shape[0])])
but I get:
ValueError:
Less than 3 levels. Friedman test not appropriate.
Second, I also tried not converting it to a matrix-form leaving it as a DataFrame (which would be ideal for me), but I guess this is not supported yet, or I am doing it wrong:
print stats.friedmanchisquare([row for index, row in df.iterrows()])
which also gives me the error:
ValueError:
Less than 3 levels. Friedman test not appropriate.
So, my question is: what is the correct way of passing parameters to stats.friedmanchisquare based on df? (or even using its df.as_matrix() representation)
You can download my dataframe in csv format here and read it using:
df = pd.read_csv('df.csv', header=0, index_col=0)
Thank you for your help :)
Solution:
Based on #Ami Tavory and #vicg's answers (please vote on them), the solution to my problem, based on the matrix representation of the data, is to add the *-operator defined here, but better explained here, as follows:
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
And the same is true if you want to work with the original dataframe, which is what I ideally wanted:
print stats.friedmanchisquare(*[row for index, row in df.iterrows()])
in this manner you iterate over the dataframe in its native format.
Note that I went ahead and ran some timeit tests to see which way is faster and as it turns out, converting it first to a numpy array beforehand is twice as fast than using df in its original dataframe format.
This was my experimental setup:
import timeit
setup = '''
import pandas as pd
import scipy.stats as stats
import numpy as np
df = pd.read_csv('df.csv', header=0, index_col=0)
'''
theCommand = '''
df = np.array(df)
stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
theCommand = '''
stats.friedmanchisquare(*[row for index, row in df.iterrows()])
'''
print min(timeit.Timer(stmt=theCommand, setup=setup).repeat(10, 10000))
which yields the following results:
4.97029900551
8.7627799511

The problem I see with your first attempt is that you end up passing one list with multiple dataframes inside of it.
The stats.friedmanchisquare needs multiple array_like arguments, not one list
Try using the * (star/unpack) operator to unpack the list
Like this
df = df.as_matrix()
print stats.friedmanchisquare(*[df[x, :] for x in np.arange(df.shape[0])])

You could pass it using the "star operator", similarly to this:
a = np.array([[1, 2, 3], [2, 3, 4] ,[4, 5, 6]])
friedmanchisquare(*(a[i, :] for i in range(a.shape[0])))

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Hashing Pandas dataframe breaks - python

A pandas DataFrame or Series can be hashed using the pandas.util.hash_pandas_object function, starting in version 0.20.1.

According to me when you are using string as values for your cells. Data frame type is object df.dtypes shows that. That is why you get different hash each time.

I wrote a package with hashable subclasses of Series and DataFrame for my needs. Hope this helps.

Related

Convert Redis Streams output to Pandas Dataframe

Why select_dtypes does not work in this case in pandas

Constructing pandas dataframe from a list of objects

How to use xlwings or pandas to get all the non-null cell?

What is the correct way of passing parameters to stats.friedmanchisquare based on a DataFrame?

Categories

Resources