Efficient pairwise comparison of rows in pandas DataFrame

Efficient pairwise comparison of rows in pandas DataFrame - python

I am currently working with a smallish dataset (about 9 million rows). Unfortunately, most of the entries are strings, and even with coercion to categories, the frame sits at a few GB in memory.
What I would like to do is compare each row with other rows and do a straight comparison of contents. For example, given
A B C D
0 cat blue old Saturday
1 dog red old Saturday
I would like to compute
d_A d_B d_C d_D
0, 0 True True True True
0, 1 False False True True
1, 0 False False True True
1, 1 True True True True
Obviously, combinatorial explosion will preclude a comparison of every record with every other record. So we can instead use blocking, by applying groupby, say on column A.
My question is, is there a a way to do this in either pandas or dask, that is faster than the following sequence:
Group by index
Outer join each group to itself to produce pairs
dataframe.apply comparison function on each row of pairs
For reference, assume I have access to a good number of cores (hundreds), and about 200G of memory.

The solution turned out to be using numpy in place of step 3). While we cannot create an outer join of every row, we can group by values in column A and create smaller groups to outer join.
The trick is then to use numpy.equal.outer(df1, df2).ravel() When dataframes are passed as inputs to a numpy function in this way, the result is a much faster (at least 30x) vectorized result. For example:
>>> df = pd.DataFrame
A B C D
0 cat blue old Saturday
1 dog red old Saturday
>>> result = pd.DataFrame(columns=["A", "B", "C", "D"],
index=pd.MultiIndex.from_product([df.index, df.index]))
>>> result["A"] = np.equal.outer(df["A"], df["A"]).ravel()
>>> result
A B C D
0, 0 True NaN NaN NaN
0, 1 False NaN NaN NaN
1, 0 False NaN NaN NaN
1, 1 True NaN NaN NaN
You can repeat for each column, or just automate the process with columnwise apply on result.

You might consider phrasing your problem as a join operation
You might consider using categoricals to reduce memory use

Related

Pandas, numpy.where(), and numpy.nan

I want to use numpy.where() to add a column to a pandas.DataFrame. I'd like to use NaN values for the rows where the condition is false (to indicate that these values are "missing").
Consider:
>>> import numpy; import pandas
>>> df = pandas.DataFrame({'A':[1,2,3,4]}); print(df)
A
0 1
1 2
2 3
3 4
>>> df['B'] = numpy.nan
>>> df['C'] = numpy.where(df['A'] < 3, 'yes', numpy.nan)
>>> print(df)
A B C
0 1 NaN yes
1 2 NaN yes
2 3 NaN nan
3 4 NaN nan
>>> df.isna()
A B C
0 False True False
1 False True False
2 False True False
3 False True False
Why does B show "NaN" but C shows "nan"? And why does DataFrame.isna() fail to detect the NaN values in C?
Should I use something other than numpy.nan inside where? None and pandas.NA both seem to work and can be detected by DataFrame.isna(), but I'm not sure these are the best choice.
Thank you!
Edit: As per #Tim Roberts and #DYZ, numpy.where returns an array of type string, so the str constructor is called on numpy.NaN. The values in column C are actually strings "nan". The question remains, however: what is the most elegant thing to do here? Should I use None? Or something else?

np.where coerces the second and the third parameter to the same datatype. Since the second parameter is a string, the third one is converted to a string, too, by calling function str():
str(numpy.nan)
# 'nan'
As the result, the values in column C are all strings.
You can first fill the NaN rows with None and then convert them to np.nan with fillna():
df['C'] = numpy.where(df['A'] < 3, 'yes', None)
df['C'].fillna(np.nan, inplace=True)

B is a pure numeric column. C has a mixture of strings and numerics, so the column has type "object", and it prints differently.

Dask groupby-apply followed by join on index without expensive reindexing

I'm in a situation in Dask which I would like to get out of, without using a lot of expensive reset_index operations.
I have a task which does a groupby-apply (where the apply returns a dataframe, which has a different size to the input dataframe, in the example this is simulated by the .head() and .tail() with reset_index()).
A operations is carried out on a different dataframe, and these two data frames need to be joined. However, the behavior is not as I had expected. I had expected the dataframe to join only on the dask index, and since dask doesn't implement multi index, I am surprised to see that it joins on both the dask index, and the returned index from the apply:
import dask.dataframe as dd
import pandas as pd
ddf = dd.from_pandas(
pd.DataFrame(
{
"group_col": ["A", "A", "A", "B", "B"],
"val_col": [1, 2, 3, 4, 5],
"val_col2": [5, 4, 3, 2, 1]
}
), npartitions=1)
ddf = ddf.set_index("group_col")
out_ddf = ddf.groupby("group_col").apply(lambda _df: _df.head(2).reset_index(drop=True))
out_ddf2 = ddf.groupby("group_col").apply(lambda _df: _df.tail(1).reset_index(drop=True))
out_ddf.join(out_ddf2, rsuffix="_other").compute()
Below is the output of the above.
val_col val_col2 val_col_other val_col2_other
group_col
A 0 1 5 3.0 3.0
1 2 4 NaN NaN
B 0 4 2 5.0 1.0
1 5 1 NaN NaN
The desired output (without expensive reshuffling) would be:
val_col val_col2 val_col_other val_col2_other
group_col
A 1 5 3 3
2 4 3 3
B 4 2 5 1
5 1 5 1
I have tried various combinations of .join/.merge calls, and I have been able to achieve the result with:
out_ddf.reset_index().merge(out_ddf2.reset_index(), suffixes=(None, "_other"), on="group_col").compute()
but I want to do some more operations on the same index later on, so I'm concerned this will hurt the performance, having to jiggle around the index so much.
So I'm looking for solutions which will give the desired result without the overhead of changing the dask indices during the operation, since the data frames are pretty big.
Thanks!

The code below might not work in general, but for your example, I would use the fact that the computations are done within a group and combine them in a single function that is applied within a group. This avoids merges/data shuffles:
def myfunc(df):
df1 = df.head(2).reset_index(drop=True)
df2 = df.tail(1).add_suffix('_other').reset_index(drop=True)
return df1.join(df2).fillna(method='ffill')
out_ddf = ddf.groupby('group_col').apply(myfunc)
print(out_ddf.compute())
For more complex workflows, a more nuanced solution will be needed to keep track of data dependencies in each computation.

Pandas.read_excel sometimes incorrectly reads Boolean values as 1's/0's

I need to read a very large Excel file into a DataFrame. The file has string, integer, float, and Boolean data, as well as missing data and totally empty rows. It may also be worth noting that some of the cell values are derived from cell formulas and/or VBA - although theoretically that shouldn't affect anything.
As the title says, pandas sometimes reads Boolean values as float or int 1's and 0's, instead of True and False. It appears to have something to do with the amount of empty rows and type of other data. For simplicity's sake, I'm just linking a 2-sheet Excel file where the issue is replicated.
Boolean_1.xlsx
Here's the code:
import pandas as pd
df1 = pd.read_excel('Boolean_1.xlsx','Sheet1')
df2 = pd.read_excel('Boolean_1.xlsx','Sheet2')
print(df1, '\n' *2, df2)
Here's the print. Mainly note row ZBA, which has the same values in both sheets, but different values in the DataFrames:
Name stuff Unnamed: 1 Unnamed: 2 Unnamed: 3
0 AFD a dsf ads
1 DFA 1 2 3
2 DFD 123.3 41.1 13.7
3 IIOP why why why
4 NaN NaN NaN NaN
5 ZBA False False True
Name adslfa Unnamed: 1 Unnamed: 2 Unnamed: 3
0 asdf 6.0 3.0 6.0
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
5 ZBA 0.0 0.0 1.0
I was also able to get integer 1's and 0's output in the large file I'm actually working on (yay), but wasn't able to easily replicate it.
What could be causing this inconsistency, and is there a way to force pandas to read Booleans as they should be read?

Pandas type-casting is applied by column / series. In general, Pandas doesn't work well with mixed types, or object dtype. You should expect internalised logic to determine the most efficient dtype for a series. In this case, Pandas has chosen float dtype as applicable for a series containing float and bool values. In my opinion, this is efficient and neat.
However, as you noted, this won't work when you have a transposed input dataset. Let's set up an example from scratch:
import pandas as pd, numpy as np
df = pd.DataFrame({'A': [True, False, True, True],
'B': [np.nan, np.nan, np.nan, False],
'C': [True, 'hello', np.nan, True]})
df = df.astype({'A': bool, 'B': float, 'C': object})
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True 0.0 True
Option 1: change "row dtype"
You can, without transposing your data, change the dtype for objects in a row. This will force series B to have object dtype, i.e. a series storing pointers to arbitrary types:
df.iloc[3] = df.iloc[3].astype(bool)
print(df)
A B C
0 True NaN True
1 False NaN hello
2 True NaN NaN
3 True False True
print(df.dtypes)
A bool
B object
C object
dtype: object
Option 2: transpose and cast to Boolean
In my opinion, this is the better option, as a data type is being attached to a specific category / series of input data.
df = df.T # transpose dataframe
df[3] = df[3].astype(bool) # convert series to Boolean
print(df)
0 1 2 3
A True False True True
B NaN NaN NaN False
C True hello NaN True
print(df.dtypes)
0 object
1 object
2 object
3 bool
dtype: object

Read_excel will determine the dtype for each column based on the first row in the column with a value. If the first row of that column is empty, Read_excel will continue to the next row until a value is found.
In Sheet1, your first row with values in column B, C, and D contains strings. Therefore, all subsequent rows will be treated as strings for these columns. In this case, FALSE = False
In Sheet2, your first row with values in column B, C, and D contains integers. Therefore, all subsequent rows will be treated as integers for these columns. In this case, FALSE = 0.

Can I get a trimmed mean of all columns in a dataframe with nan values?

The problem is that I want to get the trimmed mean of all the columns in a pandas dataframe (i.e. the mean of the values in a given column, excluding the max and the min values). It's likely that some columns will have nan values. Basically, I want to get the exact same functionality as the pandas.DataFrame.mean function, except that it's the trimmed mean.
The obvious solution is to use the scipy tmean function, and iterate over the df columns. So I did:
import scipy as sp
trim_mean = []
for i in data_clean3.columns:
trim_mean.append(sp.tmean(data_clean3[i]))
This worked great, until I encountered nan values, which caused tmean to choke. Worse, when I dropped the nan values in the dataframe, there were some datasets that were wiped out completely as they had an nan value in every column. This means that when I amalgamate all my datasets into a master set, there'll be holes on the master set where the trimmed mean should be.
Does anyone know of a way around this? As in, is there a way to get tmean to behave like the standard scipy stats functions and ignore nan values?
(Note that my code is calculating a big number of descriptive statistics on large datasets with limited hardware; highly involved or inefficient workarounds might not be optimal. Hopefully, though, I'm just missing something simple.)
(EDIT: Someone suggested in a comment (that has since vanished?) that I should used the trim_mean scipy function, which allows you to top and tail a specific proportion of the data. This is just to say that this solution won't work for me, as my datasets are of unequal sizes, so I cannot specify a fixed proportion of data that will be OK to remove in every case; it must always just be the max and the min values.)

consider df
np.random.seed()
data = np.random.choice((0, 25, 35, 100, np.nan),
(1000, 2),
p=(.01, .39, .39, .01, .2))
df = pd.DataFrame(data, columns=list('AB'))
Construct your mean using sums and divide by relevant normalizer.
(df.sum() - df.min() - df.max()) / (df.notnull().sum() - 2)
A 29.707674
B 30.402228
dtype: float64
df.mean()
A 29.756987
B 30.450617
dtype: float64

you colud use df.mean(skipna =True) DataFrame.mean
df1 = pd.DataFrame([[5, 1, 'a'], [6, 2, 'b'],[7, 3, 'd'],[np.nan, 4, 'e'],[9, 5, 'f'],[5, 1, 'g']], columns = ["A", "B", "C"])
print df1
df1 = df1[df1.A != df1.A.max()] # Remove max values
df1 = df1[df1.A != df1.A.min()] # Remove min values
print "\nDatafrmae after removing max and min\n"
print df1
print "\nMean of A\n"
print df1["A"].mean(skipna =True)
output
A B C
0 5.0 1 a
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
4 9.0 5 f
5 5.0 1 g
Datafrmae after removing max and min
A B C
1 6.0 2 b
2 7.0 3 d
3 NaN 4 e
Mean of A
6.5

Dropping rows in python pandas

I have the following DataFrame:
2010-01-03 2010-01-04 2010-01-05 2010-01-06 2010-01-07
1560 0.002624 0.004992 -0.011085 -0.007508 -0.007508
14 0.000000 -0.000978 -0.016960 -0.016960 -0.009106
2920 0.000000 0.018150 0.018150 0.002648 0.025379
1502 0.000000 0.018150 0.011648 0.005963 0.005963
78 0.000000 0.018150 0.014873 0.014873 0.007564
I have list of indices corresponding to rows that I want to drop from my DataFrame. For simplicity, assume my list is idx_to_drop = [1560,1502] which correspond to the 1st row and 4th row in the daraframe above.
I tried to run df2 = df.drop(df.index[idx_to_drop]), but that expects row numbers rather than the .ix() index value. I have many more rows and many more columns, and getting row numbers by using the where() function takes a while.
How can I drop rows whose .ix() match?

I would tackle this by breaking the problem into two pieces. Mask what you are looking for, then sub-select the inverse.
Short answer:
df[~df.index.isin([1560, 1502])]
Explanation with runnable example, using isin:
import pandas as pd
df = pd.DataFrame({'index': [1, 2, 3, 1500, 1501],
'vals': [1, 2, 3, 4, 5]}).set_index('index')
bad_rows = [1500, 1501]
mask = df.index.isin(bad_rows)
print mask
[False False False True True]
df[mask]
vals
index
1500 4
1501 5
print ~mask
[ True True True False False]
df[~mask]
vals
index
1 1
2 2
3 3
You can see that we've identified the two bad rows, then we want to choose all the rows that aren't the bad ones. Our mask if for the bad rows, and all other rows would be anything that is not the mask (~mask)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.