vectorized shuffling per row in pandas - python

I want to shuffle the columns of a pandas data frame.
However, the default method (sample) shuffles all the columns in the same way.
How can I efficiently shuffle the columns of each row differently?
import pandas as pd
df = pd.DataFrame({'foo':[1,4,7],'bar':[2,5,8],'baz':[3,6,9],})
display(df)
df.sample(frac=1, axis=1)
Certainly, an apply based solution would work - but this would not be vectorized and thus slow.
Is there a fast (and ideally vectorized) way to sample differently for each row?

Let us try with np.random.rand and argsort to generate shuffled indices
i = np.random.rand(*df.shape).argsort(1)
df.values[:] = np.take_along_axis(df.to_numpy(), i, axis=1)
print(df)
foo bar baz
0 3 1 2
1 4 5 6
2 7 9 8

You can try this solution:
def shuffle_columns_per_row(df):
arr = df.values
x, y = arr.shape
rows = np.indices((x,y))[0]
cols = [np.random.permutation(y) for _ in range(x)]
return pd.DataFrame(arr[rows, cols], columns=df.columns)
| foo | bar | baz |
|-----|-----|-----|
| 3 | 2 | 1 |
| 5 | 6 | 4 |
| 9 | 7 | 8 |

A quick check gives the benchmark:
%%timeit
df.sample(frac=1, axis=1)
# 1000 loops, best of 5: 288 µs per loop
With apply, as you said, we get:
%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.apply(lambda x: x.iloc[idx], axis=1)
# 1000 loops, best of 5: 1.47 ms per loop -> ~3700 times slower
We could rather use iloc:
%%timeit
idx = np.random.choice([0, 1, 2], size=(3,), replace=False)
df.iloc[:, idx]
# 1000 loops, best of 5: 398 µs per loop -> ~1.4 times slower
If you could live with a roughly 1.4 times decrease in speed, I think the iloc version would work.

Related

Adding column of weights to pandas DF by a condition on the DFs column

Whats the most pythonic way to add a column (of weights) to an existing Pandas DataFrame "df" by a condition on dfs column?
Small example:
df = pd.DataFrame({'A' : [1, 2, 3], 'B' : [4, 5, 6]})
df
Out[110]:
A B
0 1 4
1 2 5
2 3 6
I'd Like to add a "weight" column where if df['B'] >= 6 then df['weight'] = 20, else, df['weight'] = 1
So my output will be:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Approach #1
Here's one with type-conversion and scaling -
df['weight'] = (df['B'] >= 6)*19+1
Approach #2
Another possibly faster one with using the underlying array data -
df['weight'] = (df['B'].values >= 6)*19+1
Approach #3
Leverage multi-cores with numexpr module -
import numexpr as ne
val = df['B'].values
df['weight'] = ne.evaluate('(val >= 6)*19+1')
Timings on 500k rows as commented by OP for a random data in range [0,9) for the vectorized methods posted thus far -
In [149]: np.random.seed(0)
...: df = pd.DataFrame({'B' : np.random.randint(0,9,(500000))})
# #jpp's soln
In [150]: %timeit df['weight1'] = np.where(df['B'] >= 6, 20, 1)
100 loops, best of 3: 3.57 ms per loop
# #jpp's soln with array data
In [151]: %timeit df['weight2'] = np.where(df['B'].values >= 6, 20, 1)
100 loops, best of 3: 3.27 ms per loop
In [154]: %timeit df['weight3'] = (df['B'] >= 6)*19+1
100 loops, best of 3: 2.73 ms per loop
In [155]: %timeit df['weight4'] = (df['B'].values >= 6)*19+1
1000 loops, best of 3: 1.76 ms per loop
In [156]: %%timeit
...: val = df['B'].values
...: df['weight5'] = ne.evaluate('(val >= 6)*19+1')
1000 loops, best of 3: 1.14 ms per loop
One last one ...
With the output being 1 or 20, we could safely use lower precision : uint8 for a turbo speedup over already discussed ones, like so -
In [208]: %timeit df['weight6'] = (df['B'].values >= 6)*np.uint8(19)+1
1000 loops, best of 3: 428 µs per loop
You can use numpy.where for a vectorised solution:
df['weight'] = np.where(df['B'] >= 6, 20, 1)
Result:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20
Here's a method using df.apply
df['weight'] = df.apply(lambda row: 20 if row['B'] >= 6 else 1, axis=1)
Output:
In [6]: df
Out[6]:
A B weight
0 1 4 1
1 2 5 1
2 3 6 20

pandas dataframe use np.where and drop together

I have a dataframe and I'd like to be able to use np.where to find certain elements based on a given condition, and then use pd.drop to erase the elements corresponding to the index found with np.where.
I.e.,
idx_to_drop = np.where(myDf['column10'].isnull() | myDf['column14'].isnull())
myDf.drop(idx_to_drop)
But I get a value error since drop does not take numpy array indexes. Is there a way to achieve this using np.where and some drop function in pandas?
There are two common patterns to achieve that:
select those rows that DON'T satisfy your "dropping" condition or negate your conditions and select those rows that satisfy those conditions - #jezrael has provided a good example for that approach.
drop the rows satisfying your "dropping" conditions:
df = df.drop(np.where(df['column10'].isnull() | df['column14'].isnull())[0])
Timing: first approach seems to be bit faster:
Setup:
df = pd.DataFrame(np.random.rand(100,5), columns=list('abcde'))
df.loc[::7, ::2] = np.nan
df = pd.concat([df] * 10**4, ignore_index=True)
In [117]: df.shape
Out[117]: (1000000, 5)
In [118]: %timeit df[~(df['a'].isnull() | df['e'].isnull())]
10 loops, best of 3: 46.6 ms per loop
In [119]: %timeit df[df['a'].notnull() & df['e'].notnull()]
10 loops, best of 3: 39.9 ms per loop
In [120]: %timeit df.drop(np.where(df['a'].isnull() | df['e'].isnull())[0])
10 loops, best of 3: 65.5 ms per loop
In [122]: %timeit df.drop(np.where(df[['a','e']].isnull().any(1))[0])
10 loops, best of 3: 97.1 ms per loop
In [123]: %timeit df[df[['a','e']].notnull().all(1)]
10 loops, best of 3: 72 ms per loop
I think you need boolean indexing with inverse condition by ~, isnull and | (bitwise or):
print (~(myDf['column10'].isnull() | myDf['column14'].isnull()))
0 False
1 True
2 False
dtype: bool
myDf[~(myDf['column10'].isnull() | myDf['column14'].isnull())]
Sample:
myDf = pd.DataFrame({'column10':[np.nan, 1,5], 'column14':[np.nan, 1,np.nan]})
print (myDf)
column10 column14
0 NaN NaN
1 1.0 1.0
2 5.0 NaN
myDf = myDf[~(myDf['column10'].isnull() | myDf['column14'].isnull())]
print (myDf)
column10 column14
1 1.0 1.0
Solution with notnull and & (bitwise and)
myDf = myDf[myDf['column10'].notnull() & myDf['column14'].notnull()]
print (myDf)
column10 column14
1 1.0 1.0
Another solutions with any or all:
myDf = myDf[~myDf[['column10', 'column14']].isnull().any(axis=1)]
print (myDf)
column10 column14
1 1.0 1.0
myDf = myDf[myDf[['column10', 'column14']].notnull().all(axis=1)]
print (myDf)
column10 column14
1 1.0 1.0

pandas rolling max with groupby

I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

Use Map function with subsets of a pandas series or dataframe

Lets say I have a pandas.Dataframe that looks as follows:
c1 | c2
-------
1 | 5
2 | 6
3 | 7
4 | 8
.....
1 | 7
and I'm looking to map a function (DataFrame.corr) but I would like it to take n rows at a time. The result should be a series with the correlation values that would be shorter than the original DataFrame or with a few values that didn't get a full n rows of data.
Is there a way to do this and how? I've been looking through the DataFrame and Map, Apply, Filter documentation but it doesn't seem to have an obvious or clean solution.
With pandas 0.20, using rolling with corr produces a multi indexed dataframe. You can slice afterwards to get at what you're looking for.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['c1', 'c2'])
c1 c2
0 0 2
1 7 3
2 8 7
3 0 6
4 8 6
5 0 2
6 0 4
7 9 7
8 3 2
9 4 3
rolling + corr... pandas 0.20.x
df.rolling(5).corr().dropna().c1.xs('c2', level=1)
# Or equivalently
# df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c1, dtype: float64
rolling + corr... pandas 0.19.x or prior
Prior to 0.20, rolling + corr produced a pd.Panel
df.rolling(5).corr().loc[:, 'c1', 'c2'].dropna()
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c2, dtype: float64
numpy + as_strided
However, I wasn't satisfied with the above answers. Below is a specialized function that takes an nx2 dataframe and returns a series of the rolling correlations. DISCLAIMER This uses some advanced techniques and should really only be used if you know what this does. Meaning if you need a detailed breakdown of how this works... then it probably isn't for you.
from numpy.lib.stride_tricks import as_strided as strided
def rolling_correlation(a, w):
n, m = a.shape[0], 2
s1, s2 = a.strides
b = strided(a, (m, w, n - w + 1), (s2, s1, s1))
b_mb = b - b.mean(1, keepdims=True)
b_ss = (b_mb ** 2).sum(1) ** .5
return (b_mb[0] * b_mb[1]).sum(0) / (b_ss[0] * b_ss[1])
def rolling_correlation_df(df, w):
a = df.values
return pd.Series(rolling_correlation(a, w), df.index[w-1:])
rolling_correlation_df(df, 5)
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
dtype: float64
Timing
small data
%timeit rolling_correlation_df(df, 5)
10000 loops, best of 3: 79.9 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
100 loops, best of 3: 14.6 ms per loop
large data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10000, 2)), columns=['c1', 'c2'])
%timeit rolling_correlation_df(df, 5)
1000 loops, best of 3: 615 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
1 loop, best of 3: 1.98 s per loop

How to check if a particular cell in pandas DataFrame isnull?

I have the following df in pandas.
0 A B C
1 2 NaN 8
How can I check if df.iloc[1]['B'] is NaN?
I tried using df.isnan() and I get a table like this:
0 A B C
1 false true false
but I am not sure how to index the table and if this is an efficient way of performing the job at all?
Use pd.isnull, for select use loc or iloc:
print (df)
0 A B C
0 1 2 NaN 8
print (df.loc[0, 'B'])
nan
a = pd.isnull(df.loc[0, 'B'])
print (a)
True
print (df['B'].iloc[0])
nan
a = pd.isnull(df['B'].iloc[0])
print (a)
True
jezrael response is spot on. If you are only concern with NaN value, I was exploring to see if there's a faster option, since in my experience, summing flat arrays is (strangely) faster than counting. This code seems faster:
df.isnull().values.any()
For example:
In [2]: df = pd.DataFrame(np.random.randn(1000,1000))
In [3]: df[df > 0.9] = pd.np.nan
In [4]: %timeit df.isnull().any().any()
100 loops, best of 3: 14.7 ms per loop
In [5]: %timeit df.isnull().values.sum()
100 loops, best of 3: 2.15 ms per loop
In [6]: %timeit df.isnull().sum().sum()
100 loops, best of 3: 18 ms per loop
In [7]: %timeit df.isnull().values.any()
1000 loops, best of 3: 948 µs per loop
If you are looking for the indexes of NaN in a specific column you can use
list(df['B'].index[df['B'].apply(np.isnan)])
In case you what to get the indexes of all possible NaN values in the dataframe you may do the following
row_col_indexes = list(map(list, np.where(np.isnan(np.array(df)))))
indexes = []
for i in zip(row_col_indexes[0], row_col_indexes[1]):
indexes.append(list(i))
And if you are looking for a one liner you can use:
list(zip(*[x for x in list(map(list, np.where(np.isnan(np.array(df)))))]))

Categories

Resources