Lets say I have a pandas.Dataframe that looks as follows:
c1 | c2
-------
1 | 5
2 | 6
3 | 7
4 | 8
.....
1 | 7
and I'm looking to map a function (DataFrame.corr) but I would like it to take n rows at a time. The result should be a series with the correlation values that would be shorter than the original DataFrame or with a few values that didn't get a full n rows of data.
Is there a way to do this and how? I've been looking through the DataFrame and Map, Apply, Filter documentation but it doesn't seem to have an obvious or clean solution.
With pandas 0.20, using rolling with corr produces a multi indexed dataframe. You can slice afterwards to get at what you're looking for.
Consider the dataframe df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10, 2)), columns=['c1', 'c2'])
c1 c2
0 0 2
1 7 3
2 8 7
3 0 6
4 8 6
5 0 2
6 0 4
7 9 7
8 3 2
9 4 3
rolling + corr... pandas 0.20.x
df.rolling(5).corr().dropna().c1.xs('c2', level=1)
# Or equivalently
# df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c1, dtype: float64
rolling + corr... pandas 0.19.x or prior
Prior to 0.20, rolling + corr produced a pd.Panel
df.rolling(5).corr().loc[:, 'c1', 'c2'].dropna()
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
Name: c2, dtype: float64
numpy + as_strided
However, I wasn't satisfied with the above answers. Below is a specialized function that takes an nx2 dataframe and returns a series of the rolling correlations. DISCLAIMER This uses some advanced techniques and should really only be used if you know what this does. Meaning if you need a detailed breakdown of how this works... then it probably isn't for you.
from numpy.lib.stride_tricks import as_strided as strided
def rolling_correlation(a, w):
n, m = a.shape[0], 2
s1, s2 = a.strides
b = strided(a, (m, w, n - w + 1), (s2, s1, s1))
b_mb = b - b.mean(1, keepdims=True)
b_ss = (b_mb ** 2).sum(1) ** .5
return (b_mb[0] * b_mb[1]).sum(0) / (b_ss[0] * b_ss[1])
def rolling_correlation_df(df, w):
a = df.values
return pd.Series(rolling_correlation(a, w), df.index[w-1:])
rolling_correlation_df(df, 5)
4 0.399056
5 0.399056
6 0.684653
7 0.696074
8 0.841136
9 0.762187
dtype: float64
Timing
small data
%timeit rolling_correlation_df(df, 5)
10000 loops, best of 3: 79.9 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
100 loops, best of 3: 14.6 ms per loop
large data
np.random.seed([3,1415])
df = pd.DataFrame(np.random.randint(10, size=(10000, 2)), columns=['c1', 'c2'])
%timeit rolling_correlation_df(df, 5)
1000 loops, best of 3: 615 µs per loop
%timeit df.rolling(5).corr().stack().xs(['c1', 'c2'], level=[1, 2])
1 loop, best of 3: 1.98 s per loop
Related
I have a dataframe from source data that resembles the following:
In[1]: df = pd.DataFrame({'test_group': [1, 1, 1, 2, 2, 2, 3, 3, 3],
'test_type': [np.nan,'memory', np.nan, np.nan, 'visual', np.nan, np.nan,
'auditory', np.nan]}
Out[1]:
test_group test_type
0 1 NaN
1 1 memory
2 1 NaN
3 2 NaN
4 2 visual
5 2 NaN
6 3 NaN
7 3 auditory
8 3 NaN
test_group represents the grouping of the rows, which represent a test. I need to replace the NaNs in column test_type in each test_group with the value of the row that is not a NaN, e.g. memory, visual, etc.
I've tried a variety of approaches including isolating the "real" value in test_type such as
In [4]: df.groupby('test_group')['test_type'].unique()
Out[4]:
test_group
1 [nan, memory]
2 [nan, visual]
3 [nan, auditory]
Easy enough, I can index into each row and pluck out the value I want. This seems to head in the right direction:
In [6]: df.groupby('test_group')['test_type'].unique().apply(lambda x: x[1])
Out[6]:
test_group
1 memory
2 visual
3 auditory
I tried this among many other things but it doesn't quite work (note: apply and transform give the same result):
In [15]: grp = df.groupby('test_group')
In [16]: df['test_type'] = grp['test_type'].unique().transform(lambda x: x[1])
In [17]: df
Out[17]:
test_group test_type
0 1 NaN
1 1 memory
2 1 visual
3 2 auditory
4 2 NaN
5 2 NaN
6 3 NaN
7 3 NaN
8 3 NaN
I'm sure if I looped it I'd be done with things, but loops are too slow as the data set is millions of records per file.
You can use GroupBy.size to get the size of each group. Then boolean index using Series.isna. Now, use Index.repeat with df.reindex
repeats = df.groupby('test_group').size()
out = df[~df['test_type'].isna()]
out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
timeit analysis:
Benchmarking dataframe:
df = pd.DataFrame({'test_group': [1]*10_001 + [2]*10_001 + [3]*10_001,
'test_type' : [np.nan]*10_000 + ['memory'] +
[np.nan]*10_000 + ['visual'] +
[np.nan]*10_000 + ['auditory']})
df.shape
# (30003, 2)
Results:
# Ch3steR's answer
In [54]: %%timeit
...: repeats = df.groupby('test_group').size()
...: out = df[~df['test_type'].isna()]
...: out.reindex(out.index.repeat(repeats)).reset_index(drop=True)
...:
...:
2.56 ms ± 73.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
# timgeb's answer
In [55]: %%timeit
...: df['test_type'] = df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
...:
...:
10.1 ms ± 724 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Almost ~4X faster. I believe it's because boolean indexing is very fast. And reindex + repeat is lightwieght compared to dual fillna.
Under the assumption that there's a unique non-nan value per group, the following should satisfy your request.
>>> df['test_type'] = df.groupby('test_group')['test_type'].ffill().bfill()
>>> df
test_group test_type
0 1 memory
1 1 memory
2 1 memory
3 2 visual
4 2 visual
5 2 visual
6 3 auditory
7 3 auditory
8 3 auditory
edit:
The original answer used
df.groupby('test_group')['test_type'].fillna(method='ffill').fillna(method='bfill')
but it looks like according to schwim's timings ffill/bfill is significantly faster (for some reason).
I have a problem getting the rolling function of Pandas to do what I wish. I want for each frow to calculate the maximum so far within the group. Here is an example:
df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])
looks like
id value
0 1 3
1 1 6
2 1 3
3 2 2
4 2 1
Now I wish to obtain the following DataFrame:
id value
0 1 3
1 1 6
2 1 6
3 2 2
4 2 2
The problem is that when I do
df.groupby('id')['value'].rolling(1).max()
I get the same DataFrame back. And when I do
df.groupby('id')['value'].rolling(3).max()
I get a DataFrame with Nans. Can someone explain how to properly use rolling or some other Pandas function to obtain the DataFrame I want?
It looks like you need cummax() instead of .rolling(N).max()
In [29]: df['new'] = df.groupby('id').value.cummax()
In [30]: df
Out[30]:
id value new
0 1 3 3
1 1 6 6
2 1 3 6
3 2 2 2
4 2 1 2
Timing (using brand new Pandas version 0.20.1):
In [3]: df = pd.concat([df] * 10**4, ignore_index=True)
In [4]: df.shape
Out[4]: (50000, 2)
In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop
In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop
NOTE: from Pandas 0.20.0 what's new
Improved performance of groupby().cummin() and groupby().cummax() (GH15048, GH15109, GH15561, GH15635)
Using apply will be a tiny bit faster:
# Using apply
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop
Other method:
df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop
I was trying to using pandas to analysis a fairly large data set (~5GB). I wanted to divide the data sets into groups, then perform a Cartesian product on each group, and then aggregate the result.
The apply operation of pandas is quite expressive, I could first group, and then do the Cartesian product on each group using apply, and then aggregate the result using sum. The problem with this approach, however, is that apply is not lazy, it will compute all the intermediate results before the aggregation, and the intermediate results (Cartesian production on each group) is very large.
I was looking at Apache Spark and found one very interesting operator called cogroup. The definition is here:
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable, Iterable) tuples. This operation is also called groupWith.
This seems to be exactly what I want. If I could first cogroup and then do a sum, then the intermediate results won't be expanded (assuming cogroup works in the same lazy fashion as group).
Is there operation similar to cogroup in pandas, or how to achieve my goal efficiently?
Here is my example:
I want to group the data by id, and then do a Cartesian product for each group, and then group by cluster_x and cluster_y and aggregate the count_x and count_y using sum. The following code works, but is extremely slow and consumes too much memory.
# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1
def join_group(g):
return pandas.merge(g, g, on='dummy_key')\
[['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]
df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\
groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\
[['count_x', 'count_y']].sum()
A toy data set
id cluster count
0 i1 A 2
1 i1 B 3
2 i2 A 1
3 i2 B 4
Intermediate result after the apply (can be large)
cluster_x count_x cluster_y count_y
id
i1 0 A 2 A 2
1 A 2 B 3
2 B 3 A 2
3 B 3 B 3
i2 0 A 1 A 1
1 A 1 B 4
2 B 4 A 1
3 B 4 B 4
The desired final result
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
My first attempt failed, sort of: while I was able to limit the memory use (by summing over the Cartesian product within each group), it was considerably slower than the original. But for your particular desired output, I think we can simplify the problem considerably:
import numpy as np, pandas as pd
def fake_data(nids, nclusters, ntile):
ids = ["i{}".format(i) for i in range(1,nids+1)]
clusters = ["A{}".format(i) for i in range(nclusters)]
df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"]))
df = df.reset_index()
df = pd.concat([df]*ntile)
df["count"] = np.random.randint(0, 10, size=len(df))
return df
def join_group(g):
m= pd.merge(g, g, on='dummy_key')
return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]
def old_method(df):
df["dummy_key"] = 1
h1 = df.groupby(['id'], as_index=True).apply(join_group)
h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False)
h3 = h2[['count_x', 'count_y']].sum()
return h3
def new_method1(df):
m1 = df.groupby("cluster", as_index=False)["count"].sum()
m1["dummy_key"] = 1
m2 = m1.merge(m1, on="dummy_key")
m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1)
return m2
which gives (with df as your toy frame):
>>> new_method1(df)
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
True
and even
>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loop
Whether this will be enough of an improvement to handle your actual case, I'm not sure.
I have a multiindex dataframe in pandas, with 4 columns in the index, and some columns of data. An example is below:
import pandas as pd
import numpy as np
cnames = ['K1', 'K2', 'K3', 'K4', 'D1', 'D2']
rdata = pd.DataFrame(np.random.randint(1, 3, size=(8, len(cnames))), columns=cnames)
rdata.set_index(cnames[:4], inplace=True)
rdata.sortlevel(inplace=True)
print(rdata)
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
2 1 2 2 2 1
2 1 2 1 1
2 1 1
[8 rows x 2 columns]
What I want to do is select the rows where there are exactly 2 values at the K3 level. Not 2 rows, but two distinct values. I've found how to generate a sort of mask for what I want:
filterFunc = lambda x: len(set(x.index.get_level_values('K3'))) == 2
mask = rdata.groupby(level=cnames[:2]).apply(filterFunc)
print(mask)
K1 K2
1 1 True
2 True
2 1 False
2 False
dtype: bool
And I'd hoped that since rdata.loc[1, 2] allows you to match on just part of the index, it would be possible to do the same thing with a boolean vector like this. Unfortunately, rdata.loc[mask] fails with IndexingError: Unalignable boolean Series key provided.
This question seemed similar, but the answer given there doesn't work for anything other than the top level index, since index.get_level_values only works on a single level, not multiple ones.
Following the suggestion here I managed to accomplish what I wanted with
rdata[[mask.loc[k1, k2] for k1, k2, k3, k4 in rdata.index]]
however, both getting the count of distinct values using len(set(index.get_level_values(...))) and building the boolean vector afterwards by iterating over every row feels more like I'm fighting the framework to achieve something that seems like a simple task in a multiindex setup. Is there a better solution?
This is using pandas 0.13.1.
There might be something better, but you could at least bypass defining mask by using groupby-filter:
rdata.groupby(level=cnames[:2]).filter(
lambda grp: (grp.index.get_level_values('K3')
.unique().size) == 2)
Out[83]:
D1 D2
K1 K2 K3 K4
1 1 1 1 1 2
1 1 2
2 1 2 1
2 1 2 2 1
2 1 2 1
[5 rows x 2 columns]
It is faster than my previous suggestions. It does really well for small DataFrames:
In [84]: %timeit rdata.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
100 loops, best of 3: 3.84 ms per loop
In [76]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
100 loops, best of 3: 11.9 ms per loop
In [77]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
100 loops, best of 3: 13.4 ms per loop
and is still the fastest for large DataFrames, though not by as much:
In [78]: rdata2 = pd.concat([rdata]*100000)
In [85]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.index.get_level_values('K3').unique().size == 2)
1 loops, best of 3: 756 ms per loop
In [79]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: grp.groupby(level=['K3']).ngroups == 2)
1 loops, best of 3: 772 ms per loop
In [80]: %timeit rdata2.groupby(level=cnames[:2]).filter(lambda grp: len(set(grp.index.get_level_values('K3'))) == 2)
1 loops, best of 3: 1 s per loop
Currently I have a pandas DataFrame like this:
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
This DataFrame is used with a statistic which then requires a permutation test (EDIT: to be precise, random permutation). The indices of each column need to be shuffled (sampled) 100 times. To give an idea of the size, the number of rows can be around 50,000.
EDIT: The permutation is along the rows, i.e. shuffle the index for each column.
The biggest issue here is one of performance. I want to permute things in a fast way.
An example I had in mind was:
import random
import joblib
def permutation(dataframe):
return dataframe.apply(random.sample, axis=1, k=len(dataframe))
permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))
The issue here is that by doing this, the test is not stable: apparently the permutation works, but it is not as "random" as it would without being done in parallel, and thus there's a loss of stability in the results when I use the permuted data in follow-up calculations.
So my only "solution" was to precalculate all indices for all columns prior to doing the paralel code, which slows things down considerably.
My questions are:
Is there a more efficient way to do this permutation? (not necessarily parallel)
Is the parallel approach (using multiple processes, not threads) feasible?
EDIT: To make things clearer, here's what should happen for example to column A1 after one shuffling:
Ku8QhfS0n_hIOABXuE 6.268
fqPEquJRRlSVSfL.8A 6.343
ckiehnugOno9d7vf1Q 6.752
x57Vw5B5Fbt5JUnQk 6.297
(i.e. the row values were moving around).
EDIT2: Here's what I'm using now:
def _generate_indices(indices, columns, nperm):
random.seed(1234567890)
num_genes = indices.size
for item in range(nperm):
permuted = pandas.DataFrame(
{column: random.sample(genes, num_genes) for column in columns},
index=range(genes.size)
)
yield permuted
(in short, building a DataFrame of resampled indices for each column)
And later on (yes, I know it's pretty ugly):
# Data is the original DataFrame
# Indices one of the results of that generator
permuted = dict()
for column in data.columns:
value = data[column]
permuted[column] = value[indices[column].values].values
permuted_table = pandas.DataFrame(permuted, index=data.index)
How about this:
In [1]: import numpy as np; import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(50000, 10))
In [3]: def shuffle(df, n):
....: for i in n:
....: np.random.shuffle(df.values)
....: return df
In [4]: df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9
0 0.329588 -0.513814 -1.267923 0.691889 -0.319635 -1.468145 -0.441789 0.004142 -0.362073 -0.555779
1 0.495670 2.460727 1.174324 1.115692 1.214057 -0.843138 0.217075 0.495385 1.568166 0.252299
2 -0.898075 0.994281 -0.281349 -0.104684 -1.686646 0.651502 -1.466679 -1.256705 1.354484 0.626840
3 1.158388 -1.227794 -0.462005 -1.790205 0.399956 -1.631035 -1.707944 -1.126572 -0.892759 1.396455
4 -0.049915 0.006599 -1.099983 0.775028 -0.694906 -1.376802 -0.152225 1.413212 0.050213 -0.209760
In [5]: shuffle(df, 1).head(5)
Out[5]:
0 1 2 3 4 5 6 7 8 9
0 2.044131 0.072214 -0.304449 0.201148 1.462055 0.538476 -0.059249 -0.133299 2.925301 0.529678
1 0.036957 0.214003 -1.042905 -0.029864 1.616543 0.840719 0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621 0.657951 1.132175 -0.815403 0.548210 -0.029291 0.575587 0.032481 -0.261873 0.010381
3 1.396024 0.859455 -1.514801 0.353378 1.790324 0.286164 -0.765518 1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331 0.
In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop
This does what you need it to. The only question is whether or not it is fast enough.
Update
Per the comments by #Einar I have changed my solution.
In[7]: def shuffle2(df, n):
ind = df.index
for i in range(n):
sampler = np.random.permutation(df.shape[0])
new_vals = df.take(sampler).values
df = pd.DataFrame(new_vals, index=ind)
return df
In [8]: df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9
0 -0.175006 -0.462306 0.565517 -0.309398 1.100570 0.656627 1.207535 -0.221079 -0.933068 -0.192759
1 0.388165 0.155480 -0.015188 0.868497 1.102662 -0.571818 -0.994005 0.600943 2.205520 -0.294121
2 0.281605 -1.637529 2.238149 0.987409 -1.979691 -0.040130 1.121140 1.190092 -0.118919 0.790367
3 1.054509 0.395444 1.239756 -0.439000 0.146727 -1.705972 0.627053 -0.547096 -0.818094 -0.056983
4 0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468 0.512461 1.058758 -0.206019
In [9]: shuffle2(df, 1).head()
Out[9]:
0 1 2 3 4 5 6 7 8 9
0 0.054355 0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880 0.593901 0.182513 -1.981521
1 0.624562 1.097495 -0.428710 -0.133220 0.675428 0.892044 0.752593 -0.702470 0.272386 -0.193440
2 0.763551 -0.505923 0.206675 0.561456 0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3 1.149586 -0.003552 2.496176 -0.089767 0.246546 -1.333184 0.524872 -0.527519 0.492978 -0.829365
4 -1.893188 0.728737 0.361983 -0.188709 -0.809291 2.093554 0.396242 0.402482 1.884082 1.373781
In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop