I was using a wind speed calculation function from lon and lat components:
def wind_speed(u, v):
return np.sqrt(u ** 2 + v ** 2)
and calling it to calculate a new pandas column from two existing ones:
df['wspeed'] = map(wind_speed, df['lonwind'], df['latwind'])
Since I changed from Python 2.7 to Python 3.5 the function is not working anymore. Could the change be the cause?
In a single argument (column) function:
def celsius(T):
return round(T - 273, 1)
I am now using:
df['temp'] = df['t2m'].map(celsius)
And it works fine.
Could you help me?
If want to use map, add list:
df = pd.DataFrame({'lonwind':[1,2,3],
'latwind':[4,5,6]})
print (df)
latwind lonwind
0 4 1
1 5 2
2 6 3
def wind_speed(u, v):
return np.sqrt(u ** 2 + v ** 2)
df['wspeed'] = list(map(wind_speed, df['lonwind'], df['latwind']))
print (df)
latwind lonwind wspeed
0 4 1 4.123106
1 5 2 5.385165
2 6 3 6.708204
Without list:
df['wspeed'] = (map(wind_speed, df['lonwind'], df['latwind']))
print (df)
latwind lonwind wspeed
0 4 1 <map object at 0x000000000AC42DA0>
1 5 2 <map object at 0x000000000AC42DA0>
2 6 3 <map object at 0x000000000AC42DA0>
map(function, iterable, ...)
Return an iterator that applies function to every item of iterable, yielding the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. With multiple iterables, the iterator stops when the shortest iterable is exhausted. For cases where the function inputs are already arranged into argument tuples, see itertools.starmap().
Another solution:
df['wspeed'] = (df['lonwind'] ** 2 + df['latwind'] ** 2) **0.5
print (df)
latwind lonwind wspeed
0 4 1 4.123106
1 5 2 5.385165
2 6 3 6.708204
I would try to stick to existing numpy/scipy functions as they are extremely fast and optimized (numpy.hypot):
df['wspeed'] = np.hypot(df.latwind, df.lonwind)
Timing: against 300K rows DF:
In [47]: df = pd.concat([df] * 10**5, ignore_index=True)
In [48]: df.shape
Out[48]: (300000, 2)
In [49]: %paste
def wind_speed(u, v):
return np.sqrt(u ** 2 + v ** 2)
## -- End pasted text --
In [50]: %timeit list(map(wind_speed, df['lonwind'], df['latwind']))
1 loop, best of 3: 922 ms per loop
In [51]: %timeit np.hypot(df.latwind, df.lonwind)
100 loops, best of 3: 4.08 ms per loop
Conclusion: vectorized approach was 230 times faster
If you have to write your own one, try to use vectorized math (working with vectors / columns instead of scalars):
def wind_speed(u, v):
# using vectorized approach - column's math instead of scalar
return np.sqrt(u * u + v * v)
df['wspeed'] = wind_speed(df['lonwind'] , df['latwind'])
demo:
In [39]: df['wspeed'] = wind_speed(df['lonwind'] , df['latwind'])
In [40]: df
Out[40]:
latwind lonwind wspeed
0 4 1 4.123106
1 5 2 5.385165
2 6 3 6.708204
same vectorized approach with celsius() function:
def celsius(T):
# using vectorized function: np.round()
return np.round(T - 273, 1)
Related
There are many methods for creating new columns in Pandas (I may have missed some in my examples so please let me know if there are others and I will include here) and I wanted to figure out when is the best time to use each method. Obviously some methods are better in certain situations compared to others but I want to evaluate it from a holistic view looking at efficiency, readability, and usefulness.
I'm primarily concerned with the first three but included other ways simply to show it's possible with different approaches. Here's your sample dataframe:
df = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
Most commonly known way is to name a new column such as df['c'] and use apply:
df['c'] = df['a'].apply(lambda x: x * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using assign can accomplish the same thing:
df = df.assign(c = lambda x: x['a'] * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Updated via #roganjosh:
df['c'] = df['a'] * 2
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using map (definitely not as efficient as apply):
df['c'] = df['a'].map(lambda x: x * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Creating a new pd.series and then concat to bring it into the dataframe:
c = pd.Series(df['a'] * 2).rename("c")
df = pd.concat([df,c], axis = 1)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using join:
df.join(c)
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Short answer: vectorized calls (df['c'] = 2 * df['a']) almost always win on both speed and readability. See this answer regarding what you can use as a "hierarchy" of options when it comes to performance.
In generally, if you have a for i in ... or lambda present somewhere in a Pandas operation, this (sometimes) means that the resulting calculations call Python code rather than the optimized C code that Pandas' Cython library relies on for vectorized operations. (Same goes for operations that rely on NumPy ufuncs for the underlying .values.)
As for .assign(), it is correctly pointed out in the comments that this creates a copy, whereas you can view df['c'] = 2 * df['a'] as the equivalent of setting a dictionary key/value. The former also takes twice as long, although this is perhaps a bit apples-to-orange because one operation is returning a DataFrame while the other is just assigning a column.
>>> %timeit df.assign(c=df['a'] * 2)
498 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -r 7 -n 1000 df['c'] = df['a'] * 2
239 µs ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As for .map(): generally you see this when, as the name implies, you want to provide a mapping for a Series (though it can be passed a function, as in your question). That doesn't mean it's not performant, it just tends to be used as a specialized method in cases that I've seen:
>>> df['a'].map(dict(enumerate('xyz', 1)))
0 x
1 y
2 z
Name: a, dtype: object
And as for .apply(): to inject a bit of opinion into the answer, I would argue it's more idiomatic to use vectorization where possible. You can see in the code for the module where .apply() is defined: because you are passing a lambda, not a NumPy ufunc, what ultimately gets called is technically a Cython function, map_infer, but it is still performing whatever function you passed on each individual member of the Series df['a'], one at a time.
A succinct way would be:
df['c'] = 2 * df['a']
No need to compute the new column elementwise.
Why are you using lambda function?
You can easily achieve the above-mentioned task easily by
df['c'] = 2 * df['a']
This will not increase the overhead.
I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.
So, to set up a simple problem:
Problem
df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)
A B
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
5 5 2
Question:
How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.
df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
Is there a better, more concise, clearer, more consistent approach?
TL;DR
iloc + argsort
We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.
With the power of iloc, we can sort with any array that specifies the order.
Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.
Example 1
Using the specified problem above
df.iloc[df.prod(1).argsort()]
Same results as above
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
That was for simplicity. We could take this further if performance is an issue and focus on numpy
v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)
How fast are these solutions?
We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.
backtest setup
def add_drop():
return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
def pd_ext_sort():
return df.iloc[df.prod(1).argsort()]
def np_ext_sort():
v = df.values
a = v.prod(1).argsort()
return pd.DataFrame(v[a], df.index[a], df.columns)
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot()
Example 2
Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.
Suppose I have dataframe df
df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)
A
0 -2
1 -1
2 0
3 1
4 2
I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.
Caveat: np.lexsort sorts by the last array in its list first. \shurg
def add_drop():
return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)
def pd_ext_sort():
v = df.A.values
return df.iloc[np.lexsort([np.abs(v), v >= 0])]
def np_ext_sort():
v = df.A.values
a = np.lexsort([np.abs(v), v >= 0])
return pd.DataFrame(v[a, None], df.index[a], df.columns)
All of which return
A
1 -1
0 -2
2 0
3 1
4 2
How fast this time?
In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.
backtest setup
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot(figsize=(15, 6))
I'm trying to compute the Hamming distance between all strings in a column in a large dataframe. I have over 100,000 rows in this column so with all pairwise combinations, which is 10x10^9 comparisons. These strings are short DNA sequences. I would like to quickly convert every string in the column to a list of integers, where a unique integer represent each character in the string. E.g.
"ACGTACA" -> [0, 1, 2, 3, 1, 2, 1]
then I use scipy.spatial.distance.pdist to quickly and efficiently compute the hamming distance between all of these. Is there a fast way to do this in Pandas?
I have tried using apply but it is pretty slow:
mapping = {"A":0, "C":1, "G":2, "T":3}
df.apply(lambda x: np.array([mapping[char] for char in x]))
get_dummies and other Categorical operations don't apply because they operate on a per row level. Not within the row.
Since Hamming distance doesn't care about magnitude differences, I can get about a 40-60% speedup just replacing df.apply(lambda x: np.array([mapping[char] for char in x])) with df.apply(lambda x: map(ord, x)) on made-up datasets.
Create your test data
In [39]: pd.options.display.max_rows=12
In [40]: N = 100000
In [41]: chars = np.array(list('ABCDEF'))
In [42]: s = pd.Series(np.random.choice(chars, size=4 * np.prod(N)).view('S4'))
In [45]: s
Out[45]:
0 BEBC
1 BEEC
2 FEFA
3 BBDA
4 CCBB
5 CABE
...
99994 EEBC
99995 FFBD
99996 ACFB
99997 FDBE
99998 BDAB
99999 CCFD
dtype: object
These don't actually have to be the same length the way we are doing it.
In [43]: maxlen = s.str.len().max()
In [44]: result = pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
In [47]: result
Out[47]:
0 1 2 3
0 1 4 1 2
1 1 4 4 2
2 5 4 5 0
3 1 1 3 0
4 2 2 1 1
5 2 0 1 4
... .. .. .. ..
99994 4 4 1 2
99995 5 5 1 3
99996 0 2 5 1
99997 5 3 1 4
99998 1 3 0 1
99999 2 2 5 3
[100000 rows x 4 columns]
So you get a factorization according the same categories (e.g. the codes are meaningful)
And pretty fast
In [46]: %timeit pd.concat([ s.str[i].astype('category',categories=chars).cat.codes for i in range(maxlen) ], axis=1)
10 loops, best of 3: 118 ms per loop
I didn't test the performance of this, but you could also try somthing like
atest = "ACGTACA"
alist = atest.replace('A', '3.').replace('C', '2.').replace('G', '1.').replace('T', '0.').split('.')
anumlist = [int(x) for x in alist if x.isdigit()]
results in:
[3, 2, 1, 0, 3, 2, 3]
Edit: Ok, so testing it with atest = "ACTACA"*100000 takes a while :/
Maybe not the best idea...
Edit 5:
Another improvement:
import datetime
import numpy as np
class Test(object):
def __init__(self):
self.mapping = {'A' : 0, 'C' : 1, 'G' : 2, 'T' : 3}
def char2num(self, astring):
return [self.mapping[c] for c in astring]
def main():
now = datetime.datetime.now()
atest = "AGTCAGTCATG"*10000000
t = Test()
alist = t.char2num(atest)
testme = np.array(alist)
print testme, len(testme)
print datetime.datetime.now() - now
if __name__ == "__main__":
main()
Takes about 16 seconds for 110.000.000 characters and keeps your processor busy instead of your ram:
[0 2 3 ..., 0 3 2] 110000000
0:00:16.866659
There doesn't seem to be much difference between using ord or a dictionary-based lookup that exactly maps A->0, C->1 etc:
import pandas as pd
import numpy as np
bases = ['A', 'C', 'T', 'G']
rowlen = 4
nrows = 1000000
dna = pd.Series(np.random.choice(bases, nrows * rowlen).view('S%i' % rowlen))
lookup = dict(zip(bases, range(4)))
%timeit dna.apply(lambda row: map(lookup.get, row))
# 1 loops, best of 3: 785 ms per loop
%timeit dna.apply(lambda row: map(ord, row))
# 1 loops, best of 3: 713 ms per loop
Jeff's solution is also not far off in terms of performance:
%timeit pd.concat([dna.str[i].astype('category', categories=bases).cat.codes for i in range(rowlen)], axis=1)
# 1 loops, best of 3: 1.03 s per loop
A major advantage of this approach over mapping the rows to lists of ints is that the categories can then be viewed as a single (nrows, rowlen) uint8 array via the .values attribute, which could then be passed directly to pdist.
I was trying to using pandas to analysis a fairly large data set (~5GB). I wanted to divide the data sets into groups, then perform a Cartesian product on each group, and then aggregate the result.
The apply operation of pandas is quite expressive, I could first group, and then do the Cartesian product on each group using apply, and then aggregate the result using sum. The problem with this approach, however, is that apply is not lazy, it will compute all the intermediate results before the aggregation, and the intermediate results (Cartesian production on each group) is very large.
I was looking at Apache Spark and found one very interesting operator called cogroup. The definition is here:
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable, Iterable) tuples. This operation is also called groupWith.
This seems to be exactly what I want. If I could first cogroup and then do a sum, then the intermediate results won't be expanded (assuming cogroup works in the same lazy fashion as group).
Is there operation similar to cogroup in pandas, or how to achieve my goal efficiently?
Here is my example:
I want to group the data by id, and then do a Cartesian product for each group, and then group by cluster_x and cluster_y and aggregate the count_x and count_y using sum. The following code works, but is extremely slow and consumes too much memory.
# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1
def join_group(g):
return pandas.merge(g, g, on='dummy_key')\
[['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]
df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\
groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\
[['count_x', 'count_y']].sum()
A toy data set
id cluster count
0 i1 A 2
1 i1 B 3
2 i2 A 1
3 i2 B 4
Intermediate result after the apply (can be large)
cluster_x count_x cluster_y count_y
id
i1 0 A 2 A 2
1 A 2 B 3
2 B 3 A 2
3 B 3 B 3
i2 0 A 1 A 1
1 A 1 B 4
2 B 4 A 1
3 B 4 B 4
The desired final result
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
My first attempt failed, sort of: while I was able to limit the memory use (by summing over the Cartesian product within each group), it was considerably slower than the original. But for your particular desired output, I think we can simplify the problem considerably:
import numpy as np, pandas as pd
def fake_data(nids, nclusters, ntile):
ids = ["i{}".format(i) for i in range(1,nids+1)]
clusters = ["A{}".format(i) for i in range(nclusters)]
df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"]))
df = df.reset_index()
df = pd.concat([df]*ntile)
df["count"] = np.random.randint(0, 10, size=len(df))
return df
def join_group(g):
m= pd.merge(g, g, on='dummy_key')
return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]
def old_method(df):
df["dummy_key"] = 1
h1 = df.groupby(['id'], as_index=True).apply(join_group)
h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False)
h3 = h2[['count_x', 'count_y']].sum()
return h3
def new_method1(df):
m1 = df.groupby("cluster", as_index=False)["count"].sum()
m1["dummy_key"] = 1
m2 = m1.merge(m1, on="dummy_key")
m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1)
return m2
which gives (with df as your toy frame):
>>> new_method1(df)
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
True
and even
>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loop
Whether this will be enough of an improvement to handle your actual case, I'm not sure.
Currently I have a pandas DataFrame like this:
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
This DataFrame is used with a statistic which then requires a permutation test (EDIT: to be precise, random permutation). The indices of each column need to be shuffled (sampled) 100 times. To give an idea of the size, the number of rows can be around 50,000.
EDIT: The permutation is along the rows, i.e. shuffle the index for each column.
The biggest issue here is one of performance. I want to permute things in a fast way.
An example I had in mind was:
import random
import joblib
def permutation(dataframe):
return dataframe.apply(random.sample, axis=1, k=len(dataframe))
permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))
The issue here is that by doing this, the test is not stable: apparently the permutation works, but it is not as "random" as it would without being done in parallel, and thus there's a loss of stability in the results when I use the permuted data in follow-up calculations.
So my only "solution" was to precalculate all indices for all columns prior to doing the paralel code, which slows things down considerably.
My questions are:
Is there a more efficient way to do this permutation? (not necessarily parallel)
Is the parallel approach (using multiple processes, not threads) feasible?
EDIT: To make things clearer, here's what should happen for example to column A1 after one shuffling:
Ku8QhfS0n_hIOABXuE 6.268
fqPEquJRRlSVSfL.8A 6.343
ckiehnugOno9d7vf1Q 6.752
x57Vw5B5Fbt5JUnQk 6.297
(i.e. the row values were moving around).
EDIT2: Here's what I'm using now:
def _generate_indices(indices, columns, nperm):
random.seed(1234567890)
num_genes = indices.size
for item in range(nperm):
permuted = pandas.DataFrame(
{column: random.sample(genes, num_genes) for column in columns},
index=range(genes.size)
)
yield permuted
(in short, building a DataFrame of resampled indices for each column)
And later on (yes, I know it's pretty ugly):
# Data is the original DataFrame
# Indices one of the results of that generator
permuted = dict()
for column in data.columns:
value = data[column]
permuted[column] = value[indices[column].values].values
permuted_table = pandas.DataFrame(permuted, index=data.index)
How about this:
In [1]: import numpy as np; import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(50000, 10))
In [3]: def shuffle(df, n):
....: for i in n:
....: np.random.shuffle(df.values)
....: return df
In [4]: df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9
0 0.329588 -0.513814 -1.267923 0.691889 -0.319635 -1.468145 -0.441789 0.004142 -0.362073 -0.555779
1 0.495670 2.460727 1.174324 1.115692 1.214057 -0.843138 0.217075 0.495385 1.568166 0.252299
2 -0.898075 0.994281 -0.281349 -0.104684 -1.686646 0.651502 -1.466679 -1.256705 1.354484 0.626840
3 1.158388 -1.227794 -0.462005 -1.790205 0.399956 -1.631035 -1.707944 -1.126572 -0.892759 1.396455
4 -0.049915 0.006599 -1.099983 0.775028 -0.694906 -1.376802 -0.152225 1.413212 0.050213 -0.209760
In [5]: shuffle(df, 1).head(5)
Out[5]:
0 1 2 3 4 5 6 7 8 9
0 2.044131 0.072214 -0.304449 0.201148 1.462055 0.538476 -0.059249 -0.133299 2.925301 0.529678
1 0.036957 0.214003 -1.042905 -0.029864 1.616543 0.840719 0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621 0.657951 1.132175 -0.815403 0.548210 -0.029291 0.575587 0.032481 -0.261873 0.010381
3 1.396024 0.859455 -1.514801 0.353378 1.790324 0.286164 -0.765518 1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331 0.
In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop
This does what you need it to. The only question is whether or not it is fast enough.
Update
Per the comments by #Einar I have changed my solution.
In[7]: def shuffle2(df, n):
ind = df.index
for i in range(n):
sampler = np.random.permutation(df.shape[0])
new_vals = df.take(sampler).values
df = pd.DataFrame(new_vals, index=ind)
return df
In [8]: df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9
0 -0.175006 -0.462306 0.565517 -0.309398 1.100570 0.656627 1.207535 -0.221079 -0.933068 -0.192759
1 0.388165 0.155480 -0.015188 0.868497 1.102662 -0.571818 -0.994005 0.600943 2.205520 -0.294121
2 0.281605 -1.637529 2.238149 0.987409 -1.979691 -0.040130 1.121140 1.190092 -0.118919 0.790367
3 1.054509 0.395444 1.239756 -0.439000 0.146727 -1.705972 0.627053 -0.547096 -0.818094 -0.056983
4 0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468 0.512461 1.058758 -0.206019
In [9]: shuffle2(df, 1).head()
Out[9]:
0 1 2 3 4 5 6 7 8 9
0 0.054355 0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880 0.593901 0.182513 -1.981521
1 0.624562 1.097495 -0.428710 -0.133220 0.675428 0.892044 0.752593 -0.702470 0.272386 -0.193440
2 0.763551 -0.505923 0.206675 0.561456 0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3 1.149586 -0.003552 2.496176 -0.089767 0.246546 -1.333184 0.524872 -0.527519 0.492978 -0.829365
4 -1.893188 0.728737 0.361983 -0.188709 -0.809291 2.093554 0.396242 0.402482 1.884082 1.373781
In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop