cogroup like operation for pandas

cogroup like operation for pandas - python

I was trying to using pandas to analysis a fairly large data set (~5GB). I wanted to divide the data sets into groups, then perform a Cartesian product on each group, and then aggregate the result.
The apply operation of pandas is quite expressive, I could first group, and then do the Cartesian product on each group using apply, and then aggregate the result using sum. The problem with this approach, however, is that apply is not lazy, it will compute all the intermediate results before the aggregation, and the intermediate results (Cartesian production on each group) is very large.
I was looking at Apache Spark and found one very interesting operator called cogroup. The definition is here:
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable, Iterable) tuples. This operation is also called groupWith.
This seems to be exactly what I want. If I could first cogroup and then do a sum, then the intermediate results won't be expanded (assuming cogroup works in the same lazy fashion as group).
Is there operation similar to cogroup in pandas, or how to achieve my goal efficiently?
Here is my example:
I want to group the data by id, and then do a Cartesian product for each group, and then group by cluster_x and cluster_y and aggregate the count_x and count_y using sum. The following code works, but is extremely slow and consumes too much memory.
# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1
def join_group(g):
return pandas.merge(g, g, on='dummy_key')\
[['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]
df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\
groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\
[['count_x', 'count_y']].sum()
A toy data set
id cluster count
0 i1 A 2
1 i1 B 3
2 i2 A 1
3 i2 B 4
Intermediate result after the apply (can be large)
cluster_x count_x cluster_y count_y
id
i1 0 A 2 A 2
1 A 2 B 3
2 B 3 A 2
3 B 3 B 3
i2 0 A 1 A 1
1 A 1 B 4
2 B 4 A 1
3 B 4 B 4
The desired final result
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7

My first attempt failed, sort of: while I was able to limit the memory use (by summing over the Cartesian product within each group), it was considerably slower than the original. But for your particular desired output, I think we can simplify the problem considerably:
import numpy as np, pandas as pd
def fake_data(nids, nclusters, ntile):
ids = ["i{}".format(i) for i in range(1,nids+1)]
clusters = ["A{}".format(i) for i in range(nclusters)]
df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"]))
df = df.reset_index()
df = pd.concat([df]*ntile)
df["count"] = np.random.randint(0, 10, size=len(df))
return df
def join_group(g):
m= pd.merge(g, g, on='dummy_key')
return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]
def old_method(df):
df["dummy_key"] = 1
h1 = df.groupby(['id'], as_index=True).apply(join_group)
h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False)
h3 = h2[['count_x', 'count_y']].sum()
return h3
def new_method1(df):
m1 = df.groupby("cluster", as_index=False)["count"].sum()
m1["dummy_key"] = 1
m2 = m1.merge(m1, on="dummy_key")
m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1)
return m2
which gives (with df as your toy frame):
>>> new_method1(df)
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
True
and even
>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loop
Whether this will be enough of an improvement to handle your actual case, I'm not sure.

Related

Pandas alternative to apply - to create new column based on multiple columns

I have a Pandas dataframe and I would like to add a new column based on the values of the other columns. A minimal example illustrating my usecase is below.
df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df
a b c
---------------
0 4 5 19
1 1 2 0
2 2 5 9
3 8 2 5
x = df.sample(n=2)
x
a b c
---------------
3 8 2 5
1 1 2 0
def get_new(row):
a, b, c = row
return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)
y = x.apply(lambda row: get_new(row), axis=1)
x['new'] = y
x
a b c new
--------------------
3 8 2 5 0
1 1 2 0 5
Note: The original dataframe has ~4 million rows and ~6 columns. The number of rows in the sample might vary between 50 and 500. I am running on a 64-bit machine with 8 GB RAM.
The above works, except that it is quite slow (takes about 15 seconds for me). I also tried using x.itertuples() instead of apply and there is not much of an improvement in this case.
It seems that apply(with axis=1) is slow since it does not make use of the vectorized operations. Is there some way I could achieve this in a faster way?
Can the filtering(in the get_new function) be modified or made more efficient compared to using conditional boolean variables, as I currently have?
Can I in some way use numpy here for some speedup?
Edit: df.sample() is also quite slow and I cannot use .iloc or .loc since I am further modifying the sample and do not wish for this to affect the original dataframe.

I see a reasonable performance improvement by using .loc rather than chained indexing:
import random, pandas as pd, numpy as np
df = pd.DataFrame([[4,5,19],[1,2,0],[2,5,9],[8,2,5]], columns=['a','b','c'])
df = pd.concat([df]*1000000)
x = df.sample(n=2)
def get_new(row):
a, b, c = row
return random.choice(df[(df['a'] != a) & (df['b'] == b) & (df['c'] != c)]['c'].values)
def get_new2(row):
a, b, c = row
return random.choice(df.loc[(df['a'] != a) & (df['b'] == b) & (df['c'] != c), 'c'].values)
%timeit x.apply(lambda row: get_new(row), axis=1) # 159ms
%timeit x.apply(lambda row: get_new2(row), axis=1) # 119ms

How do I sort a dataframe by an array not in the dataframe

I've answered this question several times in the guise of different contexts and I realized that there isn't a good canonical approach specified anywhere.
So, to set up a simple problem:
Problem
df = pd.DataFrame(dict(A=range(6), B=[1, 2] * 3))
print(df)
A B
0 0 1
1 1 2
2 2 1
3 3 2
4 4 1
5 5 2
Question:
How do I sort by the product of columns 'A' and 'B'?
Here is an approach where I add a temporary column to the dataframe, use it to sort_values then drop it.
df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
Is there a better, more concise, clearer, more consistent approach?

TL;DR
iloc + argsort
We can approach this using iloc where we can take an array of ordinal positions and return the dataframe reordered by these positions.
With the power of iloc, we can sort with any array that specifies the order.
Now, all we need to do is identify a method for getting this ordering. Turns out there is a method called argsort which does exactly this. By passing the results of argsort to iloc, we can get our dataframe sorted out.
Example 1
Using the specified problem above
df.iloc[df.prod(1).argsort()]
Same results as above
A B
0 0 1
1 1 2
2 2 1
4 4 1
3 3 2
5 5 2
That was for simplicity. We could take this further if performance is an issue and focus on numpy
v = df.values
a = v.prod(1).argsort()
pd.DataFrame(v[a], df.index[a], df.columns)
How fast are these solutions?
We can see that pd_ext_sort is the most concise but does not scale as well as the others.
np_ext_sort gives the best performance at the expense of transparency. Though, I'd argue that it's still very clear what is going on.
backtest setup
def add_drop():
return df.assign(P=df.prod(1)).sort_values('P').drop('P', 1)
def pd_ext_sort():
return df.iloc[df.prod(1).argsort()]
def np_ext_sort():
v = df.values
a = v.prod(1).argsort()
return pd.DataFrame(v[a], df.index[a], df.columns)
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.rand(i, 2), columns=['A', 'B'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot()
Example 2
Suppose I have a column of negative and positive values. I want to sort by increasing magnitude... however, I want the negatives to come first.
Suppose I have dataframe df
df = pd.DataFrame(dict(A=range(-2, 3)))
print(df)
A
0 -2
1 -1
2 0
3 1
4 2
I'll set up 3 versions again. This time I'll use np.lexsort which returns the same type of array as argsort. Meaning, I can use it to reorder the dataframe.
Caveat: np.lexsort sorts by the last array in its list first. \shurg
def add_drop():
return df.assign(P=df.A >= 0, M=df.A.abs()).sort_values(['P', 'M']).drop(['P', 'M'], 1)
def pd_ext_sort():
v = df.A.values
return df.iloc[np.lexsort([np.abs(v), v >= 0])]
def np_ext_sort():
v = df.A.values
a = np.lexsort([np.abs(v), v >= 0])
return pd.DataFrame(v[a, None], df.index[a], df.columns)
All of which return
A
1 -1
0 -2
2 0
3 1
4 2
How fast this time?
In this example, both pd_ext_sort and np_ext_sort outperformed add_drop.
backtest setup
results = pd.DataFrame(
index=pd.Index([10, 100, 1000, 10000], name='Size'),
columns=pd.Index(['add_drop', 'pd_ext_sort', 'np_ext_sort'], name='method')
)
for i in results.index:
df = pd.DataFrame(np.random.randn(i, 1), columns=['A'])
for j in results.columns:
stmt = '{}()'.format(j)
setup = 'from __main__ import df, {}'.format(j)
results.set_value(i, j, timeit(stmt, setup, number=100))
results.plot(figsize=(15, 6))

Fastest way to compare rows of two pandas dataframes?

So I have two pandas dataframes, A and B.
A is 1000 rows x 500 columns, filled with binary values indicating either presence or absence.
B is 1024 rows x 10 columns, and is a full iteration of 0's and 1's, hence having 1024 rows.
I am trying to find which rows in A, at a particular 10 columns of A, correspond with a given row in B. I need the whole row to match up, rather than element by element.
For example, I would want
A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)==(1,0,1,0,1,0,0,1,0,0)).all(axis=1)]
To return something that rows (3,5,8,11,15) in A match up with that (1,0,1,0,1,0,0,1,0,0) row of B at those particular columns (1,2,3,4,5,6,7,8,9,10)
And I want to do this over every row in B.
The best way I could figure out to do this was:
import numpy as np
for i in B:
B_array = np.array(i)
Matching_Rows = A[(A.ix[:,(1,2,3,4,5,6,7,8,9,10)] == B_array).all(axis=1)]
Matching_Rows_Index = Matching_Rows.index
This isn't terrible for one instance, but I use it in a while loop that runs around 20,000 times; therefore, it slows it down quite a bit.
I have been messing around with DataFrame.apply to no avail. Could map work better?
I was just hoping someone saw something obviously more efficient as I am fairly new to python.
Thanks and best regards!

We can abuse the fact that both dataframes have binary values 0 or 1 by collapsing the relevant columns from A and all columns from B into 1D arrays each, when considering each row as a sequence of binary numbers that could be converted to decimal number equivalents. This should reduce the problem set considerably, which would help with performance. Now, after getting those 1D arrays, we can use np.in1d to look for matches from B in A and finally np.where on it to get the matching indices.
Thus, we would have an implementation like so -
# Setup 1D arrays corresponding to selected cols from A and entire B
S = 2**np.arange(10)
A_ID = np.dot(A[range(1,11)],S)
B_ID = np.dot(B,S)
# Look for matches that exist from B_ID in A_ID, whose indices
# would be desired row indices that have matched from B
out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
Sample run -
In [157]: # Setup dataframes A and B with rows 0, 4 in A having matches from B
...: A_arr = np.random.randint(0,2,(10,14))
...: B_arr = np.random.randint(0,2,(7,10))
...:
...: B_arr[2] = A_arr[4,1:11]
...: B_arr[4] = A_arr[4,1:11]
...: B_arr[5] = A_arr[0,1:11]
...:
...: A = pd.DataFrame(A_arr)
...: B = pd.DataFrame(B_arr)
...:
In [158]: S = 2**np.arange(10)
...: A_ID = np.dot(A[range(1,11)],S)
...: B_ID = np.dot(B,S)
...: out_row_idx = np.where(np.in1d(A_ID,B_ID))[0]
...:
In [159]: out_row_idx
Out[159]: array([0, 4])

You can use merge with reset_index - output are indexes of B which are equal in A in custom columns:
A = pd.DataFrame({'A':[1,0,1,1],
'B':[0,0,1,1],
'C':[1,0,1,1],
'D':[1,1,1,0],
'E':[1,1,0,1]})
print (A)
A B C D E
0 1 0 1 1 1
1 0 0 0 1 1
2 1 1 1 1 0
3 1 1 1 0 1
B = pd.DataFrame({'0':[1,0,1],
'1':[1,0,1],
'2':[1,0,0]})
print (B)
0 1 2
0 1 1 1
1 0 0 0
2 1 1 0
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A')))
index_B 0 1 2 index_A A B C D E
0 0 1 1 1 2 1 1 1 1 0
1 0 1 1 1 3 1 1 1 0 1
2 1 0 0 0 1 0 0 0 1 1
print (pd.merge(B.reset_index(),
A.reset_index(),
left_on=B.columns.tolist(),
right_on=A.columns[[0,1,2]].tolist(),
suffixes=('_B','_A'))[['index_B','index_A']])
index_B index_A
0 0 2
1 0 3
2 1 1

You can do it in pandas by using loc or ix and telling it to find the rows where the ten columns are all equal. Like this:
A.loc[(A[1]==B[1]) & (A[2]==B[2]) & (A[3]==B[3]) & A[4]==B[4]) & (A[5]==B[5]) & (A[6]==B[6]) & (A[7]==B[7]) & (A[8]==B[8]) & (A[9]==B[9]) & (A[10]==B[10])]
This is quite ugly in my opinion but it will work and gets rid of the loop so it should be significantly faster. I wouldn't be surprised if someone could come up with a more elegant way of coding the same operation.

In this special case, your rows of 10 zeros and ones can be interpreted as 10 digit binaries. If B is in order, then it can be interpreted as a range from 0 to 1023. In this case, all we need to do is take A's rows in 10 column chunks and calculate what its binary equivalent is.
I'll start by defining a range of powers of two so I can do matrix multiplication with it.
twos = pd.Series(np.power(2, np.arange(10)))
Next, I'll relabel A's columns into a MultiIndex and stack to get my chunks of 10.
A = pd.DataFrame(np.random.binomial(1, .5, (1000, 500)))
A.columns = pd.MultiIndex.from_tuples(zip((A.columns / 10).tolist(), (A.columns % 10).tolist()))
A_ = A.stack(0)
A_.head()
Finally, I'll multiply A_ with twos to get integer representation of each row and unstack.
A_.dot(twos).unstack()
This is now a 1000 x 50 DataFrame where each cell represents which of B's rows we matched for that particular 10 column chunk for that particular row of A. There isn't even a need for B.

How do I calculate a pandas column with multiple columns as arguments?

I was using a wind speed calculation function from lon and lat components:
def wind_speed(u, v):
return np.sqrt(u ** 2 + v ** 2)
and calling it to calculate a new pandas column from two existing ones:
df['wspeed'] = map(wind_speed, df['lonwind'], df['latwind'])
Since I changed from Python 2.7 to Python 3.5 the function is not working anymore. Could the change be the cause?
In a single argument (column) function:
def celsius(T):
return round(T - 273, 1)
I am now using:
df['temp'] = df['t2m'].map(celsius)
And it works fine.
Could you help me?

If want to use map, add list:
df = pd.DataFrame({'lonwind':[1,2,3],
'latwind':[4,5,6]})
print (df)
latwind lonwind
0 4 1
1 5 2
2 6 3
def wind_speed(u, v):
return np.sqrt(u ** 2 + v ** 2)
df['wspeed'] = list(map(wind_speed, df['lonwind'], df['latwind']))
print (df)
latwind lonwind wspeed
0 4 1 4.123106
1 5 2 5.385165
2 6 3 6.708204
Without list:
df['wspeed'] = (map(wind_speed, df['lonwind'], df['latwind']))
print (df)
latwind lonwind wspeed
0 4 1 <map object at 0x000000000AC42DA0>
1 5 2 <map object at 0x000000000AC42DA0>
2 6 3 <map object at 0x000000000AC42DA0>
map(function, iterable, ...)
Return an iterator that applies function to every item of iterable, yielding the results. If additional iterable arguments are passed, function must take that many arguments and is applied to the items from all iterables in parallel. With multiple iterables, the iterator stops when the shortest iterable is exhausted. For cases where the function inputs are already arranged into argument tuples, see itertools.starmap().
Another solution:
df['wspeed'] = (df['lonwind'] ** 2 + df['latwind'] ** 2) **0.5
print (df)
latwind lonwind wspeed
0 4 1 4.123106
1 5 2 5.385165
2 6 3 6.708204

I would try to stick to existing numpy/scipy functions as they are extremely fast and optimized (numpy.hypot):
df['wspeed'] = np.hypot(df.latwind, df.lonwind)
Timing: against 300K rows DF:
In [47]: df = pd.concat([df] * 10**5, ignore_index=True)
In [48]: df.shape
Out[48]: (300000, 2)
In [49]: %paste
def wind_speed(u, v):
return np.sqrt(u ** 2 + v ** 2)
## -- End pasted text --
In [50]: %timeit list(map(wind_speed, df['lonwind'], df['latwind']))
1 loop, best of 3: 922 ms per loop
In [51]: %timeit np.hypot(df.latwind, df.lonwind)
100 loops, best of 3: 4.08 ms per loop
Conclusion: vectorized approach was 230 times faster
If you have to write your own one, try to use vectorized math (working with vectors / columns instead of scalars):
def wind_speed(u, v):
# using vectorized approach - column's math instead of scalar
return np.sqrt(u * u + v * v)
df['wspeed'] = wind_speed(df['lonwind'] , df['latwind'])
demo:
In [39]: df['wspeed'] = wind_speed(df['lonwind'] , df['latwind'])
In [40]: df
Out[40]:
latwind lonwind wspeed
0 4 1 4.123106
1 5 2 5.385165
2 6 3 6.708204
same vectorized approach with celsius() function:
def celsius(T):
# using vectorized function: np.round()
return np.round(T - 273, 1)

Efficient way of doing permutations with pandas over a large DataFrame

Currently I have a pandas DataFrame like this:
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
This DataFrame is used with a statistic which then requires a permutation test (EDIT: to be precise, random permutation). The indices of each column need to be shuffled (sampled) 100 times. To give an idea of the size, the number of rows can be around 50,000.
EDIT: The permutation is along the rows, i.e. shuffle the index for each column.
The biggest issue here is one of performance. I want to permute things in a fast way.
An example I had in mind was:
import random
import joblib
def permutation(dataframe):
return dataframe.apply(random.sample, axis=1, k=len(dataframe))
permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))
The issue here is that by doing this, the test is not stable: apparently the permutation works, but it is not as "random" as it would without being done in parallel, and thus there's a loss of stability in the results when I use the permuted data in follow-up calculations.
So my only "solution" was to precalculate all indices for all columns prior to doing the paralel code, which slows things down considerably.
My questions are:
Is there a more efficient way to do this permutation? (not necessarily parallel)
Is the parallel approach (using multiple processes, not threads) feasible?
EDIT: To make things clearer, here's what should happen for example to column A1 after one shuffling:
Ku8QhfS0n_hIOABXuE 6.268
fqPEquJRRlSVSfL.8A 6.343
ckiehnugOno9d7vf1Q 6.752
x57Vw5B5Fbt5JUnQk 6.297
(i.e. the row values were moving around).
EDIT2: Here's what I'm using now:
def _generate_indices(indices, columns, nperm):
random.seed(1234567890)
num_genes = indices.size
for item in range(nperm):
permuted = pandas.DataFrame(
{column: random.sample(genes, num_genes) for column in columns},
index=range(genes.size)
)
yield permuted
(in short, building a DataFrame of resampled indices for each column)
And later on (yes, I know it's pretty ugly):
# Data is the original DataFrame
# Indices one of the results of that generator
permuted = dict()
for column in data.columns:
value = data[column]
permuted[column] = value[indices[column].values].values
permuted_table = pandas.DataFrame(permuted, index=data.index)

How about this:
In [1]: import numpy as np; import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(50000, 10))
In [3]: def shuffle(df, n):
....: for i in n:
....: np.random.shuffle(df.values)
....: return df
In [4]: df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9
0 0.329588 -0.513814 -1.267923 0.691889 -0.319635 -1.468145 -0.441789 0.004142 -0.362073 -0.555779
1 0.495670 2.460727 1.174324 1.115692 1.214057 -0.843138 0.217075 0.495385 1.568166 0.252299
2 -0.898075 0.994281 -0.281349 -0.104684 -1.686646 0.651502 -1.466679 -1.256705 1.354484 0.626840
3 1.158388 -1.227794 -0.462005 -1.790205 0.399956 -1.631035 -1.707944 -1.126572 -0.892759 1.396455
4 -0.049915 0.006599 -1.099983 0.775028 -0.694906 -1.376802 -0.152225 1.413212 0.050213 -0.209760
In [5]: shuffle(df, 1).head(5)
Out[5]:
0 1 2 3 4 5 6 7 8 9
0 2.044131 0.072214 -0.304449 0.201148 1.462055 0.538476 -0.059249 -0.133299 2.925301 0.529678
1 0.036957 0.214003 -1.042905 -0.029864 1.616543 0.840719 0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621 0.657951 1.132175 -0.815403 0.548210 -0.029291 0.575587 0.032481 -0.261873 0.010381
3 1.396024 0.859455 -1.514801 0.353378 1.790324 0.286164 -0.765518 1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331 0.
In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop
This does what you need it to. The only question is whether or not it is fast enough.
Update
Per the comments by #Einar I have changed my solution.
In[7]: def shuffle2(df, n):
ind = df.index
for i in range(n):
sampler = np.random.permutation(df.shape[0])
new_vals = df.take(sampler).values
df = pd.DataFrame(new_vals, index=ind)
return df
In [8]: df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9
0 -0.175006 -0.462306 0.565517 -0.309398 1.100570 0.656627 1.207535 -0.221079 -0.933068 -0.192759
1 0.388165 0.155480 -0.015188 0.868497 1.102662 -0.571818 -0.994005 0.600943 2.205520 -0.294121
2 0.281605 -1.637529 2.238149 0.987409 -1.979691 -0.040130 1.121140 1.190092 -0.118919 0.790367
3 1.054509 0.395444 1.239756 -0.439000 0.146727 -1.705972 0.627053 -0.547096 -0.818094 -0.056983
4 0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468 0.512461 1.058758 -0.206019
In [9]: shuffle2(df, 1).head()
Out[9]:
0 1 2 3 4 5 6 7 8 9
0 0.054355 0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880 0.593901 0.182513 -1.981521
1 0.624562 1.097495 -0.428710 -0.133220 0.675428 0.892044 0.752593 -0.702470 0.272386 -0.193440
2 0.763551 -0.505923 0.206675 0.561456 0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3 1.149586 -0.003552 2.496176 -0.089767 0.246546 -1.333184 0.524872 -0.527519 0.492978 -0.829365
4 -1.893188 0.728737 0.361983 -0.188709 -0.809291 2.093554 0.396242 0.402482 1.884082 1.373781
In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

cogroup like operation for pandas - python

Related

Pandas alternative to apply - to create new column based on multiple columns

How do I sort a dataframe by an array not in the dataframe

Fastest way to compare rows of two pandas dataframes?

How do I calculate a pandas column with multiple columns as arguments?

Efficient way of doing permutations with pandas over a large DataFrame

Categories

Resources