avoid repetitive operations with Pandas - python

Derived from another question, here
I got a 2 million rows DataFrame, something similar to this
final_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
})
I need to run calculations with the values on each row and append the results as new columns, something similar to the question in this link.
I know that there a lot of combinations of speed, temp, and temp2 that are repeated if I drop_duplicates the resulting DataFrame is only 50k rows length, which takes significantly less time to process, using an apply function like this:
def dafunc(row):
row['r1'] = row['speed'] * row['temp1'] * k1
row['r2'] = row['speed'] * row['temp2'] * k2
nodup_df = final_df.drop_duplicates(['speed,','temp1','temp2'])
nodup_df = dodup_df.apply(dafunc,axis=1)
The above code is super simplified of what I actually do.
So far I'm trying to use a dictionary where I store the results and a string formed of the combinations is the key, if the dictionary already has those results, I get them instead of making the calculations again.
Is there a more efficient way to do this using Pandas' vectorized operations?
EDIT:
In the end, the resulting DataFrame should look like this:
#assuming k1 = 0.5, k2 = 1
resulting_df = pd.DataFrame.from_dict({
'ts': [0,1,2,3,4,5],
'speed': [5,4,1,4,1,4],
'temp': [9,8,7,8,7,8],
'temp2': [2,2,7,2,7,2],
'r1': [22.5,16,3.5,16,3.5,16],
'r2': [10,8,7,8,7,8],
})

Well if you can access the columns from a numpy array based on the column index it would be a lot faster i.e
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
If you want to create multiple columns at once you can use a for loop for that and speed will be similar like
k = [0.5,1]
for i in range(1,3):
final_df['r'+str(i)] = final_df.values[:,0]*final_df.values[:,i]*k[i-1]
If you drop duplicates it will be much faster.
Output:
speed temp temp2 ts r1 r2
0 5 9 2 0 22.5 10.0
1 4 8 2 1 16.0 8.0
2 1 7 7 2 3.5 7.0
3 4 8 2 3 16.0 8.0
4 1 7 7 4 3.5 7.0
5 4 8 2 5 16.0 8.0
For small dataframe
%%timeit
final_df['r1'] = final_df.values[:,0]*final_df.values[:,1]*k1
final_df['r2'] = final_df.values[:,0]*final_df.values[:,2]*k2
1000 loops, best of 3: 708 µs per loop
For large dataframe
%%timeit
ndf = pd.concat([final_df]*10000)
ndf['r1'] = ndf.values[:,0]*ndf.values[:,1]*k1
ndf['r2'] = ndf.values[:,0]*ndf.values[:,2]*k2
1 loop, best of 3: 6.19 ms per loop

Related

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Aggregating a dataframe: weighted averages for numerical columns and concatenating as string for other types

I got a dataframe which looks like this:
np.random.seed(11)
df = pd.DataFrame({
'item_id': np.random.randint(1, 4, 10),
'item_type': [chr(x) for x in np.random.randint(65, 80, 10)],
'value1': np.round(np.random.rand(10)*30, 1),
'value2': np.round(np.random.randn(10)*30, 1),
'n': np.random.randint(100, size=10)
})
item_id item_type value1 value2 n
0 2 A 26.8 -39.2 59
1 1 N 25.7 -33.6 1
2 2 A 5.0 22.1 3
3 2 N 19.0 47.2 8
4 1 M 0.6 -0.9 87
5 2 N 3.5 -20.5 81
6 3 E 9.5 32.9 68
7 1 C 4.7 -9.3 72
8 2 M 22.8 21.8 32
9 1 B 24.5 46.5 78
I would like to transform this dataframe to have a single row for each item_id.
The columns should be aggregated by finding the weighted average of value1 and value2 (weighted by n), and combining categorical variable item_type if it is not unique. The end result looks like this:
item_type value1 value2
item_id
1 B/C/M/N 9.778571 11.955882
2 A/M/N 15.089071 -15.474317
3 E 9.500000 32.900000
What I have tried
This can be done with a custom function and using apply, like this one:
def func(x):
record = ['/'.join(sorted(x.item_type.unique()))]
total_rows = x.n.sum()
for c in ['value1', 'value2']:
record.append((x[c] * x.n / total_rows).sum())
return pd.Series(record, index=['item_type', 'value1', 'value2'])
%%timeit
df.groupby('item_id').apply(func)
6.95 ms ± 30.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is for 10 records. I have a dataframe above 40 million records. I am searching for the most efficient way to do this, before I start thinking of going parallel. All other operations I've done on this dataframe take less than a minute, but this one is sloow.
any ideas appreciated!
Not sure but may be worth a try. We can do a similar operation on a window using transform then group the results , string operations are quite slow in general:
(df.groupby("item_id")['item_type'].agg(lambda x: '/'.join(sorted(x.unique())))
.to_frame()
.join(
df[['value1','value2']].mul(df['n'],0).div(
df.groupby("item_id")['n'].transform('sum'),0)
.groupby(df['item_id']).sum()))
Or :
cols = ['value1','value2']
(df.groupby("item_id")['item_type'].agg(lambda x: '/'.join(sorted(x.unique())))
.to_frame().join
(pd.DataFrame(((df[cols].to_numpy() * df['n'].to_numpy()[:,None])/
df.groupby("item_id")['n'].transform('sum').to_numpy()[:,None]),
index=df.index,columns=cols)
.groupby(df['item_id']).sum()))
item_type value1 value2
item_id
1 B/C/M/N 9.778571 11.955882
2 A/M/N 15.089071 -15.474317
3 E 9.500000 32.900000
What about using NumPy functions only:
def numpy_func(group):
n = group['n'].values
item_type = np.str.join('/', np.unique(group["item_type"].values))
value1 = np.average(group["value1"].values, weights = n)
value2 = np.average(group["value2"].values, weights = n)
return pd.Series([item_type, value1, value2], index=['item_type', 'value1', 'value2'])
df.groupby("item_id").apply(numpy_func)
Execution time comparison
I compared your function with mine:
from datetime import datetime
before = datetime.now()
for i in range(1000):
df.groupby("item_id").apply(numpy_func)
after = datetime.now()
print(after - before)
#0:00:03.954935
before = datetime.now()
for i in range(1000):
df.groupby("item_id").apply(func)
after = datetime.now()
print(after - before)
#0:00:06.307923
It was like 1/3 faster.

cogroup like operation for pandas

I was trying to using pandas to analysis a fairly large data set (~5GB). I wanted to divide the data sets into groups, then perform a Cartesian product on each group, and then aggregate the result.
The apply operation of pandas is quite expressive, I could first group, and then do the Cartesian product on each group using apply, and then aggregate the result using sum. The problem with this approach, however, is that apply is not lazy, it will compute all the intermediate results before the aggregation, and the intermediate results (Cartesian production on each group) is very large.
I was looking at Apache Spark and found one very interesting operator called cogroup. The definition is here:
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, Iterable, Iterable) tuples. This operation is also called groupWith.
This seems to be exactly what I want. If I could first cogroup and then do a sum, then the intermediate results won't be expanded (assuming cogroup works in the same lazy fashion as group).
Is there operation similar to cogroup in pandas, or how to achieve my goal efficiently?
Here is my example:
I want to group the data by id, and then do a Cartesian product for each group, and then group by cluster_x and cluster_y and aggregate the count_x and count_y using sum. The following code works, but is extremely slow and consumes too much memory.
# add dummy_key to do Cartesian product by merge
df['dummy_key'] = 1
def join_group(g):
return pandas.merge(g, g, on='dummy_key')\
[['cache_cluster_x', 'count_x', 'cache_cluster_y', 'count_y']]
df_count_stats = df.groupby(['id'], as_index=True).apply(join_group).\
groupby(['cache_cluster_x', 'cache_cluster_y'], as_index=False)\
[['count_x', 'count_y']].sum()
A toy data set
id cluster count
0 i1 A 2
1 i1 B 3
2 i2 A 1
3 i2 B 4
Intermediate result after the apply (can be large)
cluster_x count_x cluster_y count_y
id
i1 0 A 2 A 2
1 A 2 B 3
2 B 3 A 2
3 B 3 B 3
i2 0 A 1 A 1
1 A 1 B 4
2 B 4 A 1
3 B 4 B 4
The desired final result
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
My first attempt failed, sort of: while I was able to limit the memory use (by summing over the Cartesian product within each group), it was considerably slower than the original. But for your particular desired output, I think we can simplify the problem considerably:
import numpy as np, pandas as pd
def fake_data(nids, nclusters, ntile):
ids = ["i{}".format(i) for i in range(1,nids+1)]
clusters = ["A{}".format(i) for i in range(nclusters)]
df = pd.DataFrame(index=pd.MultiIndex.from_product([ids, clusters], names=["id", "cluster"]))
df = df.reset_index()
df = pd.concat([df]*ntile)
df["count"] = np.random.randint(0, 10, size=len(df))
return df
def join_group(g):
m= pd.merge(g, g, on='dummy_key')
return m[['cluster_x', 'count_x', 'cluster_y', 'count_y']]
def old_method(df):
df["dummy_key"] = 1
h1 = df.groupby(['id'], as_index=True).apply(join_group)
h2 = h1.groupby(['cluster_x', 'cluster_y'], as_index=False)
h3 = h2[['count_x', 'count_y']].sum()
return h3
def new_method1(df):
m1 = df.groupby("cluster", as_index=False)["count"].sum()
m1["dummy_key"] = 1
m2 = m1.merge(m1, on="dummy_key")
m2 = m2.sort_index(axis=1).drop(["dummy_key"], axis=1)
return m2
which gives (with df as your toy frame):
>>> new_method1(df)
cluster_x cluster_y count_x count_y
0 A A 3 3
1 A B 3 7
2 B A 7 3
3 B B 7 7
>>> df2 = fake_data(100, 100, 1)
>>> %timeit old_method(df2)
1 loops, best of 3: 954 ms per loop
>>> %timeit new_method1(df2)
100 loops, best of 3: 8.58 ms per loop
>>> (old_method(df2) == new_method1(df2)).all().all()
True
and even
>>> df2 = fake_data(100, 100, 100)
>>> %timeit new_method1(df2)
10 loops, best of 3: 88.8 ms per loop
Whether this will be enough of an improvement to handle your actual case, I'm not sure.

Pandas dataframe total row

I have a dataframe, something like:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
and I would like to add a 'total' row to the end of dataframe:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 total 18 9.47
I've tried to use the sum command but I end up with a Series, which although I can convert back to a Dataframe, doesn't maintain the data types:
tot_row = pd.DataFrame(df.sum()).T
tot_row['foo'] = 'tot'
tot_row.dtypes:
foo object
bar object
qux object
I would like to maintain the data types from the original data frame as I need to apply other operations to the total row, something like:
baz = 2*tot_row['qux'] + 3*tot_row['bar']
Update June 2022
pd.append is now deprecated. You could use pd.concat instead but it's probably easier to use df.loc['Total'] = df.sum(numeric_only=True), as Kevin Zhu commented. Or, better still, don't modify the data frame in place and keep your data separate from your summary statistics!
Append a totals row with
df.append(df.sum(numeric_only=True), ignore_index=True)
The conversion is necessary only if you have a column of strings or objects.
It's a bit of a fragile solution so I'd recommend sticking to operations on the dataframe, though. eg.
baz = 2*df['qux'].sum() + 3*df['bar'].sum()
df.loc["Total"] = df.sum()
works for me and I find it easier to remember. Am I missing something?
Probably wasn't possible in earlier versions.
I'd actually like to add the total row only temporarily though.
Adding it permanently is good for display but makes it a hassle in further calculations.
Just found
df.append(df.sum().rename('Total'))
This prints what I want in a Jupyter notebook and appears to leave the df itself untouched.
New Method
To get both row and column total:
import numpy as np
import pandas as pd
df = pd.DataFrame({'a': [10,20],'b':[100,200],'c': ['a','b']})
df.loc['Column_Total']= df.sum(numeric_only=True, axis=0)
df.loc[:,'Row_Total'] = df.sum(numeric_only=True, axis=1)
print(df)
a b c Row_Total
0 10.0 100.0 a 110.0
1 20.0 200.0 b 220.0
Column_Total 30.0 300.0 NaN 330.0
Use DataFrame.pivot_table with margins=True:
import pandas as pd
data = [('a',1,3.14),('b',3,2.72),('c',2,1.62),('d',9,1.41),('e',3,.58)]
df = pd.DataFrame(data, columns=('foo', 'bar', 'qux'))
Original df:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
Since pivot_table requires some sort of grouping (without the index argument, it'll raise a ValueError: No group keys passed!), and your original index is vacuous, we'll use the foo column:
df.pivot_table(index='foo',
margins=True,
margins_name='total', # defaults to 'All'
aggfunc=sum)
Voilà!
bar qux
foo
a 1 3.14
b 3 2.72
c 2 1.62
d 9 1.41
e 3 0.58
total 18 9.47
Alternative way (verified on Pandas 0.18.1):
import numpy as np
total = df.apply(np.sum)
total['foo'] = 'tot'
df.append(pd.DataFrame(total.values, index=total.keys()).T, ignore_index=True)
Result:
foo bar qux
0 a 1 3.14
1 b 3 2.72
2 c 2 1.62
3 d 9 1.41
4 e 3 0.58
5 tot 18 9.47
Building on JMZ answer
df.append(df.sum(numeric_only=True), ignore_index=True)
if you want to continue using your current index you can name the sum series using .rename() as follows:
df.append(df.sum().rename('Total'))
This will add a row at the bottom of the table.
This is the way that I do it, by transposing and using the assign method in combination with a lambda function. It makes it simple for me.
df.T.assign(GrandTotal = lambda x: x.sum(axis=1)).T
Building on answer from Matthias Kauer.
To add row total:
df.loc["Row_Total"] = df.sum()
To add column total,
df.loc[:,"Column_Total"] = df.sum(axis=1)
New method [September 2022]
TL;DR:
Just use
df.style.concat(df.agg(['sum']).style)
for a solution that won't change you dataframe, works even if you have an "sum" in your index, and can be styled!
Explanation
In pandas 1.5.0, a new method named .style.concat() gives you the ability to display several dataframes together. This is a good way to show the total (or any other statistics), because it is not changing the original dataframe, and works even if you have an index named "sum" in your original dataframe.
For example:
import pandas as pd
df = pd.DataFrame([[1, 2, 3], [4, 5, 6]], columns=['A', 'B', 'C'])
df.style.concat(df.agg(['sum']).style)
and it will return a formatted table that is visible in jupyter as this:
Styling
with a little longer code, you can even make the last row look different:
df.style.concat(
df.agg(['sum']).style
.set_properties(**{'background-color': 'yellow'})
)
to get:
see other ways to style (such as bold font, or table lines) in the docs
Following helped for me to add a column total and row total to a dataframe.
Assume dft1 is your original dataframe... now add a column total and row total with the following steps.
from io import StringIO
import pandas as pd
#create dataframe string
dfstr = StringIO(u"""
a;b;c
1;1;1
2;2;2
3;3;3
4;4;4
5;5;5
""")
#create dataframe dft1 from string
dft1 = pd.read_csv(dfstr, sep=";")
## add a column total to dft1
dft1['Total'] = dft1.sum(axis=1)
## add a row total to dft1 with the following steps
sum_row = dft1.sum(axis=0) #get sum_row first
dft1_sum=pd.DataFrame(data=sum_row).T #change it to a dataframe
dft1_sum=dft1_sum.reindex(columns=dft1.columns) #line up the col index to dft1
dft1_sum.index = ['row_total'] #change row index to row_total
dft1.append(dft1_sum) # append the row to dft1
Actually all proposed solutions render the original DataFrame unusable for any further analysis and can invalidate following computations, which will be easy to overlook and could lead to false results.
This is because you add a row to the data, which Pandas cannot differentiate from an additional row of data.
Example:
import pandas as pd
data = [1, 5, 6, 8, 9]
df = pd.DataFrame(data)
df
df.describe()
yields
0
0
1
1
5
2
6
3
8
4
9
0
count
5
mean
5.8
std
3.11448
min
1
25%
5
50%
6
75%
8
max
9
After
df.loc['Totals']= df.sum(numeric_only=True, axis=0)
the dataframe looks like this
0
0
1
1
5
2
6
3
8
4
9
Totals
29
This looks nice, but the new row is treated as if it was an additional data item, so df.describe will produce false results:
0
count
6
mean
9.66667
std
9.87252
min
1
25%
5.25
50%
7
75%
8.75
max
29
So: Watch out! and apply this only after doing all other analyses of the data or work on a copy of the DataFrame!
When the "totals" need to be added to an index column:
totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
df.append(totals)
e.g.
(Pdb) df
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200 67412.0 368733992.0 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000 85380.0 692782132.0 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200 67412.0 379484173.0 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200 85392.0 328063972.0 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800 67292.0 383487021.0 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600 112309.0 379483824.0 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600 664144.0 358486985.0 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400 67300.0 593141462.0 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800 215002028.0 327493141.0 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800 202248016.0 321657935.0 2.684668e+08 1.865470e+07 9.632590e+13
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose()
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
0 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) totals = pd.DataFrame(df.sum(numeric_only=True)).transpose().set_index(pd.Index({"totals"}))
(Pdb) totals
count min bytes max bytes mean bytes std bytes sum bytes
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
(Pdb) df.append(totals)
count min bytes max bytes mean bytes std bytes sum bytes
row_0 837200.0 67412.0 3.687340e+08 2.518989e+07 5.122836e+07 2.108898e+13
row_1 299000.0 85380.0 6.927821e+08 2.845055e+08 2.026823e+08 8.506713e+13
row_2 837200.0 67412.0 3.794842e+08 8.706825e+07 1.071484e+08 7.289354e+13
row_3 239200.0 85392.0 3.280640e+08 9.870446e+07 1.016989e+08 2.361011e+13
row_4 59800.0 67292.0 3.834870e+08 1.841879e+08 1.567605e+08 1.101444e+13
row_5 717600.0 112309.0 3.794838e+08 9.687554e+07 1.103574e+08 6.951789e+13
row_6 119600.0 664144.0 3.584870e+08 1.611637e+08 1.171889e+08 1.927518e+13
row_7 478400.0 67300.0 5.931415e+08 2.824301e+08 1.446283e+08 1.351146e+14
row_8 358800.0 215002028.0 3.274931e+08 2.861329e+08 1.545693e+07 1.026645e+14
row_9 358800.0 202248016.0 3.216579e+08 2.684668e+08 1.865470e+07 9.632590e+13
totals 4305600.0 418466685.0 4.132815e+09 1.774725e+09 1.025805e+09 6.365722e+14
Since i generally want to do this at the very end as to avoid breaking the integrity of the dataframe (right before printing). I created a summary_rows_cols method which returns a printable dataframe:
def summary_rows_cols(df: pd.DataFrame,
column_sum: bool = False,
column_avg: bool = False,
column_median: bool = False,
row_sum: bool = False,
row_avg: bool = False,
row_median: bool = False
) -> pd.DataFrame:
ret = df.copy()
if column_sum: ret.loc['Sum'] = df.sum(numeric_only=True, axis=0)
if column_avg: ret.loc['Avg'] = df.mean(numeric_only=True, axis=0)
if column_median: ret.loc['Median'] = df.median(numeric_only=True, axis=0)
if row_sum: ret.loc[:, 'Sum'] = df.sum(numeric_only=True, axis=1)
if row_median: ret.loc[:, 'Avg'] = df.mean(numeric_only=True, axis=1)
if row_avg: ret.loc[:, 'Median'] = df.median(numeric_only=True, axis=1)
ret.fillna('-', inplace=True)
return ret
This allows me to enter a generic (numeric) df and get a summarized output such as:
a b c Sum Median
0 1 4 7 12 4
1 2 5 8 15 5
2 3 6 9 18 6
Sum 6 15 24 - -
from:
data = {
'a': [1, 2, 3],
'b': [4, 5, 6],
'c': [7, 8, 9]
}
df = pd.DataFrame(data)
printable = summary_rows_cols(df, row_sum=True, column_sum=True, row_median=True)

Efficient way of doing permutations with pandas over a large DataFrame

Currently I have a pandas DataFrame like this:
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
This DataFrame is used with a statistic which then requires a permutation test (EDIT: to be precise, random permutation). The indices of each column need to be shuffled (sampled) 100 times. To give an idea of the size, the number of rows can be around 50,000.
EDIT: The permutation is along the rows, i.e. shuffle the index for each column.
The biggest issue here is one of performance. I want to permute things in a fast way.
An example I had in mind was:
import random
import joblib
def permutation(dataframe):
return dataframe.apply(random.sample, axis=1, k=len(dataframe))
permute = joblib.delayed(permutation)
pool = joblib.Parallel(n_jobs=-2) # all cores minus 1
result = pool(permute(dataframe) for item in range(100))
The issue here is that by doing this, the test is not stable: apparently the permutation works, but it is not as "random" as it would without being done in parallel, and thus there's a loss of stability in the results when I use the permuted data in follow-up calculations.
So my only "solution" was to precalculate all indices for all columns prior to doing the paralel code, which slows things down considerably.
My questions are:
Is there a more efficient way to do this permutation? (not necessarily parallel)
Is the parallel approach (using multiple processes, not threads) feasible?
EDIT: To make things clearer, here's what should happen for example to column A1 after one shuffling:
Ku8QhfS0n_hIOABXuE 6.268
fqPEquJRRlSVSfL.8A 6.343
ckiehnugOno9d7vf1Q 6.752
x57Vw5B5Fbt5JUnQk 6.297
(i.e. the row values were moving around).
EDIT2: Here's what I'm using now:
def _generate_indices(indices, columns, nperm):
random.seed(1234567890)
num_genes = indices.size
for item in range(nperm):
permuted = pandas.DataFrame(
{column: random.sample(genes, num_genes) for column in columns},
index=range(genes.size)
)
yield permuted
(in short, building a DataFrame of resampled indices for each column)
And later on (yes, I know it's pretty ugly):
# Data is the original DataFrame
# Indices one of the results of that generator
permuted = dict()
for column in data.columns:
value = data[column]
permuted[column] = value[indices[column].values].values
permuted_table = pandas.DataFrame(permuted, index=data.index)
How about this:
In [1]: import numpy as np; import pandas as pd
In [2]: df = pd.DataFrame(np.random.randn(50000, 10))
In [3]: def shuffle(df, n):
....: for i in n:
....: np.random.shuffle(df.values)
....: return df
In [4]: df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9
0 0.329588 -0.513814 -1.267923 0.691889 -0.319635 -1.468145 -0.441789 0.004142 -0.362073 -0.555779
1 0.495670 2.460727 1.174324 1.115692 1.214057 -0.843138 0.217075 0.495385 1.568166 0.252299
2 -0.898075 0.994281 -0.281349 -0.104684 -1.686646 0.651502 -1.466679 -1.256705 1.354484 0.626840
3 1.158388 -1.227794 -0.462005 -1.790205 0.399956 -1.631035 -1.707944 -1.126572 -0.892759 1.396455
4 -0.049915 0.006599 -1.099983 0.775028 -0.694906 -1.376802 -0.152225 1.413212 0.050213 -0.209760
In [5]: shuffle(df, 1).head(5)
Out[5]:
0 1 2 3 4 5 6 7 8 9
0 2.044131 0.072214 -0.304449 0.201148 1.462055 0.538476 -0.059249 -0.133299 2.925301 0.529678
1 0.036957 0.214003 -1.042905 -0.029864 1.616543 0.840719 0.104798 -0.766586 -0.723782 -0.088239
2 -0.025621 0.657951 1.132175 -0.815403 0.548210 -0.029291 0.575587 0.032481 -0.261873 0.010381
3 1.396024 0.859455 -1.514801 0.353378 1.790324 0.286164 -0.765518 1.363027 -0.868599 -0.082818
4 -0.026649 -0.090119 -2.289810 -0.701342 -0.116262 -0.674597 -0.580760 -0.895089 -0.663331 0.
In [6]: %timeit shuffle(df, 100)
Out[6]:
1 loops, best of 3: 14.4 s per loop
This does what you need it to. The only question is whether or not it is fast enough.
Update
Per the comments by #Einar I have changed my solution.
In[7]: def shuffle2(df, n):
ind = df.index
for i in range(n):
sampler = np.random.permutation(df.shape[0])
new_vals = df.take(sampler).values
df = pd.DataFrame(new_vals, index=ind)
return df
In [8]: df.head()
Out[8]:
0 1 2 3 4 5 6 7 8 9
0 -0.175006 -0.462306 0.565517 -0.309398 1.100570 0.656627 1.207535 -0.221079 -0.933068 -0.192759
1 0.388165 0.155480 -0.015188 0.868497 1.102662 -0.571818 -0.994005 0.600943 2.205520 -0.294121
2 0.281605 -1.637529 2.238149 0.987409 -1.979691 -0.040130 1.121140 1.190092 -0.118919 0.790367
3 1.054509 0.395444 1.239756 -0.439000 0.146727 -1.705972 0.627053 -0.547096 -0.818094 -0.056983
4 0.209031 -0.233167 -1.900261 -0.678022 -0.064092 -1.562976 -1.516468 0.512461 1.058758 -0.206019
In [9]: shuffle2(df, 1).head()
Out[9]:
0 1 2 3 4 5 6 7 8 9
0 0.054355 0.129432 -0.805284 -1.713622 -0.610555 -0.874039 -0.840880 0.593901 0.182513 -1.981521
1 0.624562 1.097495 -0.428710 -0.133220 0.675428 0.892044 0.752593 -0.702470 0.272386 -0.193440
2 0.763551 -0.505923 0.206675 0.561456 0.441514 -0.743498 -1.462773 -0.061210 -0.435449 -2.677681
3 1.149586 -0.003552 2.496176 -0.089767 0.246546 -1.333184 0.524872 -0.527519 0.492978 -0.829365
4 -1.893188 0.728737 0.361983 -0.188709 -0.809291 2.093554 0.396242 0.402482 1.884082 1.373781
In [10]: timeit shuffle2(df, 100)
1 loops, best of 3: 2.47 s per loop

Categories

Resources