Calculating subtractions of pairs of columns in pandas DataFrame - python

I work with significantly sized (48K rows, up to tens of columns) DataFrames. At a certain point in their manipulation, I need to do pair-wise subtractions of column values and I was wondering if there is a more efficient way to do so rather than the one I'm doing (see below).
My current code:
# Matrix is the pandas DataFrame containing all the data
comparison_df = pandas.DataFrame(index=matrix.index)
combinations = itertools.product(group1, group2)
for observed, reference in combinations:
observed_data = matrix[observed]
reference_data = matrix[reference]
comparison = observed_data - reference_data
name = observed + "_" + reference
comparison_df[name] = comparison
Since the data can be large (I'm using this piece of code also during a permutation test), I'm interested in knowing if it can be optimized a bit.
EDIT: As requested, here's a sample of a typical data set
ID A1 A2 A3 B1 B2 B3
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
And a typical result would be, if the "A" group is group1 and "B" group2, for each ID row, to have for each column a pair (e.g., A1_B1, A2_B1, A3_B1...) corresponding to the pairings generated above, containing the subtraction for each row ID.

Using itertools.combinations() on DataFrame columns
You can create combinations of columns with itertools.combinations() and evaluate subtractions along with new names based on these pairs:
import pandas as pd
from cStringIO import StringIO
import itertools as iter
matrix = pd.read_csv(StringIO('''ID,A1,A2,A3,B1,B2,B3
Ku8QhfS0n_hIOABXuE,6.343,6.304,6.410,6.287,6.403,6.279
fqPEquJRRlSVSfL.8A,6.752,6.681,6.680,6.677,6.525,6.739
ckiehnugOno9d7vf1Q,6.297,6.248,6.524,6.382,6.316,6.453
x57Vw5B5Fbt5JUnQkI,6.268,6.451,6.379,6.371,6.458,6.333''')).set_index('ID')
print 'Original DataFrame:'
print matrix
print
# Create DataFrame to fill with combinations
comparison_df = pd.DataFrame(index=matrix.index)
# Create combinations of columns
for a, b in iter.combinations(matrix.columns, 2):
# Subtract column combinations
comparison_df['{}_{}'.format(a, b)] = matrix[a] - matrix[b]
print 'Combination DataFrame:'
print comparison_df
Original DataFrame:
A1 A2 A3 B1 B2 B3
ID
Ku8QhfS0n_hIOABXuE 6.343 6.304 6.410 6.287 6.403 6.279
fqPEquJRRlSVSfL.8A 6.752 6.681 6.680 6.677 6.525 6.739
ckiehnugOno9d7vf1Q 6.297 6.248 6.524 6.382 6.316 6.453
x57Vw5B5Fbt5JUnQkI 6.268 6.451 6.379 6.371 6.458 6.333
Combination DataFrame:
A1_A2 A1_A3 A1_B1 A1_B2 A1_B3 A2_A3 A2_B1 A2_B2 A2_B3 A3_B1 A3_B2 A3_B3 B1_B2 B1_B3 B2_B3
ID
Ku8QhfS0n_hIOABXuE 0.039 -0.067 0.056 -0.060 0.064 -0.106 0.017 -0.099 0.025 0.123 0.007 0.131 -0.116 0.008 0.124
fqPEquJRRlSVSfL.8A 0.071 0.072 0.075 0.227 0.013 0.001 0.004 0.156 -0.058 0.003 0.155 -0.059 0.152 -0.062 -0.214
ckiehnugOno9d7vf1Q 0.049 -0.227 -0.085 -0.019 -0.156 -0.276 -0.134 -0.068 -0.205 0.142 0.208 0.071 0.066 -0.071 -0.137
x57Vw5B5Fbt5JUnQkI -0.183 -0.111 -0.103 -0.190 -0.065 0.072 0.080 -0.007 0.118 0.008 -0.079 0.046 -0.087 0.038 0.125

Related

Can you further speed up pd.DataFrame.agg()?

I have a Pandas.DataFrame with 387 rows and 26 columns. This DataFrame is then Groupby()-ed and agg()-ed, turning into a DataFrame with 1 row and 111 columns. This takes about 0.05s. For example:
frames = frames.groupby(['id']).agg({"bytes": ["count",
"sum",
"median",
"std",
"sum",
"min",
"max"],
# === add about 70 more lines of this ===
"pixels": "sum"}
All of these use Pandas' built-in Cython functions, e.g. sum, std, min, max, first, etc. I am looking to speed this process up, but is there even a way to do such a thing? Seems like it is already considered 'vectorized' to my understanding. Thus, there isn't anything more to do with Cython is there?
Maybe calculating each column separately without the .agg() would be faster?
Would greatly appreciate any ideas, or confirmation that there is nothing else to be done. Thanks!
Edit!
Here's a working example:
import pandas as pd
import numpy as np
aggs = ["sum", "mean", "std", "min"]
cols = {k:aggs for k in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'}
df = pd.DataFrame(np.random.randint(0,100,size=(387, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
df['id'] = 1
print(df.groupby("id").agg(cols))
cProfile results:
import cProfile
cProfile.run('df.groupby("id").agg(cols)', sort='cumtime')
79825 function calls (78664 primitive calls) in 0.076 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.076 0.076 {built-in method builtins.exec}
1 0.000 0.000 0.076 0.076 <string>:1(<module>)
1 0.000 0.000 0.076 0.076 generic.py:964(aggregate)
1 0.000 0.000 0.075 0.075 apply.py:143(agg)
1 0.000 0.000 0.075 0.075 apply.py:405(agg_dict_like)
1 0.000 0.000 0.062 0.062 apply.py:435(<dictcomp>)
130/26 0.000 0.000 0.059 0.002 generic.py:225(aggregate)
26 0.001 0.000 0.058 0.002 generic.py:278(_aggregate_multiple_funcs)
78 0.001 0.000 0.023 0.000 generic.py:322(_cython_agg_general)
28 0.000 0.000 0.023 0.001 frame.py:573(__init__)
I ran some benchmarks (with 10 columns to aggregate and 6 aggregation functions for each column, and at most 100 unique ids). It seems that the total time to run the aggregation does not change until the number of rows is somewhere between 10k and 100k.
If you know your dataframes in advance, you can concatenate them into a single big DataFrame with two-level index, run groupby on two columns and get significant speedup. In a way, this runs the calculation on batches of dataframes.
In my example, it takes around 400ms to process a single DataFrame, and 600ms to process a batch of 100 DataFrames, with an average speedup of around 60x.
Here is the general approach (using 100 columns instead of 10):
import numpy as np
import pandas as pd
num_rows = 100
num_cols = 100
# builds a random df
def build_df():
# build some df with num_rows rows and num_cols cols to aggregate
df = pd.DataFrame({
"id": np.random.randint(0, 100, num_rows),
})
for i in range(num_cols):
df[i] = np.random.randint(0, 10, num_rows)
return df
agg_dict = {
i: ["count", "sum", "median", "std", "min", "max"]
for i in range(num_cols)
}
# get a single small df
df = build_df()
# build 100 random dataframes
dfs = [build_df() for _ in range(100)]
# set the first df to be equal to the "small" df we computed before
dfs[0] = df.copy()
big_df = pd.concat(dfs, keys=range(100))
%timeit big_df.groupby([big_df.index.get_level_values(0), "id"]).agg(agg_dict)
# 605 ms per loop, for 100 dataframe
agg_from_big = big_df.groupby([big_df.index.get_level_values(0), "id"]).agg(agg_dict).loc[0]
%timeit df.groupby("id").agg(agg_dict)
# 417 ms per loop, for one dataframe
agg_from_small = df.groupby("id").agg(agg_dict)
assert agg_from_small.equals(agg_from_big)
Here is the benchmarking code. The timings are comparable until the number of rows increases to 10k to 100k:
def get_setup(n):
return f"""
import pandas as pd
import numpy as np
N = {n}
num_cols = 10
df = pd.DataFrame({{
"id": np.random.randint(0, 100, N),
}})
for i in range(num_cols):
df[i] = np.random.randint(0, 10, N)
agg_dict = {{
i: ["count", "sum", "median", "std", "min", "max"]
for i in range(num_cols)
}}
"""
from timeit import timeit
def time_n(n):
return timeit(
"df.groupby('id').agg(agg_dict)", setup=get_setup(n), number=100
)
times = pd.Series({n: time_n(n) for n in [10, 100, 1000, 10_000, 100_000]})
# 10 4.532458
# 100 4.398949
# 1000 4.426178
# 10000 5.009555
# 100000 11.660783

New pandas version: how to groupby all columns with different aggregation statistics

I have a df that looks like this:
time volts1 volts2
0 0.000 -0.299072 0.427551
2 0.001 -0.299377 0.427551
4 0.002 -0.298767 0.427551
6 0.003 -0.298767 0.422974
8 0.004 -0.298767 0.422058
10 0.005 -0.298462 0.422363
12 0.006 -0.298767 0.422668
14 0.007 -0.298462 0.422363
16 0.008 -0.301208 0.420227
18 0.009 -0.303345 0.418091
In actuality, the df has >50 columns, but for simplicity, I'm just showing 3.
I want to groupby this df every n rows, lets say 5. I want to aggregate time with max and the rest of the columns I want to aggregate by mean. Because there are so many columns, I'd love to be able to loop this and not have to do it manually.
I know I can do something like this where I go through and create all new columns manually:
df.groupby(df.index // 5).agg(time=('time', 'max'),
volts1=('volts1', 'mean'),
volts1=('volts1', 'mean'),
...
)
but because there are so many columns, I want to do this in a loop, something like:
df.groupby(df.index // 5).agg(time=('time', 'max'),
# df.time is always the first column
[i for i in df.columns[1:]]=(i, 'mean'),
)
If useful:
print(pd.__version__)
1.0.5
You can use a dictionary:
d = {col: "mean" if not col=='time' else "max" for col in df.columns}
#{'time': 'max', 'volts1': 'mean', 'volts2': 'mean'}
df.groupby(df.index // 5).agg(d)
time volts1 volts2
0 0.002 -0.299072 0.427551
1 0.004 -0.298767 0.422516
2 0.007 -0.298564 0.422465
3 0.009 -0.302276 0.419159

Q: Python (pandas or other) - I need to "flatten" a data file from many rows, few columns to 1 row many columns

I need to "flatten" a data file from many rows, few columns to 1 row many columns.
I currently have a dataframe in pandas (loaded from Excel) and ultimately need to change the way the data is displayed so I can accumulate large amounts of data in a logical manner. The below tables are an attempt at illustrating my requirements.
From:
1 2
Ryan 0.706 0.071
Chad 0.151 0.831
Stephen 0.750 0.653
To:
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0.706 0.151 0.75 0.071 0.831 0.653
Thank you for any assistance!
One line, for fun
df.unstack().pipe(
lambda s: pd.DataFrame([s.values], columns=s.index.map('{0[0]}_{0[1]}'.format))
)
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0 0.706 0.151 0.75 0.071 0.831 0.653
Let's use stack, swaplevel, to_frame, and T:
df_out = df.stack().swaplevel(1,0).to_frame().T.sort_index(axis=1)
Or better yet,(using #piRSquared unstack solution)
df_out = df.unstack().to_frame().T
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out
Output:
1_Chad 1_Ryan 1_Stephen 2_Chad 2_Ryan 2_Stephen
0 0.151 0.706 0.75 0.831 0.071 0.653

Improve performance on processing a big pandas dataframe

I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...
The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64

How to barplot Pandas dataframe columns aligning by sub-index?

I have a pandas dataframe df contains two stocks' financial ratio data :
>>> df
ROIC ROE
STK_ID RPT_Date
600141 20110331 0.012 0.022
20110630 0.031 0.063
20110930 0.048 0.103
20111231 0.063 0.122
20120331 0.017 0.033
20120630 0.032 0.077
20120930 0.050 0.120
600809 20110331 0.536 0.218
20110630 0.734 0.278
20110930 0.806 0.293
20111231 1.679 0.313
20120331 0.666 0.165
20120630 1.039 0.257
20120930 1.287 0.359
And I try to plot the ratio 'ROIC' & 'ROE' of stock '600141' & '600809' together on the same 'RPT_Date' to benchmark their performance.
df.plot(kind='bar') gives below
The chart draws '600141' on the left side , '600809' on the right side. It is somewhat inconvenience to compare the 'ROIC' & 'ROE' of the two stocks on same report date 'RPT_Date' .
What I want is to put the 'ROIC' & 'ROE' bar indexed by same 'RPT_Date' in same group side by side ( 4 bar per group), and x-axis only labels the 'RPT_Date', that will clearly tell the difference of two stocks.
How to do that ?
And if I df.plot(kind='line') , it only shows two lines, but it should be four lines (2 stocks * 2 ratios) :
Is it a bug, or what I can do to correct it ? Thanks.
I am using Pandas 0.8.1.
If you unstack STK_ID, you can create side by side plots per RPT_Date.
In [55]: dfu = df.unstack("STK_ID")
In [56]: fig, axes = subplots(2,1)
In [57]: dfu.plot(ax=axes[0], kind="bar")
Out[57]: <matplotlib.axes.AxesSubplot at 0xb53070c>
In [58]: dfu.plot(ax=axes[1])
Out[58]: <matplotlib.axes.AxesSubplot at 0xb60e8cc>

Categories

Resources