New pandas version: how to groupby all columns with different aggregation statistics - python

I have a df that looks like this:
time volts1 volts2
0 0.000 -0.299072 0.427551
2 0.001 -0.299377 0.427551
4 0.002 -0.298767 0.427551
6 0.003 -0.298767 0.422974
8 0.004 -0.298767 0.422058
10 0.005 -0.298462 0.422363
12 0.006 -0.298767 0.422668
14 0.007 -0.298462 0.422363
16 0.008 -0.301208 0.420227
18 0.009 -0.303345 0.418091
In actuality, the df has >50 columns, but for simplicity, I'm just showing 3.
I want to groupby this df every n rows, lets say 5. I want to aggregate time with max and the rest of the columns I want to aggregate by mean. Because there are so many columns, I'd love to be able to loop this and not have to do it manually.
I know I can do something like this where I go through and create all new columns manually:
df.groupby(df.index // 5).agg(time=('time', 'max'),
volts1=('volts1', 'mean'),
volts1=('volts1', 'mean'),
...
)
but because there are so many columns, I want to do this in a loop, something like:
df.groupby(df.index // 5).agg(time=('time', 'max'),
# df.time is always the first column
[i for i in df.columns[1:]]=(i, 'mean'),
)
If useful:
print(pd.__version__)
1.0.5

You can use a dictionary:
d = {col: "mean" if not col=='time' else "max" for col in df.columns}
#{'time': 'max', 'volts1': 'mean', 'volts2': 'mean'}
df.groupby(df.index // 5).agg(d)
time volts1 volts2
0 0.002 -0.299072 0.427551
1 0.004 -0.298767 0.422516
2 0.007 -0.298564 0.422465
3 0.009 -0.302276 0.419159

Related

Can you further speed up pd.DataFrame.agg()?

I have a Pandas.DataFrame with 387 rows and 26 columns. This DataFrame is then Groupby()-ed and agg()-ed, turning into a DataFrame with 1 row and 111 columns. This takes about 0.05s. For example:
frames = frames.groupby(['id']).agg({"bytes": ["count",
"sum",
"median",
"std",
"sum",
"min",
"max"],
# === add about 70 more lines of this ===
"pixels": "sum"}
All of these use Pandas' built-in Cython functions, e.g. sum, std, min, max, first, etc. I am looking to speed this process up, but is there even a way to do such a thing? Seems like it is already considered 'vectorized' to my understanding. Thus, there isn't anything more to do with Cython is there?
Maybe calculating each column separately without the .agg() would be faster?
Would greatly appreciate any ideas, or confirmation that there is nothing else to be done. Thanks!
Edit!
Here's a working example:
import pandas as pd
import numpy as np
aggs = ["sum", "mean", "std", "min"]
cols = {k:aggs for k in 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'}
df = pd.DataFrame(np.random.randint(0,100,size=(387, 26)), columns=list('ABCDEFGHIJKLMNOPQRSTUVWXYZ'))
df['id'] = 1
print(df.groupby("id").agg(cols))
cProfile results:
import cProfile
cProfile.run('df.groupby("id").agg(cols)', sort='cumtime')
79825 function calls (78664 primitive calls) in 0.076 seconds
Ordered by: cumulative time
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 0.076 0.076 {built-in method builtins.exec}
1 0.000 0.000 0.076 0.076 <string>:1(<module>)
1 0.000 0.000 0.076 0.076 generic.py:964(aggregate)
1 0.000 0.000 0.075 0.075 apply.py:143(agg)
1 0.000 0.000 0.075 0.075 apply.py:405(agg_dict_like)
1 0.000 0.000 0.062 0.062 apply.py:435(<dictcomp>)
130/26 0.000 0.000 0.059 0.002 generic.py:225(aggregate)
26 0.001 0.000 0.058 0.002 generic.py:278(_aggregate_multiple_funcs)
78 0.001 0.000 0.023 0.000 generic.py:322(_cython_agg_general)
28 0.000 0.000 0.023 0.001 frame.py:573(__init__)
I ran some benchmarks (with 10 columns to aggregate and 6 aggregation functions for each column, and at most 100 unique ids). It seems that the total time to run the aggregation does not change until the number of rows is somewhere between 10k and 100k.
If you know your dataframes in advance, you can concatenate them into a single big DataFrame with two-level index, run groupby on two columns and get significant speedup. In a way, this runs the calculation on batches of dataframes.
In my example, it takes around 400ms to process a single DataFrame, and 600ms to process a batch of 100 DataFrames, with an average speedup of around 60x.
Here is the general approach (using 100 columns instead of 10):
import numpy as np
import pandas as pd
num_rows = 100
num_cols = 100
# builds a random df
def build_df():
# build some df with num_rows rows and num_cols cols to aggregate
df = pd.DataFrame({
"id": np.random.randint(0, 100, num_rows),
})
for i in range(num_cols):
df[i] = np.random.randint(0, 10, num_rows)
return df
agg_dict = {
i: ["count", "sum", "median", "std", "min", "max"]
for i in range(num_cols)
}
# get a single small df
df = build_df()
# build 100 random dataframes
dfs = [build_df() for _ in range(100)]
# set the first df to be equal to the "small" df we computed before
dfs[0] = df.copy()
big_df = pd.concat(dfs, keys=range(100))
%timeit big_df.groupby([big_df.index.get_level_values(0), "id"]).agg(agg_dict)
# 605 ms per loop, for 100 dataframe
agg_from_big = big_df.groupby([big_df.index.get_level_values(0), "id"]).agg(agg_dict).loc[0]
%timeit df.groupby("id").agg(agg_dict)
# 417 ms per loop, for one dataframe
agg_from_small = df.groupby("id").agg(agg_dict)
assert agg_from_small.equals(agg_from_big)
Here is the benchmarking code. The timings are comparable until the number of rows increases to 10k to 100k:
def get_setup(n):
return f"""
import pandas as pd
import numpy as np
N = {n}
num_cols = 10
df = pd.DataFrame({{
"id": np.random.randint(0, 100, N),
}})
for i in range(num_cols):
df[i] = np.random.randint(0, 10, N)
agg_dict = {{
i: ["count", "sum", "median", "std", "min", "max"]
for i in range(num_cols)
}}
"""
from timeit import timeit
def time_n(n):
return timeit(
"df.groupby('id').agg(agg_dict)", setup=get_setup(n), number=100
)
times = pd.Series({n: time_n(n) for n in [10, 100, 1000, 10_000, 100_000]})
# 10 4.532458
# 100 4.398949
# 1000 4.426178
# 10000 5.009555
# 100000 11.660783

How to subtract buy/sell rows for each group in dataframe

I have a dataframe that looks like this:
symbol
side
min
max
mean
wav
1000038
buy
0.931
1.0162
0.977
0.992
1000038
sell
0.932
1.0173
0.978
0.995
1000039
buy
0.881
1.00
0.99
0.995
1000039
sell
0.885
1.025
0.995
1.001
What is the most pythonic (efficient) way of generating a new dataframe consisting of the differences between the buys and the sells of each symbol.
For example: symbol 1000038, the difference between the and min sell and min buy is (0.932 - 0.931) = 0.001.
I am seeking a method that avoids looping through the dataframe rows as I believe this would be inefficient. Instead looking for a grouping type of solution.
I have tried something like this:
df1 = stats[['symbol','side']].join(stats[['mean','wav']].diff(-1))
df2 = df1[df1['side']=='sell']
print(df2)
but it does not seem to work as expected.
You could use the pandas MultiIndex. First, set up the data:
import pandas as pd
columns = ('symbol', 'side', 'min', 'max', 'mean', 'wav')
data = [
(1000038, 'buy', 0.931, 1.0162, 0.977, 0.992),
(1000038, 'sell', 0.932, 1.0173, 0.978, 0.995),
(1000039, 'buy', 0.881, 1.00, 0.99, 0.995),
(1000039, 'sell', 0.885, 1.025, 0.995, 1.001),
]
df = pd.DataFrame(data = data, columns = columns)
Then, create the index and compute the difference between two data frames:
df2 = df.set_index(['side', 'symbol'], verify_integrity=True)
df2 = df2.sort_index()
df2.loc[('buy',), :] - df2.loc[('sell',), :]
The result is:
min max mean wav
symbol
1000038 -0.001 -0.0011 -0.001 -0.003
1000039 -0.004 -0.0250 -0.005 -0.006
I'm assuming that each symbol (like 1000038) appears twice. You could use fillna() if you have un-matched buys and sells.
If needed, start with drop_duplicates and sort_values to make sure each symbol only has 1xbuy and 1xsell (in that order):
df = df.drop_duplicates(['symbol', 'side']).sort_values(['symbol', 'side'])
Then use either xs (faster) or groupby.diff for the group subtractions.
xs
Set the index to ['side', 'symbol'] and use xs to get cross-sections for buy and sell:
df.set_index(['side', 'symbol']).pipe(lambda df: df.xs('sell') - df.xs('buy'))
# min max mean wav
# symbol
# 1000038 0.001 0.0011 0.001 0.003
# 1000039 0.004 0.0250 0.005 0.006
groupby.diff
Set the index to symbol and subtract the groups using groupby.diff:
df.drop(columns='side').set_index('symbol').groupby('symbol').diff().dropna()
# min max mean wav
# symbol
# 1000038 0.001 0.0011 0.001 0.003
# 1000039 0.004 0.0250 0.005 0.006
- To flip the subtraction order, use diff(-1).
- If your version throws an error with groupby('symbol'), use groupby(level=0).

Q: Python (pandas or other) - I need to "flatten" a data file from many rows, few columns to 1 row many columns

I need to "flatten" a data file from many rows, few columns to 1 row many columns.
I currently have a dataframe in pandas (loaded from Excel) and ultimately need to change the way the data is displayed so I can accumulate large amounts of data in a logical manner. The below tables are an attempt at illustrating my requirements.
From:
1 2
Ryan 0.706 0.071
Chad 0.151 0.831
Stephen 0.750 0.653
To:
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0.706 0.151 0.75 0.071 0.831 0.653
Thank you for any assistance!
One line, for fun
df.unstack().pipe(
lambda s: pd.DataFrame([s.values], columns=s.index.map('{0[0]}_{0[1]}'.format))
)
1_Ryan 1_Chad 1_Stephen 2_Ryan 2_Chad 2_Stephen
0 0.706 0.151 0.75 0.071 0.831 0.653
Let's use stack, swaplevel, to_frame, and T:
df_out = df.stack().swaplevel(1,0).to_frame().T.sort_index(axis=1)
Or better yet,(using #piRSquared unstack solution)
df_out = df.unstack().to_frame().T
df_out.columns = df_out.columns.map('{0[0]}_{0[1]}'.format)
df_out
Output:
1_Chad 1_Ryan 1_Stephen 2_Chad 2_Ryan 2_Stephen
0 0.151 0.706 0.75 0.831 0.071 0.653

Improve performance on processing a big pandas dataframe

I have a big pandas dataframe (1 million rows), and I need better performance in my code to process this data.
My code is below, and a profiling analysis is also provided.
Header of the dataset:
key_id, date, par1, par2, par3, par4, pop, price, value
For each key, we have a row with every of the 5000 dates possibles
There is 200 key_id * 5000 date = 1000000 rows
Using different variables var1, ..., var4, I compute a value for each row, and I want to extract the top 20 dates with best value for each key_id, and then compute the popularity of the set of variables used.
In the end, I want to find the variables which optimize this popularity.
def compute_value_col(dataset, val1=0, val2=0, val3=0, val4=0):
dataset['value'] = dataset['price'] + val1 * dataset['par1'] \
+ val2 * dataset['par2'] + val3 * dataset['par3'] \
+ val4 * dataset['par4']
return dataset
def params_to_score(dataset, top=10, val1=0, val2=0, val3=0, val4=0):
dataset = compute_value_col(dataset, val1, val2, val3, val4)
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(top).reset_index(drop=True)
return dataset['pop'].sum()
def optimize(dataset, top):
for i,j,k,l in product(xrange(10),xrange(10),xrange(10),xrange(10)):
print i, j, k, l, params_to_score(dataset, top, 10*i, 10*j, 10*k, 10*l)
optimize(my_dataset, 20)
I need to enhance perf
Here is a %prun output, after running 49 params_to_score
ncalls tottime percall cumtime percall filename:lineno(function)
98 2.148 0.022 2.148 0.022 {pandas.algos.take_2d_axis1_object_object}
49 1.663 0.034 9.852 0.201 <ipython-input-59-88fc8127a27f>:150(params_to_score)
49 1.311 0.027 1.311 0.027 {method 'get_labels' of 'pandas.hashtable.Float64HashTable' objects}
49 1.219 0.025 1.223 0.025 {pandas.algos.groupby_indices}
49 0.875 0.018 0.875 0.018 {method 'get_labels' of 'pandas.hashtable.PyObjectHashTable' objects}
147 0.452 0.003 0.457 0.003 index.py:581(is_unique)
343 0.193 0.001 0.193 0.001 {method 'copy' of 'numpy.ndarray' objects}
1 0.136 0.136 10.058 10.058 <ipython-input-59-88fc8127a27f>:159(optimize)
147 0.122 0.001 0.122 0.001 {method 'argsort' of 'numpy.ndarray' objects}
833 0.112 0.000 0.112 0.000 {numpy.core.multiarray.empty}
49 0.109 0.002 0.109 0.002 {method 'get_labels_groupby' of 'pandas.hashtable.Int64HashTable' objects}
98 0.083 0.001 0.083 0.001 {pandas.algos.take_2d_axis1_float64_float64}
49 0.078 0.002 1.460 0.030 groupby.py:1014(_cumcount_array)
I think I could split the big dataframe in small dataframe by key_id, to improve the sort time, as I want to take the top 20 dates with best value for each key_id, so sorting by key is just to separate the different keys.
But I would need any advice, how can I improve the efficience of this code, as I would need to run thousands of params_to_score ?
EDIT: #Jeff
Thanks a lot for your help!
I tried using nsmallest instead of sort & head, but strangely it is 5-6 times slower, when I benchmark the two following functions:
def to_bench1(dataset):
dataset = dataset.sort(['key_id','value'], ascending=True)
dataset = dataset.groupby('key_id').head(50).reset_index(drop=True)
return dataset['pop'].sum()
def to_bench2(dataset):
dataset = dataset.set_index('pop')
dataset = dataset.groupby(['key_id'])['value'].nsmallest(50).reset_index()
return dataset['pop'].sum()
On a sample of ~100000 rows, to_bench2 performs in 0.5 seconds, while to_bench1 takes only 0.085 seconds on average.
After profiling to_bench2, I notice many more isinstance call, compared to before, but I do not know from where they come from...
The way to make this significantly faster is like this.
Create some sample data
In [148]: df = DataFrame({'A' : range(5), 'B' : [1,1,1,2,2] })
Define the compute_val_column like you have
In [149]: def f(p):
return DataFrame({ 'A' : df['A']*p, 'B' : df.B })
.....:
These are the cases (this you prob want a list of tuples), e.g. the cartesian product of all of the cases that you want to feed into the above function
In [150]: parms = [1,3]
Create a new data frame that has the full set of values, keyed by each of the parms). This is basically a broadcasting operation.
In [151]: df2 = pd.concat([ f(p) for p in parms ],keys=parms,names=['parm','indexer']).reset_index()
In [155]: df2
Out[155]:
parm indexer A B
0 1 0 0 1
1 1 1 1 1
2 1 2 2 1
3 1 3 3 2
4 1 4 4 2
5 3 0 0 1
6 3 1 3 1
7 3 2 6 1
8 3 3 9 2
9 3 4 12 2
Here's the magic. Groupby by whatever columns you want, including parm as the first one (or possibly multiple ones). Then do a partial sort (this is what nlargest does); this is more efficient that sort & head (well it depends on the group density a bit). Sum at the end (again by the groupers that we are about, as you are doing a 'partial' reduction)
In [153]: df2.groupby(['parm','B']).A.nlargest(2).sum(level=['parm','B'])
Out[153]:
parm B
1 1 3
2 7
3 1 9
2 21
dtype: int64

How to barplot Pandas dataframe columns aligning by sub-index?

I have a pandas dataframe df contains two stocks' financial ratio data :
>>> df
ROIC ROE
STK_ID RPT_Date
600141 20110331 0.012 0.022
20110630 0.031 0.063
20110930 0.048 0.103
20111231 0.063 0.122
20120331 0.017 0.033
20120630 0.032 0.077
20120930 0.050 0.120
600809 20110331 0.536 0.218
20110630 0.734 0.278
20110930 0.806 0.293
20111231 1.679 0.313
20120331 0.666 0.165
20120630 1.039 0.257
20120930 1.287 0.359
And I try to plot the ratio 'ROIC' & 'ROE' of stock '600141' & '600809' together on the same 'RPT_Date' to benchmark their performance.
df.plot(kind='bar') gives below
The chart draws '600141' on the left side , '600809' on the right side. It is somewhat inconvenience to compare the 'ROIC' & 'ROE' of the two stocks on same report date 'RPT_Date' .
What I want is to put the 'ROIC' & 'ROE' bar indexed by same 'RPT_Date' in same group side by side ( 4 bar per group), and x-axis only labels the 'RPT_Date', that will clearly tell the difference of two stocks.
How to do that ?
And if I df.plot(kind='line') , it only shows two lines, but it should be four lines (2 stocks * 2 ratios) :
Is it a bug, or what I can do to correct it ? Thanks.
I am using Pandas 0.8.1.
If you unstack STK_ID, you can create side by side plots per RPT_Date.
In [55]: dfu = df.unstack("STK_ID")
In [56]: fig, axes = subplots(2,1)
In [57]: dfu.plot(ax=axes[0], kind="bar")
Out[57]: <matplotlib.axes.AxesSubplot at 0xb53070c>
In [58]: dfu.plot(ax=axes[1])
Out[58]: <matplotlib.axes.AxesSubplot at 0xb60e8cc>

Categories

Resources