Performance Comparisons between Series and Numpy - python

I am building a Performance comparison Table between Numpy and Series:
Two Instances caught my Eye. Any help will be really helpful.
We say that we should avoid using Loops in Numpy and Series, but I came across one scenario where for loop is performing better
In Below Code I am Calculating Density of Planets using for Loops and without for Loop
mass= pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118, 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
%%timeit -n 1000
density = mass / (np.pi * np.power(diameter, 3) /6)
1000 loops, best of 3: 617 µs per loop
%%timeit -n 1000
density = pd.Series()
for planet in mass.index:
density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet], 3)) / 6)
1000 loops, best of 3: 183 µs per loop
Second, I am trying to replace nan values in Series using Two approaches
Why do the First approach works Faster??? My Guess is that second approach is converting Series Object in N-d array
sample2 = pd.Series([1, 2, 3, 4325, 23, 3, 4213, 102, 89, 4, np.nan, 6, 803, 43, np.nan, np.nan, np.nan])
x = np.mean(sample2)
x
%%timeit -n 10000
sample3 = pd.Series(np.where(np.isnan(sample2), x, sample2))
10000 loops, best of 3: 166 µs per loop
%%timeit -n 10000
sample2[np.isnan(sample2)] =x
10000 loops, best of 3: 1.08 ms per loop

In an ipython console session:
In [1]: import pandas as pd
In [2]: mass= pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.
...: 0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SA
...: TURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
...: diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118
...: , 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPI
...: TER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
...:
In [3]: mass
Out[3]:
MERCURY 0.3300
VENUS 4.8700
EARTH 5.9700
MOON 0.0730
MARS 0.6420
JUPITER 1898.0000
SATURN 568.0000
URANUS 86.8000
NEPTUNE 102.0000
PLUTO 0.0146
dtype: float64
In [4]: diameter
Out[4]:
MERCURY 4879
VENUS 12104
EARTH 12756
MOON 3475
MARS 6792
JUPITER 142984
SATURN 120536
URANUS 51118
NEPTUNE 49528
PLUTO 2370
dtype: int64
Your density calculation creates a Series with the same index. Here the index of the Series match, but I think in general pandas is able to match up indices.
In [5]: density = mass / (np.pi * np.power(diameter, 3) /6)
In [6]: density
Out[6]:
MERCURY 5.426538e-12
VENUS 5.244977e-12
EARTH 5.493286e-12
MOON 3.322460e-12
MARS 3.913302e-12
JUPITER 1.240039e-12
SATURN 6.194402e-13
URANUS 1.241079e-12
NEPTUNE 1.603427e-12
PLUTO 2.094639e-12
dtype: float64
In [7]: timeit density = mass / (np.pi * np.power(diameter, 3) /6)
532 µs ± 437 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since indices match, we can get the same numbers by using the numpy values arrays:
In [8]: mass.values/(np.pi * np.power(diameter.values, 3)/6)
Out[8]:
array([5.42653818e-12, 5.24497707e-12, 5.49328558e-12, 3.32246038e-12,
3.91330208e-12, 1.24003876e-12, 6.19440202e-13, 1.24107933e-12,
1.60342694e-12, 2.09463905e-12])
In [9]: timeit mass.values/(np.pi * np.power(diameter.values, 3)/6)
11.5 µs ± 67.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much faster. numpy isn't taking time to match indices. Also it isn't making a new Series.
Your iteration approach:
In [11]: %%timeit
...: density = pd.Series(dtype=float)
...: for planet in mass.index:
...: density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet],
...: 3)) / 6)
...:
7.36 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is much slower.
Out of curiosity, lets initial a Series and fill it with the numpy calc:
In [18]: %%timeit
...: density = pd.Series(index=mass.index, dtype=float)
...: density[:] = mass.values/(np.pi * np.power(diameter.values, 3)/6)
241 µs ± 8.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2x better than [7], but still quite a bit slower than pure numpy. While pandas uses numpy arrays - here I think both index and values are arrays. But pandas does add a significant overhead, relative to pure numpy code.

Related

Is nested comprehensive - non optimized in Python?

The question is: "Have I really to do handmade optimization or exists better explaining of this uncomprehensive comprehensive?"
Thanks! And please - don't minus my question... Even FORTRAN can optimize nested loops since 1990... or earlier.
Look the example.
dict_groups = [{'name': 'Новые Альбомы', 'gid': 4100014},
{'name': 'Synthpop [Futurepop, Retrowave, Electropop]', 'gid': 8564},
{'name': 'E:\\music\\leftfield', 'gid': 101522128},
{'name': 'Бренд одежды | MEDICINE', 'gid': 134709480},
{'name': 'Другая Музыка', 'gid': 35486626},
{'name': 'E:\\music\\trip-hop', 'gid': 27683540},
{'name': 'Depeche Mode', 'gid': 125927592}]
x = [{'gid': 35486626},{'gid': 134709480},{'gid': 27683540}]
Have to receive
rez = [{'name': 'Другая Музыка', 'gid': 35486626},
{'name': 'E:\\music\\trip-hop', 'gid': 27683540},
{'name': 'Бренд одежды | MEDICINE', 'gid': 134709480}]
One of the solutions is:
x_val = tuple(d["gid"] for d in x)
rez = [dict_el for dict_el in dict_groups if dict_el["gid"] in x_val]
with timing
%timeit x_val = tuple(d["gid"] for d in x)
1.55 µs ± 81.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [dict_el for dict_el in dict_groups if dict_el["gid"] in x_val]
2.19 µs ± 93.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
one-row nested comprehensive solution gives:
%timeit [dict_el for dict_el in dict_groups if dict_el["gid"] in tuple(d["gid"] for d in x)]
11.9 µs ± 756 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much slower! Its looks like expression tuple(d["gid"] for d in x) calculates each time!
7*1,55 + 2,19 = 13,04µs It's near the 11.9µs....

Pandas DatetimeIndex: Number of periods in a frequency string?

How can I get a count of the number of periods in a Pandas DatetimeIndex using a frequency string (offset alias)? For example, let's say I have the following DatetimeIndex:
idx = pd.date_range("2019-03-01", periods=10000, freq='5T')
I would like to know how many 5 minute periods are in a week, or '7D'. I can calculate this "manually":
periods = (7*24*60)//5
Or I can get the length of a dummy index:
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
Neither approach seems very efficient. Is there a better way using Pandas date functionality?
try using numpy
len(np.arange(pd.Timedelta('1 days'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
2016
My testing, first import time:
import time
the OP solution:
start_time = time.time()
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
print((time.time() - start_time))
out:
0.0011057853698730469]
using numpy
start_time = time.time()
len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
print((time.time() - start_time))
out:
0.0001723766326904297
Follow the sugestion of #meW, doing the performance test using timeit
using timedelta_range:
%timeit len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
out:
91.1 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using numpy:
%timeit len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
16.3 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I finally figured out a reasonable solution:
pd.to_timedelta('7D')//idx.freq
This has the advantage that I can specify a range using a frequency string (offset alias) and the period or frequency is inferred from the dataframe. The numpy solution suggested by #Terry is still the fastest solution where speed is important.

Concatenate lists of strings by day in Pandas Dataframe

I have the following:
import pandas as pd
import numpy as np
documents = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user'],
['System', 'and', 'human'],
['Relation', 'of', 'user'],
['The', 'generation'],
['The', 'intersection'],
['Graph', 'minors'],
['Graph', 'minors', 'a']]
df = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-10', '2014-05-15', '2014-05-15', '2014-05-20', '2014-05-20', '2014-05-20'], dtype=np.datetime64), 'text': documents})
There are only 5 unique days. I would like to group by day to end up with the following:
documents2 = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user', 'System', 'and', 'human'],
['Relation', 'of', 'user', 'The', 'generation'],
['The', 'intersection', 'Graph', 'minors', 'Graph', 'minors', 'a']]
df2 = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-15', '2014-05-20'], dtype=np.datetime64), 'text': documents2})
IIUC, you can aggregate by sum
df.groupby('date').text.sum() # or .agg(sum)
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
Or flatten your list using list comprehension, which yields same time complexity as chain.from_iterable but has no dependency on one more external library
df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
sum has already been shown in another answer, so let me propose a solution that is a much faster (and more efficient) using chain.from_iterable:
from itertools import chain
df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
The problem with sum is that, for every two lists that are summed, a new intermediate result is created. So the operation is O(N^2). You can cut this down to linear time using chain.
The performance difference is apparent even with a relatively small DataFrame.
df = pd.concat([df] * 1000)
%timeit df.groupby('date').text.sum()
%timeit df.groupby('date').text.agg('sum')
%timeit df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
%timeit df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
71.8 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
68.9 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.25 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The problem will be more pronounced when the groups are larger. Particularly because sum is not vectorised for objects.

Dataframe hierarchical indexing speedup

I have dataframe like this
+----+------------+------------+------------+
| | | type | payment |
+----+------------+------------+------------+
| id | res_number | | |
+----+------------+------------+------------+
| a | 1 | toys | 20000 |
| | 2 | clothing | 30000 |
| | 3 | food | 40000 |
| b | 4 | food | 40000 |
| | 5 | laptop | 30000 |
+----+------------+------------+------------+
as you can see id, and res_number are hierachical row value, and type, payment are normal columns value. What i want to get is below.
array([['toys', 20000],
['clothing', 30000],
['food', 40000]])
It indexed by 'id(=a)' no matter what 'res_number' came, and i know that
df.loc[['a']].values
perfectly works for it. But the speed of indexing is too slow... i have to index 150000 values.
so i indexed dataframe by
df.iloc[1].values
but it only brought
array(['toys', 20000])
is there any indexing method more faster in indexing hierarchical structure?
Option 1
pd.DataFrame.xs
df.xs('a').values
Option 2
pd.DataFrame.loc
df.loc['a'].values
Option 3
pd.DataFrame.query
df.query('ilevel_0 == \'a\'').values
Option 4
A bit more roundabout, use pd.MultiIndex.get_level_values to create a mask:
df[df.index.get_level_values(0) == 'a'].values
array([['toys', 20000],
['clothing', 30000],
['food', 40000]], dtype=object)
Option 5
Use .loc with axis parameter
df.loc(axis=0)['a',:].values
Output:
array([['toys', 20000],
['clothing', 30000],
['food', 40000]], dtype=object)
Another option. Keep an extra dictionary of the beginning and ending indices of each group. (
Assume the index is sorted. )
Option 1 Use the first and the last index in a group to query with iloc.
d = {k: slice(v[0], v[-1]+1) for k, v in df.groupby("id").indices.items()}
df.iloc[d["b"]]
array([['food', 40000],
['laptop', 30000]], dtype=object)
Option 2 Use the first and the last index to query with numpy's index slicing on df.values.
df.values[d["a"]]
Timing
df_testing = pd.DataFrame({"id": [str(v) for v in np.random.randint(0, 100, 150000)],
"res_number": np.arange(150000),
"payment": [v for v in np.random.randint(0, 100000, 150000)]}
).set_index(["id","res_number"]).sort_index()
d = {k: slice(v[0], v[-1]+1) for k, v in df_testing.groupby("id").indices.items()}
# by COLDSPEED
%timeit df_testing.xs('5').values
303 µs ± 17.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# by OP
%timeit df_testing.loc['5'].values
358 µs ± 22.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# Tai 1
%timeit df_testing.iloc[d["5"]].values
130 µs ± 3.04 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
# Tai 2
%timeit df_testing.values[d["5"]]
7.26 µs ± 845 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
However, getting d is not costless.
%timeit {k: slice(v[0], v[-1]+1) for k, v in df_testing.groupby("id").indices.items()}
16.3 ms ± 6.89 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
Whether creating an extra lookup table d worth it?
The cost of creating the index will be spread on the gain from doing queries. In my toy dataset, it will be 16.3 ms / (300 us - 7 us) ≈ 56 queries to recover the cost of creating the index.
Again, the index needs to be sorted.

pandas groupby apply is really slow

When I call df.groupby([...]).apply(lambda x: ...) the performance is horrible. Is there a faster / more direct way to do this simple query?
To demonstrate my point, here is some code to set up the DataFrame:
import pandas as pd
df = pd.DataFrame(data=
{'ticker': ['AAPL','AAPL','AAPL','IBM','IBM','IBM'],
'side': ['B','B','S','S','S','B'],
'size': [100, 200, 300, 400, 100, 200],
'price': [10.12, 10.13, 10.14, 20.3, 20.2, 20.1]})
price side size ticker
0 10.12 B 100 AAPL
1 10.13 B 200 AAPL
2 10.14 S 300 AAPL
3 20.30 S 400 IBM
4 20.20 S 100 IBM
5 20.10 B 200 IBM
Now here is the part that is extremely slow that I need to speed up:
%timeit avgpx = df.groupby(['ticker','side']) \
.apply(lambda group: (group['size'] * group['price']).sum() / group['size'].sum())
3.23 ms ± 148 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This produces the correct result but as you can see above, takes super long (3.23ms doesn't seem like much but this is only 6 rows... When I use this on a real dataset it takes forever).
ticker side
AAPL B 10.126667
S 10.140000
IBM B 20.100000
S 20.280000
dtype: float64
You can save some time by precomputing the product and getting rid of the apply.
df['scaled_size'] = df['size'] * df['price']
g = df.groupby(['ticker', 'side'])
g['scaled_size'].sum() / g['size'].sum()
ticker side
AAPL B 10.126667
S 10.140000
IBM B 20.100000
S 20.280000
dtype: float64
100 loops, best of 3: 2.58 ms per loop
Sanity Check
df.groupby(['ticker','side']).apply(
lambda group: (group['size'] * group['price']).sum() / group['size'].sum())
ticker side
AAPL B 10.126667
S 10.140000
IBM B 20.100000
S 20.280000
dtype: float64
100 loops, best of 3: 5.02 ms per loop
Getting rid of apply appears to result in a 2X speedup on my machine.

Categories

Resources