Concatenate lists of strings by day in Pandas Dataframe - python

I have the following:
import pandas as pd
import numpy as np
documents = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user'],
['System', 'and', 'human'],
['Relation', 'of', 'user'],
['The', 'generation'],
['The', 'intersection'],
['Graph', 'minors'],
['Graph', 'minors', 'a']]
df = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-10', '2014-05-15', '2014-05-15', '2014-05-20', '2014-05-20', '2014-05-20'], dtype=np.datetime64), 'text': documents})
There are only 5 unique days. I would like to group by day to end up with the following:
documents2 = [['Human', 'machine', 'interface'],
['A', 'survey', 'of', 'user'],
['The', 'EPS', 'user', 'System', 'and', 'human'],
['Relation', 'of', 'user', 'The', 'generation'],
['The', 'intersection', 'Graph', 'minors', 'Graph', 'minors', 'a']]
df2 = pd.DataFrame({'date': np.array(['2014-05-01', '2014-05-02', '2014-05-10', '2014-05-15', '2014-05-20'], dtype=np.datetime64), 'text': documents2})

IIUC, you can aggregate by sum
df.groupby('date').text.sum() # or .agg(sum)
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
Or flatten your list using list comprehension, which yields same time complexity as chain.from_iterable but has no dependency on one more external library
df.groupby('date').text.agg(lambda x: [item for z in x for item in z])

sum has already been shown in another answer, so let me propose a solution that is a much faster (and more efficient) using chain.from_iterable:
from itertools import chain
df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
date
2014-05-01 [Human, machine, interface]
2014-05-02 [A, survey, of, user]
2014-05-10 [The, EPS, user, System, and, human]
2014-05-15 [Relation, of, user, The, generation]
2014-05-20 [The, intersection, Graph, minors, Graph, mino...
Name: text, dtype: object
The problem with sum is that, for every two lists that are summed, a new intermediate result is created. So the operation is O(N^2). You can cut this down to linear time using chain.
The performance difference is apparent even with a relatively small DataFrame.
df = pd.concat([df] * 1000)
%timeit df.groupby('date').text.sum()
%timeit df.groupby('date').text.agg('sum')
%timeit df.groupby('date').text.agg(lambda x: [item for z in x for item in z])
%timeit df.groupby('date').text.agg(lambda x: list(itertools.chain.from_iterable(x)))
71.8 ms ± 5.02 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
68.9 ms ± 2.96 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
2.67 ms ± 199 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.25 ms ± 184 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The problem will be more pronounced when the groups are larger. Particularly because sum is not vectorised for objects.

Related

Regex in python dataframe: count occurences of pattern

I want to count how often a regex-expression (prior and ensuing characters are needed to identify the pattern) occurs in multiple dataframe columns. I found a solution which seems a litte slow. Is there a more sophisticated way?
column_A
column_B
column_C
Test • test abc
winter • sun
snow rain blank
blabla • summer abc
break • Data
test letter • stop.
So far I created a solution which is slow:
print(df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum())
The str.count should be able to apply to the whole dataframe without hard coding this way. Try
sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
I have tried with 1000 * 1000 dataframes. Here is a benchmark for your reference.
%timeit sum(df.apply(lambda x: x.str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()))
1.97 s ± 54.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
You can use list comprehension and re.search. You can reduce 938 µs to 26.7 µs. (make sure don't create list and use generator)
res = sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item))
for col in ['column_A', 'column_B','column_C'])
print(res)
# 5
Benchmark:
%%timeit 
sum(sum(True for item in df[col] if re.search("(?<=[A-Za-z]) • (?=[A-Za-z])", item)) for col in ['column_A', 'column_B','column_C'])
# 26 µs ± 2.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%%timeit 
df["column_A"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_B"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum() + df["column_C"].str.count("(?<=[A-Za-z]) • (?=[A-Za-z])").sum()
# 938 µs ± 149 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# --------------------------------------------------------------------#

Performance Comparisons between Series and Numpy

I am building a Performance comparison Table between Numpy and Series:
Two Instances caught my Eye. Any help will be really helpful.
We say that we should avoid using Loops in Numpy and Series, but I came across one scenario where for loop is performing better
In Below Code I am Calculating Density of Planets using for Loops and without for Loop
mass= pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118, 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
%%timeit -n 1000
density = mass / (np.pi * np.power(diameter, 3) /6)
1000 loops, best of 3: 617 µs per loop
%%timeit -n 1000
density = pd.Series()
for planet in mass.index:
density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet], 3)) / 6)
1000 loops, best of 3: 183 µs per loop
Second, I am trying to replace nan values in Series using Two approaches
Why do the First approach works Faster??? My Guess is that second approach is converting Series Object in N-d array
sample2 = pd.Series([1, 2, 3, 4325, 23, 3, 4213, 102, 89, 4, np.nan, 6, 803, 43, np.nan, np.nan, np.nan])
x = np.mean(sample2)
x
%%timeit -n 10000
sample3 = pd.Series(np.where(np.isnan(sample2), x, sample2))
10000 loops, best of 3: 166 µs per loop
%%timeit -n 10000
sample2[np.isnan(sample2)] =x
10000 loops, best of 3: 1.08 ms per loop
In an ipython console session:
In [1]: import pandas as pd
In [2]: mass= pd.Series([0.330, 4.87, 5.97, 0.073, 0.642, 1898, 568, 86.8, 102, 0.
...: 0146], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPITER', 'SA
...: TURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
...: diameter = pd.Series([4879, 12104, 12756, 3475, 6792, 142984, 120536, 51118
...: , 49528, 2370], index = ['MERCURY', 'VENUS', 'EARTH', 'MOON', 'MARS', 'JUPI
...: TER', 'SATURN', 'URANUS', 'NEPTUNE', 'PLUTO'])
...:
In [3]: mass
Out[3]:
MERCURY 0.3300
VENUS 4.8700
EARTH 5.9700
MOON 0.0730
MARS 0.6420
JUPITER 1898.0000
SATURN 568.0000
URANUS 86.8000
NEPTUNE 102.0000
PLUTO 0.0146
dtype: float64
In [4]: diameter
Out[4]:
MERCURY 4879
VENUS 12104
EARTH 12756
MOON 3475
MARS 6792
JUPITER 142984
SATURN 120536
URANUS 51118
NEPTUNE 49528
PLUTO 2370
dtype: int64
Your density calculation creates a Series with the same index. Here the index of the Series match, but I think in general pandas is able to match up indices.
In [5]: density = mass / (np.pi * np.power(diameter, 3) /6)
In [6]: density
Out[6]:
MERCURY 5.426538e-12
VENUS 5.244977e-12
EARTH 5.493286e-12
MOON 3.322460e-12
MARS 3.913302e-12
JUPITER 1.240039e-12
SATURN 6.194402e-13
URANUS 1.241079e-12
NEPTUNE 1.603427e-12
PLUTO 2.094639e-12
dtype: float64
In [7]: timeit density = mass / (np.pi * np.power(diameter, 3) /6)
532 µs ± 437 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Since indices match, we can get the same numbers by using the numpy values arrays:
In [8]: mass.values/(np.pi * np.power(diameter.values, 3)/6)
Out[8]:
array([5.42653818e-12, 5.24497707e-12, 5.49328558e-12, 3.32246038e-12,
3.91330208e-12, 1.24003876e-12, 6.19440202e-13, 1.24107933e-12,
1.60342694e-12, 2.09463905e-12])
In [9]: timeit mass.values/(np.pi * np.power(diameter.values, 3)/6)
11.5 µs ± 67.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much faster. numpy isn't taking time to match indices. Also it isn't making a new Series.
Your iteration approach:
In [11]: %%timeit
...: density = pd.Series(dtype=float)
...: for planet in mass.index:
...: density[planet] = mass[planet] / ((np.pi * np.power(diameter[planet],
...: 3)) / 6)
...:
7.36 ms ± 312 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This is much slower.
Out of curiosity, lets initial a Series and fill it with the numpy calc:
In [18]: %%timeit
...: density = pd.Series(index=mass.index, dtype=float)
...: density[:] = mass.values/(np.pi * np.power(diameter.values, 3)/6)
241 µs ± 8.97 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
2x better than [7], but still quite a bit slower than pure numpy. While pandas uses numpy arrays - here I think both index and values are arrays. But pandas does add a significant overhead, relative to pure numpy code.

Is nested comprehensive - non optimized in Python?

The question is: "Have I really to do handmade optimization or exists better explaining of this uncomprehensive comprehensive?"
Thanks! And please - don't minus my question... Even FORTRAN can optimize nested loops since 1990... or earlier.
Look the example.
dict_groups = [{'name': 'Новые Альбомы', 'gid': 4100014},
{'name': 'Synthpop [Futurepop, Retrowave, Electropop]', 'gid': 8564},
{'name': 'E:\\music\\leftfield', 'gid': 101522128},
{'name': 'Бренд одежды | MEDICINE', 'gid': 134709480},
{'name': 'Другая Музыка', 'gid': 35486626},
{'name': 'E:\\music\\trip-hop', 'gid': 27683540},
{'name': 'Depeche Mode', 'gid': 125927592}]
x = [{'gid': 35486626},{'gid': 134709480},{'gid': 27683540}]
Have to receive
rez = [{'name': 'Другая Музыка', 'gid': 35486626},
{'name': 'E:\\music\\trip-hop', 'gid': 27683540},
{'name': 'Бренд одежды | MEDICINE', 'gid': 134709480}]
One of the solutions is:
x_val = tuple(d["gid"] for d in x)
rez = [dict_el for dict_el in dict_groups if dict_el["gid"] in x_val]
with timing
%timeit x_val = tuple(d["gid"] for d in x)
1.55 µs ± 81.1 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
%timeit [dict_el for dict_el in dict_groups if dict_el["gid"] in x_val]
2.19 µs ± 93.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
one-row nested comprehensive solution gives:
%timeit [dict_el for dict_el in dict_groups if dict_el["gid"] in tuple(d["gid"] for d in x)]
11.9 µs ± 756 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
This is much slower! Its looks like expression tuple(d["gid"] for d in x) calculates each time!
7*1,55 + 2,19 = 13,04µs It's near the 11.9µs....

slice a Dataframe into smaller DatafFames

Im using 12 hours sensor data at 25Hrz that I query from mongo db into a dataframe
I'm trying to extract a list or a dict of 1 minute dataframes from the 12 hours.
I use a window of 1 minute and a stride/ step of 10 seconds.
The goal is to build a dataset by creating al list or dict of 1 minute dataframes/samples from 12 hours of data, that will be converted to tensor and fed to a deep learning model.
The index of the dataframe is datetime and 4 columns of sensor values.
here is how part of the data looks like:
A B C D
2020-06-17 22:00:00.000 1.052 -0.147 0.836 0.623
2020-06-17 22:00:00.040 1.011 -0.147 0.820 0.574
2020-06-17 22:00:00.080 1.067 -0.131 0.868 0.607
2020-06-17 22:00:00.120 1.033 -0.163 0.820 0.607
2020-06-17 22:00:00.160 1.030 -0.147 0.820 0.607
below is a sample code that is similar to how I extract windows of 1 minutes data. For 12 hours it takes 5 minutes-which is a long time..
Any ideas on how to reduce the running time in this case?
step= 10*25
w=60*25
df # 12 hours df data
sensor_dfs=[]
df_range = range(0, df.shape[0]-step, step)
for a in df_range:
sample = df.iloc[a:a+w]
sensor_dfs.append(sample)
I created random data and made the following experiments looking at runtime:
# create random normal samples
w= 60*25 # 1 minute window
step=w # no overlap
num_samples=50000
data= np.random.normal(size=(num_samples,3))
date_rng=pd.date_range(start="2020-07-09 00:00:00.000",
freq="40ms",periods=num_samples)
data=pd.DataFrame(data, columns=["x","y","z"], index=date_rng)
data.head()
x y z
2020-07-09 00:00:00.000 -1.062264 -0.008656 0.399642
2020-07-09 00:00:00.040 0.182398 -1.014290 -1.108719
2020-07-09 00:00:00.080 -0.489814 -0.020697 0.651120
2020-07-09 00:00:00.120 -0.776405 -0.596601 0.611516
2020-07-09 00:00:00.160 0.663900 0.149909 -0.552779
numbers are of type float64
data.dtypes
x float64
y float64
z float64
dtype: object
using for loops
minute_samples=[]
for i in range(0,len(data)-w,step):
minute_samples.append(data.iloc[i:i+w])
result:6.45 ms ± 256 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using list comprehension
minute_samples=[data.iloc[i:i+w] for i in range(0,len(data)-w,step)]
result: 6.13 ms ± 181 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using Grouper with list comprehension
minute_samples=[df for i, df in data.groupby(pd.Grouper(freq="1T"))]
result:7.89 ms ± 382 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
using grouper with dict
minute_samples=dict(tuple(data.groupby(pd.Grouper(freq="1T"))))
result: 7.41 ms ± 38 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
resample is also an option here but since behind the scenes it uses grouper then I don't think it will be different in terms of runtime
It seems like list comprehension is slightly better than the rest

Pandas DatetimeIndex: Number of periods in a frequency string?

How can I get a count of the number of periods in a Pandas DatetimeIndex using a frequency string (offset alias)? For example, let's say I have the following DatetimeIndex:
idx = pd.date_range("2019-03-01", periods=10000, freq='5T')
I would like to know how many 5 minute periods are in a week, or '7D'. I can calculate this "manually":
periods = (7*24*60)//5
Or I can get the length of a dummy index:
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
Neither approach seems very efficient. Is there a better way using Pandas date functionality?
try using numpy
len(np.arange(pd.Timedelta('1 days'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
2016
My testing, first import time:
import time
the OP solution:
start_time = time.time()
len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
print((time.time() - start_time))
out:
0.0011057853698730469]
using numpy
start_time = time.time()
len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
print((time.time() - start_time))
out:
0.0001723766326904297
Follow the sugestion of #meW, doing the performance test using timeit
using timedelta_range:
%timeit len(pd.timedelta_range(start='1 day', end='8 days', freq='5T'))
out:
91.1 µs ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
using numpy:
%timeit len(np.arange(pd.Timedelta('1 day'), pd.Timedelta('8 days'), timedelta(minutes=5)))
out:
16.3 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
I finally figured out a reasonable solution:
pd.to_timedelta('7D')//idx.freq
This has the advantage that I can specify a range using a frequency string (offset alias) and the period or frequency is inferred from the dataframe. The numpy solution suggested by #Terry is still the fastest solution where speed is important.

Categories

Resources