Fast replacement of tzinfo of a pandas.Series of datetime - python

I have a pandas.Seriesof datetime and need to replace the tzinfo for every element in it.
I know how to do it using apply with python function but it's very slow: ~16s for 1M elements on a MacBookPro
In [71]: s = pd.date_range('2015-1-1', freq='h', periods=1e6).to_series().reset_index(drop=True)
In [72]: %timeit s.apply(lambda x: x.replace(tzinfo=pytz.utc))
1 loops, best of 3: 16.7 s per loop
Is there a numpy ufunc function for it?

Use dt.localize:
In [33]:
import pytz
%timeit s.dt.tz_localize(pytz.utc)
%timeit s.apply(lambda x: x.replace(tzinfo=pytz.utc))
10 loops, best of 3: 107 ms per loop
1 loops, best of 3: 10.4 s per loop
As you can see ~100X faster

Related

Is there a more efficient and elegant way to filter pandas index by date?

I often use DatetimeIndex.date, especially in groupby methods. However, DatetimeIndex.date is slow when compared to DatetimeIndex.year/month/day. From what I understand, it is because the .date attribute works with a lambda function over the index and returns a datetime ordered index, while index.year/month/day just returns integer indices. I have made a small example function that performs a bit better and would speed up some of my code (at least for finding the values in a groupby), but I feel that there must be a better way:
In [217]: index = pd.date_range('2011-01-01', periods=100000, freq='h')
In [218]: data = np.random.rand(len(index))
In [219]: df = pd.DataFrame({'data':data},index)
In [220]: def func(df):
...: groupby = df.groupby([df.index.year, df.index.month, df.index.day]).mean()
...: index = pd.date_range(df.index[0], periods = len(groupby), freq='D')
...: groupby.index = index
...: return groupby
...:
In [221]: df.groupby(df.index.date).mean().equals(func(df))
Out[221]: True
In [222]: df.groupby(df.index.date).mean().index.equals(func(df).index)
Out[222]: True
In [223]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.32 s per loop
In [224]: %timeit func(df)
10 loops, best of 3: 89.2 ms per loop
Does the pandas/index have a similar functionality that I am not finding?
You can even improve it a little bit:
In [69]: %timeit func(df)
10 loops, best of 3: 84.3 ms per loop
In [70]: %timeit df.groupby(pd.TimeGrouper('1D')).mean()
100 loops, best of 3: 6 ms per loop
In [84]: %timeit df.groupby(pd.Grouper(level=0, freq='1D')).mean()
100 loops, best of 3: 6.48 ms per loop
In [71]: (func(df) == df.groupby(pd.TimeGrouper('1D')).mean()).all()
Out[71]:
data True
dtype: bool
another solution - using DataFrame.resample() method:
In [73]: (df.resample('1D').mean() == func(df)).all()
Out[73]:
data True
dtype: bool
In [74]: %timeit df.resample('1D').mean()
100 loops, best of 3: 6.63 ms per loop
UPDATE: grouping by the string:
In [75]: %timeit df.groupby(df.index.strftime('%Y%m%d')).mean()
1 loop, best of 3: 2.6 s per loop
In [76]: %timeit df.groupby(df.index.date).mean()
1 loop, best of 3: 1.07 s per loop

Difference between df.loc['col name'], df.loc[index]['col name'] and df.loc[index, 'col name'] in pandas?

I have a dataframe df with a column name 'Store'. If I want to retrieve the column, the following lines work equally well - df['Store'] or df[:]['Store'] or df[:,'Store'].
What is the difference between the two? And should one be used over the other?
Thank you.
df.loc[index, 'col name'] is more idiomatic and preferred, especially if you want to filter rows
Demo: for 1.000.000 x 3 shape DF
In [26]: df = pd.DataFrame(np.random.rand(10**6,3), columns=list('abc'))
In [27]: %timeit df[df.a < 0.5]['a']
10 loops, best of 3: 45.8 ms per loop
In [28]: %timeit df.loc[df.a < 0.5]['a']
10 loops, best of 3: 45.8 ms per loop
In [29]: %timeit df.loc[df.a < 0.5, 'a']
10 loops, best of 3: 37 ms per loop
For construction where you need only one column and don't filter rows like df[:]['Store'] - it's better to use simply df['Store']:
In [30]: %timeit df[:]['a']
1000 loops, best of 3: 436 µs per loop
In [31]: %timeit df.loc[:]['a']
10000 loops, best of 3: 25.9 µs per loop
In [36]: %timeit df['a'].loc[:]
10000 loops, best of 3: 26.5 µs per loop
In [32]: %timeit df.loc[:, 'a']
10000 loops, best of 3: 126 µs per loop
In [33]: %timeit df['a']
The slowest run took 5.08 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 8.17 µs per loop
Unconditional access of multiple columns:
In [34]: %timeit df[['a','b']]
10 loops, best of 3: 22 ms per loop
In [35]: %timeit df.loc[:, ['a','b']]
10 loops, best of 3: 22.6 ms per loop

Turn pandas series to series of lists or numpy array to array of lists

I have a series s
s = pd.Series([1, 2])
What is an efficient way to make s look like
0 [1]
1 [2]
dtype: object
Here's one approach that extracts into array and extends to 2D by introducing a new axis with None/np.newaxis -
pd.Series(s.values[:,None].tolist())
Here's a similar one, but extends to 2D by reshaping -
pd.Series(s.values.reshape(-1,1).tolist())
Runtime test using #P-robot's setup -
In [43]: s = pd.Series(np.random.randint(1,10,1000))
In [44]: %timeit pd.Series(np.vstack(s.values).tolist()) # #Nickil Maveli's soln
100 loops, best of 3: 5.77 ms per loop
In [45]: %timeit pd.Series([[a] for a in s]) # #P-robot's soln
1000 loops, best of 3: 412 µs per loop
In [46]: %timeit s.apply(lambda x: [x]) # #mgc's soln
1000 loops, best of 3: 551 µs per loop
In [47]: %timeit pd.Series(s.values[:,None].tolist()) # Approach1
1000 loops, best of 3: 307 µs per loop
In [48]: %timeit pd.Series(s.values.reshape(-1,1).tolist()) # Approach2
1000 loops, best of 3: 306 µs per loop
If you want the result to still be a pandas Series you can use the apply method :
In [1]: import pandas as pd
In [2]: s = pd.Series([1, 2])
In [3]: s.apply(lambda x: [x])
Out[3]:
0 [1]
1 [2]
dtype: object
This does it:
import numpy as np
np.array([[a] for a in s],dtype=object)
array([[1],
[2]], dtype=object)
Adjusting atomh33ls' answer, here's a series of lists:
output = pd.Series([[a] for a in s])
type(output)
>> pandas.core.series.Series
type(output[0])
>> list
Timings for a selection of the suggestions:
import numpy as np, pandas as pd
s = pd.Series(np.random.randint(1,10,1000))
>> %timeit pd.Series(np.vstack(s.values).tolist())
100 loops, best of 3: 3.2 ms per loop
>> %timeit pd.Series([[a] for a in s])
1000 loops, best of 3: 393 µs per loop
>> %timeit s.apply(lambda x: [x])
1000 loops, best of 3: 473 µs per loop

Pandas selecting columns - best habit and performance

There are many different ways to select a column in a pandas.DataFrame (same for rows). I am wondering if it makes any difference and if there are any performance and style recommendations.
E.g., if I have a DataFrame as follows:
import pandas as pd
import numpy as np
df = pd.DataFrame(data=np.random.random((10,4)), columns=['a','b','c','d'])
df.head()
There are many different ways to select e.g., column d
1) df['d']
2) df.loc[:,'d'] (where df.loc[row_indexer,column_indexer])
3) df.loc[:]['d']
4) df.ix[:]['d']
5) df.ix[:,'d']
Intuitively, I would prefer 2), maybe because I am used to the [row_indexer,column_indexer] style from numpy
I would use ipython's magic function %timeit to find out the best performant method.
The results are:
%timeit df['d']
100000 loops, best of 3: 5.35 µs per loop
%timeit df.loc[:,'d']
10000 loops, best of 3: 44.3 µs per loop
%timeit df.loc[:]['d']
100000 loops, best of 3: 12.4 µs per loop
%timeit df.ix[:]['d']
100000 loops, best of 3: 10.4 µs per loop
%timeit df.ix[:,'d']
10000 loops, best of 3: 53 µs per loop
It turns out that the 1st method is considerably faster than others.

python pandas: why map is faster?

in pandas' manual, there is this example about indexing:
In [653]: criterion = df2['a'].map(lambda x: x.startswith('t'))
In [654]: df2[criterion]
then Wes wrote:
**# equivalent but slower**
In [655]: df2[[x.startswith('t') for x in df2['a']]]
can anyone here explain a bit why the map approach is faster? Is this a python feature or this is a pandas feature?
Arguments about why a certain way of doing things in Python "should be" faster can't be taken too seriously, because you're often measuring implementation details which may behave differently in certain situations. As a result, when people guess what should be faster, they're often (usually?) wrong. For example, I find that map can actually be slower. Using this setup code:
import numpy as np, pandas as pd
import random, string
def make_test(num, width):
s = [''.join(random.sample(string.ascii_lowercase, width)) for i in range(num)]
df = pd.DataFrame({"a": s})
return df
Let's compare the time they take to make the indexing object -- whether a Series or a list -- and the resulting time it takes to use that object to index into the DataFrame. It could be, for example, that making a list is fast but before using it as an index it needs to be internally converted to a Series or an ndarray or something and so there's extra time added there.
First, for a small frame:
>>> df = make_test(10, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10000 loops, best of 3: 85.8 µs per loop
>>> %timeit [x.startswith('t') for x in df['a']]
100000 loops, best of 3: 15.6 µs per loop
>>> %timeit df['a'].str.startswith("t")
10000 loops, best of 3: 118 µs per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
1000 loops, best of 3: 304 µs per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10000 loops, best of 3: 194 µs per loop
>>> %timeit df[df['a'].str.startswith("t")]
1000 loops, best of 3: 348 µs per loop
and in this case the listcomp is fastest. That doesn't actually surprise me too much, to be honest, because going via a lambda is likely to be slower than using str.startswith directly, but it's really hard to guess. 10 is small enough we're probably still measuring things like setup costs for Series; what happens in a larger frame?
>>> df = make_test(10**5, 10)
>>> %timeit df['a'].map(lambda x: x.startswith('t'))
10 loops, best of 3: 46.6 ms per loop
>>> %timeit [x.startswith('t') for x in df['a']]
10 loops, best of 3: 27.8 ms per loop
>>> %timeit df['a'].str.startswith("t")
10 loops, best of 3: 48.5 ms per loop
>>> %timeit df[df['a'].map(lambda x: x.startswith('t'))]
10 loops, best of 3: 47.1 ms per loop
>>> %timeit df[[x.startswith('t') for x in df['a']]]
10 loops, best of 3: 52.8 ms per loop
>>> %timeit df[df['a'].str.startswith("t")]
10 loops, best of 3: 49.6 ms per loop
And now it seems like the map is winning when used as an index, although the difference is marginal. But not so fast: what if we manually turn the listcomp into an array or a Series?
>>> %timeit df[np.array([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 40.7 ms per loop
>>> %timeit df[pd.Series([x.startswith('t') for x in df['a']])]
10 loops, best of 3: 37.5 ms per loop
and now the listcomp wins again!
Conclusion: who knows? But never believe anything without timeit results, and even then you have to ask whether you're testing what you think you are.

Categories

Resources