With this pycon talk as a source.
def clean_string(item):
if type(item)==type(1):
return item
else:
return np.nan
dataframe object has a column containing numerical and string data, I want to change strings to np.nan
while leaving numerical data as it is.
This approach is working fine
df['Energy Supply'].apply(clean_string)
but when I am trying to use vectorisation, values of all the column items changed to np.nan
df['Energy Supply'] = clean_string(df['Energy Supply']) # vectorisation
but the above method is converting all items to np.nan. I believe this is because type(item) in clean_string function is pd.Series type.
Is there a way to overcome this problem?
PS: I am a beginner in pandas
Vectorizing an operation in pandas isn't always possible. I'm not aware of a pandas built-in vectorized way to get the type of the elements in a Series, so your .apply() solution may be the best approach.
The reason that your code doesn't work in the second case is that you are passing the entire Series to your clean_string() function. It compares the type of the Series to type(1), which is False and then returns one value np.nan. Pandas automatically broadcasts this value when assigning it back to the df, so you get a column of NaN. In order to avoid this, you would have to loop over all of the elements in the Series in your clean_string() function.
Out of curiosity, I tried a few other approaches to see if any of them would be faster than your version. To test, I created 10,000 and 100,000 element pd.Series with alternating integer and string values:
import numpy as np
import pandas as pd
s = pd.Series(i if i%2==0 else str(i) for i in range(10000))
s2 = pd.Series(i if i%2==0 else str(i) for i in range(100000))
These tests are done using pandas 1.0.3 and python 3.8.
Baseline using clean_string()
In []: %timeit s.apply(clean_string)
3.75 ms ± 14.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In []: %timeit s2.apply(clean_string)
39.5 ms ± 301 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Series.str methods
An alternative way to test for strings vs. non-strings would be to use the built-in .str functions on the Series, for example, if you apply .str.len(), it will return NaN for any non-strings in the Series. These are even called "Vectorized String Methods" in pandas documentation, so maybe they will be more efficient.
In []: %timeit s.mask(s.str.len()>0)
6 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In []: %timeit s2.mask(s2.str.len()>0)
56.8 ms ± 142 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Unfortunately, this approach is slower than the .apply(). Despite being "vectorized" it doesn't look like this is a better approach. It is also not quite identical to the logic of clean_string() because it is testing for elements that are strings not for elements that are integers.
Applying type directly to the Series
Based on this answer, I decided to test using .apply() with type to get the type of each element. Once we know the type, compare to int and use the .mask() method to convert any non-integers to NaN.
In []: %timeit s.mask(s.apply(type)!=int)
1.88 ms ± 4.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In []: %timeit s2.mask(s2.apply(type)!=int)
15.2 ms ± 32.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This turns out to be the fastest approach that I've found.
Related
I want to test whether all elements of an array are zero. According to the StackOverflow posts Test if numpy array contains only zeros and https://stackoverflow.com/a/72976775/5269892, compared to (array == 0).all(), not array.any() should be the both most memory-efficient and fastest method.
I tested the performance with a random-number floating array, see below. Somehow though, at least for the given array size, not array.any() and even casting the array to boolean type appear to be slower than (array == 0).all(). How comes?
np.random.seed(100)
a = np.random.rand(10418*144)
%timeit (a == 0)
%timeit (a == 0).all()
%timeit a.astype(bool)
%timeit a.any()
%timeit not a.any()
# 711 µs ± 192 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 740 µs ± 1.38 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.69 ms ± 587 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.71 ms ± 1.31 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
# 1.71 ms ± 2.05 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
The problem is due to the first two operations being vectorized using SIMD instructions while the three last are not. More specifically, the three last calls do an implicit conversion to bool (_aligned_contig_cast_double_to_bool) which is not yet vectorized. This is a known issue and I have already proposed a pull request for this (which revealed some unexpected issues due to undefined behaviors now fixed). If everything is fine, it should be available in the next major release of Numpy.
Note that a.any() and not a.any() implicitly perform a cast to an array of boolean so to then perform the any operation faster. This is not very efficient, but this is done that way so to reduce the number of generated function variants (Numpy is written in C and so a different implementation has to be generated for each type and optimizing many variants is hard so we prefer so perform implicit casts here, not to mention that this also reduce the size of the generated binaries). If this is not enough, not you can use Cython so to generate a faster specific optimized code.
I've a dataframe having around 16,000 rows and I'm performing max aggregation of one column and grouping it by another one.
df.groupby(['col1']).agg({'col2': 'max'}).reset_index()
It takes 1.97s. I'd like to improve it's performance. Request you suggest in lines of utilizing numpy or vectorization.
Datatype both columns are object.
%%timeit
df.groupby(['col1']).agg({'col2': 'max'}).reset_index()
1.97 s ± 42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each
I played around with datatypes and changed datatype of col2 to integer and it reduced the run time significantly.
%%timeit
df['col2'] = df['col2'].astype(int)
df.groupby(['col1']).agg({'col2': 'max'}).reset_index()
6.58 ms ± 74.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Aggregation of integer will be faster than string.
I know there is more than one way to approach this and get the job done. Are there any considerations other than performance when choosing whether to use Apply Lambda? I have a particularly large dataframe with a column of emails, and I have need to strip the '#domain' from all of them. There is the simple:
DF['PRINCIPAL'] = DF['PRINCIPAL'].str.split("#", expand=True)[0]
and then the Apply Lambda:
DF['PRINCIPAL'] = DF.apply(lambda x: x['PRINCIPAL'].str.split("#", expand=True)[0]
I assume they are roughly equivalent, but they're method of execution will mean they are each more efficient in certain situations. Is there anything I should know?
Use:
df = pd.DataFrame({'email':['abc#ABC.com']*1000})
s1 = df['email'].str.split('#').str[0]
s2 = pd.Series([i.split('#')[0] for i in df['email']], name='email')
s1.eq(s2).all()
Output
True
Timings:
%timeit s1 = df['email'].str.split('#').str[0]
1.77 ms ± 75.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit s2 = pd.Series([i.split('#')[0] for i in df['email']], name='email')
737 µs ± 67.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops
each)
You can use assign which is the method recommended by Marc Garcia in his talk toward pandas 1.0 because you can chain operations on the same dataframe see example between 6:17 and 7:30:
DF = DF.assign(PRINCIPAL=lambda x: x['PRINCIPAL'].str.split("#", expand=True)[0])
I've run across some legacy code with data stored as a single-row pd.DataFrame.
My intuition would be that working with a pd.Series would be faster in this case - I don't know how they do optimization, but I know that they can and do so.
Is my intuition correct? Or is there no significant difference for most actions?
(to clarify - obviously the best practice would not be a single row DataFrame, but I'm asking about performance)
Yes for a large number of columns there will be a noticeable impact on performance.
You should consider that a DataFrame is a dict of Series so when you perform an operation on the single row, pandas has to coalesce all the column values first before performing the operation.
Even for 100 elements you can see there is a hit:
s = pd.Series(np.random.randn(100))
df = pd.DataFrame(np.random.randn(1,100))
%timeit s.sum()
%timeit df.sum(axis=1)
104 µs ± 5.08 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
194 µs ± 2.8 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In my opinion there is no reason to have a single row df that couldn't be achieved with a Series where the index values are the same as the column names for that df
The performance degradation isn't linear as for a 10k array it's not quite 2x worse:
s = pd.Series(np.random.randn(10000))
df = pd.DataFrame(np.random.randn(1,10000))
%timeit s.sum()
%timeit df.sum(axis=1)
149 µs ± 10.5 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
253 µs ± 36.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
I am a bit surprised that for a unique dtype DataFrame (nxn dataFrame), it is slower to access a row than a column. From what I gather a DataFrame of identical dtype should be stored as a contiguous block in memory, so accessing rows or columns should be equally as fast (just a matter of updating the correct stride).
Sample code:
df = pd.DataFrame(np.random.randn(100, 100))
%timeit df[0]
%timeit df.loc[0]
The slowest run took 12.86 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 2.72 µs per loop
10000 loops, best of 3: 116 µs per loop
There is definitely something I dont understand well about how a dataFrame is stored, thanks for your help !
I'm not an expert in the implementation details of Pandas, but I've used it enough that I can make an educated guess.
As I understand it, the Pandas data structure is most directly comparable to a dictionary of dictionaries, where the first index is the columns. Thus, the DF:
a b
c 1 2
d 3 4
is essentially {'a': {'c': 1, 'd': 3}, 'b': {'c': 2, 'd': 4}}. I'll assume I'm correct about that assertion from here on out, and would love to be corrected if someone knows more about pandas.
Thus, indexing a column is a simple hash lookup, whereas indexing a row requires iterating over all columns and doing a hash lookup for each one.
I think the reasoning is that this makes it really efficient to access a particular attribute of all rows and add new columns, which is normally how you interact with a dataframe. For such tabular use cases, it's much faster than a simple matrix layout, since you don't have to stride through memory (a whole column is stored more or less locally), but of course that's a tradeoff that makes interacting with rows less efficient (hence why it's not as easy syntactically to do so; you'll note that most Pandas operations default to interacting with columns, and interacting with rows is more or less a secondary objective in the module).
If you look at the underlying numpy array, you'll see that access is the same speed for rows / columns, at least in my test:
%timeit df.values[0]
# 10.2 µs ± 596 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit df.values[:, 0]
# 10.2 µs ± 730 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
Series (columns) are more first-class citizens in a dataframe than rows are. I think accessing the columns is more like a dictionary lookup, which is why it's so fast. Usually there are few columns, and each is meaningful, so it makes sense to store them this way. There are often very many rows, though, and an individual row doesn't have as much significance. This is a bit of conjecture, though. You'd have to go look at the source code to see what is actually being called each time and determine from that why the operations take a different amount of time - maybe an answer will pop up with that later.
Here's another timing comparison:
%timeit df.iloc[0, :]
# 141 µs ± 7 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df.iloc[:, 0]
# 61.9 µs ± 1.76 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
Accessing the columns is quicker this way too, though much slower. I'm not sure what would explain this. I assume that the slowdown compared with accessing a row/column directly comes from needing to return a pd.Series. When accessing a row, a new pd.Series might need to be created. But I don't know why iloc is slower for columns too - perhaps it also creates a new series each time, since iloc can be used quite flexibly and might not return an existing series (or could return a dataframe). But if a new series is created both times, then I'm again at a loss for why one operation beats the other.
And for more completeness
%timeit df.loc[0, :]
# 155 µs ± 6.48 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
%timeit df.loc[:, 0]
# 35.6 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)