Very slow indexing in Pandas 0.15 compared to 0.13.1 - python

I use pandas daily in my work. I recently upgraded to 0.15.1 from 0.13.1 and now a bunch of code is too slow to finish when iterating through relatively small DataFrames.
(I realize there are often better/faster ways to accomplish iteration on a DataFrame, but sometimes it's very clear and succinct to have a for loop structure)
I narrowed the problem down to an issue when mixing types:
def iterGet(df,col):
for i in df.index:
tmp = df[col].loc[i]
def iterLocSet(df,col,val):
for i in df.index:
#df[col].loc[i] = val
df.loc[i,col] = val
df.at[i,col] = val
return df
N = 100
df = pd.DataFrame(rand(N,3),columns = ['a','b','c'])
df['listCol'] = [[] for i in range(df.shape[0])]
df['strCol'] = [str(i) for i in range(df.shape[0])]
df['intCol'] = [i for i in range(df.shape[0])]
df['float64Col'] = [float64(i) for i in range(df.shape[0])]
print df.a[:5]
%time iterGet(df[['a','intCol']].copy(),'a')
%time tmpDf = iterLocSet(df[['a','intCol']].copy(),'a',0.)
print tmpDf.a[:5]
%time iterGet(df[['a','float64Col']].copy(),'a')
%time tmpDf = iterLocSet(df[['a','float64Col']].copy(),'a',0.)
print tmpDf.a[:5]
On Pandas 0.15.1 the result is:
0 0.114738
1 0.586447
2 0.296024
3 0.446697
4 0.720984
Name: a, dtype: float64
Wall time: 6 ms
Wall time: 3.41 s
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float64
Wall time: 6 ms
Wall time: 18 ms
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float64
But on Pandas 0.13.1 the result is this:
0 0.651796
1 0.738661
2 0.885366
3 0.513006
4 0.846323
Name: a, dtype: float64
Wall time: 6 ms
Wall time: 14 ms
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float64
Wall time: 5 ms
Wall time: 15 ms
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float6
It appears that making an assignment using row-indexing on a multi-typed array is ~200x slower in Pandas 0.15.1?
I am aware there may be a potential pitfall here by assigning to what may be a copy of the array, but I admit I do not fully understand that issue either. Here at least I can see the assignment is working. EDIT Although I see now that using either of these in the for loop fixes the problem:
df.loc[i,col] = val
df.at[i,col] = val
I don't know enough about the implementation to diagnose this. Can anyone reproduce this? Is this what you would expect? What am I doing wrong?
Thanks!

Using .loc even on a single-dtyped frame, can cause a copy of the data on a partial assignment. (This is almost always true when you have object dtypes, less so with numeric types).
When partial assignment, I mean:
df.loc[1,'B'] = value
IOW. this is setting a single value in this case (setting multiple values is similar). However setting a column is very different.
df['B'] = values
df[:,'B'] = values
is quite efficient and does not copy.
Thus you should completely avoid iteration and simply do.
df['B'] = [ ..... ] # if you want to set with a list-like
df['B'] = value # for a scalar
So in your above example, it is likely copying at every iteration. 0.13.1 was a bit buggy in handling partial assignments and would incorrectly handle certain cases, so copying was needed a bit more.

Related

I need to create a dataframe were values reference previous rows

I am just starting to use python and im trying to learn some of the general things about it. As I was playing around with it I wanted to see if I could make a dataframe that shows a starting number which is compounded by a return. Sorry if this description doesnt make much sense but I basically want a dataframe x long that shows me:
number*(return)^(row number) in each row
so for example say number is 10 and the return is 10% so i would like for the dataframe to give me the series
1 11
2 12.1
3 13.3
4 14.6
5 ...
6 ...
Thanks so much in advanced!
Let us try
import numpy as np
val = 10
det = 0.1
n = 4
out = 10*((1+det)**np.arange(n))
s = pd.Series(out)
s
Out[426]:
0 10.00
1 11.00
2 12.10
3 13.31
dtype: float64
Notice here I am using the index from 0 , since 1.1**0 will yield the original value
I think this does what you want:
df = pd.DataFrame({'returns': [x for x in range(1, 10)]})
df.index = df.index + 1
df.returns = df.returns.apply(lambda x: (10 * (1.1**x)))
print(df)
Out:
returns
1 11.000000
2 12.100000
3 13.310000
4 14.641000
5 16.105100
6 17.715610
7 19.487171
8 21.435888
9 23.579477

Python: how to multiply 2 columns?

I have the simple dataframe and I would like to add the column 'Pow_calkowita'. If 'liczba_kon' is 0, 'Pow_calkowita' is 'Powierzchn', but if 'liczba_kon' is not 0, 'Pow_calkowita' is 'liczba_kon' * 'Powierzchn. Why I can't do that?
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
row['Pow_calkowita'] = row['Powierzchn']
elif row['liczba_kon'] != 0:
row['Pow_calkowita'] = row['Powierzchn'] * row['liczba_kon']
My code didn't return any values.
liczba_kon Powierzchn
0 3 69.60495
1 1 39.27270
2 1 130.41225
3 1 129.29570
4 1 294.94400
5 1 64.79345
6 1 108.75560
7 1 35.12290
8 1 178.23905
9 1 263.00930
10 1 32.02235
11 1 125.41480
12 1 47.05420
13 1 45.97135
14 1 154.87120
15 1 37.17370
16 1 37.80705
17 1 38.78760
18 1 35.50065
19 1 74.68940
I have found some soultion:
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
Is it good way?
To write idiomatic code for Pandas and leverage on Pandas' efficient array processing, you should avoid writing codes to loop over the array by yourself. Pandas allows you to write succinct codes yet process efficiently by making use of vectorization over its efficient numpy ndarray data structure. Underlying, it uses fast array processing using optimized C language binary codes. Pandas already handles the necessary looping behind the scene and this is also an advantage using Pandas by single statement without explicitly writing loops to iterate over all elements. By using Pandas, you would better enjoy its fast efficient yet succinct vectorization processing instead.
As your formula is based on a condition, you cannot use direct multiplication. Instead you can use np.where() as follows:
import numpy as np
df['Pow_calkowita'] = np.where(df['liczba_kon'] == 0, df['Powierzchn'], df['Powierzchn'] * df['liczba_kon'])
When the test condition in first parameter is true, the value from second parameter is taken, else, the value from the third parameter is taken.
Test run output: (Add 2 more test cases at the end; one with 0 value of liczba_kon)
print(df)
liczba_kon Powierzchn Pow_calkowita
0 3 69.60495 208.81485
1 1 39.27270 39.27270
2 1 130.41225 130.41225
3 1 129.29570 129.29570
4 1 294.94400 294.94400
5 1 64.79345 64.79345
6 1 108.75560 108.75560
7 1 35.12290 35.12290
8 1 178.23905 178.23905
9 1 263.00930 263.00930
10 1 32.02235 32.02235
11 1 125.41480 125.41480
12 1 47.05420 47.05420
13 1 45.97135 45.97135
14 1 154.87120 154.87120
15 1 37.17370 37.17370
16 1 37.80705 37.80705
17 1 38.78760 38.78760
18 1 35.50065 35.50065
19 1 74.68940 74.68940
20 0 69.60495 69.60495
21 2 74.68940 149.37880
To answer the first question: "Why I can't do that?"
The documentation states (in the notes):
Because iterrows returns a Series for each row, ....
and
You should never modify something you are iterating over. [...] the iterator returns a copy and not a view, and writing to it will have no effect.
this basically means that it returns a new Series with the values of that row
So, what you are getting is NOT the actual row, and definitely NOT the dataframe!
BUT what you are doing is working, although not in the way that you want to:
df = DF(dict(a= [1,2,3], b= list("abc")))
df # To demonstrate what you are doing
a b
0 1 a
1 2 b
2 3 c
for index, row in df.iterrows():
... print("\n------------------\n>>> Next Row:\n")
... print(row)
... row["c"] = "ADDED" ####### HERE I am adding to 'the row'
... print("\n -- >> added:")
... print(row)
... print("----------------------")
...
------------------
Next Row: # as you can see, this Series has the same values
a 1 # as the row that it represents
b a
Name: 0, dtype: object
-- >> added:
a 1
b a
c ADDED # and adding to it works... but you aren't doing anything
Name: 0, dtype: object # with it, unless you append it to a list
----------------------
------------------
Next Row:
a 2
b b
Name: 1, dtype: object
### same here
-- >> added:
a 2
b b
c ADDED
Name: 1, dtype: object
----------------------
------------------
Next Row:
a 3
b c
Name: 2, dtype: object
### and here
-- >> added:
a 3
b c
c ADDED
Name: 2, dtype: object
----------------------
To answer the second question: "Is it good way?"
No.
Because using the multiplication like SeaBean has shown actually uses the power of
numpy and pandas, which are vectorized operations.
This is a link to a good article on vectorization in numpy arrays, which are basically the building blocks of pandas DataFrames and Series.
dataframe is designed to operate with vectorication. you can treat it as a database table. So you should use its functions as long as it's possible.
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
This simplified the code and enhanced performance., we can test their performance:
sampleSize = 100000
df=pd.DataFrame({
'liczba_kon': np.random.randint(3, size=(sampleSize)),
'Powierzchn': np.random.randint(1000, size=(sampleSize)),
})
# vectorication
s = time.time()
tdf = df # temp df
tdf['liczba_kon'] = tdf['liczba_kon'].replace(0, 1) # replace 0 to 1
tdf['Pow_calkowita'] = tdf['liczba_kon'] * tdf['Powierzchn'] # multiply
df['Pow_calkowita'] = tdf['Pow_calkowita'] # copy column
print(time.time() - s)
# iteration
s = time.time()
result = []
for index, row in df.iterrows():
if row['liczba_kon'] == 0:
result.append(row['Powierzchn'])
elif row['liczba_kon'] != 0:
result.append(row['Powierzchn'] * row['liczba_kon'])
df['Pow_calkowita'] = result
print(time.time() - s)
We can see vectorication performed much faster.
0.0034716129302978516
6.193516492843628

How to get one hot encoding of specific words in a text in Pandas?

Let's say I have a dataframe and list of words i.e
toxic = ['bad','horrible','disguisting']
df = pd.DataFrame({'text':['You look horrible','You are good','you are bad and disguisting']})
main = pd.concat([df,pd.DataFrame(columns=toxic)]).fillna(0)
samp = main['text'].str.split().apply(lambda x : [i for i in toxic if i in x])
for i,j in enumerate(samp):
for k in j:
main.loc[i,k] = 1
This leads to :
bad disguisting horrible text
0 0 0 1 You look horrible
1 0 0 0 You are good
2 1 1 0 you are bad and disguisting
This is bit faster than get_dummies, but for loops in pandas is not appreciable when there is huge amount of data.
I tried with str.get_dummies, this will rather one hot encode every word in the series which makes it bit slower.
pd.concat([df,main['text'].str.get_dummies(' ')[toxic]],1)
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
If I try the same in scipy.
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(toxic)
main['text'].str.split().apply(le.transform)
This leads to Value Error,y contains new labels. Is there a way to ignore the error in scipy?
How can I improve the speed of achieving the same, is there any other fast way of doing the same?
Use sklearn.feature_extraction.text.CountVectorizer:
from sklearn.feature_extraction.text import CountVectorizer
cv = CountVectorizer(vocabulary=toxic)
r = pd.SparseDataFrame(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names(),
default_fill_value=0)
Result:
In [127]: r
Out[127]:
bad horrible disguisting
0 0 1 0
1 0 0 0
2 1 0 1
In [128]: type(r)
Out[128]: pandas.core.sparse.frame.SparseDataFrame
In [129]: r.info()
<class 'pandas.core.sparse.frame.SparseDataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 3 columns):
bad 3 non-null int64
horrible 3 non-null int64
disguisting 3 non-null int64
dtypes: int64(3)
memory usage: 104.0 bytes
In [130]: r.memory_usage()
Out[130]:
Index 80
bad 8 # <--- NOTE: it's using 8 bytes (1x int64) instead of 24 bytes for three values (3x8)
horrible 8
disguisting 8
dtype: int64
joining SparseDataFrame with the original DataFrame:
In [137]: r2 = df.join(r)
In [138]: r2
Out[138]:
text bad horrible disguisting
0 You look horrible 0 1 0
1 You are good 0 0 0
2 you are bad and disguisting 1 0 1
In [139]: r2.memory_usage()
Out[139]:
Index 80
text 24
bad 8
horrible 8
disguisting 8
dtype: int64
In [140]: type(r2)
Out[140]: pandas.core.frame.DataFrame
In [141]: type(r2['horrible'])
Out[141]: pandas.core.sparse.series.SparseSeries
In [142]: type(r2['text'])
Out[142]: pandas.core.series.Series
PS in older Pandas versions Sparsed columns loosed their sparsity (got densed) after joining SparsedDataFrame with a regular DataFrame, now we can have a mixture of regular Series (columns) and SparseSeries - really nice feature!
The accepted answer is deprecated, see release notes:
SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present to aid in migrating from previous versions.
Pandas 1.0.5 Solution:
r = df = pd.DataFrame.sparse.from_spmatrix(cv.fit_transform(df['text']),
df.index,
cv.get_feature_names())

Python Applymap taking time to run

I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:
a b c
0 4 3 3
1 1 2 1
2 2 1 0
3 2 0 1
4 2 0 4
I want to binaries the cell values.
I did the following:
df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))
While the code works fine, but it takes a lot of time to run.
Why is that?
Is there a faster way?
Thanks
Edit:
Error when doing df.to_pickle
df_preference.to_pickle('df_preference.pickle')
I get this:
---------------------------------------------------------------------------
SystemError Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
1 # Pickling the data to the disk
2
----> 3 df_preference.to_pickle('df_preference.pickle')
\\dwdfhome01\Anaconda\lib\site-packages\pandas\core\generic.pyc in to_pickle(self, path)
1170 """
1171 from pandas.io.pickle import to_pickle
-> 1172 return to_pickle(self, path)
1173
1174 def to_clipboard(self, excel=None, sep=None, **kwargs):
\\dwdfhome01\Anaconda\lib\site-packages\pandas\io\pickle.pyc in to_pickle(obj, path)
13 """
14 with open(path, 'wb') as f:
---> 15 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
16
17
SystemError: error return without exception set
UPDATE:
read this topic and this issue in regards to your error
Try to save your DF as HDF5 - it's much more convenient.
You may also want to read this comparison...
OLD answer:
try this:
In [110]: (df>0).astype(np.int8)
Out[110]:
a b c
0 1 1 1
1 1 1 1
2 1 1 0
3 1 0 1
4 1 0 1
.applymap() - one of the slowest method, because it goes to each cell (basically it performs nested loops inside).
df>0 works with vectorized data, so it does it much faster
.apply() - will work faster than .applymap() as it works on columns, but still much slower compared to df>0
UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap() will take ages on (55K x 9K) DF:
In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))
In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop
In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop
In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop
You could use a scipy sparsematrix. This would make the calculations only relevant to the data that is actually there instead of operating on all the zeros.

Writing to multiple adjacent columns in pandas efficiently

With a numpy ndarray it is possible to write to multiple columns at a time without making a copy first (as long as they are adjacent). If I wanted to write to the first three columns of an array I would write
a[0,0:3] = 1,2,3 # this is very fast ('a' is a numpy ndarray)
I was hoping that in pandas I would similarly be able to select multiple adjacent columns by "label-slicing" like so (assuming the first 3 columns are labeled 'a','b','c')
a.loc[0,'a':'c'] = 1,2,3 # this works but is very slow ('a' is a pandas DataFrame)
or similarly
a.iloc[0,3:6] = 1,2,3 # this is equally as slow
However, this takes several 100s of milliseconds as compared to writing to a numpy array which takes only a few microseconds. I'm unclear on whether pandas is making a copy of the array under the hood. The only way I could find to write to the dataframe in this way that gives good speed is to work on the underlying ndarray directly
a.values[0,0:3] = 1,2,3 # this works fine and is fast
Have I missed something in the Pandas docs or is their no way to do multiple adjacent column indexing on a Pandas dataframe with speed comparable to numpy?
Edit
Here's the actual dataframe I am working with.
>> conn = sqlite3.connect('prath.sqlite')
>> prath = pd.read_sql("select image_id,pixel_index,skin,r,g,b from pixels",conn)
>> prath.shape
(5913307, 6)
>> prath.head()
image_id pixel_index skin r g b
0 21 113764 0 0 0 0
1 13 187789 0 183 149 173
2 17 535758 0 147 32 35
3 31 6255 0 116 1 16
4 15 119272 0 238 229 224
>> prath.dtypes
image_id int64
pixel_index int64
skin int64
r int64
g int64
b int64
dtype: object
Here is some runtime comparisons for the different indexing methods (again, pandas indexing is very slow)
>> %timeit prath.loc[0,'r':'b'] = 4,5,6
1 loops, best of 3: 888 ms per loop
>> %timeit prath.iloc[0,3:6] = 4,5,6
1 loops, best of 3: 894 ms per loop
>> %timeit prath.values[0,3:6] = 4,5,6
100000 loops, best of 3: 4.8 µs per loop
Edit to clarify: I don't believe pandas has a direct analog to setting a view in numpy in terms of both speed and syntax. iloc and loc are probably the most direct analog in terms of syntax and purpose, but are much slower. This is a fairly common situation with numpy and pandas. Pandas does a lot more than numpy (labeled columns/indexes, automatic alignment, etc.), but is slower to varying degrees. When you need speed and can do things in numpy, then do them in numpy.
I think in a nutshell that the tradeoff here is that loc and iloc will be slower but work 100% of the time whereas values will be fast but not always work (to be honest, I didn't even realize it would work in the way you got it to work).
But here's a really simple example where values doesn't work because column 'g' is a float rather than integer.
prath['g'] = 3.33
prath.values[0,3:6] = 4,5,6
prath.head(3)
image_id pixel_index skin r g b
0 21 113764 0 0 3.33 0
1 13 187789 0 183 3.33 173
2 17 535758 0 147 3.33 35
prath.iloc[0,3:6] = 4,5,6
prath.head(3)
image_id pixel_index skin r g b
0 21 113764 0 4 5.00 6
1 13 187789 0 183 3.33 173
2 17 535758 0 147 3.33 35
You can often get numpy-like speed and behavior from pandas when columns are of homogeneous type, you want to be careful about this. Edit to add: As #toes notes in the comment, the documentation does state that you can do this with homogeneous data. However, it's potentially very error prone as the example above shows, and I don't think many people would consider this a good general practice in pandas.
My general recommendation would be to do things in numpy in cases where you need the speed (and have homogeneous data types), and pandas when you don't. The nice thing is that numpy and pandas play well together so it's really not that hard to convert between dataframes and arrays as you go.
Edit to add: The following seems to work (albeit with a warning) even with column 'g' as a float. The speed is in between the values way and loc/iloc ways. I'm not sure if this can be expected to work all the time though. Just putting it out as a possible middle way.
prath[0:1][['r','g','b']] = 4,5,6
We are adding the ability to index directly even in a multi-dtype frame. This is in master now and will be in 0.17.0. You can do this in < 0.17.0, but it requires (more) manipulation of the internals.
In [1]: df = DataFrame({'A' : range(5), 'B' : range(6,11), 'C' : 'foo'})
In [2]: df.dtypes
Out[2]:
A int64
B int64
C object
dtype: object
The copy=False flag is new. This gives you a dict of dtypes->blocks (which are dtype separable)
In [3]: b = df.as_blocks(copy=False)
In [4]: b
Out[4]:
{'int64': A B
0 0 6
1 1 7
2 2 8
3 3 9
4 4 10, 'object': C
0 foo
1 foo
2 foo
3 foo
4 foo}
Here is the underlying numpy array.
In [5]: b['int64'].values
Out[5]:
array([[ 0, 6],
[ 1, 7],
[ 2, 8],
[ 3, 9],
[ 4, 10]])
This is the array in the original data set
In [7]: id(df._data.blocks[0].values)
Out[7]: 4429267232
Here is our view on it. They are the same
In [8]: id(b['int64'].values.base)
Out[8]: 4429267232
Now you can access the frame, and use pandas set operations to modify.
You can also directly access the numpy array via .values, which is now a VIEW into the original.
You will not incur any speed penalty for modifications as copies won't be made as long as you don't change the dtype of the data itself (e.g. don't try to put a string here; it will work but the view will be lost)
In [9]: b['int64'].loc[0,'A'] = -1
In [11]: b['int64'].values[0,1] = -2
Since we have a view, you can then change the underlying data.
In [12]: df
Out[12]:
A B C
0 -1 -2 foo
1 1 7 foo
2 2 8 foo
3 3 9 foo
4 4 10 foo
Note that if you modify the shape of the data (e.g. if you add a column for example) then the views will be lost.

Categories

Resources