How to speed LabelEncoder up recoding a categorical variable into integers

How to speed LabelEncoder up recoding a categorical variable into integers - python

I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:
In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000),
'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
In [88]: le.fit(df.values.flat)
%time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s
In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms
EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)
import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin
class PandasLabelEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self.classes_ = unique1d(y)
return self
def transform(self, y):
s = pd.Series(y).astype('category', categories=self.classes_)
return s.cat.codes

I tried this with the DataFrame:
In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})
It looks like this:
In [261]: df.head()
Out[261]:
ID_0 ID_1
0 v z
1 i i
2 d n
3 z r
4 x x
In [262]: df.shape
Out[262]: (10000000, 2)
So, 10 million rows. Locally, my timings are:
In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop
In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop
Then I made a dict mapping letters to numbers and used pandas.Series.map:
In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))
In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop
In [274]: df.head()
Out[274]:
ID_0 ID_1
0 21 25
1 8 8
2 3 13
3 25 17
4 23 23
So that might be an option. The dict just needs to have all of the values that occur in the data.
EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:
In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop
EDIT: per the 2nd comment:
In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop
In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop

I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.
Instead of
df[c] = df[c].apply(le.transform)
or
dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)
or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)
df[c] = np.array([dict_table[v] for v in df[c].values])
you can instead do
df[c] = df[c].apply(hash)
Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).
Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python
Note that the secure hash functions will have fewer collisions at the cost of speed.
Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.
I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

Related

Add column to dataframe that has each row's duplicate count value takes too long

I've read SOF posts on how to create a field that contains the number of duplicates that row contains in a pandas DataFrame. Without using any other libraries, I tried writing a function that does this, and it works on small DataFrame objects; however, it takes way too long on larger ones and consumes too much memory.
This is the function:
def count_duplicates(dataframe):
function = lambda x: dataframe.to_numpy().tolist().count(x.to_list()) - 1
return dataframe.apply(function, axis=1)
I did a dir into a numpy array from the DataFrame.to_numpy function, and I didn't see a function quite like the list.count function. The reason why this takes so long is because for each row, it needs to compare the row with all of the rows in the numpy array. I'd like a much more efficient way to do this, even if it's not using a pandas DataFrame. I feel like there should be a simple way to do this with numpy, but I'm just not familiar enough. I've been testing different approaches for a while and it's resulting in a lot of errors. I'm going to keep testing different approaches, but felt the community might provide a better way.
Thank you for your help.
Here is an example DataFrame:
one two
0 1 1
1 2 2
2 3 3
3 1 1
I'd use it like this:
d['duplicates'] = count_duplicates(d)
The resulting DataFrame is:
one two duplicates
0 1 1 1
1 2 2 0
2 3 3 0
3 1 1 1
The problem is the actual DataFrame will have 1.4 million rows, and each lambda takes an average of 0.148558 seconds, which if multiplied by 1.4 million rows is about 207981.459 seconds or 57.772 hours. I need a much faster way to accomplish this.
Thank you again.
I updated the function which is speeding things up:
def _counter(series_to_count, list_of_lists):
return list_of_lists.count(series_to_count.to_list()) - 1
def count_duplicates(dataframe):
df_list = dataframe.to_numpy().tolist()
return dataframe.apply(_counter, args=(df_list,), axis=1)
This takes only 29.487 seconds. The bottleneck was converting the dataframe on each function call.
I'm still interested in optimizing this. I'd like to get this down to 2-3 seconds if at all possible. It may not be, but I'd like to make sure it is as fast as possible.
Thank you again.

Here is a vectorized way to do this. For 1.4 million rows, with an average of 140 duplicates for each row, it takes under 0.05 seconds. When there are no duplicates at all, it takes about 0.4 second.
d['duplicates'] = d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
On your example:
>>> d
one two duplicates
0 1 1 1
1 2 2 0
2 3 3 0
3 1 1 1
Speed
Relatively high rate of duplicates:
n = 1_400_000
d = pd.DataFrame(np.random.randint(0, 100, size=(n, 2)), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 48.3 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# how many duplicates on average?
>>> (d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1).mean()
139.995841
# (as expected: n / 100**2)
No duplicates
n = 1_400_000
d = pd.DataFrame(np.arange(2 * n).reshape(-1, 2), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 389 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Use pandas.read_csv to convert comma seperate string list into dataframe

How can I use Pandas read_csv to convert a big list quickly into a dataframe?
import Pandas as pd
x = '1,2,3,4,5,7,8,9'
df = pd.read_csv(x)
I know that I could split the string by comma -> put into a list -> convert to dataframe, but was wondering was there a way to do this with pd.read_csv that would be faster?

x = '1,2,3,4,5,7,8,9'
df = pd.read_csv(pd.io.common.StringIO(x), header=None)
df
0 1 2 3 4 5 7 8
0 1 2 3 4 5 7 8 9
Is the best you can do with pd.read_csv
Consider the much larger string
y = '\n'.join([','.join(['0,1,2,3,4,5,6,7,8,9'] * 100)] * 1000)
And compare timing of these two options
%timeit pd.DataFrame([l.split(',') for l in y.split('\n')]).astype(int)
%timeit pd.read_csv(pd.io.common.StringIO(y), header=None)
1 loop, best of 3: 200 ms per loop
10 loops, best of 3: 125 ms per loop
If all we needed to do is split the string, split would be faster. However, one of the things pd.read_csv does for us is parse integers. That extra overhead gets expensive when having to do it after the split.

Python Applymap taking time to run

I have a matrix of data ( 55K X8.5k) with counts. Most of them are zeros, but few of them would be like any count. Lets say something like this:
a b c
0 4 3 3
1 1 2 1
2 2 1 0
3 2 0 1
4 2 0 4
I want to binaries the cell values.
I did the following:
df_preference=df_recommender.applymap(lambda x: np.where(x >0, 1, 0))
While the code works fine, but it takes a lot of time to run.
Why is that?
Is there a faster way?
Thanks
Edit:
Error when doing df.to_pickle
df_preference.to_pickle('df_preference.pickle')
I get this:
---------------------------------------------------------------------------
SystemError Traceback (most recent call last)
<ipython-input-16-3fa90d19520a> in <module>()
1 # Pickling the data to the disk
2
----> 3 df_preference.to_pickle('df_preference.pickle')
\\dwdfhome01\Anaconda\lib\site-packages\pandas\core\generic.pyc in to_pickle(self, path)
1170 """
1171 from pandas.io.pickle import to_pickle
-> 1172 return to_pickle(self, path)
1173
1174 def to_clipboard(self, excel=None, sep=None, **kwargs):
\\dwdfhome01\Anaconda\lib\site-packages\pandas\io\pickle.pyc in to_pickle(obj, path)
13 """
14 with open(path, 'wb') as f:
---> 15 pkl.dump(obj, f, protocol=pkl.HIGHEST_PROTOCOL)
16
17
SystemError: error return without exception set

UPDATE:
read this topic and this issue in regards to your error
Try to save your DF as HDF5 - it's much more convenient.
You may also want to read this comparison...
OLD answer:
try this:
In [110]: (df>0).astype(np.int8)
Out[110]:
a b c
0 1 1 1
1 1 1 1
2 1 1 0
3 1 0 1
4 1 0 1
.applymap() - one of the slowest method, because it goes to each cell (basically it performs nested loops inside).
df>0 works with vectorized data, so it does it much faster
.apply() - will work faster than .applymap() as it works on columns, but still much slower compared to df>0
UPDATE2: time comparison on a smaller DF (1000 x 1000), as applymap() will take ages on (55K x 9K) DF:
In [5]: df = pd.DataFrame(np.random.randint(0, 10, size=(1000, 1000)))
In [6]: %timeit df.applymap(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 3.75 s per loop
In [7]: %timeit df.apply(lambda x: np.where(x >0, 1, 0))
1 loop, best of 3: 256 ms per loop
In [8]: %timeit (df>0).astype(np.int8)
100 loops, best of 3: 2.95 ms per loop

You could use a scipy sparsematrix. This would make the calculations only relevant to the data that is actually there instead of operating on all the zeros.

Writing to multiple adjacent columns in pandas efficiently

With a numpy ndarray it is possible to write to multiple columns at a time without making a copy first (as long as they are adjacent). If I wanted to write to the first three columns of an array I would write
a[0,0:3] = 1,2,3 # this is very fast ('a' is a numpy ndarray)
I was hoping that in pandas I would similarly be able to select multiple adjacent columns by "label-slicing" like so (assuming the first 3 columns are labeled 'a','b','c')
a.loc[0,'a':'c'] = 1,2,3 # this works but is very slow ('a' is a pandas DataFrame)
or similarly
a.iloc[0,3:6] = 1,2,3 # this is equally as slow
However, this takes several 100s of milliseconds as compared to writing to a numpy array which takes only a few microseconds. I'm unclear on whether pandas is making a copy of the array under the hood. The only way I could find to write to the dataframe in this way that gives good speed is to work on the underlying ndarray directly
a.values[0,0:3] = 1,2,3 # this works fine and is fast
Have I missed something in the Pandas docs or is their no way to do multiple adjacent column indexing on a Pandas dataframe with speed comparable to numpy?
Edit
Here's the actual dataframe I am working with.
>> conn = sqlite3.connect('prath.sqlite')
>> prath = pd.read_sql("select image_id,pixel_index,skin,r,g,b from pixels",conn)
>> prath.shape
(5913307, 6)
>> prath.head()
image_id pixel_index skin r g b
0 21 113764 0 0 0 0
1 13 187789 0 183 149 173
2 17 535758 0 147 32 35
3 31 6255 0 116 1 16
4 15 119272 0 238 229 224
>> prath.dtypes
image_id int64
pixel_index int64
skin int64
r int64
g int64
b int64
dtype: object
Here is some runtime comparisons for the different indexing methods (again, pandas indexing is very slow)
>> %timeit prath.loc[0,'r':'b'] = 4,5,6
1 loops, best of 3: 888 ms per loop
>> %timeit prath.iloc[0,3:6] = 4,5,6
1 loops, best of 3: 894 ms per loop
>> %timeit prath.values[0,3:6] = 4,5,6
100000 loops, best of 3: 4.8 µs per loop

Edit to clarify: I don't believe pandas has a direct analog to setting a view in numpy in terms of both speed and syntax. iloc and loc are probably the most direct analog in terms of syntax and purpose, but are much slower. This is a fairly common situation with numpy and pandas. Pandas does a lot more than numpy (labeled columns/indexes, automatic alignment, etc.), but is slower to varying degrees. When you need speed and can do things in numpy, then do them in numpy.
I think in a nutshell that the tradeoff here is that loc and iloc will be slower but work 100% of the time whereas values will be fast but not always work (to be honest, I didn't even realize it would work in the way you got it to work).
But here's a really simple example where values doesn't work because column 'g' is a float rather than integer.
prath['g'] = 3.33
prath.values[0,3:6] = 4,5,6
prath.head(3)
image_id pixel_index skin r g b
0 21 113764 0 0 3.33 0
1 13 187789 0 183 3.33 173
2 17 535758 0 147 3.33 35
prath.iloc[0,3:6] = 4,5,6
prath.head(3)
image_id pixel_index skin r g b
0 21 113764 0 4 5.00 6
1 13 187789 0 183 3.33 173
2 17 535758 0 147 3.33 35
You can often get numpy-like speed and behavior from pandas when columns are of homogeneous type, you want to be careful about this. Edit to add: As #toes notes in the comment, the documentation does state that you can do this with homogeneous data. However, it's potentially very error prone as the example above shows, and I don't think many people would consider this a good general practice in pandas.
My general recommendation would be to do things in numpy in cases where you need the speed (and have homogeneous data types), and pandas when you don't. The nice thing is that numpy and pandas play well together so it's really not that hard to convert between dataframes and arrays as you go.
Edit to add: The following seems to work (albeit with a warning) even with column 'g' as a float. The speed is in between the values way and loc/iloc ways. I'm not sure if this can be expected to work all the time though. Just putting it out as a possible middle way.
prath[0:1][['r','g','b']] = 4,5,6

We are adding the ability to index directly even in a multi-dtype frame. This is in master now and will be in 0.17.0. You can do this in < 0.17.0, but it requires (more) manipulation of the internals.
In [1]: df = DataFrame({'A' : range(5), 'B' : range(6,11), 'C' : 'foo'})
In [2]: df.dtypes
Out[2]:
A int64
B int64
C object
dtype: object
The copy=False flag is new. This gives you a dict of dtypes->blocks (which are dtype separable)
In [3]: b = df.as_blocks(copy=False)
In [4]: b
Out[4]:
{'int64': A B
0 0 6
1 1 7
2 2 8
3 3 9
4 4 10, 'object': C
0 foo
1 foo
2 foo
3 foo
4 foo}
Here is the underlying numpy array.
In [5]: b['int64'].values
Out[5]:
array([[ 0, 6],
[ 1, 7],
[ 2, 8],
[ 3, 9],
[ 4, 10]])
This is the array in the original data set
In [7]: id(df._data.blocks[0].values)
Out[7]: 4429267232
Here is our view on it. They are the same
In [8]: id(b['int64'].values.base)
Out[8]: 4429267232
Now you can access the frame, and use pandas set operations to modify.
You can also directly access the numpy array via .values, which is now a VIEW into the original.
You will not incur any speed penalty for modifications as copies won't be made as long as you don't change the dtype of the data itself (e.g. don't try to put a string here; it will work but the view will be lost)
In [9]: b['int64'].loc[0,'A'] = -1
In [11]: b['int64'].values[0,1] = -2
Since we have a view, you can then change the underlying data.
In [12]: df
Out[12]:
A B C
0 -1 -2 foo
1 1 7 foo
2 2 8 foo
3 3 9 foo
4 4 10 foo
Note that if you modify the shape of the data (e.g. if you add a column for example) then the views will be lost.

How to efficiently select a rows from pandas DataFrame?

The following table contains some keys and values:
N = 100
tbl = pd.DataFrame({'key':np.random.randint(0, 10, N),
'y':np.random.rand(N), 'z':np.random.rand(N)})
I would like to obtain a DataFrame in which each row contains a key and all the fields that correspond to the minimal value of a specified field.
Since the original table is very large, I'm interested in the most efficient way.
NOTE getting the minimal value of a field is simple:
tbl.groupby('key').agg(pd.Series.min)
But this takes the minimum values of every field, independently, I would like to know what is the minimum value of y and what z value corresponds to it.
Below I post an answer to my question with my naive approach, but I suspect there are better ways

Here is a straightforward approach:
gr = tbl.groupby('key')
def take_min_y(t):
ix = t.y.argmin()
return t.loc[[ix]]
tbl_mins = gr.apply(take_min_y)
Is there a better way?

Based on your updated edit I believe the following is what you want:
In [107]:
tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
Out[107]:
key y z
47 0 0.094841 0.221435
26 1 0.062200 0.748082
45 2 0.032497 0.160199
28 3 0.002242 0.064829
73 4 0.122438 0.723844
75 5 0.128193 0.638933
79 6 0.071833 0.952624
86 7 0.058974 0.113317
36 8 0.068757 0.611111
12 9 0.082604 0.271268
idxmin returns the index of the min value, we can then use this to filter the original dataframe to select these rows.
Timings show this method is approx 7 times faster:
In [108]:
%timeit tbl.iloc[gr['y'].agg(pd.Series.idxmin)]
def take_min_y(t):
ix = t.y.argmin()
return t.loc[[ix]]
%timeit tbl_mins = gr.apply(take_min_y)
1000 loops, best of 3: 1.08 ms per loop
100 loops, best of 3: 7.06 ms per loop

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.