Writing to multiple adjacent columns in pandas efficiently

Writing to multiple adjacent columns in pandas efficiently - python

With a numpy ndarray it is possible to write to multiple columns at a time without making a copy first (as long as they are adjacent). If I wanted to write to the first three columns of an array I would write
a[0,0:3] = 1,2,3 # this is very fast ('a' is a numpy ndarray)
I was hoping that in pandas I would similarly be able to select multiple adjacent columns by "label-slicing" like so (assuming the first 3 columns are labeled 'a','b','c')
a.loc[0,'a':'c'] = 1,2,3 # this works but is very slow ('a' is a pandas DataFrame)
or similarly
a.iloc[0,3:6] = 1,2,3 # this is equally as slow
However, this takes several 100s of milliseconds as compared to writing to a numpy array which takes only a few microseconds. I'm unclear on whether pandas is making a copy of the array under the hood. The only way I could find to write to the dataframe in this way that gives good speed is to work on the underlying ndarray directly
a.values[0,0:3] = 1,2,3 # this works fine and is fast
Have I missed something in the Pandas docs or is their no way to do multiple adjacent column indexing on a Pandas dataframe with speed comparable to numpy?
Edit
Here's the actual dataframe I am working with.
>> conn = sqlite3.connect('prath.sqlite')
>> prath = pd.read_sql("select image_id,pixel_index,skin,r,g,b from pixels",conn)
>> prath.shape
(5913307, 6)
>> prath.head()
image_id pixel_index skin r g b
0 21 113764 0 0 0 0
1 13 187789 0 183 149 173
2 17 535758 0 147 32 35
3 31 6255 0 116 1 16
4 15 119272 0 238 229 224
>> prath.dtypes
image_id int64
pixel_index int64
skin int64
r int64
g int64
b int64
dtype: object
Here is some runtime comparisons for the different indexing methods (again, pandas indexing is very slow)
>> %timeit prath.loc[0,'r':'b'] = 4,5,6
1 loops, best of 3: 888 ms per loop
>> %timeit prath.iloc[0,3:6] = 4,5,6
1 loops, best of 3: 894 ms per loop
>> %timeit prath.values[0,3:6] = 4,5,6
100000 loops, best of 3: 4.8 µs per loop

Edit to clarify: I don't believe pandas has a direct analog to setting a view in numpy in terms of both speed and syntax. iloc and loc are probably the most direct analog in terms of syntax and purpose, but are much slower. This is a fairly common situation with numpy and pandas. Pandas does a lot more than numpy (labeled columns/indexes, automatic alignment, etc.), but is slower to varying degrees. When you need speed and can do things in numpy, then do them in numpy.
I think in a nutshell that the tradeoff here is that loc and iloc will be slower but work 100% of the time whereas values will be fast but not always work (to be honest, I didn't even realize it would work in the way you got it to work).
But here's a really simple example where values doesn't work because column 'g' is a float rather than integer.
prath['g'] = 3.33
prath.values[0,3:6] = 4,5,6
prath.head(3)
image_id pixel_index skin r g b
0 21 113764 0 0 3.33 0
1 13 187789 0 183 3.33 173
2 17 535758 0 147 3.33 35
prath.iloc[0,3:6] = 4,5,6
prath.head(3)
image_id pixel_index skin r g b
0 21 113764 0 4 5.00 6
1 13 187789 0 183 3.33 173
2 17 535758 0 147 3.33 35
You can often get numpy-like speed and behavior from pandas when columns are of homogeneous type, you want to be careful about this. Edit to add: As #toes notes in the comment, the documentation does state that you can do this with homogeneous data. However, it's potentially very error prone as the example above shows, and I don't think many people would consider this a good general practice in pandas.
My general recommendation would be to do things in numpy in cases where you need the speed (and have homogeneous data types), and pandas when you don't. The nice thing is that numpy and pandas play well together so it's really not that hard to convert between dataframes and arrays as you go.
Edit to add: The following seems to work (albeit with a warning) even with column 'g' as a float. The speed is in between the values way and loc/iloc ways. I'm not sure if this can be expected to work all the time though. Just putting it out as a possible middle way.
prath[0:1][['r','g','b']] = 4,5,6

We are adding the ability to index directly even in a multi-dtype frame. This is in master now and will be in 0.17.0. You can do this in < 0.17.0, but it requires (more) manipulation of the internals.
In [1]: df = DataFrame({'A' : range(5), 'B' : range(6,11), 'C' : 'foo'})
In [2]: df.dtypes
Out[2]:
A int64
B int64
C object
dtype: object
The copy=False flag is new. This gives you a dict of dtypes->blocks (which are dtype separable)
In [3]: b = df.as_blocks(copy=False)
In [4]: b
Out[4]:
{'int64': A B
0 0 6
1 1 7
2 2 8
3 3 9
4 4 10, 'object': C
0 foo
1 foo
2 foo
3 foo
4 foo}
Here is the underlying numpy array.
In [5]: b['int64'].values
Out[5]:
array([[ 0, 6],
[ 1, 7],
[ 2, 8],
[ 3, 9],
[ 4, 10]])
This is the array in the original data set
In [7]: id(df._data.blocks[0].values)
Out[7]: 4429267232
Here is our view on it. They are the same
In [8]: id(b['int64'].values.base)
Out[8]: 4429267232
Now you can access the frame, and use pandas set operations to modify.
You can also directly access the numpy array via .values, which is now a VIEW into the original.
You will not incur any speed penalty for modifications as copies won't be made as long as you don't change the dtype of the data itself (e.g. don't try to put a string here; it will work but the view will be lost)
In [9]: b['int64'].loc[0,'A'] = -1
In [11]: b['int64'].values[0,1] = -2
Since we have a view, you can then change the underlying data.
In [12]: df
Out[12]:
A B C
0 -1 -2 foo
1 1 7 foo
2 2 8 foo
3 3 9 foo
4 4 10 foo
Note that if you modify the shape of the data (e.g. if you add a column for example) then the views will be lost.

Related

Pandas Data Frame get the values only not the definition [duplicate]

This seems like a ridiculously easy question... but I'm not seeing the easy answer I was expecting.
So, how do I get the value at an nth row of a given column in Pandas? (I am particularly interested in the first row, but would be interested in a more general practice as well).
For example, let's say I want to pull the 1.2 value in Btime as a variable.
Whats the right way to do this?
>>> df_test
ATime X Y Z Btime C D E
0 1.2 2 15 2 1.2 12 25 12
1 1.4 3 12 1 1.3 13 22 11
2 1.5 1 10 6 1.4 11 20 16
3 1.6 2 9 10 1.7 12 29 12
4 1.9 1 1 9 1.9 11 21 19
5 2.0 0 0 0 2.0 8 10 11
6 2.4 0 0 0 2.4 10 12 15

To select the ith row, use iloc:
In [31]: df_test.iloc[0]
Out[31]:
ATime 1.2
X 2.0
Y 15.0
Z 2.0
Btime 1.2
C 12.0
D 25.0
E 12.0
Name: 0, dtype: float64
To select the ith value in the Btime column you could use:
In [30]: df_test['Btime'].iloc[0]
Out[30]: 1.2
There is a difference between df_test['Btime'].iloc[0] (recommended) and df_test.iloc[0]['Btime']:
DataFrames store data in column-based blocks (where each block has a single
dtype). If you select by column first, a view can be returned (which is
quicker than returning a copy) and the original dtype is preserved. In contrast,
if you select by row first, and if the DataFrame has columns of different
dtypes, then Pandas copies the data into a new Series of object dtype. So
selecting columns is a bit faster than selecting rows. Thus, although
df_test.iloc[0]['Btime'] works, df_test['Btime'].iloc[0] is a little bit
more efficient.
There is a big difference between the two when it comes to assignment.
df_test['Btime'].iloc[0] = x affects df_test, but df_test.iloc[0]['Btime']
may not. See below for an explanation of why. Because a subtle difference in
the order of indexing makes a big difference in behavior, it is better to use single indexing assignment:
df.iloc[0, df.columns.get_loc('Btime')] = x
df.iloc[0, df.columns.get_loc('Btime')] = x (recommended):
The recommended way to assign new values to a
DataFrame is to avoid chained indexing, and instead use the method shown by
andrew,
df.loc[df.index[n], 'Btime'] = x
or
df.iloc[n, df.columns.get_loc('Btime')] = x
The latter method is a bit faster, because df.loc has to convert the row and column labels to
positional indices, so there is a little less conversion necessary if you use
df.iloc instead.
df['Btime'].iloc[0] = x works, but is not recommended:
Although this works, it is taking advantage of the way DataFrames are currently implemented. There is no guarantee that Pandas has to work this way in the future. In particular, it is taking advantage of the fact that (currently) df['Btime'] always returns a
view (not a copy) so df['Btime'].iloc[n] = x can be used to assign a new value
at the nth location of the Btime column of df.
Since Pandas makes no explicit guarantees about when indexers return a view versus a copy, assignments that use chained indexing generally always raise a SettingWithCopyWarning even though in this case the assignment succeeds in modifying df:
In [22]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [24]: df['bar'] = 100
In [25]: df['bar'].iloc[0] = 99
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
In [26]: df
Out[26]:
foo bar
0 A 99 <-- assignment succeeded
2 B 100
1 C 100
df.iloc[0]['Btime'] = x does not work:
In contrast, assignment with df.iloc[0]['bar'] = 123 does not work because df.iloc[0] is returning a copy:
In [66]: df.iloc[0]['bar'] = 123
/home/unutbu/data/binky/bin/ipython:1: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
In [67]: df
Out[67]:
foo bar
0 A 99 <-- assignment failed
2 B 100
1 C 100
Warning: I had previously suggested df_test.ix[i, 'Btime']. But this is not guaranteed to give you the ith value since ix tries to index by label before trying to index by position. So if the DataFrame has an integer index which is not in sorted order starting at 0, then using ix[i] will return the row labeled i rather than the ith row. For example,
In [1]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [2]: df
Out[2]:
foo
0 A
2 B
1 C
In [4]: df.ix[1, 'foo']
Out[4]: 'C'

Note that the answer from #unutbu will be correct until you want to set the value to something new, then it will not work if your dataframe is a view.
In [4]: df = pd.DataFrame({'foo':list('ABC')}, index=[0,2,1])
In [5]: df['bar'] = 100
In [6]: df['bar'].iloc[0] = 99
/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas-0.16.0_19_g8d2818e-py2.7-macosx-10.9-x86_64.egg/pandas/core/indexing.py:118: SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame
See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
self._setitem_with_indexer(indexer, value)
Another approach that will consistently work with both setting and getting is:
In [7]: df.loc[df.index[0], 'foo']
Out[7]: 'A'
In [8]: df.loc[df.index[0], 'bar'] = 99
In [9]: df
Out[9]:
foo bar
0 A 99
2 B 100
1 C 100

Another way to do this:
first_value = df['Btime'].values[0]
This way seems to be faster than using .iloc:
In [1]: %timeit -n 1000 df['Btime'].values[20]
5.82 µs ± 142 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [2]: %timeit -n 1000 df['Btime'].iloc[20]
29.2 µs ± 1.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

df.iloc[0].head(1) - First data set only from entire first row.
df.iloc[0] - Entire First row in column.

In a general way, if you want to pick up the first N rows from the J column from pandas dataframe the best way to do this is:
data = dataframe[0:N][:,J]

To access a single value you can use the method iat that is much faster than iloc:
df['Btime'].iat[0]
You can also use the method take:
df['Btime'].take(0)

.iat and .at are the methods for getting and setting single values and are much faster than .iloc and .loc. Mykola Zotko pointed this out in their answer, but they did not use .iat to its full extent.
When we can use .iat or .at, we should only have to index into the dataframe once.
This is not great:
df['Btime'].iat[0]
It is not ideal because the 'Btime' column was first selected as a series, then .iat was used to index into that series.
These two options are the best:
Using zero-indexed positions:
df.iat[0, 4] # get the value in the zeroth row, and 4th column
Using Labels:
df.at[0, 'Btime'] # get the value where the index label is 0 and the column name is "Btime".
Both methods return the value of 1.2.

To get e.g the value from column 'test' and row 1 it works like
df[['test']].values[0][0]
as only df[['test']].values[0] gives back a array

Another way of getting the first row and preserving the index:
x = df.first('d') # Returns the first day. '3d' gives first three days.

According to pandas docs, at is the fastest way to access a scalar value such as the use case in the OP (already suggested by Alex on this page).
Building upon Alex's answer, because dataframes don't necessarily have a range index it might be more complete to index df.index (since dataframe indexes are built on numpy arrays, you can index them like an array) or call get_loc() on columns to get the integer location of a column.
df.at[df.index[0], 'Btime']
df.iat[0, df.columns.get_loc('Btime')]
One common problem is that if you used a boolean mask to get a single value, but ended up with a value with an index (actually a Series); e.g.:
0 1.2
Name: Btime, dtype: float64
you can use squeeze() to get the scalar value, i.e.
df.loc[df['Btime']<1.3, 'Btime'].squeeze()

Finding top N values for each group, 200 million rows

I have a pandas DataFrame which has around 200 million rows and looks like this:
UserID MovieID Rating
1 455 5
2 411 4
1 288 2
2 300 3
2 137 5
1 300 3
...
I want to get top N movies for each user sorted by rating in descending order, so for N=2 the output should look like this:
UserID MovieID Rating
1 455 5
1 300 3
2 137 5
2 411 4
When I try to do it like this, I get a 'memory error' caused by the 'groupby' (I have 8gb of RAM on my machine)
df.sort_values(by=['rating']).groupby('userID').head(2)
Any suggestions?

Quick and dirty answer
Given that the sort works, you may be able to squeak by with the following, which uses a Numpy-based memory efficient alternative to the Pandas groupby:
import pandas as pd
d = '''UserID MovieID Rating
1 455 5
2 411 4
3 207 5
1 288 2
3 69 2
2 300 3
3 410 4
3 108 3
2 137 5
3 308 3
1 300 3'''
df = pd.read_csv(pd.compat.StringIO(d), sep='\s+', index_col='UserID')
df = df.sort_values(['UserID', 'Rating'])
# carefully handle the construction of ix to ensure no copies are made
ix = np.zeros(df.shape[0], np.int8)
np.subtract(df.index.values[1:], df.index.values[:-1], out=ix[:-1])
# the above assumes that UserID is the index of df. If it's just a column, use this instead
#np.subtract(df['UserID'].values[1:], df['UserID'].values[:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(df.iloc[ix])
Output:
MovieID Rating
UserID
1 300 3
1 455 5
2 411 4
2 137 5
3 410 4
3 207 5
More memory efficient answer
Instead of a Pandas dataframe, for stuff this big you should just work with Numpy arrays (which Pandas uses for storing data under the hood). If you use an appropriate structured array, you should be able to fit all of your data into a single array roughly of size:
2 * 10**8 * (4 + 2 + 1)
1,400,000,000 bytes
or ~1.304 GB
which means that it (and a couple of temporaries for calculations) should easily fit into your 8 GB system memory.
Here's some details:
The trickiest part will be initializing the structured array. You may be able to get away with manually initializing the array and then copying the data over:
dfdtype = np.dtype([('UserID', np.uint32), ('MovieID', np.uint16), ('Rating', np.uint8)])
arr = np.empty(df.shape[0], dtype=dfdtype)
arr['UserID'] = df.index.values
for n in dfdtype.names[1:]:
arr[n] = df[n].values
If the above causes an out of memory error, from the start of your program you'll have to build and populate a structured array instead of a dataframe:
arr = np.empty(rowcount, dtype=dfdtype)
...
adapt the code you use to populate the df and put it here
...
Once you have arr, here's how you'd do the groupby you're aiming for:
arr.sort(order=['UserID', 'Rating'])
ix = np.zeros(arr.shape[0], np.int8)
np.subtract(arr['UserID'][1:], arr['UserID'][:-1], out=ix[:-1])
ix[:-1] += ix[1:]
ix[-2:] = 1
ix = ix.view(np.bool)
print(arr[ix])
The above size calculation and dtype assumes that no UserID is larger than 4,294,967,295, no MovieID is larger than 65535, and no rating is larger than 255. This means that the columns of your dataframe can be (np.uint32, np.uint16, np.uint8) without loosing any data.

If you want to keep working with pandas, you can divide your data into batches - 10K rows at a time, for example. You can split the data either after loading the source data to the DF, or even better, load the data in parts.
You can save the results of each iteration (batch) into a dictionary keeping only the number of movies you're interested with:
{userID: {MovieID_1: score1, MovieID_2: s2, ... MovieID_N: sN}, ...}
and update the nested dictionary on each iteration, keeping only the best N movies per user.
This way you'll be able to analyze data much larger than your computer's memory

Pandas column creation methods

There are many methods for creating new columns in Pandas (I may have missed some in my examples so please let me know if there are others and I will include here) and I wanted to figure out when is the best time to use each method. Obviously some methods are better in certain situations compared to others but I want to evaluate it from a holistic view looking at efficiency, readability, and usefulness.
I'm primarily concerned with the first three but included other ways simply to show it's possible with different approaches. Here's your sample dataframe:
df = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
Most commonly known way is to name a new column such as df['c'] and use apply:
df['c'] = df['a'].apply(lambda x: x * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using assign can accomplish the same thing:
df = df.assign(c = lambda x: x['a'] * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Updated via #roganjosh:
df['c'] = df['a'] * 2
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using map (definitely not as efficient as apply):
df['c'] = df['a'].map(lambda x: x * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Creating a new pd.series and then concat to bring it into the dataframe:
c = pd.Series(df['a'] * 2).rename("c")
df = pd.concat([df,c], axis = 1)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using join:
df.join(c)
a b c
0 1 4 2
1 2 5 4
2 3 6 6

Short answer: vectorized calls (df['c'] = 2 * df['a']) almost always win on both speed and readability. See this answer regarding what you can use as a "hierarchy" of options when it comes to performance.
In generally, if you have a for i in ... or lambda present somewhere in a Pandas operation, this (sometimes) means that the resulting calculations call Python code rather than the optimized C code that Pandas' Cython library relies on for vectorized operations. (Same goes for operations that rely on NumPy ufuncs for the underlying .values.)
As for .assign(), it is correctly pointed out in the comments that this creates a copy, whereas you can view df['c'] = 2 * df['a'] as the equivalent of setting a dictionary key/value. The former also takes twice as long, although this is perhaps a bit apples-to-orange because one operation is returning a DataFrame while the other is just assigning a column.
>>> %timeit df.assign(c=df['a'] * 2)
498 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -r 7 -n 1000 df['c'] = df['a'] * 2
239 µs ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As for .map(): generally you see this when, as the name implies, you want to provide a mapping for a Series (though it can be passed a function, as in your question). That doesn't mean it's not performant, it just tends to be used as a specialized method in cases that I've seen:
>>> df['a'].map(dict(enumerate('xyz', 1)))
0 x
1 y
2 z
Name: a, dtype: object
And as for .apply(): to inject a bit of opinion into the answer, I would argue it's more idiomatic to use vectorization where possible. You can see in the code for the module where .apply() is defined: because you are passing a lambda, not a NumPy ufunc, what ultimately gets called is technically a Cython function, map_infer, but it is still performing whatever function you passed on each individual member of the Series df['a'], one at a time.

A succinct way would be:
df['c'] = 2 * df['a']
No need to compute the new column elementwise.

Why are you using lambda function?
You can easily achieve the above-mentioned task easily by
df['c'] = 2 * df['a']
This will not increase the overhead.

How to speed LabelEncoder up recoding a categorical variable into integers

I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?

It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:
In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000),
'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
In [88]: le.fit(df.values.flat)
%time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s
In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms
EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)
import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin
class PandasLabelEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self.classes_ = unique1d(y)
return self
def transform(self, y):
s = pd.Series(y).astype('category', categories=self.classes_)
return s.cat.codes

I tried this with the DataFrame:
In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})
It looks like this:
In [261]: df.head()
Out[261]:
ID_0 ID_1
0 v z
1 i i
2 d n
3 z r
4 x x
In [262]: df.shape
Out[262]: (10000000, 2)
So, 10 million rows. Locally, my timings are:
In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop
In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop
Then I made a dict mapping letters to numbers and used pandas.Series.map:
In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))
In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop
In [274]: df.head()
Out[274]:
ID_0 ID_1
0 21 25
1 8 8
2 3 13
3 25 17
4 23 23
So that might be an option. The dict just needs to have all of the values that occur in the data.
EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:
In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop
EDIT: per the 2nd comment:
In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop
In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop

I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.
Instead of
df[c] = df[c].apply(le.transform)
or
dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)
or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)
df[c] = np.array([dict_table[v] for v in df[c].values])
you can instead do
df[c] = df[c].apply(hash)
Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).
Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python
Note that the secure hash functions will have fewer collisions at the cost of speed.
Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.
I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

Very slow indexing in Pandas 0.15 compared to 0.13.1

I use pandas daily in my work. I recently upgraded to 0.15.1 from 0.13.1 and now a bunch of code is too slow to finish when iterating through relatively small DataFrames.
(I realize there are often better/faster ways to accomplish iteration on a DataFrame, but sometimes it's very clear and succinct to have a for loop structure)
I narrowed the problem down to an issue when mixing types:
def iterGet(df,col):
for i in df.index:
tmp = df[col].loc[i]
def iterLocSet(df,col,val):
for i in df.index:
#df[col].loc[i] = val
df.loc[i,col] = val
df.at[i,col] = val
return df
N = 100
df = pd.DataFrame(rand(N,3),columns = ['a','b','c'])
df['listCol'] = [[] for i in range(df.shape[0])]
df['strCol'] = [str(i) for i in range(df.shape[0])]
df['intCol'] = [i for i in range(df.shape[0])]
df['float64Col'] = [float64(i) for i in range(df.shape[0])]
print df.a[:5]
%time iterGet(df[['a','intCol']].copy(),'a')
%time tmpDf = iterLocSet(df[['a','intCol']].copy(),'a',0.)
print tmpDf.a[:5]
%time iterGet(df[['a','float64Col']].copy(),'a')
%time tmpDf = iterLocSet(df[['a','float64Col']].copy(),'a',0.)
print tmpDf.a[:5]
On Pandas 0.15.1 the result is:
0 0.114738
1 0.586447
2 0.296024
3 0.446697
4 0.720984
Name: a, dtype: float64
Wall time: 6 ms
Wall time: 3.41 s
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float64
Wall time: 6 ms
Wall time: 18 ms
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float64
But on Pandas 0.13.1 the result is this:
0 0.651796
1 0.738661
2 0.885366
3 0.513006
4 0.846323
Name: a, dtype: float64
Wall time: 6 ms
Wall time: 14 ms
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float64
Wall time: 5 ms
Wall time: 15 ms
0 0
1 0
2 0
3 0
4 0
Name: a, dtype: float6
It appears that making an assignment using row-indexing on a multi-typed array is ~200x slower in Pandas 0.15.1?
I am aware there may be a potential pitfall here by assigning to what may be a copy of the array, but I admit I do not fully understand that issue either. Here at least I can see the assignment is working. EDIT Although I see now that using either of these in the for loop fixes the problem:
df.loc[i,col] = val
df.at[i,col] = val
I don't know enough about the implementation to diagnose this. Can anyone reproduce this? Is this what you would expect? What am I doing wrong?
Thanks!

Using .loc even on a single-dtyped frame, can cause a copy of the data on a partial assignment. (This is almost always true when you have object dtypes, less so with numeric types).
When partial assignment, I mean:
df.loc[1,'B'] = value
IOW. this is setting a single value in this case (setting multiple values is similar). However setting a column is very different.
df['B'] = values
df[:,'B'] = values
is quite efficient and does not copy.
Thus you should completely avoid iteration and simply do.
df['B'] = [ ..... ] # if you want to set with a list-like
df['B'] = value # for a scalar
So in your above example, it is likely copying at every iteration. 0.13.1 was a bit buggy in handling partial assignments and would incorrectly handle certain cases, so copying was needed a bit more.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.