How does adding a random byte *increase* duplicates? - python

Here is a Python function for generating my own specific type for UUID (It's a long story for why I can't use uuid.uuid1()):
def uuid():
sec = hex(int(time()))[2:]
usec = hex(datetime.now().microsecond)[2:]
rand = hex(choice(range(256)))[2:]
return upper(sec + usec + rand)
# 534AD79CDF1D27
Now, let's let that run for a long period of time, and see if we find any duplicates:
UUIDs Duplicates
100000 2
200000 8
300000 8
400000 8
500000 8
600000 9
700000 9
800000 9
900000 9
1000000 10
1100000 14
1200000 14
1300000 14
1400000 17
1500000 17
1600000 18
1700000 21
1800000 24
1900000 24
2000000 27
Yep! Nearly 30 duplicates in fact... Now, here's a new function without the random byte at the end:
def uuid():
sec = hex(int(time()))[2:]
usec = hex(datetime.now().microsecond)[2:]
return upper(sec + usec)
#534ADA2AC4A41
Let's see how many duplicates we get now:
UUIDs Duplicates
100000 0
200000 0
300000 0
400000 0
500000 0
600000 0
700000 0
800000 0
900000 0
1000000 0
1100000 0
1200000 0
1300000 0
1400000 0
1500000 0
1600000 0
1700000 0
1800000 0
1900000 0
2000000 0
Well, would you look at that? Not a single duplicate! Also, if you're curious how I'm determining the number of duplicates, here is the code for that:
len([x for x, y in Counter(ids).items() if y > 1])
Now, on to the actual question: How does adding a randomly generated byte increase the number of duplicates?

The problem is that you are using hex() without zero-padding. hex(int(time())) is basically always 8 nybbles long because it increments quite slowly, so that first part is constant length. Nybble here refers to a single hex digit.
But, hex(datetime.now().microsecond) is not a constant length. It will vary between 1 nybble (up to 9 us) up to 5 nybbles (for 999999 us). Without the "random byte", this isn't a problem because you can get the microsecond value uniquely by truncating off the seconds.
However, your "random byte" is also produced without any padding! So, you can end up generating a 1 nybble random byte, or a 2 nybble random byte. So, you will end up potentially creating more conflicts because you can generate the same uuid with e.g. a 3-nybble usec and a 2-nybble rand, as with a 4-nybble usec and a 1-nybble rand. So, for example, these two are collisions:
usec = 0xabc
rand = 0xde
and
usec = 0xabcd
rand = 0xe
To fix this, pad all of your strings. This is really easy to do with format:
usec = format(datetime.now().microsecond, '05x') # hexify `microsecond` with 5 fixed hex digits

usec will be between 1 and 5 characters and rand will be between 1 and 2 characters, so it's not too surprising that concatenating the two (within the same second) will produce collisions.
For example, usec = '12' and rand = '3' yields the same string as usec = '1' and rand = '23' (i.e. '123').
You could avoid this by left-padding them so that usec is always exactly 5 characters and rand is always exactly 2 characters.

Related

Optimizing Memory Allocations of Pandas Code to Process Rows Using Explicit Loops with Numba Optimization

Assume I have data in the form (As a Pandas' Data Frame):
Index
ID
Value
Div Factor
Weighted Sum
1
1
2
1
2
1
3
2
3
2
6
1
4
1
1
3
5
2
3
2
6
2
9
3
7
2
8
4
8
3
5
1
9
3
6
2
10
1
8
4
11
3
2
3
12
3
7
4
I want to calculate the column Weighted Sum as following (For the $i$ -th row):
Look at all values from row 1 to i.
Sum values by groups according to the ID value of each row. So we have k sum values where k is the number of unique ID value from the row 1 to i.
Divide each sum (There are k sum values) by the number of elements in the group.
Sum those k values and divide by k (The average of the averages).
For example, let's do rows 1, 7 and 12:
Row 1
For i = 1 we have a single value hence the sum is 2 and the average of the single group is 2 and average over all groups is 2.
Row 7
For i = 7 we have only 2 unique values of ID above it: 1 and 2.
For the group of ID = 1 we have: (1 + 3 + 2) / 3 = 2.
For the group of ID = 2 we have: (8 + 9 + 3 + 6) / 4 = 6.5.
Then the average of averages is (2 + 6.5) / 2 = 4.25.
Row 12
For i = 12 we have 3 unique ID values on the rows 1:12.
For the group of ID = 1 we have: (8 + 1 + 3 + 2) / 4 = 3.5.
For the group of ID = 2 we have: (8 + 9 + 3 + 6) / 4 = 6.5.
For the group of ID = 3 we have: (7 + 2 + 6 + 5) / 4 = 5.
Then the average of averages is (3.5 + 6.5 + 5) / 3 = 5.
Remark: The method should be feasible for the case of ~1e7 rows and ~1e6 unique ID's.
As a follow up to Apply a Function per Row of Sub Groups of the Data **Above** the Current Row, I got a good answer yet it allocates a lot of memory in case the number of rows and the number of unique ID are high.
I was wondering if there is a way to create a much smaller auxiliary data using explicit loops while accelerating them using Numba.
The idea is to have similar or better performance while reducing the memory footprint considerably.
Here's a benchmark showing the performance of pandas, numpy and numba/numpy solutions at various row counts (12 to 36,000) and unique ID counts (3 to 9,000):
rows 12, unique ID values: 3
Timeit results:
foo_1 (pandas) ran in 0.003095399937592447 seconds using 1 iterations
foo_2 (numpy) ran in 0.0003358999965712428 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00018770003225654364 seconds using 1 iterations
rows 120, unique ID values: 30
Timeit results:
foo_1 (pandas) ran in 0.0024368000449612737 seconds using 1 iterations
foo_2 (numpy) ran in 0.001127400086261332 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.00029390002600848675 seconds using 1 iterations
rows 1200, unique ID values: 300
Timeit results:
foo_1 (pandas) ran in 0.01624089991673827 seconds using 1 iterations
foo_2 (numpy) ran in 0.009926999919116497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.002144100028090179 seconds using 1 iterations
rows 12000, unique ID values: 3000
Timeit results:
foo_1 (pandas) ran in 2.391147599904798 seconds using 1 iterations
foo_2 (numpy) ran in 0.2884287000633776 seconds using 1 iterations
foo_3 (numpy_numba) ran in 0.1226186000276357 seconds using 1 iterations
rows 36000, unique ID values: 9000
Timeit results:
foo_1 (pandas) ran in 44.33448620000854 seconds using 1 iterations
foo_2 (numpy) ran in 3.0259654000401497 seconds using 1 iterations
foo_3 (numpy_numba) ran in 1.6273660999722779 seconds using 1 iterations
The pandas solution creates an intermediate dataframe that is num IDs x num rows in size. The numpy and numpy/numba solutions calculate results column by column, so they create a handful of intermediate 1D arrays of length num rows. The numpy/numba solution is consistently 2-5 times faster than numpy, and pandas is 2-10 times slower than numpy.
Upping the size a bit more gives the following result (where the pandas solution is commented out):
rows 120000, unique ID values: 30000
Timeit results:
foo_1 (pandas) ran in 6.00004568696022e-06 seconds using 1 iterations
foo_2 (numpy) ran in 28.882483799941838 seconds using 1 iterations
foo_3 (numpy_numba) ran in 38.77682559995446 seconds using 1 iterations
So it appears that there is a threshold above which the numpy/numba solution loses ground to regular numpy.
Full test code:
import pandas as pd
# insert code to initialize dfInit here
print(dfInit)
'''
Index ID Value Div Factor
1 1 2 1
2 1 3 2
3 2 6 1
4 1 1 3
5 2 3 2
6 2 9 3
7 2 8 4
8 3 5 1
9 3 6 2
10 1 8 4
11 3 2 3
12 3 7 4
'''
def initDf(colMult=1):
df = dfInit.copy()
dfMult = pd.concat([df.assign(ID=dfInit.ID + 3*i, Index=df.Index + len(df)*i) for i in range(colMult)], axis=0).reset_index(drop=True)
print(f'\nrows {len(dfMult)}, unique ID values: {len(dfMult.ID.unique())}')
return dfMult
df = initDf()
def pd_foo_1(df):
df1 = df[['ID', 'Value']].set_index('ID', append=True).unstack(-1)
df2 = df1.fillna(0).cumsum() / df1.notnull().astype(int).cumsum()
df['Weighted Sum'] = df2.mean(axis=1)
return df
def foo_1(df):
#return None
try:
return pd_foo_1(df)
except (ValueError):
print('overflow encountered')
return None
import numpy as np
def foo_2(df):
values = df.Value.to_numpy()
ids = df.ID.to_numpy()
uniqIds = df.ID.unique()
aggSumsAcrossIds = np.zeros(values.shape)
aggCntsAcrossIds = np.zeros(values.shape)
for id in uniqIds:
curCounts = (ids == id)
cumCounts = np.cumsum(curCounts)
curValues = values.copy()
curValues[~curCounts] = 0
cumValues = np.cumsum(curValues)
aggSumsAcrossIds += cumValues / (cumCounts + (cumCounts == 0))
curHasAppeared = cumCounts > 0
aggCntsAcrossIds += curHasAppeared
weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
df['Weighted Sum'] = weightedSum
return df
from numba import njit
#njit
def np_foo_3(values, ids):
uniqIds = np.unique(ids)
aggSumsAcrossIds = np.zeros(values.shape)
aggCntsAcrossIds = np.zeros(values.shape)
for id in uniqIds:
curCounts = (ids == id)
cumCounts = np.cumsum(curCounts)
curValues = values.copy()
curValues[~curCounts] = 0
cumValues = np.cumsum(curValues)
aggSumsAcrossIds += cumValues / (cumCounts + (cumCounts == 0))
curHasAppeared = cumCounts > 0
aggCntsAcrossIds += curHasAppeared
weightedSum = aggSumsAcrossIds / aggCntsAcrossIds
return weightedSum
def foo_3(df):
values = df.Value.to_numpy()
ids = df.ID.to_numpy()
weightedSum = np_foo_3(values, ids)
df['Weighted Sum'] = weightedSum
return df
foo_count = 3
foo_names=['foo_' + str(i + 1) for i in range(foo_count)]
foo_labels=['pandas', 'numpy', 'numpy_numba']
exec("foo_funcs=[" + ','.join(f"foo_{str(i + 1)}" for i in range(foo_count)) + "]")
for foo in foo_names:
print(f'{foo} output:')
#print(eval(f"{foo}(df)"))
eval(f"{foo}(df)"); print("... output suppressed.")
# ===================== BENCHMARK with timeit:
from timeit import timeit
n = 1
for colMult in [1,10,100,1000,3000,10000]:
df = initDf(colMult)
print(f'Timeit results:')
for i, foo in enumerate(foo_names):
t = timeit(f"{foo}(df)", setup=f"from __main__ import df, {foo}", number=n) / n
print(f'{foo} ({foo_labels[i]}) ran in {t} seconds using {n} iterations')
# ===================== ... END BENCHMARK with timeit.
Space used:
The memory of a pandas solution which pivots IDs is proportional to num rows x num unique IDs. By comparison, a solution that loops over IDs processing one copy of the Value column at a time uses memory proportional to num rows.
This means that 10^7 rows x 10^6 unique IDs or about 10^13 4-8 byte values (call it 10^14 bytes, or 100,000 GB) is not feasible storing the pivot table in program memory with pandas.
However, 10^7 rows of at most 10 1-D arrays of 8-byte values uses on the order of 10^9 bytes or 1 GB of program memory in the looping solutions (numpy or numpy/numba above).
Note that adding a nested loop over chunks of a fixed number of rows will allow us to cap the memory usage of most of the roughly 10 1-D arrays mentioned above, but we will still need to calculate at least 1 array (the result), so this will never give us more than a factor of 10 reduction in memory footprint.

Compute column from multiple previous rows in dataframes with conditionals

I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
If the "Event" = True I want to sum the "Counter" values for the last 10 previous rows before the event happened.
EXCEPT if there is another Event within those 10 previous rows. In this case I only want to sum up the counter values between those two events (without exceeding 10 rows).
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.
We need create a subgroup key with cumsum , then do rolling sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0
Element-wise approach
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
Better: vectorized operations
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from #YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where #YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Compiling your own workflows with numba
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:
import numba
#numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!

Python numpy vectorization for heat dispersion

I'm supposed to write a code to represent heat dispersion using the finite difference formula given below.
𝑢(𝑡)𝑖𝑗=(𝑢(𝑡−1)[𝑖+1,𝑗] + 𝑢(𝑡−1) [𝑖−1,𝑗] +𝑢(𝑡−1)[𝑖,𝑗+1] + 𝑢(𝑡−1)[𝑖,𝑗−1])/4
The formula is supposed to produce the result only for a time step of 1. So, if an array like this was given:
100 100 100 100 100
100 0 0 0 100
100 0 0 0 100
100 0 0 0 100
100 100 100 100 100
The resulting array at time step 1 would be:
100 100 100 100 100
100 50 25 50 100
100 25 0 25 100
100 50 25 50 100
100 100 100 100 100
I know the representation using for loops would be as follows, where the array would have a minimum of 2 rows and 2 columns as a precondition:
h = np.copy(u)
for i in range(1,h.shape[0]-1):
for j in range (1, h.shape[1]-1):
num = u[i+1][j] + u[i-1][j] + u[i][j+1] + u[i][j-1]
h[i][j] = num/4
But I cannot figure out how to vectorize the code to represent heat dispersion. I am supposed to use numpy arrays and vectorization and am not allowed to use for loops of any kind, and I think I am supposed to rely on slicing, but I cannot figure out how to write it out and have started out with.
r, c = h.shape
if(c==2 or r==2):
return h
I'm sure that if the rows=2 or columns =2 then the array is returned as is, but correct me if Im wrong. Any help would be greatly appreciated. Thank you!
Try:
h[1:-1,1:-1] = (h[2:,1:-1] + h[:-2,1:-1] + h[1:-1,2:] + h[1:-1,:-2]) / 4
This solution uses slicing where:
1:-1 stays for indices 1,2, ..., LAST - 1
2: stays for 2, 3, ..., LAST
:-2 stays for 0, 1, ..., LAST - 2
During each iteration only the inner elements (indices 1..LAST-1) are updated

Is there a way to run this Python snippet faster?

from collections import defaultdict
dct = defaultdict(list)
for n in range(len(res)):
for i in indices_ordered:
dct[i].append(res[n][i])
Note that res is a list of pandas Series of length 5000, and indices_ordered is a list of strings of length 20000. It takes 23 minutes in my Mac (2.3 GHz Intel Core i5 and 16 GB 2133 MHz LPDDR3) to run this code. I am pretty new to Python, but I feel a more clever coding (maybe less looping) would help a lot.
Edit:
Here is an example of how to create data (res and indices_ordered) to be able to run above snippet (which is slightly changed to access the only field rather than by field name since I could not find how to construct inline a Series with a field name)
import random, string, pandas
index_sz = 20000
res_sz = 5000
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)) for i in range(index_sz)]
res = [pandas.Series([random.randint(0,10) for i in range(index_sz)], index = random.sample(indices_ordered, index_sz)) for i in range(res_sz)]
The issue here is that you iterate over indices_ordered for every single value. Just drop indices_ordered. Stripping it way back in orders of magnitude to test the timings:
import random
import string
import numpy as np
import pandas as pd
from collections import defaultdict
index_sz = 200
res_sz = 50
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits)
for _ in range(10)) for i in range(index_sz)]
res = [pd.Series([random.randint(0,10) for i in range(index_sz)],
index = random.sample(indices_ordered, index_sz))
for i in range(res_sz)]
def your_way(res, indices_ordered):
dct = defaultdict(list)
for n in range(len(res)):
for i in indices_ordered:
dct[i].append(res[n][i])
def my_way(res):
dct = defaultdict(list)
for item in res:
for string_item, value in item.iteritems():
dct[string_item].append(value)
Gives:
%timeit your_way(res, indices_ordered)
160 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_way(res)
6.79 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This reduces the time complexity of the whole approach because you don't keep going through indicies_ordered each time and assigning values, so the difference will become much more stark as the size of the data grows.
Just increasing one order of magnitude:
index_sz = 2000
res_sz = 500
Gives:
%timeit your_way(res, indices_ordered)
17.8 s ± 999 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit my_way(res)
543 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
EDIT: Now that testing data is available, it is clear that the changes below have no effect on run-time. The described techniques are only effective when the inner loop is very efficient (on the order of 5-10 dict lookups), making it more efficient still by removing some of the said lookups. Here the r[i] item lookup dwarfs anything else by orders of magnitude, so the optimizations are simply irrelevant.
Your outer loop takes 5000 iterations, and your inner loop 20000 iterations. This means that you are executing 100 million iterations in 23 minutes, i.e. that each iteration takes 13.8 μs. That is not fast, even in Python.
I would try to cut down the run-time by stripping any unnecessary work from the inner loop. Specifically:
convert for n in range(len(res)) followed by res[n] to for r in res. I don't know how efficient item lookup is in pandas, but it's better to do it in the outer than in the inner loop.
move the score attribute lookup to the outer loop.
get rid of defaultdict and pre-create the lists and use an ordinary dict.
avoid dict stores at all and work on the lists directly, pre-creating them in a sequence. Only create a dictionary at the end.
cache the lookup of the append list method, and prepare in advance the (append, i) pairs that the inner loop needs.
Here is code that implements the above suggestions:
# pre-create the lists
lsts = [[] for _ in range(len(indices_ordered))]
# prepare the pairs (appendfn, i)
fast_append = [(l.append, i)
for (l, i) in zip(lsts, indices_ordered)]
for r in res:
# pre-fetch res[n].score
r_score = r.score
for append, i in fast_append:
append(r_score[i])
# finally, create the dict out of the lists
dct = {i: lst for (i, lst) in zip(indices_ordered, lsts)}
You really should use a DataFrame.
Here's a way to create the data directly:
import pandas as pd
import numpy as np
import random
import string
index_sz = 3
res_sz = 10
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(3)) for i in range(index_sz)]
df = pd.DataFrame(np.random.randint(10, size=(res_sz, index_sz)), columns=indices_ordered)
There's no need to sort or index anything. A DataFrame can basically be accessed as an array or as a dict.
It should be much faster than juggling with defaultdicts, lists and Series.
df now looks like:
>>> df
7XQ VTV 38Y
0 6 9 5
1 5 5 4
2 6 0 7
3 0 0 8
4 7 8 9
5 8 6 4
6 2 4 9
7 3 2 2
8 7 6 0
9 8 0 1
>>> df['7XQ']
0 6
1 5
2 6
3 0
4 7
5 8
6 2
7 3
8 7
9 8
Name: 7XQ, dtype: int64
>>> df['7XQ'][:5]
0 6
1 5
2 6
3 0
4 7
Name: 7XQ, dtype: int64
With the original size, this script outputs a 5000 rows × 20000 columns DataFrame
in less than 3 seconds on my laptop.
Use pandas magic (with 2 lines of code) on your input list of pd.Series objects:
all_data = pd.concat([*res])
d = all_data.groupby(all_data.index).apply(list).to_dict()
Implied actions:
pd.concat([*res]) - concatenates all series into a single one preserving indices of each series object (pandas.concat)
all_data.groupby(all_data.index).apply(list).to_dict() - determine a groups of same index label values upon all_data.index, then put each group values into a list with .apply(list) and eventually convert grouped result into a dictionary .to_dict() (pandas.Series.groupby)

How to speed LabelEncoder up recoding a categorical variable into integers

I have a large csv with two strings per row in this form:
g,k
a,h
c,i
j,e
d,i
i,h
b,b
d,d
i,a
d,h
I read in the first two columns and recode the strings to integers as follows:
import pandas as pd
df = pd.read_csv("test.csv", usecols=[0,1], prefix="ID_", header=None)
from sklearn.preprocessing import LabelEncoder
# Initialize the LabelEncoder.
le = LabelEncoder()
le.fit(df.values.flat)
# Convert to digits.
df = df.apply(le.transform)
This code is from https://stackoverflow.com/a/39419342/2179021.
The code works very well but is slow when df is large. I timed each step and the result was surprising to me.
pd.read_csv takes about 40 seconds.
le.fit(df.values.flat) takes about 30 seconds
df = df.apply(le.transform) takes about 250 seconds.
Is there any way to speed up this last step? It feels like it should be the fastest step of them all!
More timings for the recoding step on a computer with 4GB of RAM
The answer below by maxymoo is fast but doesn't give the right answer. Taking the example csv from the top of the question, it translates it to:
0 1
0 4 6
1 0 4
2 2 5
3 6 3
4 3 5
5 5 4
6 1 1
7 3 2
8 5 0
9 3 4
Notice that 'd' is mapped to 3 in the first column but 2 in the second.
I tried the solution from https://stackoverflow.com/a/39356398/2179021 and get the following.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000), 'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
df.info()
memory usage: 7.6MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
1 loops, best of 3: 1.7 s per loop
Then I increased the dataframe size by a factor of 10.
df = pd.DataFrame({'ID_0':np.random.randint(0,1000,10000000), 'ID_1':np.random.randint(0,1000,10000000)}).astype(str)
df.info()
memory usage: 76.3+ MB
%timeit x = (df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack())
MemoryError Traceback (most recent call last)
This method appears to use so much RAM trying to translate this relatively small dataframe that it crashes.
I also timed LabelEncoder with the larger dataset with 10 millions rows. It runs without crashing but the fit line alone took 50 seconds. The df.apply(le.transform) step took about 80 seconds.
How can I:
Get something of roughly the speed of maxymoo's answer and roughly the memory usage of LabelEncoder but that gives the right answer when the dataframe has two columns.
Store the mapping so that I can reuse it for different data (as in the way LabelEncoder allows me to do)?
It looks like it will be much faster to use the pandas category datatype; internally this uses a hash table rather whereas LabelEncoder uses a sorted search:
In [87]: df = pd.DataFrame({'ID_0':np.random.randint(0,1000,1000000),
'ID_1':np.random.randint(0,1000,1000000)}).astype(str)
In [88]: le.fit(df.values.flat)
%time x = df.apply(le.transform)
CPU times: user 6.28 s, sys: 48.9 ms, total: 6.33 s
Wall time: 6.37 s
In [89]: %time x = df.apply(lambda x: x.astype('category').cat.codes)
CPU times: user 301 ms, sys: 28.6 ms, total: 330 ms
Wall time: 331 ms
EDIT: Here is a custom transformer class that that you could use (you probably won't see this in an official scikit-learn release since the maintainers don't want to have pandas as a dependency)
import pandas as pd
from pandas.core.nanops import unique1d
from sklearn.base import BaseEstimator, TransformerMixin
class PandasLabelEncoder(BaseEstimator, TransformerMixin):
def fit(self, y):
self.classes_ = unique1d(y)
return self
def transform(self, y):
s = pd.Series(y).astype('category', categories=self.classes_)
return s.cat.codes
I tried this with the DataFrame:
In [xxx]: import string
In [xxx]: letters = np.array([c for c in string.ascii_lowercase])
In [249]: df = pd.DataFrame({'ID_0': np.random.choice(letters, 10000000), 'ID_1':np.random.choice(letters, 10000000)})
It looks like this:
In [261]: df.head()
Out[261]:
ID_0 ID_1
0 v z
1 i i
2 d n
3 z r
4 x x
In [262]: df.shape
Out[262]: (10000000, 2)
So, 10 million rows. Locally, my timings are:
In [257]: % timeit le.fit(df.values.flat)
1 loops, best of 3: 17.2 s per loop
In [258]: % timeit df2 = df.apply(le.transform)
1 loops, best of 3: 30.2 s per loop
Then I made a dict mapping letters to numbers and used pandas.Series.map:
In [248]: letters = np.array([l for l in string.ascii_lowercase])
In [263]: d = dict(zip(letters, range(26)))
In [273]: %timeit for c in df.columns: df[c] = df[c].map(d)
1 loops, best of 3: 1.12 s per loop
In [274]: df.head()
Out[274]:
ID_0 ID_1
0 21 25
1 8 8
2 3 13
3 25 17
4 23 23
So that might be an option. The dict just needs to have all of the values that occur in the data.
EDIT: The OP asked what timing I have for that second option, with categories. This is what I get:
In [40]: %timeit x=df.stack().astype('category').cat.rename_categories(np.arange(len(df.stack().unique()))).unstack()
1 loops, best of 3: 13.5 s per loop
EDIT: per the 2nd comment:
In [45]: %timeit uniques = np.sort(pd.unique(df.values.ravel()))
1 loops, best of 3: 933 ms per loop
In [46]: %timeit dfc = df.apply(lambda x: x.astype('category', categories=uniques))
1 loops, best of 3: 1.35 s per loop
I would like to point out an alternate solution that should serve many readers well. Although I prefer to have a known set of IDs, it is not always necessary if this is strictly one-way remapping.
Instead of
df[c] = df[c].apply(le.transform)
or
dict_table = {val: i for i, val in enumerate(uniques)}
df[c] = df[c].map(dict_table)
or (checkout _encode() and _encode_python() in sklearn source code, which I assume is faster on average than other methods mentioned)
df[c] = np.array([dict_table[v] for v in df[c].values])
you can instead do
df[c] = df[c].apply(hash)
Pros: much faster, less memory needed, no training, hashes can be reduced to smaller representations (more collisions by casting dtype).
Cons: gives funky numbers, can have collisions (not guaranteed to be perfectly unique), can't guarantee the function won't change with a new version of python
Note that the secure hash functions will have fewer collisions at the cost of speed.
Example of when to use: You have somewhat long strings that are mostly unique and the data set is huge. Most importantly, you don't care about rare hash collisions even though it can be a source of noise in your model's predictions.
I've tried all the methods above and my workload was taking about 90 minutes to learn the encoding from training (1M rows and 600 features) and reapply that to several test sets, while also dealing with new values. The hash method brought it down to a few minutes and I don't need to save any model.

Categories

Resources