Is there a way to run this Python snippet faster? - python

from collections import defaultdict
dct = defaultdict(list)
for n in range(len(res)):
for i in indices_ordered:
dct[i].append(res[n][i])
Note that res is a list of pandas Series of length 5000, and indices_ordered is a list of strings of length 20000. It takes 23 minutes in my Mac (2.3 GHz Intel Core i5 and 16 GB 2133 MHz LPDDR3) to run this code. I am pretty new to Python, but I feel a more clever coding (maybe less looping) would help a lot.
Edit:
Here is an example of how to create data (res and indices_ordered) to be able to run above snippet (which is slightly changed to access the only field rather than by field name since I could not find how to construct inline a Series with a field name)
import random, string, pandas
index_sz = 20000
res_sz = 5000
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(10)) for i in range(index_sz)]
res = [pandas.Series([random.randint(0,10) for i in range(index_sz)], index = random.sample(indices_ordered, index_sz)) for i in range(res_sz)]

The issue here is that you iterate over indices_ordered for every single value. Just drop indices_ordered. Stripping it way back in orders of magnitude to test the timings:
import random
import string
import numpy as np
import pandas as pd
from collections import defaultdict
index_sz = 200
res_sz = 50
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits)
for _ in range(10)) for i in range(index_sz)]
res = [pd.Series([random.randint(0,10) for i in range(index_sz)],
index = random.sample(indices_ordered, index_sz))
for i in range(res_sz)]
def your_way(res, indices_ordered):
dct = defaultdict(list)
for n in range(len(res)):
for i in indices_ordered:
dct[i].append(res[n][i])
def my_way(res):
dct = defaultdict(list)
for item in res:
for string_item, value in item.iteritems():
dct[string_item].append(value)
Gives:
%timeit your_way(res, indices_ordered)
160 ms ± 5.45 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit my_way(res)
6.79 ms ± 47.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This reduces the time complexity of the whole approach because you don't keep going through indicies_ordered each time and assigning values, so the difference will become much more stark as the size of the data grows.
Just increasing one order of magnitude:
index_sz = 2000
res_sz = 500
Gives:
%timeit your_way(res, indices_ordered)
17.8 s ± 999 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit my_way(res)
543 ms ± 9.07 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

EDIT: Now that testing data is available, it is clear that the changes below have no effect on run-time. The described techniques are only effective when the inner loop is very efficient (on the order of 5-10 dict lookups), making it more efficient still by removing some of the said lookups. Here the r[i] item lookup dwarfs anything else by orders of magnitude, so the optimizations are simply irrelevant.
Your outer loop takes 5000 iterations, and your inner loop 20000 iterations. This means that you are executing 100 million iterations in 23 minutes, i.e. that each iteration takes 13.8 μs. That is not fast, even in Python.
I would try to cut down the run-time by stripping any unnecessary work from the inner loop. Specifically:
convert for n in range(len(res)) followed by res[n] to for r in res. I don't know how efficient item lookup is in pandas, but it's better to do it in the outer than in the inner loop.
move the score attribute lookup to the outer loop.
get rid of defaultdict and pre-create the lists and use an ordinary dict.
avoid dict stores at all and work on the lists directly, pre-creating them in a sequence. Only create a dictionary at the end.
cache the lookup of the append list method, and prepare in advance the (append, i) pairs that the inner loop needs.
Here is code that implements the above suggestions:
# pre-create the lists
lsts = [[] for _ in range(len(indices_ordered))]
# prepare the pairs (appendfn, i)
fast_append = [(l.append, i)
for (l, i) in zip(lsts, indices_ordered)]
for r in res:
# pre-fetch res[n].score
r_score = r.score
for append, i in fast_append:
append(r_score[i])
# finally, create the dict out of the lists
dct = {i: lst for (i, lst) in zip(indices_ordered, lsts)}

You really should use a DataFrame.
Here's a way to create the data directly:
import pandas as pd
import numpy as np
import random
import string
index_sz = 3
res_sz = 10
indices_ordered = [''.join(random.choice(string.ascii_uppercase + string.digits) for _ in range(3)) for i in range(index_sz)]
df = pd.DataFrame(np.random.randint(10, size=(res_sz, index_sz)), columns=indices_ordered)
There's no need to sort or index anything. A DataFrame can basically be accessed as an array or as a dict.
It should be much faster than juggling with defaultdicts, lists and Series.
df now looks like:
>>> df
7XQ VTV 38Y
0 6 9 5
1 5 5 4
2 6 0 7
3 0 0 8
4 7 8 9
5 8 6 4
6 2 4 9
7 3 2 2
8 7 6 0
9 8 0 1
>>> df['7XQ']
0 6
1 5
2 6
3 0
4 7
5 8
6 2
7 3
8 7
9 8
Name: 7XQ, dtype: int64
>>> df['7XQ'][:5]
0 6
1 5
2 6
3 0
4 7
Name: 7XQ, dtype: int64
With the original size, this script outputs a 5000 rows × 20000 columns DataFrame
in less than 3 seconds on my laptop.

Use pandas magic (with 2 lines of code) on your input list of pd.Series objects:
all_data = pd.concat([*res])
d = all_data.groupby(all_data.index).apply(list).to_dict()
Implied actions:
pd.concat([*res]) - concatenates all series into a single one preserving indices of each series object (pandas.concat)
all_data.groupby(all_data.index).apply(list).to_dict() - determine a groups of same index label values upon all_data.index, then put each group values into a list with .apply(list) and eventually convert grouped result into a dictionary .to_dict() (pandas.Series.groupby)

Related

Add column to dataframe that has each row's duplicate count value takes too long

I've read SOF posts on how to create a field that contains the number of duplicates that row contains in a pandas DataFrame. Without using any other libraries, I tried writing a function that does this, and it works on small DataFrame objects; however, it takes way too long on larger ones and consumes too much memory.
This is the function:
def count_duplicates(dataframe):
function = lambda x: dataframe.to_numpy().tolist().count(x.to_list()) - 1
return dataframe.apply(function, axis=1)
I did a dir into a numpy array from the DataFrame.to_numpy function, and I didn't see a function quite like the list.count function. The reason why this takes so long is because for each row, it needs to compare the row with all of the rows in the numpy array. I'd like a much more efficient way to do this, even if it's not using a pandas DataFrame. I feel like there should be a simple way to do this with numpy, but I'm just not familiar enough. I've been testing different approaches for a while and it's resulting in a lot of errors. I'm going to keep testing different approaches, but felt the community might provide a better way.
Thank you for your help.
Here is an example DataFrame:
one two
0 1 1
1 2 2
2 3 3
3 1 1
I'd use it like this:
d['duplicates'] = count_duplicates(d)
The resulting DataFrame is:
one two duplicates
0 1 1 1
1 2 2 0
2 3 3 0
3 1 1 1
The problem is the actual DataFrame will have 1.4 million rows, and each lambda takes an average of 0.148558 seconds, which if multiplied by 1.4 million rows is about 207981.459 seconds or 57.772 hours. I need a much faster way to accomplish this.
Thank you again.
I updated the function which is speeding things up:
def _counter(series_to_count, list_of_lists):
return list_of_lists.count(series_to_count.to_list()) - 1
def count_duplicates(dataframe):
df_list = dataframe.to_numpy().tolist()
return dataframe.apply(_counter, args=(df_list,), axis=1)
This takes only 29.487 seconds. The bottleneck was converting the dataframe on each function call.
I'm still interested in optimizing this. I'd like to get this down to 2-3 seconds if at all possible. It may not be, but I'd like to make sure it is as fast as possible.
Thank you again.
Here is a vectorized way to do this. For 1.4 million rows, with an average of 140 duplicates for each row, it takes under 0.05 seconds. When there are no duplicates at all, it takes about 0.4 second.
d['duplicates'] = d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
On your example:
>>> d
one two duplicates
0 1 1 1
1 2 2 0
2 3 3 0
3 1 1 1
Speed
Relatively high rate of duplicates:
n = 1_400_000
d = pd.DataFrame(np.random.randint(0, 100, size=(n, 2)), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 48.3 ms ± 110 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
# how many duplicates on average?
>>> (d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1).mean()
139.995841
# (as expected: n / 100**2)
No duplicates
n = 1_400_000
d = pd.DataFrame(np.arange(2 * n).reshape(-1, 2), columns='one two'.split())
%timeit d.groupby(['one', 'two'], sort=False)['one'].transform('size') - 1
# 389 ms ± 1.55 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Compute column from multiple previous rows in dataframes with conditionals

I'm starting to belive that pandas dataframes are much less intuitive to handle than Excel, but I'm not giving up yet!
So, I'm JUST trying to check data in the same column but in (various) previous rows using the .shift() method. I'm using the following DF as an example since the original is too complicated to copy into here, but the principle is the same.
counter = list(range(20))
df1 = pd.DataFrame(counter, columns=["Counter"])
df1["Event"] = [True, False, False, False, False, False, True, False,False,False,False,False,False,False,False,False,False,False,False,True]
I'm trying to create sums of the column counter, but only under the following conditions:
If the "Event" = True I want to sum the "Counter" values for the last 10 previous rows before the event happened.
EXCEPT if there is another Event within those 10 previous rows. In this case I only want to sum up the counter values between those two events (without exceeding 10 rows).
To clarify my goal this is the result I had in mind:
My attempt so far looks like this:
for index, row in df1.iterrows():
if row["Event"] == True:
counter = 1
summ = 0
while counter < 10 and row["Event"].shift(counter) == False:
summ += row["Counter"].shift(counter)
counter += 1
else:
df1.at[index, "Sum"] = summ
I'm trying to first find Event == True and from there start iterating backwards with a counter and summing up the counters as I go. However it seems to have a problem with shift:
AttributeError: 'bool' object has no attribute 'shift'
Please shatter my believes and show me, that Excel isn't actually superior.
We need create a subgroup key with cumsum , then do rolling sum
n=10
s=df1.Counter.groupby(df1.Event.iloc[::-1].cumsum()).\
rolling(n+1,min_periods=1).sum().\
reset_index(level=0,drop=True).where(df1.Event)
df1['sum']=(s-df1.Counter).fillna(0)
df1
Counter Event sum
0 0 True 0.0
1 1 False 0.0
2 2 False 0.0
3 3 False 0.0
4 4 False 0.0
5 5 False 0.0
6 6 True 15.0
7 7 False 0.0
8 8 False 0.0
9 9 False 0.0
10 10 False 0.0
11 11 False 0.0
12 12 False 0.0
13 13 False 0.0
14 14 False 0.0
15 15 False 0.0
16 16 False 0.0
17 17 False 0.0
18 18 False 0.0
19 19 True 135.0
Element-wise approach
You definitely can approach a task in pandas the way you would in excel. Your approach needs to be tweaked a bit because pandas.Series.shift operates on whole arrays or Series, not on a single value - you can't use it just to move back up the dataframe relative to a value.
The following loops through the indices of your dataframe, walking back up (up to) 10 spots for each Event:
def create_sum_column_loop(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
for index in range(df.shape[0]):
counter = 1
summ = 0
if df.loc[index, "Event"]: # == True is implied
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if df.loc[index - backup, "Event"]:
break
# increment by the counter
summ += df.loc[index - backup, "Counter"]
df.loc[index, "Sum"] = summ
return df
This does the job:
In [15]: df1_sum1 = create_sum_column(df1.copy()) # copy to preserve original
In [16]: df1_sum1
Counter Event Sum
0 0 True 0
1 1 False 0
2 2 False 0
3 3 False 0
4 4 False 0
5 5 False 0
6 6 True 15
7 7 False 0
8 8 False 0
9 9 False 0
10 10 False 0
11 11 False 0
12 12 False 0
13 13 False 0
14 14 False 0
15 15 False 0
16 16 False 0
17 17 False 0
18 18 False 0
19 19 True 135
Better: vectorized operations
However, the power of pandas comes in its vectorized operations. Python is an interpreted, dynamically-typed language, meaning it's flexible, user friendly (easy to read/write/learn), and slow. To combat this, many commonly-used workflows, including many pandas.Series operations, are written in optimized, compiled code from other languages like C, C++, and Fortran. Under the hood, they're doing the same thing... df1.Counter.cumsum() does loop through the elements and create a running total, but it does it in C, making it lightning fast.
This is what makes learning a framework like pandas difficult - you need to relearn how to do math using that framework. For pandas, the entire game is learning how to use pandas and numpy built-in operators to do your work.
Borrowing the clever solution from #YOBEN_S:
def create_sum_column_vectorized(df):
n = 10
s = (
df.Counter
# group by a unique identifier for each event. This is a
# particularly clever bit, where #YOBEN_S reverses
# the order of df.Event, then computes a running total
.groupby(df.Event.iloc[::-1].cumsum())
# compute the rolling sum within each group
.rolling(n+1,min_periods=1).sum()
# drop the group index so we can align with the original DataFrame
.reset_index(level=0,drop=True)
# drop all non-event observations
.where(df.Event)
)
# remove the counter value for the actual event
# rows, then fill the remaining rows with 0s
df['sum'] = (s - df.Counter).fillna(0)
return df
We can see that the result is the same as the one above (though the values are suddenly floats):
In [23]: df1_sum2 = create_sum_column_vectorized(df1) # copy to preserve original
In [24]: df1_sum2
The difference comes in the performance. In ipython or jupyter we can use the %timeit command to see how long a statement takes to run:
In [25]: %timeit create_sum_column_loop(df1.copy())
3.21 ms ± 54.8 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [26]: %timeit create_sum_column_vectorized(df1.copy())
7.76 ms ± 255 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
For small datasets, like the one in your example, the difference will be negligible or will even slightly favor the pure python loop.
For much larger datasets, the difference becomes apparent. Let's create a dataset similar to your example, but with 100,000 rows:
In [27]: df_big = pd.DataFrame({
...: 'Counter': np.arange(100000),
...: 'Event': np.random.random(size=100000) > 0.9,
...: })
...:
Now, you can really see the performance benefit of the vectorized approach:
In [28]: %timeit create_sum_column_loop(df_big.copy())
13 s ± 101 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [29]: %timeit create_sum_column_vectorized(df_big.copy())
5.81 s ± 28 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
The vectorized version takes less than half the time. This difference will continue to widen as the amount of data increases.
Compiling your own workflows with numba
Note that for specific operations, it is possible to speed up operations further by pre-compiling the code yourself. In this case, the looped version can be compiled with numba:
import numba
#numba.jit(nopython=True)
def _inner_vectorized_loop(counter, event, sum_col):
for index in range(len(counter)):
summ = 0
if event[index]:
for backup in range(1, 11):
# handle case where index - backup is before
# the start of the dataframe
if index - backup < 0:
break
# stop counting when we hit another event
if event[index - backup]:
break
# increment by the counter
summ = summ + counter[index - backup]
sum_col[index] = summ
return sum_col
def create_sum_column_loop_jit(df):
'''
Adds a Sum column with the rolling sum of 10 Counters prior to an Event
'''
df["Sum"] = 0
df["Sum"] = _inner_vectorized_loop(
df.Counter.values, df.Event.values, df.Sum.values)
return df
This beats both pandas and the for loop by a factor of more than 1000!
In [90]: %timeit create_sum_column_loop_jit(df_big.copy())
1.62 ms ± 53.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Balancing readability, efficiency, and flexibility is the constant challenge. Best of luck as you dive in!

Pandas column creation methods

There are many methods for creating new columns in Pandas (I may have missed some in my examples so please let me know if there are others and I will include here) and I wanted to figure out when is the best time to use each method. Obviously some methods are better in certain situations compared to others but I want to evaluate it from a holistic view looking at efficiency, readability, and usefulness.
I'm primarily concerned with the first three but included other ways simply to show it's possible with different approaches. Here's your sample dataframe:
df = pd.DataFrame({'a':[1,2,3],'b':[4,5,6]})
Most commonly known way is to name a new column such as df['c'] and use apply:
df['c'] = df['a'].apply(lambda x: x * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using assign can accomplish the same thing:
df = df.assign(c = lambda x: x['a'] * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Updated via #roganjosh:
df['c'] = df['a'] * 2
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using map (definitely not as efficient as apply):
df['c'] = df['a'].map(lambda x: x * 2)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Creating a new pd.series and then concat to bring it into the dataframe:
c = pd.Series(df['a'] * 2).rename("c")
df = pd.concat([df,c], axis = 1)
df
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Using join:
df.join(c)
a b c
0 1 4 2
1 2 5 4
2 3 6 6
Short answer: vectorized calls (df['c'] = 2 * df['a']) almost always win on both speed and readability. See this answer regarding what you can use as a "hierarchy" of options when it comes to performance.
In generally, if you have a for i in ... or lambda present somewhere in a Pandas operation, this (sometimes) means that the resulting calculations call Python code rather than the optimized C code that Pandas' Cython library relies on for vectorized operations. (Same goes for operations that rely on NumPy ufuncs for the underlying .values.)
As for .assign(), it is correctly pointed out in the comments that this creates a copy, whereas you can view df['c'] = 2 * df['a'] as the equivalent of setting a dictionary key/value. The former also takes twice as long, although this is perhaps a bit apples-to-orange because one operation is returning a DataFrame while the other is just assigning a column.
>>> %timeit df.assign(c=df['a'] * 2)
498 µs ± 15.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
>>> %timeit -r 7 -n 1000 df['c'] = df['a'] * 2
239 µs ± 22.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
As for .map(): generally you see this when, as the name implies, you want to provide a mapping for a Series (though it can be passed a function, as in your question). That doesn't mean it's not performant, it just tends to be used as a specialized method in cases that I've seen:
>>> df['a'].map(dict(enumerate('xyz', 1)))
0 x
1 y
2 z
Name: a, dtype: object
And as for .apply(): to inject a bit of opinion into the answer, I would argue it's more idiomatic to use vectorization where possible. You can see in the code for the module where .apply() is defined: because you are passing a lambda, not a NumPy ufunc, what ultimately gets called is technically a Cython function, map_infer, but it is still performing whatever function you passed on each individual member of the Series df['a'], one at a time.
A succinct way would be:
df['c'] = 2 * df['a']
No need to compute the new column elementwise.
Why are you using lambda function?
You can easily achieve the above-mentioned task easily by
df['c'] = 2 * df['a']
This will not increase the overhead.

How to improve the speed of splitting a list?

I just want to improve the speed of splitting a list.Now I have a way to split the list, but the speed is not as fast as I expected.
def split_list(lines):
return [x for xs in lines for x in xs.split('-')]
import time
lst= []
for i in range(1000000):
lst.append('320000-320000')
start=time.clock()
lst_new=split_list(lst)
end=time.clock()
print('time\n',str(end-start))
For example,Input:
lst
['320000-320000', '320000-320000']
Output:
lst_new
['320000', '320000', '320000', '320000']
I'm not satisfied with the speed of spliting,as my data contains many lists.
But now I don't know whether there's a more effective way to do it.
According to advice,I try to describe my whole question more specifically.
import pandas as pd
df = pd.DataFrame({ 'line':["320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000",
"320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000",
"320000-320000, 340000-320000, 320000-340000",
"380000-320000",
"380000-320000,380000-310000",
"370000-320000,370000-320000,320000-320000"], 'id':[1,2,3,4,5,6,7,8,9,10,11,12],})
def most_common(lst):
return max(set(lst), key=lst.count)
def split_list(lines):
return [x for xs in lines for x in xs.split('-')]
df['line']=df['line'].str.split(',')
col_ix=df['line'].index.values
df['line_start'] = pd.Series(0, index=df.index)
df['line_destination'] = pd.Series(0, index=df.index)
import time
start=time.clock()
for ix in col_ix:
col=df['line'][ix]
col_split=split_list(col)
even_col_split=col_split[0:][::2]
even_col_split_most=most_common(even_col_split)
df['line_start'][ix]=even_col_split_most
odd_col_split=col_split[1:][::2]
odd_col_split_most=most_common(odd_col_split)
df['line_destination'][ix]=odd_col_split_most
end=time.clock()
print('time\n',str(end-start))
del df['line']
print('df\n',df)
Input:
df
id line
0 1 320000-320000, 340000-320000, 320000-340000
1 2 380000-320000
2 3 380000-320000,380000-310000
3 4 370000-320000,370000-320000,320000-320000
4 5 320000-320000, 340000-320000, 320000-340000
5 6 380000-320000
6 7 380000-320000,380000-310000
7 8 370000-320000,370000-320000,320000-320000
8 9 320000-320000, 340000-320000, 320000-340000
9 10 380000-320000
10 11 380000-320000,380000-310000
11 12 370000-320000,370000-320000,320000-320000
Output:
df
id line_start line_destination
0 1 320000 320000
1 2 380000 320000
2 3 380000 320000
3 4 370000 320000
4 5 320000 320000
5 6 380000 320000
6 7 380000 320000
7 8 370000 320000
8 9 320000 320000
9 10 380000 320000
10 11 380000 320000
11 12 370000 320000
You can regard the number of line(eg.320000-32000 represent the starting point and destination of the route).
Expected:
Make the code run faster.(I can't bear the speed of the code)
'-'.join(lst).split('-')
seems quite a bit faster:
>>> timeit("'-'.join(lst).split('-')", globals=globals(), number=10)
1.0838123590219766
>>> timeit("[x for xs in lst for x in xs.split('-')]", globals=globals(), number=10)
3.1370303670410067
Depending on what you want to do with your list, using a genertor can be slightly faster.
If you need to keep the output stored, then the list solution is faster.
If all you need to is to iterate over the words once, you can get rid of some overhead by using a generator.
def split_list_gen(lines):
for line in lines:
yield from line.split('-')
Benchmark
import time
lst = ['32000-32000'] * 10000000
start = time.clock()
for x in split_list(lst):
pass
end = time.clock()
print('list time:', str(end - start))
start = time.clock()
for y in split_list_gen(lst):
pass
end = time.clock()
print('generator time:', str(end - start))
Output
The generator solution is consistently about 10% faster.
list time: 0.4568295369982612
generator time: 0.4020671741918084
Pushing more of the work below the Python level seems to provide a small speedup:
In [7]: %timeit x = split_list(lst)
407 ms ± 876 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [8]: %timeit x = list(chain.from_iterable(map(methodcaller("split", "-"), lst
...: )))
374 ms ± 2.67 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
methodcaller creates a function that calls the function for you:
methodcaller("split", "-")(x) == x.split("-")
chain.from_iterable creates a single iterator consisting of the elements from a group of iterables:
list(chain.from_iterable([[1,2], [3,4]])) == [1,2,3,4]
mapping the function returned by methodcaller on to your list of strings produces an iterable of lists suitable for flattening by from_iterable. The benefit of this more functional approach is that the functions involved are all implemented in C and can work with the data in the Python objects, rather than Python byte code that works on the Python objects.

Efficient way of generating latin squares (or randomly permute numbers in matrix uniquely on both axes - using NumPy)

For example, if there are 5 numbers 1, 2, 3, 4, 5
I want a random result like
[[ 2, 3, 1, 4, 5]
[ 5, 1, 2, 3, 4]
[ 3, 2, 4, 5, 1]
[ 1, 4, 5, 2, 3]
[ 4, 5, 3, 1, 2]]
Ensure every number is unique in its row and column.
Is there an efficient way to do this?
I've tried to use while loops to generate one row for each iteration, but it seems not so efficient.
import numpy as np
numbers = list(range(1,6))
result = np.zeros((5,5), dtype='int32')
row_index = 0
while row_index < 5:
np.random.shuffle(numbers)
for column_index, number in enumerate(numbers):
if number in result[:, column_index]:
break
else:
result[row_index, :] = numbers
row_index += 1
Just for your information, what you are looking for is a way of generating latin squares.
As for the solution, it depends on how much random "random" is for you.
I would devise at least four main techniques, two of which have been already proposed.
Hence, I will briefly describe the other two:
loop through all possible permutations of the items and accept the first that satisfy the unicity constraint along rows
use only cyclic permutations to build subsequent rows: these are by construction satisfying the unicity constraint along rows (the cyclic transformation can be done forward or backward); for improved "randomness" the rows can be shuffled
Assuming we work with standard Python data types since I do not see a real merit in using NumPy (but results can be easily converted to np.ndarray if necessary), this would be in code (the first function is just to check that the solution is actually correct):
import random
import math
import itertools
# this only works for Iterable[Iterable]
def is_latin_rectangle(rows):
valid = True
for row in rows:
if len(set(row)) < len(row):
valid = False
if valid and rows:
for i, val in enumerate(rows[0]):
col = [row[i] for row in rows]
if len(set(col)) < len(col):
valid = False
break
return valid
def is_latin_square(rows):
return is_latin_rectangle(rows) and len(rows) == len(rows[0])
# : prepare the input
n = 9
items = list(range(1, n + 1))
# shuffle items
random.shuffle(items)
# number of permutations
print(math.factorial(n))
def latin_square1(items, shuffle=True):
result = []
for elems in itertools.permutations(items):
valid = True
for i, elem in enumerate(elems):
orthogonals = [x[i] for x in result] + [elem]
if len(set(orthogonals)) < len(orthogonals):
valid = False
break
if valid:
result.append(elems)
if shuffle:
random.shuffle(result)
return result
rows1 = latin_square1(items)
for row in rows1:
print(row)
print(is_latin_square(rows1))
def latin_square2(items, shuffle=True, forward=False):
sign = -1 if forward else 1
result = [items[sign * i:] + items[:sign * i] for i in range(len(items))]
if shuffle:
random.shuffle(result)
return result
rows2 = latin_square2(items)
for row in rows2:
print(row)
print(is_latin_square(rows2))
rows2b = latin_square2(items, False)
for row in rows2b:
print(row)
print(is_latin_square(rows2b))
For comparison, an implementation by trying random permutations and accepting valid ones (fundamentally what #hpaulj proposed) is also presented.
def latin_square3(items):
result = [list(items)]
while len(result) < len(items):
new_row = list(items)
random.shuffle(new_row)
result.append(new_row)
if not is_latin_rectangle(result):
result = result[:-1]
return result
rows3 = latin_square3(items)
for row in rows3:
print(row)
print(is_latin_square(rows3))
I did not have time (yet) to implement the other method (with backtrack Sudoku-like solutions from #ConfusedByCode).
With timings for n = 5:
%timeit latin_square1(items)
321 µs ± 24.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%timeit latin_square2(items)
7.5 µs ± 222 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square2(items, False)
2.21 µs ± 69.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square3(items)
2.15 ms ± 102 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
... and for n = 9:
%timeit latin_square1(items)
895 ms ± 18.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit latin_square2(items)
12.5 µs ± 200 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square2(items, False)
3.55 µs ± 55.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
%timeit latin_square3(items)
The slowest run took 36.54 times longer than the fastest. This could mean that an intermediate result is being cached.
9.76 s ± 9.23 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, solution 1 is giving a fair deal of randomness but it is not terribly fast (and scale with O(n!)), solution 2 (and 2b) are much faster (scaling with O(n)) but not as random as solution 1. Solution 3 is very slow and the performance can vary significantly (can probably be sped up by letting the last iteration be computed instead of guessed).
Getting more technical, other efficient algorithms are discussed in:
Jacobson, M. T. and Matthews, P. (1996), Generating uniformly distributed random latin squares. J. Combin. Designs, 4: 405-437. doi:10.1002/(SICI)1520-6610(1996)4:6<405::AID-JCD3>3.0.CO;2-J
This may seem odd, but you have basically described generating a random n-dimension Sudoku puzzle. From a blog post by Daniel Beer:
The basic approach to solving a Sudoku puzzle is by a backtracking search of candidate values for each cell. The general procedure is as follows:
Generate, for each cell, a list of candidate values by starting with the set of all possible values and eliminating those which appear in the same row, column and box as the cell being examined.
Choose one empty cell. If none are available, the puzzle is solved.
If the cell has no candidate values, the puzzle is unsolvable.
For each candidate value in that cell, place the value in the cell and try to recursively solve the puzzle.
There are two optimizations which greatly improve the performance of this algorithm:
When choosing a cell, always pick the one with the fewest candidate values. This reduces the branching factor. As values are added to the grid, the number of candidates for other cells reduces too.
When analysing the candidate values for empty cells, it's much quicker to start with the analysis of the previous step and modify it by removing values along the row, column and box of the last-modified cell. This is O(N) in the size of the puzzle, whereas analysing from scratch is O(N3).
In your case an "unsolvable puzzle" is an invalid matrix. Every element in the matrix will be unique on both axis in a solvable puzzle.
I experimented with a brute-force random choice. Generate a row, and if valid, add to the accumulated lines:
def foo(n=5,maxi=200):
arr = np.random.choice(numbers,n, replace=False)[None,:]
for i in range(maxi):
row = np.random.choice(numbers,n, replace=False)[None,:]
if (arr==row).any(): continue
arr = np.concatenate((arr, row),axis=0)
if arr.shape[0]==n: break
print(i)
return arr
Some sample runs:
In [66]: print(foo())
199
[[1 5 4 2 3]
[4 1 5 3 2]
[5 3 2 1 4]
[2 4 3 5 1]]
In [67]: print(foo())
100
[[4 2 3 1 5]
[1 4 5 3 2]
[5 1 2 4 3]
[3 5 1 2 4]
[2 3 4 5 1]]
In [68]: print(foo())
57
[[1 4 5 3 2]
[2 1 3 4 5]
[3 5 4 2 1]
[5 3 2 1 4]
[4 2 1 5 3]]
In [69]: print(foo())
174
[[2 1 5 4 3]
[3 4 1 2 5]
[1 3 2 5 4]
[4 5 3 1 2]
[5 2 4 3 1]]
In [76]: print(foo())
41
[[3 4 5 1 2]
[1 5 2 3 4]
[5 2 3 4 1]
[2 1 4 5 3]
[4 3 1 2 5]]
The required number of tries varies all over the place, with some exceeding my iteration limit.
Without getting into any theory, there's going to be difference between quickly generating a 2d permutation, and generating one that is in some sense or other, maximally random. I suspect my approach is closer to this random goal than a more systematic and efficient approach (but I can't prove it).
def opFoo():
numbers = list(range(1,6))
result = np.zeros((5,5), dtype='int32')
row_index = 0; i = 0
while row_index < 5:
np.random.shuffle(numbers)
for column_index, number in enumerate(numbers):
if number in result[:, column_index]:
break
else:
result[row_index, :] = numbers
row_index += 1
i += 1
return i, result
In [125]: opFoo()
Out[125]:
(11, array([[2, 3, 1, 5, 4],
[4, 5, 1, 2, 3],
[3, 1, 2, 4, 5],
[1, 3, 5, 4, 2],
[5, 3, 4, 2, 1]]))
Mine is quite a bit slower than the OP's, but mine is correct.
This is an improvement on mine (2x faster):
def foo1(n=5,maxi=300):
numbers = np.arange(1,n+1)
np.random.shuffle(numbers)
arr = numbers.copy()[None,:]
for i in range(maxi):
np.random.shuffle(numbers)
if (arr==numbers).any(): continue
arr = np.concatenate((arr, numbers[None,:]),axis=0)
if arr.shape[0]==n: break
return arr, i
Why is translated Sudoku solver slower than original?
I found that with this translation of Java Sudoku solver, that using Python lists was faster than numpy arrays.
I may try to adapt that script to this problem - tomorrow.
EDIT: Below is an implementation of the second solution in norok2's answer.
EDIT: we can shuffle the generated square again to make it real random.
So the solve functions can be modified to:
def solve(numbers):
shuffle(numbers)
shift = randint(1, len(numbers)-1)
res = []
for r in xrange(len(numbers)):
res.append(list(numbers))
numbers = list(numbers[shift:] + numbers[0:shift])
rows = range(len(numbers))
shuffle(rows)
shuffled_res = []
for i in xrange(len(rows)):
shuffled_res.append(res[rows[i]])
return shuffled_res
EDIT: I previously misunderstand the question.
So, here's a 'quick' method which generates a 'to-some-extent' random solutions.
The basic idea is,
a, b, c
b, c, a
c, a, b
We can just move a row of data by a fixed step to form the next row. Which will qualify our restriction.
So, here's the code:
from random import shuffle, randint
def solve(numbers):
shuffle(numbers)
shift = randint(1, len(numbers)-1)
res = []
for r in xrange(len(numbers)):
res.append(list(numbers))
numbers = list(numbers[shift:] + numbers[0:shift])
return res
def check(arr):
for c in xrange(len(arr)):
col = [arr[r][c] for r in xrange(len(arr))]
if len(set(col)) != len(col):
return False
return True
if __name__ == '__main__':
from pprint import pprint
res = solve(range(5))
pprint(res)
print check(res)
This is a possible solution by itertools, if you don't insist on using numpy which I'm not familiar with:
import itertools
from random import randint
list(itertools.permutations(range(1, 6)))[randint(0, len(range(1, 6))]
# itertools returns a iterator of all possible permutations of the given list.
Can't type code from the phone, here's the pseudocode:
Create a matrix with one diamention more than tge target matrix(3 d)
Initialize the 25 elements with numbers from 1 to 5
Iterate over the 25 elements.
Choose a random value for the first element from the element list(which contains numbers 1 through 5)
Remove the randomly chosen value from all the elements in its row and column.
Repeat steps 4 and 5 for all the elements.

Categories

Resources