Python - How to check list monotonicity - python

What would be an efficient and pythonic way to check list monotonicity? i.e. that it has monotonically increasing or decreasing values?
Examples:
[0, 1, 2, 3, 3, 4] # This is a monotonically increasing list
[4.3, 4.2, 4.2, -2] # This is a monotonically decreasing list
[2, 3, 1] # This is neither

It's better to avoid ambiguous terms like "increasing" or "decreasing" as it's not clear if equality is acceptable or not. You should always use either for example "non-increasing" (clearly equality is accepted) or "strictly decreasing" (clearly equality is NOT accepted).
def strictly_increasing(L):
return all(x<y for x, y in zip(L, L[1:]))
def strictly_decreasing(L):
return all(x>y for x, y in zip(L, L[1:]))
def non_increasing(L):
return all(x>=y for x, y in zip(L, L[1:]))
def non_decreasing(L):
return all(x<=y for x, y in zip(L, L[1:]))
def monotonic(L):
return non_increasing(L) or non_decreasing(L)

If you have large lists of numbers it might be best to use numpy, and if you are:
import numpy as np
def monotonic(x):
dx = np.diff(x)
return np.all(dx <= 0) or np.all(dx >= 0)
should do the trick.

import itertools
import operator
def monotone_increasing(lst):
pairs = zip(lst, lst[1:])
return all(itertools.starmap(operator.le, pairs))
def monotone_decreasing(lst):
pairs = zip(lst, lst[1:])
return all(itertools.starmap(operator.ge, pairs))
def monotone(lst):
return monotone_increasing(lst) or monotone_decreasing(lst)
This approach is O(N) in the length of the list.

#6502 has an elegant code for sequences (iterables with __getitem__ and __len__ methods) and #chqrlie has an even better code which does not create temporary copies of sequences with slicing. I just want to add a general version that works for all iterables (objects with an __iter__ method):
def pairwise(iterable):
items = iter(iterable)
last = next(items)
for item in items:
yield last, item
last = item
def strictly_increasing(iterable):
return all(x<y for x, y in pairwise(iterable))
def strictly_decreasing(iterable):
return all(x>y for x, y in pairwise(iterable))
def non_increasing(iterable):
return all(x>=y for x, y in pairwise(iterable))
def non_decreasing(iterable):
return all(x<=y for x, y in pairwise(iterable))
def monotonic(iterable):
return non_increasing(iterable) or non_decreasing(iterable)

The pandas package makes this convenient.
import pandas as pd
The following commands work with a list of integers or floats.
Monotonically increasing (≥):
pd.Series(mylist).is_monotonic_increasing
Strictly monotonically increasing (>):
myseries = pd.Series(mylist)
myseries.is_unique and myseries.is_monotonic_increasing
Alternative using an undocumented private method:
pd.Index(mylist)._is_strictly_monotonic_increasing
Monotonically decreasing (≤):
pd.Series(mylist).is_monotonic_decreasing
Strictly monotonically decreasing (<):
myseries = pd.Series(mylist)
myseries.is_unique and myseries.is_monotonic_decreasing
Alternative using an undocumented private method:
pd.Index(mylist)._is_strictly_monotonic_decreasing

#6502 has elegant python code for this. Here is an alternative solution with simpler iterators and no potentially expensive temporary slices:
def strictly_increasing(L):
return all(L[i] < L[i+1] for i in range(len(L)-1))
def strictly_decreasing(L):
return all(L[i] > L[i+1] for i in range(len(L)-1))
def non_increasing(L):
return all(L[i] >= L[i+1] for i in range(len(L)-1))
def non_decreasing(L):
return all(L[i] <= L[i+1] for i in range(len(L)-1))
def monotonic(L):
return non_increasing(L) or non_decreasing(L)

import operator, itertools
def is_monotone(lst):
op = operator.le # pick 'op' based upon trend between
if not op(lst[0], lst[-1]): # first and last element in the 'lst'
op = operator.ge
return all(op(x,y) for x, y in itertools.izip(lst, lst[1:]))

Here is a functional solution using reduce of complexity O(n):
is_increasing = lambda L: reduce(lambda a,b: b if a < b else 9999 , L)!=9999
is_decreasing = lambda L: reduce(lambda a,b: b if a > b else -9999 , L)!=-9999
Replace 9999 with the top limit of your values, and -9999 with the bottom limit. For example, if you are testing a list of digits, you can use 10 and -1.
I tested its performance against #6502's answer and its faster.
Case True: [1,2,3,4,5,6,7,8,9]
# my solution ..
$ python -m timeit "inc = lambda L: reduce(lambda a,b: b if a < b else 9999 , L)!=9999; inc([1,2,3,4,5,6,7,8,9])"
1000000 loops, best of 3: 1.9 usec per loop
# while the other solution:
$ python -m timeit "inc = lambda L: all(x<y for x, y in zip(L, L[1:]));inc([1,2,3,4,5,6,7,8,9])"
100000 loops, best of 3: 2.77 usec per loop
Case False from the 2nd element: [4,2,3,4,5,6,7,8,7]:
# my solution ..
$ python -m timeit "inc = lambda L: reduce(lambda a,b: b if a < b else 9999 , L)!=9999; inc([4,2,3,4,5,6,7,8,7])"
1000000 loops, best of 3: 1.87 usec per loop
# while the other solution:
$ python -m timeit "inc = lambda L: all(x<y for x, y in zip(L, L[1:]));inc([4,2,3,4,5,6,7,8,7])"
100000 loops, best of 3: 2.15 usec per loop

I timed all of the answers in this question under different conditions, and found that:
Sorting was the fastest by a long shot IF the list was already monotonically increasing
Sorting was the slowest by a long shot IF the list was shuffled/random or if the number of elements out of order was greater than ~1. The more out of order the list of course corresponds to a slower result.
Michael J. Barbers method was the fastest IF the list was mostly monotonically increasing, or completely random.
Here is the code to try it out:
import timeit
setup = '''
import random
from itertools import izip, starmap, islice
import operator
def is_increasing_normal(lst):
for i in range(0, len(lst) - 1):
if lst[i] >= lst[i + 1]:
return False
return True
def is_increasing_zip(lst):
return all(x < y for x, y in izip(lst, islice(lst, 1, None)))
def is_increasing_sorted(lst):
return lst == sorted(lst)
def is_increasing_starmap(lst):
pairs = izip(lst, islice(lst, 1, None))
return all(starmap(operator.le, pairs))
if {list_method} in (1, 2):
lst = list(range({n}))
if {list_method} == 2:
for _ in range(int({n} * 0.0001)):
lst.insert(random.randrange(0, len(lst)), -random.randrange(1,100))
if {list_method} == 3:
lst = [int(1000*random.random()) for i in xrange({n})]
'''
n = 100000
iterations = 10000
list_method = 1
timeit.timeit('is_increasing_normal(lst)', setup=setup.format(n=n, list_method=list_method), number=iterations)
timeit.timeit('is_increasing_zip(lst)', setup=setup.format(n=n, list_method=list_method), number=iterations)
timeit.timeit('is_increasing_sorted(lst)', setup=setup.format(n=n, list_method=list_method), number=iterations)
timeit.timeit('is_increasing_starmap(lst)', setup=setup.format(n=n, list_method=list_method), number=iterations)
If the list was already monotonically increasing (list_method == 1), the fastest to slowest was:
sorted
starmap
normal
zip
If the list was mostly monotonically increasing (list_method == 2), the fastest to slowest was:
starmap
zip
normal
sorted
(Whether or not the starmap or zip was fastest depended on the execution and I couldn't identify a pattern. Starmap appeared to be usually faster)
If the list was completely random (list_method == 3), the fastest to slowest was:
starmap
zip
normal
sorted (extremely bad)

Here's a variant that accepts both materialized and non-materialized sequences. It automatically determines whether or not it's monotonic, and if so, its direction (i.e. increasing or decreasing) and strictness. Inline comments are provided to help the reader. Similarly for test-cases provided at the end.
def isMonotonic(seq):
"""
seq.............: - A Python sequence, materialized or not.
Returns.........:
(True,0,True): - Mono Const, Strict: Seq empty or 1-item.
(True,0,False): - Mono Const, Not-Strict: All 2+ Seq items same.
(True,+1,True): - Mono Incr, Strict.
(True,+1,False): - Mono Incr, Not-Strict.
(True,-1,True): - Mono Decr, Strict.
(True,-1,False): - Mono Decr, Not-Strict.
(False,None,None) - Not Monotonic.
"""
items = iter(seq) # Ensure iterator (i.e. that next(...) works).
prev_value = next(items, None) # Fetch 1st item, or None if empty.
if prev_value == None: return (True,0,True) # seq was empty.
# ============================================================
# The next for/loop scans until it finds first value-change.
# ============================================================
# Ex: [3,3,3,78,...] --or- [-5,-5,-5,-102,...]
# ============================================================
# -- If that 'change-value' represents an Increase or Decrease,
# then we know to look for Monotonically Increasing or
# Decreasing, respectively.
# -- If no value-change is found end-to-end (e.g. [3,3,3,...3]),
# then it's Monotonically Constant, Non-Strict.
# -- Finally, if the sequence was exhausted above, which means
# it had exactly one-element, then it Monotonically Constant,
# Strict.
# ============================================================
isSequenceExhausted = True
curr_value = prev_value
for item in items:
isSequenceExhausted = False # Tiny inefficiency.
if item == prev_value: continue
curr_value = item
break
else:
return (True,0,True) if isSequenceExhausted else (True,0,False)
# ============================================================
# ============================================================
# If we tricked down to here, then none of the above
# checked-cases applied (i.e. didn't short-circuit and
# 'return'); so we continue with the final step of
# iterating through the remaining sequence items to
# determine Monotonicity, direction and strictness.
# ============================================================
strict = True
if curr_value > prev_value: # Scan for Increasing Monotonicity.
for item in items:
if item < curr_value: return (False,None,None)
if item == curr_value: strict = False # Tiny inefficiency.
curr_value = item
return (True,+1,strict)
else: # Scan for Decreasing Monotonicity.
for item in items:
if item > curr_value: return (False,None,None)
if item == curr_value: strict = False # Tiny inefficiency.
curr_value = item
return (True,-1,strict)
# ============================================================
# Test cases ...
assert isMonotonic([1,2,3,4]) == (True,+1,True)
assert isMonotonic([4,3,2,1]) == (True,-1,True)
assert isMonotonic([-1,-2,-3,-4]) == (True,-1,True)
assert isMonotonic([]) == (True,0,True)
assert isMonotonic([20]) == (True,0,True)
assert isMonotonic([-20]) == (True,0,True)
assert isMonotonic([1,1]) == (True,0,False)
assert isMonotonic([1,-1]) == (True,-1,True)
assert isMonotonic([1,-1,-1]) == (True,-1,False)
assert isMonotonic([1,3,3]) == (True,+1,False)
assert isMonotonic([1,2,1]) == (False,None,None)
assert isMonotonic([0,0,0,0]) == (True,0,False)
I suppose this could be more Pythonic, but it's tricky because it avoids creating intermediate collections (e.g. list, genexps, etc); as well as employs a fall/trickle-through and short-circuit approach to filter through the various cases: E.g. Edge-sequences (like empty or one-item sequences; or sequences with all identical items); Identifying increasing or decreasing monotonicity, strictness, and so on. I hope it helps.

L = [1,2,3]
L == sorted(L)
L == sorted(L, reverse=True)

Here is an implementation that is both efficient (the space required is constant, no slicing performing a temporary shallow copy of the input) and general (any iterables are supported as input, not just sequences):
def is_weakly_increasing(iterable):
iterator = iter(iterable)
next(iterator)
return all(x <= y for x, y in zip(iterable, iterator))
def is_weakly_decreasing(iterable):
iterator = iter(iterable)
next(iterator)
return all(x >= y for x, y in zip(iterable, iterator))
def is_weakly_monotonic(iterable):
return is_weakly_increasing(iterable) or is_weakly_decreasing(iterable)
def is_strictly_increasing(iterable):
iterator = iter(iterable)
next(iterator)
return all(x < y for x, y in zip(iterable, iterator))
def is_stricly_decreasing(iterable):
iterator = iter(iterable)
next(iterator)
return all(x > y for x, y in zip(iterable, iterator))
def is_strictly_monotonic(iterable):
return is_strictly_increasing(iterable) or is_strictly_decreasing(iterable)

def IsMonotonic(data):
''' Returns true if data is monotonic.'''
data = np.array(data)
# Greater-Equal
if (data[-1] > data[0]):
return np.all(data[1:] >= data[:-1])
# Less-Equal
else:
return np.all(data[1:] <= data[:-1])
My proposition (with numpy) as a summary of few ideas here. Uses
casting to np.array for creation of bool values for each lists comparision,
np.all for checking if all results are True
checking diffrence between first and last element for choosing comparison operator,
using direct comparison >=, <= instead of calculatin np.diff,

Here are two ways of determining if a list if monotonically increasing or decreasing using just range or list comprehensions. Using range is slightly more efficient because it can short-circuit, whereas the list comprehension must iterate over the entire list. Enjoy.
a = [1,2,3,4,5]
b = [0,1,6,1,0]
c = [9,8,7,6,5]
def monotonic_increase(x):
if len(x) <= 1: return False
for i in range(1, len(x)):
if x[i-1] >= x[i]:
return False
return True
def monotonic_decrease(x):
if len(x) <= 1: return False
for i in range(1, len(x)):
if x[i-1] <= x[i]:
return False
return True
monotonic_increase = lambda x: len(x) > 1 and all(x[i-1] < x[i] for i in range(1, len(x)))
monotonic_decrease = lambda x: len(x) > 1 and all(x[i-1] > x[i] for i in range(1, len(x)))
print(monotonic_increase(a))
print(monotonic_decrease(c))
print(monotonic_decrease([]))
print(monotonic_increase(c))
print(monotonic_decrease(a))
print(monotonic_increase(b))
print(monotonic_decrease(b))

def solution1(a):
up, down = True, True
for i in range(1, len(a)):
if a[i] < a[i-1]: up = False
if a[i] > a[i-1]: down = False
return up or down
def solution2(a):
return a == sorted(a[::-1])
Solution1 is O(n) and the shorter solution2 is O(n log n)

summary: Yes, setting a single variable from another variable is thread safe. This provides a high-speed way to transfer a single value bewteen tasks.
this link has a clear explanation:Grok the GIL: How to write fast and thread-safe Python
My simplified explanation:
All computers, way down at the machine code, provide some (not all) instructions that are non-interruptable. These are reffered to as "atomic".
For example, storing a 16 bit-constant into a 16-bit memory location. way-back the 8-bit computers were a mix of features and implimentations. For example, the 8-bit processors 6502, 6800 both had non-interruptable (atomic) instructions to move 16-bits from one memory location to another. The Z80 did not.
The PDP-11 machine code had an atomic increment/decrement instructions (var++; v--) which is how K&R came to add that to the early C compilers. At least from what I remember.
Why does this matter?
Today a lot of computer programming languages are written to a custom-designed "virtual machine". These virtual machines, like all computer processors, support a mix of features and implimentations.
Python's virtual machine is written with it's custom GIL ("global interpreter lock")
what this means is that some (but not all) python's operations are "thread-safe" in that some (but not all) execute in a single "GIL" cycle.
There are some surprises. For example "sort" is atomic and therefor thread-safe. Since this is a relatively long process, that's a surprise (read the link for a much better explanation)
Python has a function that will display the "virtual machine codes" a function is "compiled" into:
## ref: https://opensource.com/article/17/4/grok-gil
import dis
vara= 33
varb= 55
def foo():
global vara
vara= 33
varb= 55
varb= vara
return
dis.dis(foo)
which gives the output:
415 0 LOAD_CONST 1 (33)
2 STORE_GLOBAL 0 (vara)
416 4 LOAD_CONST 2 (55)
6 STORE_FAST 0 (varb)
417 8 LOAD_GLOBAL 0 (vara)
10 STORE_FAST 0 (varb)
418 12 LOAD_CONST 0 (None)
14 RETURN_VALUE
Care must be taken when using this (well, all multithreaded programs should be carefully designed and coded) since ONLY A SINGLE value is safe! The good news is that this SINGLE VALUE can point to an entire data structure. Of course the entired data structure isn't atomic so any changes to it should be wrapped in mutlithread "stalls" (another name for semaphores, locks, etc- since they basically "stall" the code, until the semaphores/locks are released by other threads. I read this somewhere from one of the co-developers of multithreading...in hindsight he thought "stalls" were a better, more discription term. Perhaps dijkstra??)

>>> seq = [0, 1, 2, 3, 3, 4]
>>> seq == sorted(seq) or seq == sorted(seq, reverse=True)

Related

Fastest way to determine if an ordered sublist is in a large lists of lists?

Suppose I have a my_huge_list_of_lists with 2,000,000 lists in it, each list about 50 in length.
I want to shorten the 2,000,000 my_huge_list_of_lists by discarding sublists that do not contain two elements in the sequence.
So far I have:
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in xrange(len(lst) - n + 1))
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a,b])]
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [b,a])]
The consecutiveness of the search term [a,b] or [b,a] is important so I can't use a set.issubset().
I find this slow. I'd like to speed it up. I've considered a few options like using an 'early exit' and statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if (a in x and not check_if_list_is_sublist(x, [a,b]))]
and less times in the for loop with an or statement:
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not (check_if_list_is_sublist(x, [a,b])
or check_if_list_is_sublist(x, [b,a]))]
and also working on speeding up the function (WIP)
# https://stackoverflow.com/questions/48232080/the-fastest-way-to-check-if-the-sub-list-exists-on-the-large-list
def check_if_list_is_sublist(lst, sublst):
checks if a list appears in order in another larger list.
set_of_sublists = {tuple(sublst) for sublist in lst}
and done some searching on Stack Overflow; but can't think of a way because the number of times check_if_list_is_sublist() is called is len(my_huge_list) * 2.
edit: add some user data as requested
from random import randint
from string import ascii_lowercase
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(2000000)]
my_neighbor_search_fwd = [i,c]
my_neighbor_search_rev = my_neighbor_search_fwd.reverse()
Unpack the item in the n-sized subsequence into n variables. Then write a list comprehension to filter the list doing a check for a, b or b, a in the sub-list.e.g.
a, b = sublst
def checklst(lst, a, b):
l = len(lst)
start = 0
while True:
try:
a_index = lst.index(a, start)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
start = a_index + 1
if start == l:
return False
continue # keep looking at the next a
%timeit found = [l for l in lst if checklst(l, a, b)]
1.88 s ± 31.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit found = [x for x in lst if (a in x and not check_if_list_is_sublist(x, [a,b]))]
22.1 s ± 1.67 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
So, I can't think of any clever algorithm checks to really reduce the amount of work here. However, you are doing a LOT of allocations in your code, and iterating too far. So, just moving some declarations out of the function a bit got me
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
my_huge_list_of_lists = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
Which reduced the run-time of your sample code above by about 50%. With a list this size, spawning some more processes and dividing the work would probably see a performance increase as well. Can't think of any way to really reduce the amount of comparisons though...
Although this isn't what you'd call an "answer" per se, but rather it's a benchmarking framework that should help you determine the quickest way to accomplish what you want because it allows relatively easy modification as well as the addition of different approaches.
I've put the answers currently posted into it, as well as the results of running it with them.
Caveats: Note that I haven't verified that all the tested answers in it are "correct" in the sense that they actually do what you want, nor how much memory they'll consume in the process—which might be another consideration.
Currently it appears that #Oluwafemi Sule's answer is the fastest by a order of magnitude (10x times) from the closest competitor.
from __future__ import print_function
from collections import namedtuple
import sys
from textwrap import dedent
import timeit
import traceback
N = 10 # Number of executions of each "algorithm".
R = 3 # Number of repetitions of those N executions.
from random import randint, randrange, seed
from string import ascii_lowercase
a, b = 'a', 'b'
NUM_SUBLISTS = 1000
SUBLIST_LEN = 50
PERCENTAGE = 50 # Percentage of sublist that should get removed.
seed(42) # Initialize random number so the results are reproducible.
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for __ in range(SUBLIST_LEN)]
for __ in range(NUM_SUBLISTS)]
# Put the target sequence in percentage of the sublists so they'll be removed.
for __ in range(NUM_SUBLISTS*PERCENTAGE // 100):
list_index = randrange(NUM_SUBLISTS)
sublist_index = randrange(SUBLIST_LEN)
my_huge_list_of_lists[list_index][sublist_index:sublist_index+2] = [a, b]
# Common setup for all testcases (executed before any algorithm specific setup).
COMMON_SETUP = dedent("""
from __main__ import a, b, my_huge_list_of_lists, NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE
""")
class TestCase(namedtuple('CodeFragments', ['setup', 'test'])):
""" A test case is composed of separate setup and test code fragments. """
def __new__(cls, setup, test):
""" Dedent code fragment in each string argument. """
return tuple.__new__(cls, (dedent(setup), dedent(test)))
testcases = {
"OP (Nas Banov)": TestCase("""
# https://stackoverflow.com/questions/3313590/check-for-presence-of-a-sliced-list-in-python
def check_if_list_is_sublist(lst, sublst):
''' Checks if a list appears in order in another larger list. '''
n = len(sublst)
return any((sublst == lst[i:i+n]) for i in range(len(lst) - n + 1))
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x, [a, b])]
"""
),
"Sphinx Solution 1 (hash)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
# Solution 1: By using built-in hash function.
def prepare1(huge_list, interval=1): # Use built-in hash function.
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
def check_sublist1(hash_db, sublst): # Use built-in hash function.
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
""", """
hash_db = prepare1(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist1(hash_db, x)]
"""
),
"Sphinx Solution 2 (str)": TestCase("""
# https://stackoverflow.com/a/49518843/355230
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): # Use str() as hash function.
return {str(huge_list[index:index+interval]):huge_list[index:index+interval]
for index in range(len(huge_list) - interval + 1)}
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
""", """
hash_db = prepare2(my_huge_list_of_lists, interval=2)
shortened = [x for x in my_huge_list_of_lists
if check_sublist2(hash_db, x)]
"""
),
"Paul Becotte": TestCase("""
# https://stackoverflow.com/a/49504792/355230
sublst = [a, b]
l = len(sublst)
indices = range(len(sublst))
def check_if_list_is_sublist(lst):
for i in range(len(lst) - (l -1)):
if lst[i] == sublst[0] and lst[i+1] == sublst[1]:
return True
if lst[i] == sublst[1] and lst[i + 1] == sublst[0]:
return True
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not check_if_list_is_sublist(x)]
"""
),
"Oluwafemi Sule": TestCase("""
# https://stackoverflow.com/a/49504440/355230
def checklst(lst, a, b):
try:
a_index = lst.index(a)
except ValueError:
return False
try:
return a_index > -1 and lst[a_index+1] == b
except IndexError:
try:
return a_index > -1 and lst[a_index-1] == b
except IndexError:
return False
""", """
shortened = [x for x in my_huge_list_of_lists
if not checklst(x, a, b)]
"""
),
}
# Collect timing results of executing each testcase multiple times.
try:
results = [
(label,
min(timeit.repeat(testcases[label].test,
setup=COMMON_SETUP + testcases[label].setup,
repeat=R, number=N)),
) for label in testcases
]
except Exception:
traceback.print_exc(file=sys.stdout) # direct output to stdout
sys.exit(1)
# Display results.
print('Results for {:,d} sublists of length {:,d} with {}% percent of them matching.'
.format(NUM_SUBLISTS, SUBLIST_LEN, PERCENTAGE))
major, minor, micro = sys.version_info[:3]
print('Fastest to slowest execution speeds using Python {}.{}.{}\n'
'({:,d} executions, best of {:d} repetitions)'.format(major, minor, micro, N, R))
print()
longest = max(len(result[0]) for result in results) # length of longest label
ranked = sorted(results, key=lambda t: t[1]) # ascending sort by execution time
fastest = ranked[0][1]
for result in ranked:
print('{:>{width}} : {:9.6f} secs, rel speed {:5.2f}x, {:6.2f}% slower '
''.format(
result[0], result[1], round(result[1]/fastest, 2),
round((result[1]/fastest - 1) * 100, 2),
width=longest))
print()
Output:
Results for 1,000 sublists of length 50 with 50% percent of them matching
Fastest to slowest execution speeds using Python 3.6.4
(10 executions, best of 3 repetitions)
Oluwafemi Sule : 0.006441 secs, rel speed 1.00x, 0.00% slower
Paul Becotte : 0.069462 secs, rel speed 10.78x, 978.49% slower
OP (Nas Banov) : 0.082758 secs, rel speed 12.85x, 1184.92% slower
Sphinx Solution 2 (str) : 0.152119 secs, rel speed 23.62x, 2261.84% slower
Sphinx Solution 1 (hash) : 0.154562 secs, rel speed 24.00x, 2299.77% slower
For search match in one large list, I believe hash(element) then build indexes will be one good solution.
The benefit you will get:
build indexes once, save your time for future use (don't need to loop again and again for each search).
Even, we can build indexes when launching the program, then release it when program exits,
Below codes use two methods to get hash value: hash() and str(); sometimes you should customize one hash function based on your specific scenarios.
If use str(), the codes seems simple, and don't need to consider the hash conflict. But it may cause memory bomb up.
For hash(), I used the list to save all sub_lst which has same hash value. and you can use hash(sub_lst)%designed_length to control hash size (but it will increase the hash conflict rate).
Output for below codes:
By Hash: 0.00023986603994852955
By str(): 0.00022884208565612796
By OP's: 0.3001317172469765
[Finished in 1.781s]
Test Codes:
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 25)] for x in range(50)] for y in range(10000)]
#print(my_huge_list_of_lists)
test_lst = [['a', 'b', 'c' ], ['a', 'b', 'c'] ]
#Solution 1: By using built-in hash function
def prepare1(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append(huge_list[index:index+interval])
else:
hash_db[hash_sub] = [huge_list[index:index+interval]]
return hash_db
hash_db = prepare1(my_huge_list_of_lists, interval=2)
def check_sublist1(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return any([sublst == item for item in hash_db[hash_sub]])
return False
print('By Hash:', timeit.timeit("check_sublist1(hash_db, test_lst)", setup="from __main__ import check_sublist1, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 2: By using str() as hash function
def prepare2(huge_list, interval=1): #use str() as hash function
return { str(huge_list[index:index+interval]):huge_list[index:index+interval] for index in range(len(huge_list) - interval + 1)}
hash_db = prepare2(my_huge_list_of_lists, interval=2)
def check_sublist2(hash_db, sublst): #use str() as hash function
hash_sub = str(sublst)
if hash_sub in hash_db:
return sublst == hash_db[hash_sub]
return False
print('By str():', timeit.timeit("check_sublist2(hash_db, test_lst)", setup="from __main__ import check_sublist2, my_huge_list_of_lists, test_lst, hash_db ", number=100))
#Solution 3: OP's current solution
def check_if_list_is_sublist(lst, sublst):
#checks if a list appears in order in another larger list.
n = len(sublst)
return any((sublst == lst[i:i + n]) for i in range(len(lst) - n + 1))
print('By OP\'s:', timeit.timeit("check_if_list_is_sublist(my_huge_list_of_lists, test_lst)", setup="from __main__ import check_if_list_is_sublist, my_huge_list_of_lists, test_lst ", number=100))
If you'd like to remove the matched elements from one list, it is doable, but the effect is you may have to rebuild the indexes for the new list. Unless the list is a chain list then save the pointer for each element in the indexes. I just google Python how to get the pointer for one element of a list, but can't find anything helpful. If someone knows how to do, please don't hesitate to share your solution. Thanks.
Below is one sample: (it generate one new list instead of return original one, sometimes we still need to filter something from original list)
from random import randint
from string import ascii_lowercase
import timeit
#Generate Test Data
my_huge_list_of_lists = [[ascii_lowercase[randint(0, 1)] for x in range(2)] for y in range(100)]
#print(my_huge_list_of_lists)
test_lst = [[['a', 'b'], ['a', 'b'] ], [['b', 'a'], ['a', 'b']]]
#Solution 1: By using built-in hash function
def prepare(huge_list, interval=1): #use built-in hash function
hash_db = {}
for index in range(len(huge_list) - interval + 1):
hash_sub = hash(str(huge_list[index:index+interval]))
if hash_sub in hash_db:
hash_db[hash_sub].append({'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]})
else:
hash_db[hash_sub] = [{'beg':index, 'end':index+interval, 'data':huge_list[index:index+interval]}]
return hash_db
hash_db = prepare(my_huge_list_of_lists, interval=2)
def check_sublist(hash_db, sublst): #use built-in hash function
hash_sub = hash(str(sublst))
if hash_sub in hash_db:
return [ item for item in hash_db[hash_sub] if sublst == item['data'] ]
return []
def remove_if_match_sublist(target_list, hash_db, sublsts):
matches = []
for sublst in sublsts:
matches += check_sublist(hash_db, sublst)
#make sure delete elements from end to begin
sorted_match = sorted(matches, key=lambda item:item['beg'], reverse=True)
new_list = list(target_list)
for item in sorted_match:
del new_list[item['beg']:item['end']]
return new_list
print('Removed By Hash:', timeit.timeit("remove_if_match_sublist(my_huge_list_of_lists, hash_db, test_lst)", setup="from __main__ import check_sublist, my_huge_list_of_lists, test_lst, hash_db, remove_if_match_sublist ", number=1))

Filtering another filter object

I am trying to generate prime endlessly,by filtering out composite numbers. Using list to store and test for all primes makes the whole thing slow, so i tried to use generators.
from itertools import count
def chk(it,num):
for i in it:
if i%num:
yield(i)
genStore = [count(2)]
primeStore = []
while 1:
prime = next(genStore[-1])
primeStore.append(prime)
genStore.append(chk(genStore[-1],num))
It works quite well, generating primes, until it hit maximum recursion depth.
So I found ifilter (or filter in python 3).
From documentation of python standard library:
Make an iterator that filters elements from iterable returning only those for which the predicate is True. If predicate is None, return the items that are true. Equivalent to:
def ifilter(predicate, iterable):
# ifilter(lambda x: x%2, range(10)) --> 1 3 5 7 9
if predicate is None:
predicate = bool
for x in iterable:
if predicate(x):
yield x
So I get the following:
from itertools import count
genStore = [count(2)]
primeStore = []
while 1:
prime = next(genStore[-1])
primeStore.append(prime)
genStore.append(filter(lambda x:x%num,genStore[-1]))
I expected to get:
2
3
5
7
11
13
17
...
What I get is:
2
3
4
5
6
7
...
It seems next() only iterate through count(), not the filter. Object in list should point to the object, so I expected it works like filter(lambda x: x%n,(.... (filter(lambda x:x%3,filter(lambda x:x%2,count(2)))). I do some experiment and noticed the following characteristic:
filter(lambda x:x%2,filter(lambda x:x%3,count(0))) #works, filter all 2*n and 3*n
genStore = [count(2)]; genStore.append(filter(lambda x:x%2,genStore[-1])); genStore.append (filter(lambda x:x%2,genStore[-1])) - works, also filter all 2*n and 3*n
next(filter(lambda x:x%2,filter(lambda x:x%3,count(2)))) - works, printing out 5
On contrast:
from itertools import count
genStore = [count(2)]
primeStore = []
while 1:
prime = next(genStore[-1])
print(prime)
primeStore.append(prime)
genStore.append(filter(lambda x:x%prime,genStore[-1]))
if len(genStore) == 3:
for i in genStore[-1]:
print(i)
#It doesn't work, only filtering out 4*n.
Questions:
Why doesn't it work?
Is it a feature of python, or I made mistakes somewhere?
Is there any way to fix it?
I think your problem stems from the fact that lambdas are not evaluated will it's 'too late' and then you will get prime be same for all of them as all of them point at the same variable.
you can try to add custom filter and use normal function instead of lambda:
def myfilt(f, i, p):
for n in i:
print("gen:", n, i)
if f(n, p):
yield n
def byprime(x, p):
if x % p:
print("pri:", x, p)
return True
f = myfilt(byprime, genStore[-1], prime)
this way you avoid the problems of lambdas being all the same

Python: how to search for a substring in a set the fast way?

I have a set containing ~300.000 tuples
In [26]: sa = set(o.node for o in vrts_l2_5)
In [27]: len(sa)
Out[27]: 289798
In [31]: random.sample(sa, 1)
Out[31]: [('835644', '4696507')]
Now I want to lookup elements based on a common substring, e.g. the first 4 'digits' (in fact the elements are strings). This is my approach:
def lookup_set(x_appr, y_appr):
return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [36]: lookup_set('6652','46529')
Out[36]: [('665274', '4652941'), ('665266', '4652956')]
Is there a more efficient, that is, faster way to to this?
You can do it in O(log(n) + m) time, where n is the number of tuples and m is the number of matching tuples, if you can afford to keep two sorted copies of the tuples.
Sorting itself will cost O(nlog(n)), i.e. it will be asymptotically slower then your naive approach, but if you have to do a certain number of queries(more than log(n), which is almost certainly quite small) it will pay off.
The idea is that you can use bisection to find the candidates that have the correct first value and the correct second value and then intersect these sets.
However note that you want a strange kind of comparison: you care for all strings starting with the given argument. This simply means that when searching for the right-most occurrence you should fill the key with 9s.
A complete working(although not tested very much) code:
from random import randint
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
f_sorted = sorted(sa, key=first)
s_sorted = sa
s_sorted.sort(key=second)
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set(x_appr, y_appr):
x_left = bisect_left(f_sorted, x_appr, key=first)
x_right = bisect_right(f_sorted, x_appr, key=first)
x_candidates = f_sorted[x_left:x_right + 1]
y_left = bisect_left(s_sorted, y_appr, key=second)
y_right = bisect_right(s_sorted, y_appr, key=second)
y_candidates = s_sorted[y_left:y_right + 1]
return set(x_candidates).intersection(y_candidates)
And the comparison with your initial solution:
In [2]: def lookup_set2(x_appr, y_appr):
...: return [n for n in sa if n[0].startswith(x_appr) and n[1].startswith(y_appr)]
In [3]: lookup_set('123', '124')
Out[3]: set([])
In [4]: lookup_set2('123', '124')
Out[4]: []
In [5]: lookup_set('123', '125')
Out[5]: set([])
In [6]: lookup_set2('123', '125')
Out[6]: []
In [7]: lookup_set('12', '125')
Out[7]: set([('12478', '125908'), ('124625', '125184'), ('125494', '125940')])
In [8]: lookup_set2('12', '125')
Out[8]: [('124625', '125184'), ('12478', '125908'), ('125494', '125940')]
In [9]: %timeit lookup_set('12', '125')
1000 loops, best of 3: 589 us per loop
In [10]: %timeit lookup_set2('12', '125')
10 loops, best of 3: 145 ms per loop
In [11]: %timeit lookup_set('123', '125')
10000 loops, best of 3: 102 us per loop
In [12]: %timeit lookup_set2('123', '125')
10 loops, best of 3: 144 ms per loop
As you can see this solution is about 240-1400 times faster(in these examples) than your naive approach.
If you have a big set of matches:
In [19]: %timeit lookup_set('1', '2')
10 loops, best of 3: 27.1 ms per loop
In [20]: %timeit lookup_set2('1', '2')
10 loops, best of 3: 152 ms per loop
In [21]: len(lookup_set('1', '2'))
Out[21]: 3587
In [23]: %timeit lookup_set('', '2')
10 loops, best of 3: 182 ms per loop
In [24]: %timeit lookup_set2('', '2')
1 loops, best of 3: 212 ms per loop
In [25]: len(lookup_set2('', '2'))
Out[25]: 33053
As you can see this solution is faster even if the number of matches is about 10% of the total size. However, if you try to match all the data:
In [26]: %timeit lookup_set('', '')
1 loops, best of 3: 360 ms per loop
In [27]: %timeit lookup_set2('', '')
1 loops, best of 3: 221 ms per loop
It becomes (not so much) slower, although this is a quite peculiar case, and I doubt you'll frequently match almost all the elements.
Note that the time take to sort the data is quite small:
In [13]: from random import randint
...: from operator import itemgetter
...:
...: first = itemgetter(0)
...: second = itemgetter(1)
...:
...: sa2 = [(str(randint(0, 1000000)), str(randint(0, 1000000))) for _ in range(300000)]
In [14]: %%timeit
...: f_sorted = sorted(sa2, key=first)
...: s_sorted = sorted(sa2, key=second)
...: max_length = max(len(s) for _,s in sa2)
...:
1 loops, best of 3: 881 ms per loop
As you can see it takes less than one second to do the two sorted copies. Actually the above code would be slightly faster since it sorts "in-place" the second copy(although tim-sort could still require O(n) memory).
This means that if you have to do more than about 6-8 queries this solution will be faster.
Note: python'd standard library provides a bisect module. However it doesn't allow a key parameter(even though I remember reading that Guido wanted it, so it may be added in the future). Hence if you want to use it directly, you'll have to use the "decorate-sort-undecorate" idiom.
Instead of:
f_sorted = sorted(sa, key=first)
You should do:
f_sorted = sorted((first, (first,second)) for first,second in sa)
I.e. you explicitly insert the key as the first element of the tuple. Afterwards you could use ('123', '') as element to pass to the bisect_* functions and it should find the correct index.
I decided to avoid this. I copy pasted the code from the sources of the module and slightly modified it to provide a simpler interface for your use-case.
Final remark: if you could convert the tuple elements to integers then the comparisons would be faster. However, most of the time would still be taken to perform the intersection of the sets, so I don't know exactly how much it will improve performances.
You could use a trie data structure. It is possible to build one with a tree of dict objects (see How to create a TRIE in Python) but there is a package marisa-trie that implements a memory-efficient version by binding to c++ libraries
I have not used this library before, but playing around with it, I got this working:
from random import randint
from marisa_trie import RecordTrie
sa = [(str(randint(1000000,9999999)),str(randint(1000000,9999999))) for i in range(100000)]
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip((unicode(first) for first, _ in sa), sa)),
RecordTrie(fmt, zip((unicode(second) for _, second in sa), sa)))
def lookup_set(sa_tries, x_appr, y_appr):
"""lookup prefix in the appropriate trie and intersect the result"""
return (set(item[1] for item in sa_tries[0].items(unicode(x_appr))) &
set(item[1] for item in sa_tries[1].items(unicode(y_appr))))
lookup_set(sa_tries, "2", "4")
I went through and implemented the 4 suggested solutions to compare their efficiency. I ran the tests with different prefix lengths to see how the input would affect performance. The trie and sorted list performance is definitely sensitive to the length of input with both getting faster as the input gets longer (I think it is actually sensitivity to the size of output since the output gets smaller as the prefix gets longer). However, the sorted set solution is definitely faster in all situations.
In these timing tests, there were 200000 tuples in sa and 10 runs for each method:
for prefix length 1
lookup_set_startswith : min=0.072107 avg=0.073878 max=0.077299
lookup_set_int : min=0.030447 avg=0.037739 max=0.045255
lookup_set_trie : min=0.111548 avg=0.124679 max=0.147859
lookup_set_sorted : min=0.012086 avg=0.013643 max=0.016096
for prefix length 2
lookup_set_startswith : min=0.066498 avg=0.069850 max=0.081271
lookup_set_int : min=0.027356 avg=0.034562 max=0.039137
lookup_set_trie : min=0.006949 avg=0.010091 max=0.032491
lookup_set_sorted : min=0.000915 avg=0.000944 max=0.001004
for prefix length 3
lookup_set_startswith : min=0.065708 avg=0.068467 max=0.079485
lookup_set_int : min=0.023907 avg=0.033344 max=0.043196
lookup_set_trie : min=0.000774 avg=0.000854 max=0.000929
lookup_set_sorted : min=0.000149 avg=0.000155 max=0.000163
for prefix length 4
lookup_set_startswith : min=0.065742 avg=0.068987 max=0.077351
lookup_set_int : min=0.026766 avg=0.034558 max=0.052269
lookup_set_trie : min=0.000147 avg=0.000167 max=0.000189
lookup_set_sorted : min=0.000065 avg=0.000068 max=0.000070
Here's the code:
import random
def random_digits(num_digits):
return random.randint(10**(num_digits-1), (10**num_digits)-1)
sa = [(str(random_digits(6)),str(random_digits(7))) for _ in range(200000)]
### naive approach
def lookup_set_startswith(x_appr, y_appr):
return [item for item in sa if item[0].startswith(x_appr) and item[1].startswith(y_appr) ]
### trie approach
from marisa_trie import RecordTrie
# make length of string in packed format big enough!
fmt = ">10p10p"
sa_tries = (RecordTrie(fmt, zip([unicode(first) for first, second in sa], sa)),
RecordTrie(fmt, zip([unicode(second) for first, second in sa], sa)))
def lookup_set_trie(x_appr, y_appr):
# lookup prefix in the appropriate trie and intersect the result
return set(item[1] for item in sa_tries[0].items(unicode(x_appr))) & \
set(item[1] for item in sa_tries[1].items(unicode(y_appr)))
### int approach
sa_ints = [(int(first), int(second)) for first, second in sa]
sa_lens = tuple(map(len, sa[0]))
def lookup_set_int(x_appr, y_appr):
x_limit = 10**(sa_lens[0]-len(x_appr))
y_limit = 10**(sa_lens[1]-len(y_appr))
x_int = int(x_appr) * x_limit
y_int = int(y_appr) * y_limit
return [sa[i] for i, int_item in enumerate(sa_ints) \
if (x_int <= int_item[0] and int_item[0] < x_int+x_limit) and \
(y_int <= int_item[1] and int_item[1] < y_int+y_limit) ]
### sorted set approach
from operator import itemgetter
first = itemgetter(0)
second = itemgetter(1)
sa_sorted = (sorted(sa, key=first), sorted(sa, key=second))
max_length = max(len(s) for _,s in sa)
# See: bisect module from stdlib
def bisect_right(seq, element, key):
lo = 0
hi = len(seq)
element = element.ljust(max_length, '9')
while lo < hi:
mid = (lo+hi)//2
if element < key(seq[mid]):
hi = mid
else:
lo = mid + 1
return lo
def bisect_left(seq, element, key):
lo = 0
hi = len(seq)
while lo < hi:
mid = (lo+hi)//2
if key(seq[mid]) < element:
lo = mid + 1
else:
hi = mid
return lo
def lookup_set_sorted(x_appr, y_appr):
x_left = bisect_left(sa_sorted[0], x_appr, key=first)
x_right = bisect_right(sa_sorted[0], x_appr, key=first)
x_candidates = sa_sorted[0][x_left:x_right]
y_left = bisect_left(sa_sorted[1], y_appr, key=second)
y_right = bisect_right(sa_sorted[1], y_appr, key=second)
y_candidates = sa_sorted[1][y_left:y_right]
return set(x_candidates).intersection(y_candidates)
####
# test correctness
ntests = 10
candidates = [lambda x, y: set(lookup_set_startswith(x,y)),
lambda x, y: set(lookup_set_int(x,y)),
lookup_set_trie,
lookup_set_sorted]
print "checking correctness (or at least consistency)..."
for dlen in range(1,5):
print "prefix length %d:" % dlen,
for i in range(ntests):
print " #%d" % i,
prefix = map(str, (random_digits(dlen), random_digits(dlen)))
answers = [c(*prefix) for c in candidates]
for i, ans in enumerate(answers):
for j, ans2 in enumerate(answers[i+1:]):
assert ans == ans2, "answers for %s for #%d and #%d don't match" \
% (prefix, i, j+i+1)
print
####
# time calls
import timeit
import numpy as np
ntests = 10
candidates = [lookup_set_startswith,
lookup_set_int,
lookup_set_trie,
lookup_set_sorted]
print "timing..."
for dlen in range(1,5):
print "for prefix length", dlen
times = [ [] for c in candidates ]
for _ in range(ntests):
prefix = map(str, (random_digits(dlen), random_digits(dlen)))
for c, c_times in zip(candidates, times):
tstart = timeit.default_timer()
trash = c(*prefix)
c_times.append(timeit.default_timer()-tstart)
for c, c_times in zip(candidates, times):
print " %-25s: min=%f avg=%f max=%f" % (c.func_name, min(c_times), np.mean(c_times), max(c_times))
Integer manipulation is much faster than string. (and smaller in memory as well)
So if you can compare integers instead you'll be much faster.
I suspect something like this should work for you:
sa = set(int(o.node) for o in vrts_l2_5)
Then this may work for you:
def lookup_set(samples, x_appr, x_len, y_appr, y_len):
"""
x_appr == SSS0000 where S is the digit to search for
x_len == number of digits to S (if SSS0000 then x_len == 4)
"""
return ((x, y) for x, y in samples if round(x, -x_len) == x_appr and round(y, -y_len) == y_approx)
Also, it returns a generator, so you're not loading all the results into memory at once.
Updated to use round method mentioned by Bakuriu
There may be, but not by terribly much. str.startswith and and are both shortcutting operators (they can return once they find a failure), and indexing tuples is a fast operation. Most of the time spent here will be from object lookups, such as finding the startswith method for each string. Probably the most worthwhile option is to run it through Pypy.
A faster solution would be to create a dictionary dict and put the first value as a key and the second as a value.
Then you would search keys matching x_appr in the ordered key list of dict (the ordered list would allow you to optimize the search in key list with a dichotomy for example). This will provide a key list named for example k_list.
And then lookup for values of dict having a key in k_list and matching y_appr.
You can also include the second step (value that match y_appr) before appending to k_list. So that k_list will contains all the key of the correct elements of dict.
Here I've just compare 'in' method and 'find' method:
The CSV input file contains a list of URL
# -*- coding: utf-8 -*-
### test perfo str in set
import re
import sys
import time
import json
import csv
import timeit
cache = set()
#######################################################################
def checkinCache(c):
global cache
for s in cache:
if c in s:
return True
return False
#######################################################################
def checkfindCache(c):
global cache
for s in cache:
if s.find(c) != -1:
return True
return False
#######################################################################
print "1/3-loading pages..."
with open("liste_all_meta.csv.clean", "rb") as f:
reader = csv.reader(f, delimiter=",")
for i,line in enumerate(reader):
cache.add(re.sub("'","",line[2].strip()))
print " "+str(len(cache))+" PAGES IN CACHE"
print "2/3-test IN..."
tstart = timeit.default_timer()
for i in range(0, 1000):
checkinCache("string to find"+str(i))
print timeit.default_timer()-tstart
print "3/3-test FIND..."
tstart = timeit.default_timer()
for i in range(0, 1000):
checkfindCache("string to find"+str(i))
print timeit.default_timer()-tstart
print "\n\nBYE\n"
results in seconds:
1/3-loading pages...
482897 PAGES IN CACHE
2/3-test IN...
107.765980005
3/3-test FIND...
167.788629055
BYE
so, the 'in' method is faster than 'find' method :)
Have fun

Proper Usage of list.append in Python

Like to know why method 1 is correct and method 2 is wrong.
Method1:
def remove_duplicates(x):
y = []
for n in x:
if n not in y:
y.append(n)
return y
Method 2:
def remove_duplicates(x):
y = []
for n in x:
if n not in y:
y = y.append(n)
return y
I don't understand why the second method returns the wrong answer?
The list.append method returns None. So y = y.append(n)
sets y to None.
If this happens on the very last iteration of the for-loop, then None is returned.
If it happens before the last iteration, then on the next time through the loop,
if n not in y
will raise a
TypeError: argument of type 'NoneType' is not iterable
Note: In most cases there are faster ways to remove duplicates than Method 1, but how to do it depends on if you wish to preserve order, if the items are orderable, and if the items in x are hashable.
def unique_hashable(seq):
# Not order preserving. Use this if the items in seq are hashable,
# and you don't care about preserving order.
return list(set(seq))
def unique_hashable_order_preserving(seq):
# http://www.peterbe.com/plog/uniqifiers-benchmark (Dave Kirby)
# Use this if the items in seq are hashable and you want to preserve the
# order in which unique items in seq appear.
seen = set()
return [x for x in seq if x not in seen and not seen.add(x)]
def unique_unhashable_orderable(seq):
# Author: Tim Peters
# http://code.activestate.com/recipes/52560-remove-duplicates-from-a-sequence/
# Use this if the items in seq are unhashable, but seq is sortable
# (i.e. orderable). Note the result does not preserve order because of
# the sort.
#
# We can't hash all the elements. Second fastest is to sort,
# which brings the equal elements together; then duplicates are
# easy to weed out in a single pass.
# NOTE: Python's list.sort() was designed to be efficient in the
# presence of many duplicate elements. This isn't true of all
# sort functions in all languages or libraries, so this approach
# is more effective in Python than it may be elsewhere.
try:
t = list(seq)
t.sort()
except TypeError:
del t
else:
last = t[0]
lasti = i = 1
while i < len(seq):
if t[i] != last:
t[lasti] = last = t[i]
lasti += 1
i += 1
return t[:lasti]
def unique_unhashable_nonorderable(seq):
# Use this (your Method1) if the items in seq are unhashable and unorderable.
# This method is order preserving.
u = []
for x in seq:
if x not in u:
u.append(x)
return u
And this may be the fastest if you have NumPy and the items in seq are orderable:
import numpy as np
def unique_order_preserving_numpy(seq):
u, ind = np.unique(seq, return_index=True)
return u[np.argsort(ind)]

Python set intersection question

I have three sets:
s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])] # true, 16 and 14
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])] # false
I want a function that will return True if every set in the list intersects with at least one other set in the list. Is there a built-in for this or a simple list comprehension?
all(any(a & b for a in s if a is not b) for b in s)
Here's a very simple solution that's very efficient for large inputs:
def g(s):
import collections
count = collections.defaultdict(int)
for a in s:
for x in a:
count[x] += 1
return all(any(count[x] > 1 for x in a) for a in s)
It's a little verbose but I think it's a pretty efficient solution. It takes advantage of the fact that when two sets intersect, we can mark them both as connected. It does this by keeping a list of flags as long as the list of sets. when set i and set j intersect, it sets the flag for both of them. It then loops over the list of sets and only tries to find a intersection for sets that haven't already been intersected. After reading the comments, I think this is what #Victor was talking about.
s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])] # true, 16 and 14
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])] # false
def connected(sets):
L = len(sets)
if not L: return True
if L == 1: return False
passed = [False] * L
i = 0
while True:
while passed[i]:
i += 1
if i == L:
return True
for j, s in enumerate(sets):
if j == i: continue
if sets[i] & s:
passed[i] = passed[j] = True
break
else:
return False
print connected(s0)
print connected(s1)
I decided that an empty list of sets is connected (If you produce an element of the list, I can produce an element that it intersects ;). A list with only one element is dis-connected trivially. It's one line to change in either case if you disagree.
Here's a more efficient (if much more complicated) solution, that performs a linear number of intersections and a number of unions of order O( n*log(n) ), where n is the length of s:
def f(s):
import math
j = int(math.log(len(s) - 1, 2)) + 1
unions = [set()] * (j + 1)
for i, a in enumerate(s):
unions[:j] = [set.union(set(), *s[i+2**k:i+2**(k+1)]) for k in range(j)]
if not (a & set.union(*unions)):
return False
j = int(math.log(i ^ (i + 1), 2))
unions[j] = set.union(a, *unions[:j])
return True
Note that this solution only works on Python >= 2.6.
As usual I'd like to give the inevitable itertools solution ;-)
from itertools import combinations, groupby
from operator import itemgetter
def any_intersects( sets ):
# we are doing stuff with combinations of sets
combined = combinations(sets,2)
# group these combinations by their first set
grouped = (g for k,g in groupby( combined, key=itemgetter(0)))
# are any intersections in each group
intersected = (any((a&b) for a,b in group) for group in grouped)
return all( intersected )
s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])]
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])]
print any_intersects( s0 ) # True
print any_intersects( s1 ) # False
This is really lazy and will only do the intersections that are required. It can also be a very confusing and unreadable oneliner ;-)
To answer your question, no, there isn't a built-in or simple list comprehension that does what you want. Here's another itertools based solution that is very efficient -- surprisingly about twice as fast as #THC4k's itertools answer using groupby() in timing tests using your sample input. It could probably be optimized a bit further, but is very readable as presented. Like #AaronMcSmooth, I arbitrarily decided what to return when there are no or is only one set in the input list.
from itertools import combinations
def all_intersect(sets):
N = len(sets)
if not N: return True
if N == 1: return False
intersected = [False] * N
for i,j in combinations(xrange(N), 2):
if not intersected[i] or not intersected[j]:
if sets[i] & sets[j]:
intersected[i] = intersected[j] = True
return all(intersected)
This strategy isn't likely to be as efficient as #Victor's suggestion, but might be more efficient than jchl's answer due to increased use of set arithmetic (union).
s0 = [set([16,9,2,10]), set([16,14,22,15]), set([14,7])]
s1 = [set([16,9,2,10]), set([16,14,22,15]), set([7,8])]
def freeze(list_of_sets):
"""Transform a list of sets into a frozenset of frozensets."""
return frozenset(frozenset(set_) for set_ in list_of_sets)
def all_sets_have_relatives(set_of_sets):
"""Check if all sets have another set that they intersect with.
>>> all_sets_have_relatives(s0) # true, 16 and 14
True
>>> all_sets_have_relatives(s1) # false
False
"""
set_of_sets = freeze(set_of_sets)
def has_relative(set_):
return set_ & frozenset.union(*(set_of_sets - set((set_,))))
return all(has_relative(set) for set in set_of_sets)
This may give better performance depending on the distribution of the sets.
def all_intersect(s):
count = 0
for x, a in enumerate(s):
for y, b in enumerate(s):
if a & b and x!=y:
count += 1
break
return count == len(s)

Categories

Resources