Surprising order in python set combination methods - python

I am aware that set() in python doesn't have an order as it is implemented as a hash table. However, I was a little surprised to solve a question which involved order using set.intersection().
So I am given two lists with an order, say for example, denoting some ranking or sequence of occurrence. I have to find the element that is common to both lists and has highest order (occurs the first) in the two lists. For example,
List1 = ['j','s','l','p','t']
List2 = ['b','d','q','s','y','j']
should output 's' as it is the second best in List1 and occurs the first in List2.
If you convert each of the lists into sets and take an intersection Set1.intersection(Set2), you get a set set(['s', 'j']). In my case I could convert this into a list and spit out the first element and this was approximately O(n1 + n2).
I was happy to solve this interview question (all test passed), but I am amazed how could I pull off such a order based problem using python set.
Does anyone have a clue how this is working? What's a possible case, that this could breakdown?
EDIT: This seems to be like a stoke of luck case, so if you have a good solution for this problem, it will be also appreciated

I found a O(n1+n2) approach. Commented code follows. The trick is to create a lookup table (not a dictionary, a simple array) to index the minimum position of the letters in both lists, and then find the minimum of the sum of those positions and the associated letter.
List1 = ['j','s','l','p','t']
List2 = ['b','d','q','s','y','j']
# unreachable max value to initialize the slots
maxlen = max(len(List1),len(List2))+1000
# create empty slot_array (initialized to values higher than last letter pos
# for both lists (the value is greater than any "valid" position)
# first pos (index 0) is for letter "a", last pos is for letter "z"
slot_array = [[maxlen,maxlen] for x in range(ord('a'),ord('z')+1)]
# scan both lists, and update the position if lower than the one in slot_array
for list_index,the_list in enumerate((List1,List2)):
print(list_index)
for letter_index,letter in enumerate(the_list):
slot = slot_array[ord(letter)-ord('a')]
if slot[list_index]>letter_index:
slot[list_index] = letter_index
# now compute minimum of sum of both minimum positions for each letter
min_value = maxlen*2
for i,(a,b) in enumerate(slot_array):
sab = a+b
if sab < min_value:
min_value = sab
min_index = i
# result is the letter with the minimal sum
print(chr(min_index+ord('a')))

A list comprehension could do the job:
Set2 = set(List2)
[x for x in List1 if x in Set2]
This will maintain the order of List1, you could do the same for List2 too.
You can then call next on the list comprehension (or a generator to be more efficient) to get the first match.

Related

Finding the lowest number that does not occur at the end of a string, in a tuple in a list

I have a list of tuples, each has a single string as element 0, in these strings I want to get out the final number, and then find the lowest (positive) number that is not in this list.
How do you do this?
E.g. for the list tups:
tups=[('.p1.r1.c2',),('.p1.r1.c4',),('.p1.r1.c16',)]
the final numbers are 2, 4 and 16, so the lowest unused number is 1.
my attempt was this:
tups2= [tup[0] for tup in tups] # convert tuples in lists to the strings with information we are interested in
tups3 = [tup .rfind("c") for tup in tups2] # find the bit we care about
I wasn't sure how to finish it, or if it was fast/smart way to proceed
Where are you blocked? You can achieve that in basically two steps:
Step 1: Create the list of numbers
One way of doing this (inspired from there):
numbers = [int(s[0][len(s[0].rstrip('0123456789')):]) for s in tups]
In your example, numbers is [2, 4, 16].
Step 2: Find the lowest positive number that is not in this list
x = 1
while x in numbers:
x += 1
You didn't really specify your problem but I'm guessing that getting the lowest unused number is the issue.
the solutions above is great but it just gets the lowest number in the list and not the lowest unused one.
I tried to make a list of all the unused numbers then getting the minimum value of it.
I hope that would help
tups=[('15.p1.r1.c2',),('.poj1.r1.c4',),('.p2.r4.c160',)]
numbers = []
unused_numbers = []
for tup in tups:
words = tup[0].strip(".").split('.')
digits_list = [''.join(x for x in i if x.isdigit()) for i in words]
unused_numbers.extend(digits_list[:-1])
numbers.append(digits_list[-1])
print(numbers)
print(min(unused_numbers))
I used the same method Thibault D used to get a list of numbers:
tups=[('.p1.r1.c2',),('.p1.r1.c4',),('.p1.r1.c16',)]
num = [int(i[0][len(i[0].rstrip('0123456789')):]) for i in tups]
However, I used an easier method to get the minimum number:
min(num) - 1
This basically gets the lowest number in the list, and then subtracts 1 from it.

Longest repeated substring in massive string

Given a long string, find the longest repeated sub-string.
The brute-force approach of course is to find all substrings and check the substrings of the remaining string, but the string(s) in question have millions of characters (like a DNA sequence, AGGCTAGCT etc) and I'd like something that finishes before the universe collapses in on itself.
Tried a number of approaches, and I have one solution that works quite fast on strings of up to several million, but takes literally forever (6+ hours) for larger strings, particularly when the length of the repeated sequence gets really long.
def find_lrs(text, cntr=2):
sol = (0, 0, 0)
del_list = ['01','01','01']
while len(del_list) != 0:
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
del_list = [(item, d[item]) for item in d if len(d[item]) > 1]
# if list is empty, we're done
if len(del_list) == 0:
return sol
else:
sol = (del_list[0][1][0], (del_list[0][1][1]),len(del_list[0][0]))
cntr += 1
return sol
I know it's ugly, but hey, I'm a beginner, and I'm just happy I got something to work. Idea is to go through the string starting out with length-2 substrings as the keys, and the index the substring is at the value. If the text was, say, 'BANANA', after the first pass through, the dict would look like this:
{'BA': [0], 'AN': [1, 3], 'NA': [2, 4], 'A': [5]}
BA shows up only once - starting at index 0. AN and NA show up twice, showing up at index 1/3 and 2/4, respectively.
I then create a list that only includes keys that showed up at least twice. In the example above, we can remove BA, since it only showed up once - if there's no substring of length 2 starting out with 'BA', there won't be an substring of length 3 starting with BA.
So after the first past through the pruned list is:
[('AN', [1, 3]), ('NA', [2, 4])]
Since there is at least two possibilities, we save the longest substring and indices found so far and increment the substring length to 3. We continue until no substring was repeated.
As noted, this works on strings up to 10 million in about 2 minutes, which apparently is reasonable - BUT, that's with the longest repeated sequence being fairly short. On a shorter string but longer repeated sequence, it takes -hours- to run. I suspect that it has something to do with how big the dictionary gets, but not quite sure why.
What I'd like to do of course is keep the dictionary short by removing the substrings that clearly aren't repeated, but I can't delete items from the dict while iterating over it. I know there are suffix tree approaches and such that - for now - are outside my ken.
Could simply be that this is beyond my current knowledge, which of course is fine, but I can't help shaking the idea that there is a solution here.
I forgot to update this. After going over my code again, away from my PC - literally writing out little diagrams on my iPad - I realized that the code above wasn't doing what I thought it was doing.
As noted above, my plan of attack was to start out by going through the string starting out with length-2 substrings as the keys, and the index the substring is at the value, creating a list that captures only length-2 substrings that occured at least twice, and only look at those locations.
All well and good - but look closely and you'll realize that I'm never actually updating the default dictionary to only have locations with two or more repeats! //bangs head against table.
I ultimately came up with two solutions. The first solution used a slightly different approach, the 'sorted suffixes' approach. This gets all the suffixes of the word, then sorts them in alphabetical order. For example, the suffixes of "BANANA", sorted, would be:
A
ANA
ANANA
BANANA
NA
NANA
We then look at each adjacent suffix and find how many letters each pair start out having in common. A and ANA have only 'A' in common. ANA and ANANA have "ANA" in common, so we have length 3 as the longest repeated substring. ANANA and BANANA have nothing in common at the start, ditto BANANA and NA. NA and NANA have "NA" in common. So 'ANA', length 3, is the longest repeated substring.
I made a little helper function to do the actual comparing. The code looks like this:
def longest_prefix(suf1, suf2, mx=None):
min_len = min(len(suf1), len(suf2))
for i in range(min_len):
if suf1[i] != suf2[i]:
return (suf1[0:i], len(suf1[0:i]))
return (suf1[0:i], len(suf1[0:i]))
def longest_repeat(txt):
lst = sorted([text[i:] for i in range(len(text))])
print(lst)
mxLen = 0
mx_string = ""
for x in range(len(lst) - 1):
temp = longest_prefix(lst[x], lst[x + 1])
if temp[1] > mxLen:
mxLen = temp[1]
mx_string = temp[0]
first = txt.find(mx_string)
last = txt.rfind(mx_string)
return (first, last, mxLen)
This works. I then went back and relooked at my original code and saw that I wasn't resetting the dictionary. The key is that after each pass through I update the dictionary to -only- look at repeat candidates.
def longest_repeat(text):
# create the initial dictionary with all length-2 repeats
cntr = 2 # size of initial substring length we look for
d = defaultdict(list)
for i in range(len(text)):
d[text[i:i + cntr]].append(i)
# find any item in dict that wasn't repeated at least once
del_list = [(d[item]) for item in d if len(d[item]) > 1]
sol = (0,0,0)
# Keep looking as long as del_list isn't empty,
while len(del_list) > 0:
d = defaultdict(list) # reset dictionary
cntr += 1 # increment search length
for item in del_list:
for i in item:
d[text[i:i + cntr]].append(i)
# filter as above
del_list = [(d[item]) for item in d if len(d[item]) > 1]
# if not empty, update solution
if len(del_list) != 0:
sol = (del_list[0][0], del_list[0][1], cntr)
return sol
This was quite fast, and I think it's easier to follow.

Find pairs of numbers that add to a certain value?

I have a function match that takes in a list of numbers and a target number and I want to write a function that finds within the array two numbers that add to that target.
Here is my approach:
>>> def match(values, target=3):
... for i in values:
... for j in values:
... if j != i:
... if i + j == target:
... return print(f'{i} and {j}')
... return print('no matching pair')
Is this solution valiant? Can it be improved?
The best approach would result in O(NlogN) solution.
You sort the list, this will cost you O(NlogN)
Once the list is sorted you get two indices, the former points to the first element, the latter -- to the latest element and you check to see if the sum of the elements matches whatever is your target. If the sum is above the target, you move the upper index down, if the sum is below the target -- you move the lower index up. Finish when the upper index is equal to the lower index. This operation is linear and can be done in O(N) time.
All in all, you have O(NlogN) for the sorting and O(N) for the indexing, bringing the complexity of the whole solution to O(NlogN).
There is room for improvement. Right now, you have a nested loop. Also, you do not return when you use print.
As you iterate over values, you are getting the following:
values = [1, 2, 3]
target = 3
first_value = 1
difference: 3 - 1 = 2
We can see that in order for 1 to add up to 3, a 2 is required. Rather than iterating over the values, we can simply ask 2 in values.
def match(values, target):
values = set(values)
for value in values:
summand = target - value
if summand in values:
break
else:
print('No matching pair')
print(f'{value} and {summand}')
Edit: Converted values to a set since it has handles in quicker than if it were looking it up in a list. If you require the indices of these pairs, such as in the LeetCode problem you should not convert it to a set, since you will lose the order. You should also use enumerate in the for-loop to get the indices.
Edit: summand == value edge case
def match(values, target):
for i, value in enumerate(values):
summand = target - value
if summand in values[i + 1:]:
break
else:
print('No matching pair')
return
print(f'{value} and {summand}')

Counting string list in lists

I have two lists:
bigList = [["A1.1", "A2.1", "A3.1", "A4.1"], ["A3.1", "A4.1", "A5.1"], ["A4.1", "A5.1"]]
smallList = ["A4.1", "A5.1"]
What is the fastest way in Python to count how many times bigList (lists) contain smallList.
At the moment, the right answer is 2.
Maybe I should use Numpy array?
You can use set method issubset:
Syntax:
A.issubset(B)
Return Value from issubset()
The issubset() returns
True if A is a subset of B
False if A is not a subset of B
bigList = [["A1.1", "A2.1", "A3.1", "A4.1"], ["A3.1", "A4.1", "A5.1"], ["A4.1", "A5.1"]]
smallList = ["A4.1", "A5.1"]
count={}
track=1
for sub_list in bigList:
if set(smallList).issubset(sub_list):
if tuple(smallList) not in count:
count[tuple(smallList)]=track
else:
count[tuple(smallList)]+=1
print(count)
output:
{('A4.1', 'A5.1'): 2}
To check if a list contains another list entirely, we can use a set comparison:
set(smallerList) <= set(biggerList)
This returns True if all elements of smallerList are contained in the set of biggerList. Do note that this method checks if the individual items are contained and not the order - which sets do not regard. As such it cannot be used if the sequence order matters.
From here we can use a simple list comprehension to check through all the sets of bigList and apply the above check. Then we just sum up the number of sets that did contain the smaller set (aka we add 1 if it is a match, 0 if not).
count = sum([1 if set(smallList) <= set(bigList) else 0 for x in bigList])
As pointed out by N. Ivanov - this will scale linearly depending on how many sublists are contained in bigList.

Comparing adjacent elements together without using zip method

How would you go about comparing two adjacent elements in a list in python? How would save or store the value of that item while going through a for loop? I'm trying not to use the zip method and just using an ordinary for loop.
comparing_two_elements = ['Hi','Hello','Goodbye','Does it really work finding longest length of string','Jet','Yes it really does work']
longer_string = ''
for i in range(len(comparing_two_elements)-1):
if len(prior_string) < len(comparing_two_elements[i + 1]):
longer_string = comparing_two_elements[i+1]
print(longer_string)
The below works simply by 'saving' the first element of your list as the longest element, as it will be the first time you loop over your list, and then on subsequent iterations it will compare the length of that item to the length of the next item in the list.
longest_element = None
for element in comparing_two_elements:
if not longest_element:
longest_element = element
continue
if len(longest_element) < len(element):
longest_element = element
If you want to go the "interesting" route, you could do it with combination of other functions, for eg
length_map = map(len, comparing_two_elements)
longest_index = length_map.index(max(length_map))
longest_element = comparing_two_elements[longest_index]
Use the third, optional step argument to range - and don't subtract 1 from len(...) ! Your logic is incomplete: what if the first of a pair of strings is longer? you don't do anything in that case.
It's not clear what you're trying to do. This for loop runs through i = 0, 2, 4, ... up to but excluding len(comparing_two_elements) (assumed to be even!), and prints the longer of each adjacent pair:
for i in range(0, len(comparing_two_elements), 2):
if len(comparing_two_elements[i]) < len(comparing_two_elements[i + 1]):
idx = i
else:
idx = i + 1
print(comparing_two_elements[idx])
This may not do exactly what you want, but as several people have observed, it's unclear just what that is. At least it's something you can reason about and adapt.
If you just want the longest string in a sequence seq, the whole adjacent pairs rigamarole is pointless; simply use:
longest_string = max(seq, key=len)

Categories

Resources