Longest alphabetical substring - where to begin - python

I am working on the "longest alphabetical substring" problem from the popular MIT course. I have read a lot of the information on SO about how to code it but I'm really struggling to make the leap conceptually. The finger exercises preceding it were not too hard. I was wondering if anyone knows of any material out there that would really break down the problem solving being employed in this problem. I've tried getting out a pen and paper and I just get lost. I see people employing "counters" of sorts, or strings that contain the "longest substring so far" and when I'm looking at someone else's solution I can understand what they've done with their code, but if I'm trying to synthesize something of my own it's just not clicking.
I even took a break from the course and tried learning via some other books, but I keep coming back to this problem and feel like I need to break through it. I guess what I'm struggling with is making the leap from knowing some Python syntax and tools to actually employing those tools organically for problem solving or "computing".
Before anyone points me towards it, I'm aware of the course materials that are aimed at helping out. I've seen some videos that one of the TAs made that are somewhat helpful but he doesn't really break this down. I feel like I need to pair program it with someone or like... sit in front of a whiteboard and have someone walk me step by step and answer every stupid question I would have.
For reference, the problem is as follows:
Assume s is a string of lower case characters.
Write a program that prints the longest substring of s in which the letters occur in alphabetical order. For example, if s = 'azcbobobegghakl', then your program should print
Longest substring in alphabetical order is: beggh
In the case of ties, print the first substring. For example, if s = 'abcbcd', then your program should print
Longest substring in alphabetical order is: abc
I know that it's helpful to post code but I don't have anything that isn't elsewhere on SO because, well, that's what I've been playing with in my IDE to see if I can understand what's going on. Again, not looking for code snippets - more some reading or resources that will expand upon the logic being employed in this problem. I'll post what I do have but it's not complete and it's as far as I get before I start feeling confused.
s = 'azcbobobegghakl'
current = s[0]
longest = s[0]
for letter in range(0, len(s) -1):
if s[letter + 1] >= s[letter]:
current.append(s[letter + 1])
if len(current) > len(longest):
longest = current
else:
current =
Sorry for formatting errors, still new to this. I'm really frustrated with this problem.

You're almost there in your example, just needs a little tweaking
s = 'azcbobobegghakl'
longest = [s[0],] # make them lists so we can manipulate them (unlike strings)
current = [s[0],]
for letter in range(0, len(s) -1):
if s[letter + 1] >= s[letter]:
current.append(s[letter + 1])
if len(current) > len(longest):
longest = current
else:
current = [s[letter+1],] # reset current if we break an alphabetical chain
longest_string = ''.join(longest) # turn out list back into a string
output of longest_string:
'beegh'

If you are struggling with the concepts and logic behind solving this problem, I would recommend perhaps stepping back a little and going through easier coding tutorials and interactive exercises. You might also enjoy experimenting with JavaScript, where it might be easier to get creative right from the outset, building out snippets and/or webpages that one can immediately interact with in the browser. Then when you get more fun coding vocabulary under your belt, the algorithmic part of it will seem more familiar and natural. I also think letting your own creativity and imagination guide you can be a very powerful learning process.
Let's forget about the alphabetical part for the moment. Imagine we have a bag of letters that we pull out one at a time without knowing which is next and we have to record the longest run of Rs in a row. How would you do it? Let's try to describe the process in words, then pseudocode.
We'll keep a container for the longest run we've seen so far and another to check the current run. We pull letters until we hit two Rs in a row, which we put it in the "current" container. The next letter is not an R, which means our run ended. The "longest-so-far" run is empty so we pour the "current" container in it and continue. The next four letters are not Rs so we just ignore them. Then we get one R, which we put in "current" and then an H. Our run ended again but this time our one R in "current" was less than the two we already have in "longest-so-far" so we keep those and empty "current."
We get an A, a B, and a C, and then a run of five Rs, which we put into the "current" container one by one. Our bag now contains the last letter, a T. We see that our run ended again and that "current" container has more than the "longest-so-far" container so we pour out "longest" and replace its contents with the five Rs in "current." That's it, we found the longest run of Rs in the bag. (If we had more runs, each time one ended we'd choose whether to replace the contents of "longest-so-far.")
In pseudocode:
// Initialise
current <- nothing
longest <- nothing
for letter in bag:
if letter == 'R':
add 'R' to current
else:
if there are more letters
in current than longest:
empty longest and pour
in letters from current
otherwise:
empty current to get ready
for the next possible run
Now the alphabetical stipulation just slightly complicates our condition. We will need to keep track of the most recent letter placed in "current," and in order for a run to continue, its not seeing another of the same letter that counts. Rather, the next letter has to be "greater" (lexicographically) than the last one we placed in current; otherwise the run ends and we perform our quantity check against "longest-so-far."

Generally, it is easier to create a listing of all possibilities from the input, and then filter the results based on the additional logic needed. For instance, when finding longest substrings, all substrings of the input can be found, and then only elements that are valid sequences are retained:
def is_valid(d):
return all(d[i] <= d[i+1] for i in range(len(d)-1))
def longest_substring(s):
substrings = list(filter(is_valid, [s[i:b] for i in range(len(s)) for b in range(len(s))]))
max_length = len(max(substrings, key=len)) #this finds the length length of the longest valid substring, to be used if a tie is discovered
return [i for i in substrings if len(i) == max_length][0]
l = [['abcbcd', 'abc'], ['azcbobobegghakl', 'beggh']]
for a, b in l:
assert longest_substring(a) == b
print('all tests passed')
Output:
all tests passed

One way of dealing with implementation complexity is, for me, to write some unit tests: at some point, if I can't figure out from "reading the code" what is wrong, and/or what parts are missing, I like to write unit tests which is an "orthogonal" approach to the problem (instead of thinking "how can I solve this?" I ask myself "what tests should I write to verify it works ok?").
Then, by running the tests I can observe how the implementation behaves, and try to fix problems "one by one", i.e concentrate on making that next unit test pass.
It's also a way of "cutting a big problem in smaller problems which are easier to reason about".

s = 'azcbobobeggh'
ls = [] #create a new empty list
for i in range(len(s) - 1): # iterate s from index 0 to index -2
if s[i] <= s[i+1]: # compare the letters
ls.append(s[i]) # after comparing them, append them to the new list
else:
ls.append(s[i])
ls.append('mark') # place a 'mark' to separate them into chunks by order
ls.append(s[-1]) # get back the index -1 that missed by the loop
# at this point here ls:
# ['a', 'z', 'mark', 'c', 'mark', 'b', 'o', 'mark', 'b', 'o', 'mark', 'b', 'e', 'g', 'g', 'h']
ls = str(''.join(ls)) # 'azmarkcmarkbomarkbomarkbeggh'
ls = ls.split('mark') # ['az', 'c', 'bo', 'bo', 'beggh']
res = max(ls, key=len) # now just find the longest string in the list
print('Longest substring in alphabetical order is: ' + res)
# Longest substring in alphabetical order is: beggh

Related

Effeciently remove single letter substrings from a string

So I've been trying to attack this problem for a while but have no idea how to do it efficiently.
I'm given a substring of N (N >= 3) characters, and the substring contains solely of the characters 'A' and 'B'. I have to efficiently find a way to count all the substrings possible, which have only one A or only one B, with the same order given.
For example ABABA:
For three letters, the substrings would be: ABA, BAB, ABA. For this all three count because all three of them contain only one B or only one A.
For four letters, the substrings would be: ABAB, BABA. None of these count because they both don't have only one A or B.
For five letters: ABABA. This doesn't count because it doesn't have only one A or B.
If the string was bigger, then all substring combinations would be checked.
I need to implement this is O(n^2) or even O(nlogn) time, but the best I've been able to do was O(n^3) time, where I loop from 3 to the string's length for the length of the substrings, use a nested for loop to check each substring, then use indexOf and lastIndexOf and seeing for each substring if they match and don't equal -1 (meaning that there is only 1 of the character), for both A and B.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!
Effeciently remove single letter substrings from a string
This is completely impossible. Removing a letter is O(n) time already. The right answer is to not remove anything anywhere. You don't need to.
The actual answer is to stop removing letters and making substrings. If you call substring you messed up.
Any ideas how to implement O(n^2) or O(nlogn) time? Thanks!
I have no clue. Also seems kinda silly. But, there's some good news: There's an O(n) algorithm available, why mess about with pointlessly inefficient algorithms?
charAt(i) is efficient. We can use that.
Here's your algorithm, in pseudocode because if I just write it for you, you wouldn't learn much:
First do the setup. It's a little bit complicated:
Maintain counters for # of times A and B occurs.
Maintain the position of the start of the current substring you're on. This starts at 0, obviously.
Start off the proceedings by looping from 0 to x (x = substring length), and update your A/B counters. So, if x is 3, and input is ABABA, you want to end with aCount = 2 and bCount = 1.
With that prepwork completed, let's run the algorithm:
Check for your current substring (that's the substring that starts at 0) if it 'works'. You do not need to run substring or do any string manipulation at all to know this. Just check your aCount and bCount variables. Is one of them precisely 1? Then this substring works. If not, it doesn't. Increment your answer counter by 1 if it works, don't do that if it doesn't.
Next, move to the next substring. To calculate this, first get the character at your current position (0). Then substract 1 from aCount or bCount depending on what's there. Then, fetch the char at 'the end' (.charAt(pos + x)) and add 1 to aCount or bCount depending on what's there. Your aCount and bCount vars now represent how many As respectively Bs are in the substring that starts at pos 1. And it only took 2 constant steps to update these vars.
... and loop. Keep looping until the end (pos + x) is at the end of the string.
This is O(n): Given, say, an input string of 1000 chars, and a substring check of 10, then the setup costs 10, and the central loop costs 990 loops. O(n) to the dot. .charAt is O(1), and you need two of them on every loop. Constant factors don't change big-O number.

Multiple index matches with a for loop in python

I'm trying to understand just how a python for loop iterates. I know how to iterate with c++ but I have been asked to write this program in python. Forgive my knowledge in python but I am by no means an expert on the subject.
I've googled many possible solutions, however, they have not given actual guidance to my issue. Meaning that there was never an actual explanation as to how the coding works to iterate one by one and to be able to match 3 consecutive indexes.
for i in range(0, len(dna)):
if dna[i] == 'A' & dna[i+1] == 'T' & dna[i+2] == 'G':
protein_sequence[dna[i:i+3]]
//for i in range(0, len(dna)-(3+len(dna)%3), 3):
// if protein[dna[i:i+3]] == "ATG":
// protein_sequence += protein[dna[i:i+3]]
if protein[dna[i:i+3]] == "STOP" :
break
protein_sequence += protein[dna[i:i+3]]
What I am trying to do is to iterate through and match an "exact" three character sequence. Once the sequence is found then I can iterate through by sequences of 3's until I match the "Stop" sequence. The for loop that is commented out didn't work either as far as finding the "Start" trigger to initiate the for loop. Thank you in advance for assistance.
In Python, there is no such thing as a multiple index match; in case you need to look up the surrounding values of an element in an array, use a sliding window of size len(pattern):
def match(s, pattern): # returns the FIRST match
for start in xrange(len(s) - len(pattern)):
if s[start: start + len(pattern)] == pattern:
return start
return None
idx = match(dna, "ATG")
if idx is not None:
pass # do something witty with it instead
Of course, this performs poorly on large data due to its time complexity of O(n^2): you'll need to employ faster algorithms, like Aho-Corasick or KMP.
You could simplify by using the split function limiting it to the first occurrence of ‘atg’ then doing your 3 letter loop:
dna='cgatgxggctatgaatcttccggtaatg'
z=dna.split('atg',1)
Output:
z
['cg', 'xggctatgaatcttccggtaatg']

Function result varies on each run

I have the following function that generates the longest palindrome of a string by removing and re-ordering the characters:
from collections import Counter
def find_longest_palindrome(s):
count = Counter(s)
chars = list(set(s))
beg, mid, end = '', '', ''
for i in range(len(chars)):
if count[chars[i]] % 2 != 0:
mid = chars[i]
count[chars[i - 1]] -= 1
else:
for j in range(0, int(count[chars[i]] / 2)):
beg += chars[i]
end = beg
end = ''.join(list(reversed(end)))
return beg + mid + end
out = find_longest_palindrome('aacggg')
print(out)
I got this function by 'translating' this example from C++
When ever I run my function, I get one of the following outputs at random it seems:
a
aca
agcga
The correct one in this case is 'agcga' as this is the longest palindrome for the input string 'aacggg'.
Could anyone suggest why this is occurring and how I could get the function to reliably return the longest palindrome?
P.S. The C++ code does not have this issue.
Your code depends on the order of list(set(s)).
But sets are unordered.
In CPython 3.4-3.7, the specific order you happen to get for sets of strings depends on the hash values for strings, which are explicitly randomized at startup, so it makes sense that you’d get different results on each run.
The reason you don’t see this in C++ is that the C++ set class template is not an unordered set, but a sorted set (based on a binary search tree, instead of a hash table), so you always get the same order in every run.
You could get the same behavior in Python by calling sorted on the set instead of just copying it to a list in whatever order it has.
But the code still isn’t correct; it just happens to work for some examples because the sorted order happens to give you the characters in most-repeated order. But that’s obviously not true in general, so you need to rethink your logic.
The most obvious difference introduced in your translation is this:
count[ch--]--;
… or, since you're looping over the characters by index instead of directly, more like:
count[chars[i--]]--;
Either way, this decrements the count of the current character, and then decrements the current character so that the loop will re-check the same character the next time through. You've turned this into something completely different:
count[chars[i - 1]] -= 1
This just decrements the count of the previous character.
In a for-each loop, you can't just change the loop variable and have any effect on the looping. To exactly replicate the C++ behavior, you'd either need to switch to a while loop, or put a while True: loop inside the for loop to get the same "repeat the same character" effect.
And, of course, you have to decrement the count of the current character, not decrement the count of the previous character that you're never going to see again.
for i in range(len(chars)):
while True:
if count[chars[i]] % 2 != 0:
mid = chars[i]
count[chars[i]] -= 1
else:
for j in range(0, int(count[chars[i]] / 2)):
beg += chars[i]
break
Of course you could obviously simplify this—starting with just looping for ch in chars:, but if you think about the logic of how the two loops work together, you should be able to see how to remove a whole level of indentation here. But this seems to be the smallest change to your code.
Notice that if you do this change, without the sorted change, the answer is chosen randomly when the correct answer is ambiguous—e.g., your example will give agcga one time, then aggga the next time.
Adding the sorted will make that choice consistent, but no less arbitrary.

Reverse vowels in a String Python [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 5 years ago.
Improve this question
Im a beginner to your world. I have seen many attempts to answer this question using fancy builtin function and have found the answer on how to do this a thousand times but NONE using two for loops method.
I would like for the code to only reverse a vowel when it finds one in a string, using the two for loop method.
I coded something that didnt seem to work and I dont quite understand why,
for x in newWord:
for y in vowels:
if x == y:
newWord[x] = newWord[7]
print newWord
with vowels being a list of vowels, with newWord also being a list.
This code is currently not working, like most others who have tried the two for loop method. Any help is appreciated. Thanks
Roughly the approach you want to use is to make two passes over the list of characters. In the first pass you find the index of each vowel, building a list of work to be done (locations of vowels to be swapped).
Then you prepare your work list by matching the first and last items until you have less than two items left in the work list. (An odd number of vowels means that the one in the middle doesn't have to be swapped with anything).
Now you simply iterate over the work list, tuples/pairs of indexes. Swap the character at the first offset with the character at the other one for each pair. Done.
(This is assuming you want transform the list in place. If not then you can either just start with a copy: new_word = word[:] or can iterate over enumerate(word) and conditionally either append the character at each point (if the offset isn't in your work list) ... or append the offset character (if this index matches one of those in your list). I the latter case you might make your work list a dictionary instead).
Here's the code to demonstrate:
def rev_vowels(word):
word = list(word)
results = word[:]
vowel_locations = [index for index, char in enumerate(word) if char in 'aeiou']
work = zip(vowel_locations[:int(len(vowel_locations)/2)], reversed(vowel_locations))
for left, right in work:
results[left], results[right] = word[right], word[left]
return results
This does use a list comprehension, the zip() and reversed() builtins, a complex slice for the first argument to zip(), and the Python tuple packing idiom for swapping variables. So you might have to replace those with more verbose constructs to fulfill your "no fancy builtins" constraint.
Fundamentally, however, a list comprehension is just syntactic sugar around a for loop. So, overall, this demonstrates the approach using two for loops over the data. While I'm returning a copy of the data as my results the code would work without that. (That's why I'm using the tuple packing idiom on line seven (just before the return statement).
If this question is being asked in an interview context I'm reasonably confident that this would be a reasonably good answer. You can easily break down how to implement zip, how to expand a list comprehension into a traditional for suite (block), and the swap line could be two separate assignments when returning a copy rather than performing an in-place transformation on the data.
These variable names are very verbose. But that's to make the intentions especially clear.
This code should solve your problem without using any "fancy builtin functions".
def f(word):
vowels = "aeiou"
string = list(word)
i = 0
j = len(word)-1
while i < j:
if string[i].lower() not in vowels:
i += 1
elif string[j].lower() not in vowels:
j -= 1
else:
string[i], string[j] = string[j], string[i]
i += 1
j -= 1
return "".join(string)

Generating expressions from permutations of variables and operators

So, I've decided that it's time to learn regular expressions. Thus, I set out to solve various problems, and after a bit of smooth sailing, I seem to have hit a wall and need help getting unstuck.
The task:
Given a list of characters and logical operators, find all possible combinations of these characters and operators that are not gibberish.
For example, given:
my_list = ['p', 'q', '&', '|']
the output would be:
answers = ['p', 'q', 'p&q', 'p|q'...]
However, strings like 'pq&' and 'p&|' are gibberish and therefore not allowed.
Naturally, as more elements are added to my_list, the more complicated the process becomes.
My current approach:
(I'd like to learn how to solve it with regex, but I am also curious if there exists a better way, too... but again, my focus is regex)
step 1:
find all permutations of the elements such that each permutation is 3 <= x <= len(my_list) long.
step 2:
Loop over the list, and if a regex match is found, pull that element out and put it in the answers list.
(I'm not married to this 2-step approach, it is just what seemed most logical to me)
My current code, minus the regex:
import re
from itertool import permutations
my_list = ['p', 'q', '~r', 'r', '|', '&']
foo = []
answers = []
count = 3
while count < 7:
for i in permutations(a, count):
i = ''.join(k for k in i)
foo.append(i)
count +=1
for i in foo:
if re.match(r'insert_regex', i):
answers.append(i)
else:
None
print answers
Now, I have tried a vast slew of different regex's to get this to work (too many to list them all here) but some of the main ones are:
A straightforward approach by finding all the cases that have two letters side by side, or two operators side by side, then instead of appending 'answers', I just removed them from 'foo'. This is the regex I tried:
r'(\w\w)[&\|]{2,}'
and did not even come close.
I then decided to try and find the strings that I wanted, as opposed to the ones I did not want.
First I tested:
r'^[~\w]'
to make sure I could get the strings whose first character were a letter or a negation. This worked. I was happy.
I then tried:
r'^[~\w][&\|]'
to try and get the next logical operator; however, it only picked up strings whose first character was a letter, and ignored all of the strings whose first character was a negation.
I then tried a conditional so that if the first character was a negation, the next character would be a letter, otherwise it would be an operator:
r'^(?(~)\w|[&\|])'
but this thew me "error: bad character in group name".
I then tried to resolve this error by:
r'^(?:(~)\w|[&\|])'
But that returned only strings that started with '~' or an operator.
I then tried a slew of other things related to conditionals and groupings (2 days worth, actually), but I can't seem to find a solution. Part of the problem is that I don't know enough about regex to know where to go to find the solution, so I have kind of been wandering around the internet aimlessly.
I have run through a lot of tutorials and explanation pages, but they are all rather opaque and don't piece things together in a way is conducive to understanding... they just sort of throw out code for you to copy and paste or mimic.
Any insights you have would be much appreciated, and as much as I would love an answer to the problem, if possible, an ELI5 explanation of what the solution does would be excellent for my own progress.
In a bitter twist of irony, it turns out that I had the solution written down (I documented all the regex's I tried), but it originally failed because I forgot to remove strings from the original list, not the copy.
If anyone is looking for a solution to the problem, the following code worked on all of my test cases (can't promise beyond that, however).
import re
from itertools import permutations
import copy
a = ['p', 'q', 'r', '~r', '|', '&']
foo = []
count = 3
while count < len(a)+1:
for j in permutations(a, count):
j = ''.join(k for k in j)
foo.append(j)
count +=1
foo_copy = copy.copy(foo)
for i in foo:
if re.search(r'(^[&\|])|(\w\w)|(\w~)|([&\|][&\|])|([&\|]$)', i):
foo_copy.remove(i)
else:
None
print foo_copy
You have a list of variables (characters), binary operators, and/or variables prefixed with a unitary operator (like ~). The last case can be dealt with just like a variable.
As binary operators need a variable at either side, we can conclude that a valid expression is an alternation of variables and operators, starting and ending with a variable.
So, you could first divide the input list into two lists based on whether an item is a variable or an operator. Then you could increase the size of the output you will generate, and for each size, get the permutations of both lists and zip these in order to build a valid expression each time. This way you don't need a regular expression to verify the validity.
Here is the suggested function:
from itertools import permutations, zip_longest, chain
def expressions(my_list):
answers = []
variables = [x for x in my_list if x[-1].isalpha()]
operators = [x for x in my_list if not x[-1].isalpha()]
max_var_count = min(len(operators) + 1, len(variables))
for var_count in range(1, max_var_count+1):
for vars in permutations(variables, var_count):
for ops in permutations(operators, var_count-1):
answers.append(''.join(list(chain.from_iterable(zip_longest(vars, ops)))[:-1]))
return answers
print(expressions(['p', 'q', '~r', 'r', '|', '&']))

Categories

Resources