Python Regex Findall Lookahead - python

I've created a function which searches a protein string for an open reading frame. Here it is:
def orf_finder(seq,format):
record = SeqIO.read(seq,format) #Reads in the sequence and tells biopython what format it is.
string = [] #creates an empty list
for i in range(3):
string.append(record.seq[i:]) #creates a list of three lists, each holding a different reading frame.
protein_string = [] #creates an empty list
protein_string.append([str(i.translate()) for i in string]) #translates each list in 'string' and combines them into one long list
regex = re.compile('M''[A-Z]'+r'*') #compiles a regular expression pattern: methionine, followed by any amino acid and ending with a stop codon.
res = max(regex.findall(str(protein_string)), key=len) #res is a string of the longest translated orf in the sequence.
print "The longest ORF (translated) is:\n\n",res,"\n"
print "The first blast result for this protein is:\n"
blast_records = NCBIXML.parse(NCBIWWW.qblast("blastp", "nr", res)) #blasts the sequence and puts the results into a 'record object'.
blast_record = blast_records.next()
counter = 0 #the counter is a method for outputting the first blast record. After it is printed, the counter equals '1' and therefore the loop stops.
for alignment in blast_record.alignments:
for hsp in alignment.hsps:
if counter < 1: #mechanism for stopping loop
print 'Sequence:', alignment.title
print 'Sength:', alignment.length
print 'E value:', hsp.expect
print 'Query:',hsp.query[0:]
print 'Match:',hsp.match[0:]
counter = 1
The only issue is, I don't think that my regex, re.compile('M''[A-Z]'+r'*'), does not find overlapping matches. I know that a lookahead clause, ?=, might solve my problem, but I can't seem to implement it without returning an error.
Does anyone know how I can get it to work?
The code above uses biopython to read-in the DNA sequence, translate it and then searches for a protein readin frame; a sequence starting with 'M' and ending with '*'.

re.compile(r"M[A-Z]+\*")
Assuming that your searched string starts with 'M', followed by one or more upper case 'A-Z' and ends with an '*'.

Related

Get N characters from string, respecting full words python

I am using this code to get the first 4000 characters of a long text.
text = data[0:4000]
print(text)
data is the variable containing the long text, now the problem is when I print text, at the end, I get half a word, for example "con" while the word should be "content".
I am wondering if there is a way to ensure the words aren't truncated.
Find the first space after 4000 characters. You can use max to account for text that ends a few characters past 4000, but with no space at the end.
ix = max(data.find(' ', 4000), 4000)
text = data[:ix]
A simple find statement that looks for a space beginning at character 4000 gets this started
x = txt.find(' ',4000)
But to avoid truncating the last word then you need to test the results of your find statement.
If the starting point of 4000 is within the last word then it will return a -1 and you'll print/return the entire text.
If the starting point is before the last word then it will return the index of the next space and you'll print up to that index
x = txt.find(' ',4000)
if x < 0:
print (txt)
else:
print (txt[:x])
Also remember that the starting point on find is zero based so if the 4000th character is a space it will find the next space. As a simple example, the following code will return "four five" rather than simply "four". If this is not the desired result then consider using 3999 in your find.
txt = "four five six"
x = txt.find(' ',5)
print(txt[:x])
# returns "four five"

Using nested for loop and if statement to replace character with integer

I need to output any repeated character to refer to the previous character.
For example: a(-1)rdv(-4)(-4)k or hel(-1)o
This is my code so far:
text= 'aardvark'
i=0
j=0
for i in range(len(text)-1):
for j in range(i+1, len(text)):
if text[j]==text[i]:
sub= text[j]
val2=text.find(sub, i+1, len(text))
p=val2+1
val=str(i-j)
text= text[:val2] + val + text[p:]
break
print(text)
Output: a-1rdva-4k
The second 'a' is not recognised. And I'm not sure how to include brackets in my print.
By updating the text in-place each time you find a back-reference, you muck up your indices (your text gets longer each time) and you never process the last characters properly. You stop checking when you find the first repeat of the 'current' character, so the 3rd a is never processed. This applies to every 3rd repeat in an input string. In addition, if your input text contains any - characters or digits they'll end up being tested against the -offset references you inserted before them too!
For your specific example of aardvark, a string with 8 characters, what happens is this:
You find the second a and set text to a-1rdvark. The text is now 9 characters long, so the last r will never be checked (you loop to i = 6 at most); this would be a problem if your test string ended in a double letter. You break out of the loop, so the j for loop never comes to the 3rd a, and the second a can't be tested for anymore as it has already been replaced.
Your code finds - (not repeated), 1 (not repeated) and then r (repeated once), so now you replace text with a-1rdva-4k. Now you have a string of 10 characters, so -, and 4 will never be tested. Not a big problem anymore, but what if there was a repeat in just the last 3 positions of the string?
Build a new object for the output (adding both letters you haven't seen before and backreferences). That way you won't cause the text you are looping over to grow, and you will continue to find repeats; for the parentheses you could use more string concatenation. You'll need to scan the part of the string before i, not after, for this to work, and go backwards! Testing i - 1, i - 2, etc, down to 0. Naturally, this means your i loop should then range up to the full length:
output = ''
for i in range(len(text)):
current = text[i]
for j in range(i - 1, -1, -1):
if text[j] == current:
current = '(' + str(j - i) + ')'
break
output = output + current
print(output)
I kept the fix to a minimum here, but ideally I'd also make some more changes:
Add all processed characters and references to a new list instead of a string, then use str.join() to join that list into the output afterwards. This is far more efficient than rebuilding the string each iteration.
Using two loops means you check every character in the string again while looping over the text, so the number of steps the algorithm takes grows exponentially with the length of the input. In Computer Science we talk about the time complexity of algorithms, and yours is a O(N^2) (N squared) exponential algorithm. A text with 1000 letters would take up to 1 million steps to process! Rather than loop an exponential number of times, you can use a dictionary to track indices of letters you have seen. If the current character is in the dictionary you can then trivially calculate the offset. Dictionary lookups take constant time (O(1)), making the whole algorithm take linear time (O(N)), meaning that the time the process takes is directly proportional to the length of the input string.
Use enumerate() to add a counter to the loop so you can just loop over the characters directly, no need to use range().
You can use string formatting to build a "(<offset>)" string; Python 3.6 and newer have formatted string literals, where f'...' strings take {} placeholders that are just expressions. f'({some - calculation + or * other})' will execute the expression and put the result in a string that has(and)characters in it too. For earlier Python versions, you can use the [str.format()method](https://docs.python.org/3/library/stdtypes.html#str.format) to get the same result; the syntax then becomes'({})'.format(some - calculation + or * other)`.
Put together, that becomes:
def add_backrefs(text):
output = []
seen = {}
for i, character in enumerate(text):
if character in seen:
# add a back-reference, we have seen this already
output.append(f'({seen[character] - i})')
else:
# add the literal character instead
output.append(character)
# record the position of this character for later reference
seen[character] = i
return ''.join(output)
Demo:
>>> add_backrefs('aardvark')
'a(-1)rdv(-4)(-4)k'
>>> add_backrefs('hello')
'hel(-1)o'
text= 'aardvark'
d={} # create a dictionary to keep track of index of element last seen at
new_text='' # new text to be generated
for i in range(len(text)): # iterate in text from index 0 up to length of text
c = text[i] # storing a character in temporary element as used frequently
if c not in d: # check if character which is explored is visited before or not
d[c] = i # if character visited first time then just add index value of it in dictionary
new_text += c # concatenate character to result text
else: # visiting alreaady visited character
new_text += '({0})'.format(d[c]-i) # used string formatting which will print value of difference of last seen repeated character with current index instead of {0}
d[c] = i # change last seen character index
print(new_text)
Output:
a(-1)rdv(-4)(-4)k

Most Frequent Character - User Submitted String without Dictionaries or Counters

Currently, I am in the midst of writing a program that calculates all of the non white space characters in a user submitted string and then returns the most frequently used character. I cannot use collections, a counter, or the dictionary. Here is what I want to do:
Split the string so that white space is removed. Then count each character and return a value. I would have something to post here but everything I have attempted thus far has been met with critical failure. The closest I came was this program here:
strin=input('Enter a string: ')
fc=[]
nfc=0
for ch in strin:
i=0
j=0
while i<len(strin):
if ch.lower()==strin[i].lower():
j+=1
i+=1
if j>nfc and ch!=' ':
nfc=j
fc=ch
print('The most frequent character in string is: ', fc )
If you can fix this code or tell me a better way of doing it that meets the required criteria that would be helpful. And, before you say this has been done a hundred times on this forum please note I created an account specifically to ask this question. Yes there are a ton of questions like this but some that are reading from a text file or an existing string within the program. And an overwhelmingly large amount of these contain either a dictionary, counter, or collection which I cannot presently use in this chapter.
Just do it "the old way". Create a list (okay it's a collection, but a very basic one so shouldn't be a problem) of 26 zeroes and increase according to position. Compute max index at the same time.
strin="lazy cat dog whatever"
l=[0]*26
maxindex=-1
maxvalue=0
for c in strin.lower():
pos = ord(c)-ord('a')
if 0<=pos<=25:
l[pos]+=1
if l[pos]>maxvalue:
maxindex=pos
maxvalue = l[pos]
print("max count {} for letter {}".format(maxvalue,chr(maxindex+ord('a'))))
result:
max count 3 for letter a
As an alternative to Jean's solution (not using a list that allows for one-pass over the string), you could just use str.count here which does pretty much what you're trying to do:
strin = input("Enter a string: ").strip()
maxcount = float('-inf')
maxchar = ''
for char in strin:
c = strin.count(char) if not char.isspace() else 0
if c > maxcount:
maxcount = c
maxchar = char
print("Char {}, Count {}".format(maxchar, maxcount))
If lists are available, I'd use Jean's solution. He doesn't use a O(N) function N times :-)
P.s: you could compact this with one line if you use max:
max(((strin.count(i), i) for i in strin if not i.isspace()))
To keep track of several counts for different characters, you have to use a collection (even if it is a global namespace implemented as a dictionary in Python).
To print the most frequent non-space character while supporting arbitrary Unicode strings:
import sys
text = input("Enter a string (case is ignored)").casefold() # default caseless matching
# count non-space character frequencies
counter = [0] * (sys.maxunicode + 1)
for nonspace in map(ord, ''.join(text.split())):
counter[nonspace] += 1
# find the most common character
print(chr(max(range(len(counter)), key=counter.__getitem__)))
A similar list in Cython was the fastest way to find frequency of each character.

Python Challenge #3: Loop stops way too early

I'm working on PythonChallenge #3. I've got a huge block of text that I have to sort through. I am trying to find a sequence in which the first and last three letters are caps, and the middle one is lowercase.
My function loops through the text. The variable block stores the seven letters that are currently being looped through. There's a variable, toPrint, which gets turned on and off based on whether the letters in block correspond to my pattern (AAAaAAA). Based on the last block printed according to my function, my loop stops early in my text. I have no idea why this is happening and if you could help me figure this out, that would be great.
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
words = []
for i in text:
toPrint = True
block = text[text.index(i):text.index(i)+7]
for b in block[:3]:
if b.isupper() == False:
toPrint = False
for b in block[3]:
if b.islower() == False:
toPrint = False
for b in block[4:]:
if b.isupper() == False:
toPrint = False
if toPrint == True and block not in words:
words.append(block)
print (block)
print (words)
With Regex:
This is a really good time to use regex, it's super fast, more clear, and doesn't require a bunch of nested if statements.
import re
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
print(re.search(r"[A-Z]{3}[a-z][A-Z]{3}", text).group(0))
Explanation of regex:
[A-Z]{3] ---> matches any 3 uppercase letters
[a-z] -------> matches a single lowercase letter
[A-Z]{3] ---> matches 3 more uppercase letters
Without Regex:
If you really don't want to use regex this is how you could do it:
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
for i, _ in enumerate(text[:-6]): #loop through index of each char (not including last 6)
sevenCharacters = text[i:i+7] #create chunk of seven characters
shouldBeCapital = sevenCharacters[0:3] + sevenCharacters[4:7] #combine all the chars that should be cap into list
if (all(char.isupper() for char in shouldBeCapital)): #make sure all those characters are indeeed capital
if(sevenCharacters[3].islower()): #make sure middle character is lowercase
print(sevenCharacters)
I think your first problem is that you are using str.index(). Like find(), the .index() method of a string returns the index of the first match that is found.
Thus, in your example, whenever you search for 'x' you will get the index of the first 'x' found, etc. You cannot successfully work with any character that is not unique in the string, or that is not the first occurrence of a repeated character.
In order to keep the same structure (which isn't necessary- there is an answer posted using enumerate that I prefer myself) I implemented a queuing approach with your block variable. Each iteration, a character is dropped from the front of block, while the new character is appended to the end.
I also cleaned up some of your needless comparisons with False. You will find that this is not only inefficient, it is frequently wrong, because many of the "boolean" activities you perform will not be on actual boolean values. Get out of the habit of spelling out True/False. Just use if c or if not c.
Here's the result:
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
words = []
block = '.' + text[0:6]
for i in text[6:]:
block = block[1:] + i # Drop 1st char, append 'i'
toPrint = True
for b in block[:3]:
if not b.isupper():
toPrint = False
if not block[3].islower():
toPrint = False
for b in block[4:]:
if not b.isupper():
toPrint = False
if toPrint and block not in words:
words.append(block)
print (words)
If I understood your question, then according to my opinion there is no need of loop. My this simple code can find required sequence.
# Use this code
text = """kAewtloYgcFQaJNhHVGxXDiQmzjfcpYbzxlWrVcqsmUbCunkfxZWDZjUZMiGqhRRiUvGmYmvnJ"""
import re
print(re.findall("[A-Z]{3}[a-z][A-Z]{3}", text))

python: dictionary of words and wordforms

I have the following problem: I created a dictionary (german) with words and their corresponding lemma. exemple:
"Lagerbestände", "Lager-bestand"; "Wohnhäuser", "Wohn-haus"; "Bahnhof", "Bahn-hof"
I now have a text and I want to check for all word their lemmata. It can happen that it appears a word which is not in the dict, such as "Restbestände". But the lemma of "bestände", we already know it. So I want to take the first part of the word which is unknown in dicti and add this to the lemmatized second part and print this out (or return it).
Example: "Restbestände" --> "Rest-bestand". ("bestand" is taken from the lemma of "Lagerbestände")
I coded the following:
for limit in range(1, len(Word)):
for k, v in dicti.iteritems():
if re.search('[\w]*'+Word[limit:], k, re.IGNORECASE) != None:
if '-' in v:
tmp = v.find('-')
end = v[tmp:]
end = re.sub(ur'[-]',"", end)
Word = Word[:limit] + '-' + end `
But I got 2 problems:
At the end of the words, it is printed out every time "&#10". How can I avoid this?
The second part of the word is sometimes not correct - there must be a logical error.
However; how would you solve this?
At the end of the words, it is printed out every time "&#10". How can
I avoid this?
In must use UNICODE everywhere in your script. Everywhere, everywhere, everywhere.
Also, python RegEx functions accept flag re.UNICODE that you should always set. German letters are out of ASCII set, so RegEx can be sometimes confused, for instance when matching r'\w'

Categories

Resources