How do I separate a specific sequence from a string multiple times?

How do I separate a specific sequence from a string multiple times? - python

I have a string named protein. It prints something like this: KALSKJKDALIEUTSTARTALKSJDALK*KAJSLDKJSTARTJAISOIEWORUNCD*
I want a code that will search this string for START and * and print the characters in between them, in this case letters.
For example: protein = STARTJSADHFJAS*KJSTARTAKSLJDIOQWIE*
print protein_filtered = [JSADHFJAS, AKSLJDIOQWIE]
So far I have this, but this will only get me the first protein sequence. Also, I don't care if its appended to a list or if its a string.
start_marker = 'START'
end_marker = '*'
start = protein.index(start_marker) + len(start_marker)
end = protein.index(end_marker, start + 1)
print protein[start:end]

START(.*?)\*
You can do this through re.See demo.
https://regex101.com/r/hE4jH0/41
import re
p = re.compile(ur'START(.*?)\*', re.MULTILINE)
test_str = u"STARTJSADHFJAS*KJSTARTAKSLJDIOQWIE*"
re.findall(p, test_str)
We have used non greedy regex here by appending ? after .* .That is so that the regex stops on the first occurrence of *.If it is greedy it will reach up to the last occurrence of *

One Solution can be:
final_list = [i.split('\\')[0] for i in [i for i protein.split('START') if i]]

Related

Split String in till first encounter of number and ":"

I have a string "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n" (for example) and need to split the string and get output only as
"nobody,jram,dapp,test1,app1,lasp\r\n"
how will i be able to do that?

You can use str.rsplit() it will split the string based on the delemiter from right side. rsplit() return result as list then you can access the values using index.
s = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
res = s.rsplit(':', 1)[-1]
print(res)

This solution uses regex to find an occurrence of a digit followed by a colon. Then returns the part afterwards as the match.
import re
s1 = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s1)
match1 = m.group(0)
print(match1)
Output: nobody,jram,dapp,test1,app1,lasp
Note that this solution will still work (according to what was requested in the title) even if you have another colon in the text which is not preceded by a number.
s2 = "person:x:1319:test:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s2)
match2 = m.group(0)
print(match2)
Output: test:nobody,jram,dapp,test1,app1,lasp

How to find the substring if the substring has random characters replaced?

Let's say we have a string in Python:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
And we are interested in finding the beginning coordinates of the substring substring ="ChristmasWhen". This is very straightforward in Python, i.e.
>>> substring ="ChristmasWhen"
>>> original_string.find(substring)
18
and this checks out
>>> "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"[18]
'C'
If we tried to look for a string which didn't exist, find() will return -1.
Here is my problem:
I have a substring which is guaranteed to be from the original string. However, characters in this substring have been randomly replaced with another character.
How could I algorithmically find the beginning coordinate of the substring (or at least, check if it's possible) if the substring has random characters '-' replacing certain letters?
Here's a concrete example:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
substring = '-hri-t-asW-en'
Naturally, if I try original_string.find('-hri-t-asW-en'), but it would be possible to find hri begins at 19, and therefore with the prefix -, the substring original_string.find('-hri-t-asW-en') must be 18.

This is typically what regular expressions are for : find patterns. You can then try:
import re # use regexp
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
r = re.compile(".hri.t.asW.en") # constructs the search machinery
res = r.search(original_string) # search
print (res.group(0)) # get results
result will be:
ChristmasWhen
Now if your input (the search string) must use '-' as a wildcard you can then translate it to obtain the right regular expression:
import re
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = ".hri.t.asW.en" # supposedly inputed by user
s = s.replace('-','.') # translate to regexp syntax
r = re.compile(s)
res = r.search(original_string)
print (res.group(0))

perhaps use a regular expression? For instance, you can use the . (dot character) to match any character (other than a newline, by default). So if you modify your substring to use dots instead of dashes for the erased letters in the string, you can use re.search to locate those patterns:
text = 'TwasTheNightBeforeChristmasWhenAllThroughTheHouse';
re.search('.hri.t.asW.en', text)

You can use regular expresions to find both the match and the possition
import re
p = re.compile(".hri.t.asW.en")
for m in p.finditer('TwasTheNightBeforeChristmasWhenAllThroughTheHouse'):
print(m.start(), m.group())
out: (18 ChristmasWhen)

A non-regex approach, less efficient than the latter, but still a possibility:
o = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = '-hri-t-asW-en'
r = next(i for i in range(len(o)-len(s)) if all(a == b or b == '-' for a, b in zip(o[i:i+len(s)], s)))
Output
18

Regex to remove specific words in python

I want to do the some manipulation using regex in python.
So input is +1223,+12_remove_me,+222,+2223_remove_me
and
output should be +1223,+222
Output should only contain comma seperated words which don't contain _remove_me and only one comma between each word.
Note: REGEX which I tried \+([0-9|+]*)_ , \+([0-9|+]*) and some other combination using which I did not get required output.
Note 2 I can't use loop, need to do that without loop with regex only.

Your regex seems incomplete, but you were on the right track. Note that a pipe symbol inside a character class is treated as a literal and your [0-9|+] matches a digit or a | or a + symbols.
You may use
,?\+\d+_[^,]+
See the regex demo
Explanation:
,? - optional , (if the "word" is at the beginning of the string, it should be optional)
\+ - a literal +
\d+ - 1+ digits
_ - a literal underscore
[^,]+ - 1+ characters other than ,
Python demo:
import re
p = re.compile(r',?\+\d+_[^,]+')
test_str = "+1223,+12_remove_me,+222,+2223_remove_me"
result = p.sub("", test_str)
print(result)
# => +1223,+222

A non-regex approach would involve using str.split() and excluding items ending with _remove_me:
>>> s = "+1223,+12_remove_me,+222,+2223_remove_me"
>>> items = [item for item in s.split(",") if not item.endswith("_remove_me")]
>>> items
['+1223', '+222']
Or, if _remove_me can be present anywhere inside each item, use not in:
>>> items = [item for item in s.split(",") if "_remove_me" not in item]
>>> items
['+1223', '+222']
You can then use str.join() to join the items into a string again:
>>> ",".join(items)
'+1223,+222'

In your case you need regex with negotiation
[^(_remove_me)]
Demo

You could perform this without a regex, just using string manipulation. The following can be written as a one-liner, but has been expanded for readability.
my_string = '+1223,+12_remove_me,+222,+2223_remove_me' #define string
my_list = my_string.split(',') #create a list of words
my_list = [word for word in my_list if '_remove_me' not in word] #stop here if you want a list of words
output_string = ','.join(my_list)

Python regex, how to search for multiple strings?

I'm new to python and am trying to figure out python regex to find any strings that match -. For example, 'type1-001' and 'type2-001' should be a match, but 'type3-asdf001' shouldn't be a match. I would like to be able to match with a regex like [type1|type2|type3]-\d+ to find any strings that start with type1, type2, or type3 and then are appended with '-' and digits. Also, it would be cool to know how to search for any upper case text appended with '-' and digits.
Here's what I think should work, but I can't seem to get it right...
pref_num = re.compile(r'[type1|type2]-\d+')

[] will match any of the set of characters appearing between the brackets. To group regexes you need to use (). So, I think your regex should be something like:
pref_num = re.compile(r'(type1|type2)-\d+')
As to how to search any uppercase text appended with - and digits, I would suggest:
[A-Z]+-\d+

If you only want the digit after "type" to be variable then you should put only those in the square brackets like so:
re.compile(r'type[1|2]-\d+')

You can use the pattern
'type[1-3]-[0-9]{3}'
Demo
>>> import re
>>> p = 'type[1-3]-[0-9]{3}'
>>> s = 'type2-005 with some text type1-101 and then type1-asdf001'
>>> re.findall(p, s)
['type2-005', 'type1-101']

pref_num = re.compile(r'(type1|type2|type3)-\d+')
m = pref_num.search('type1-000')
if m != None: print(m.string)
m = pref_num.search('type2-000')
if m != None: print(m.string)
m = pref_num.search('type3-abc000')
if m != None: print(m.string)

find string with a pattern at the end regex python

I want to check if a string ends with a "_INT".
Here is my code
nOther = "c1_1"
tail = re.compile('_\d*$')
if tail.search(nOther):
nOther = nOther.replace("_","0")
print nOther
output:
c101
c102
c103
c104
but there may be two underscores in the string, I am only interested in the last one.
How can I edit my code to handle this?

Using two steps is useless (check if the pattern matches, make the replacement), because re.sub makes it in one step:
txt = re.sub(r'_(?=\d+$)', '0', txt)
The pattern use a lookahead (?=...) (i.e. followed by) that is only a check and the content inside is not a part of the match result. (In other words \d+$ is not replaced)

One way to do it would be to capture everything that is not the last underscore and rebuild the string.
import re
nOther = "c1_1"
tail = re.compile('(.*)_(\d*$)')
tail.sub(nOther, "0")
m = tail.search(nOther)
if m:
nOther = m.group(1) + '0' + m.group(2)
print nOther

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I separate a specific sequence from a string multiple times? - python

One Solution can be: final_list = [i.split('\\')[0] for i in [i for i protein.split('START') if i]]

Related

Split String in till first encounter of number and ":"

How to find the substring if the substring has random characters replaced?

Regex to remove specific words in python

Python regex, how to search for multiple strings?

find string with a pattern at the end regex python

Categories

Resources