I have a string "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n" (for example) and need to split the string and get output only as
"nobody,jram,dapp,test1,app1,lasp\r\n"
how will i be able to do that?
You can use str.rsplit() it will split the string based on the delemiter from right side. rsplit() return result as list then you can access the values using index.
s = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
res = s.rsplit(':', 1)[-1]
print(res)
This solution uses regex to find an occurrence of a digit followed by a colon. Then returns the part afterwards as the match.
import re
s1 = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s1)
match1 = m.group(0)
print(match1)
Output: nobody,jram,dapp,test1,app1,lasp
Note that this solution will still work (according to what was requested in the title) even if you have another colon in the text which is not preceded by a number.
s2 = "person:x:1319:test:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s2)
match2 = m.group(0)
print(match2)
Output: test:nobody,jram,dapp,test1,app1,lasp
Related
Let's say we have a string in Python:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
And we are interested in finding the beginning coordinates of the substring substring ="ChristmasWhen". This is very straightforward in Python, i.e.
>>> substring ="ChristmasWhen"
>>> original_string.find(substring)
18
and this checks out
>>> "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"[18]
'C'
If we tried to look for a string which didn't exist, find() will return -1.
Here is my problem:
I have a substring which is guaranteed to be from the original string. However, characters in this substring have been randomly replaced with another character.
How could I algorithmically find the beginning coordinate of the substring (or at least, check if it's possible) if the substring has random characters '-' replacing certain letters?
Here's a concrete example:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
substring = '-hri-t-asW-en'
Naturally, if I try original_string.find('-hri-t-asW-en'), but it would be possible to find hri begins at 19, and therefore with the prefix -, the substring original_string.find('-hri-t-asW-en') must be 18.
This is typically what regular expressions are for : find patterns. You can then try:
import re # use regexp
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
r = re.compile(".hri.t.asW.en") # constructs the search machinery
res = r.search(original_string) # search
print (res.group(0)) # get results
result will be:
ChristmasWhen
Now if your input (the search string) must use '-' as a wildcard you can then translate it to obtain the right regular expression:
import re
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = ".hri.t.asW.en" # supposedly inputed by user
s = s.replace('-','.') # translate to regexp syntax
r = re.compile(s)
res = r.search(original_string)
print (res.group(0))
perhaps use a regular expression? For instance, you can use the . (dot character) to match any character (other than a newline, by default). So if you modify your substring to use dots instead of dashes for the erased letters in the string, you can use re.search to locate those patterns:
text = 'TwasTheNightBeforeChristmasWhenAllThroughTheHouse';
re.search('.hri.t.asW.en', text)
You can use regular expresions to find both the match and the possition
import re
p = re.compile(".hri.t.asW.en")
for m in p.finditer('TwasTheNightBeforeChristmasWhenAllThroughTheHouse'):
print(m.start(), m.group())
out: (18 ChristmasWhen)
A non-regex approach, less efficient than the latter, but still a possibility:
o = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = '-hri-t-asW-en'
r = next(i for i in range(len(o)-len(s)) if all(a == b or b == '-' for a, b in zip(o[i:i+len(s)], s)))
Output
18
Say I have str = "qwop(8) 5" and I want to return the position of 8.
I have the following solution:
import re
str = "qwop(8) 5"
regex = re.compile("\(\d\)")
match = re.search(regex, string) # match object has span = (4, 7)
print(match.span()[0] + 1) # +1 gets at the number 8 rather than the first bracket
This seems really messy. Is there a more sophisticated solution? Preferably using re as I've already imported that for other uses.
Use match.start() to get the start index of the match, and a capturing group to capture specifically the digit between the brackets to avoid the +1 in the index. If you want the very start of the pattern, use match.start(), if you only want the digit, use match.start(1);
import re
test_str = 'qwop(8) 5'
pattern = r'\((\d)\)'
match = re.search(pattern, test_str)
start_index = match.start()
print('Start index:\t{}\nCharacter at index:\t{}'.format(start_index,
test_str[start_index]))
match_index = match.start(1)
print('Match index:\t{}\nCharacter at index:\t{}'.format(match_index,
test_str[match_index]))
Outputs;
Start index: 4
Character at index: (
Match index: 5
Character at index: 8
You can use:
regex = re.compile(r'\((\d+)\)')
The r prefix means that we are working with a raw string. A raw string means that if you write for instance r'\n', Python will not interpret this as a string with a new line character. But as a string with two characters: a backslash ('\\') and an 'n'.
The additional brackets are there to define a capture group. Furthermore a number is a sequence of (one or more) digits. So the + makes sure that we will capture (1425) as well.
We can then perform a .search() and obtain a match. You then can use .start(1) to obtain the start of the first capture group:
>>> regex.search(data)
<_sre.SRE_Match object; span=(4, 7), match='(8)'>
>>> regex.search(data).start(1)
5
If you are inteested in the content of the first capture group, you can call .group(1):
>>> regex.search(data).group(1)
'8'
import re
s = "qwop(8)(9) 5"
regex = re.compile("\(\d\)")
match = re.search(regex, s)
print(match.start() + 1)
start() means the start index, re.search search for the first occurrence.
so this will only show the index of (8).
I have a string named protein. It prints something like this: KALSKJKDALIEUTSTARTALKSJDALK*KAJSLDKJSTARTJAISOIEWORUNCD*
I want a code that will search this string for START and * and print the characters in between them, in this case letters.
For example: protein = STARTJSADHFJAS*KJSTARTAKSLJDIOQWIE*
print protein_filtered = [JSADHFJAS, AKSLJDIOQWIE]
So far I have this, but this will only get me the first protein sequence. Also, I don't care if its appended to a list or if its a string.
start_marker = 'START'
end_marker = '*'
start = protein.index(start_marker) + len(start_marker)
end = protein.index(end_marker, start + 1)
print protein[start:end]
START(.*?)\*
You can do this through re.See demo.
https://regex101.com/r/hE4jH0/41
import re
p = re.compile(ur'START(.*?)\*', re.MULTILINE)
test_str = u"STARTJSADHFJAS*KJSTARTAKSLJDIOQWIE*"
re.findall(p, test_str)
We have used non greedy regex here by appending ? after .* .That is so that the regex stops on the first occurrence of *.If it is greedy it will reach up to the last occurrence of *
One Solution can be:
final_list = [i.split('\\')[0] for i in [i for i protein.split('START') if i]]
I'm new to python and am trying to figure out python regex to find any strings that match -. For example, 'type1-001' and 'type2-001' should be a match, but 'type3-asdf001' shouldn't be a match. I would like to be able to match with a regex like [type1|type2|type3]-\d+ to find any strings that start with type1, type2, or type3 and then are appended with '-' and digits. Also, it would be cool to know how to search for any upper case text appended with '-' and digits.
Here's what I think should work, but I can't seem to get it right...
pref_num = re.compile(r'[type1|type2]-\d+')
[] will match any of the set of characters appearing between the brackets. To group regexes you need to use (). So, I think your regex should be something like:
pref_num = re.compile(r'(type1|type2)-\d+')
As to how to search any uppercase text appended with - and digits, I would suggest:
[A-Z]+-\d+
If you only want the digit after "type" to be variable then you should put only those in the square brackets like so:
re.compile(r'type[1|2]-\d+')
You can use the pattern
'type[1-3]-[0-9]{3}'
Demo
>>> import re
>>> p = 'type[1-3]-[0-9]{3}'
>>> s = 'type2-005 with some text type1-101 and then type1-asdf001'
>>> re.findall(p, s)
['type2-005', 'type1-101']
pref_num = re.compile(r'(type1|type2|type3)-\d+')
m = pref_num.search('type1-000')
if m != None: print(m.string)
m = pref_num.search('type2-000')
if m != None: print(m.string)
m = pref_num.search('type3-abc000')
if m != None: print(m.string)
I want to check if a string ends with a "_INT".
Here is my code
nOther = "c1_1"
tail = re.compile('_\d*$')
if tail.search(nOther):
nOther = nOther.replace("_","0")
print nOther
output:
c101
c102
c103
c104
but there may be two underscores in the string, I am only interested in the last one.
How can I edit my code to handle this?
Using two steps is useless (check if the pattern matches, make the replacement), because re.sub makes it in one step:
txt = re.sub(r'_(?=\d+$)', '0', txt)
The pattern use a lookahead (?=...) (i.e. followed by) that is only a check and the content inside is not a part of the match result. (In other words \d+$ is not replaced)
One way to do it would be to capture everything that is not the last underscore and rebuild the string.
import re
nOther = "c1_1"
tail = re.compile('(.*)_(\d*$)')
tail.sub(nOther, "0")
m = tail.search(nOther)
if m:
nOther = m.group(1) + '0' + m.group(2)
print nOther