I'm new to python and am trying to figure out python regex to find any strings that match -. For example, 'type1-001' and 'type2-001' should be a match, but 'type3-asdf001' shouldn't be a match. I would like to be able to match with a regex like [type1|type2|type3]-\d+ to find any strings that start with type1, type2, or type3 and then are appended with '-' and digits. Also, it would be cool to know how to search for any upper case text appended with '-' and digits.
Here's what I think should work, but I can't seem to get it right...
pref_num = re.compile(r'[type1|type2]-\d+')
[] will match any of the set of characters appearing between the brackets. To group regexes you need to use (). So, I think your regex should be something like:
pref_num = re.compile(r'(type1|type2)-\d+')
As to how to search any uppercase text appended with - and digits, I would suggest:
[A-Z]+-\d+
If you only want the digit after "type" to be variable then you should put only those in the square brackets like so:
re.compile(r'type[1|2]-\d+')
You can use the pattern
'type[1-3]-[0-9]{3}'
Demo
>>> import re
>>> p = 'type[1-3]-[0-9]{3}'
>>> s = 'type2-005 with some text type1-101 and then type1-asdf001'
>>> re.findall(p, s)
['type2-005', 'type1-101']
pref_num = re.compile(r'(type1|type2|type3)-\d+')
m = pref_num.search('type1-000')
if m != None: print(m.string)
m = pref_num.search('type2-000')
if m != None: print(m.string)
m = pref_num.search('type3-abc000')
if m != None: print(m.string)
Related
I have a string "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n" (for example) and need to split the string and get output only as
"nobody,jram,dapp,test1,app1,lasp\r\n"
how will i be able to do that?
You can use str.rsplit() it will split the string based on the delemiter from right side. rsplit() return result as list then you can access the values using index.
s = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
res = s.rsplit(':', 1)[-1]
print(res)
This solution uses regex to find an occurrence of a digit followed by a colon. Then returns the part afterwards as the match.
import re
s1 = "person:x:1319:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s1)
match1 = m.group(0)
print(match1)
Output: nobody,jram,dapp,test1,app1,lasp
Note that this solution will still work (according to what was requested in the title) even if you have another colon in the text which is not preceded by a number.
s2 = "person:x:1319:test:nobody,jram,dapp,test1,app1,lasp\r\n"
m = re.search(r'(?<=\d:).*', s2)
match2 = m.group(0)
print(match2)
Output: test:nobody,jram,dapp,test1,app1,lasp
Let's say we have a string in Python:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
And we are interested in finding the beginning coordinates of the substring substring ="ChristmasWhen". This is very straightforward in Python, i.e.
>>> substring ="ChristmasWhen"
>>> original_string.find(substring)
18
and this checks out
>>> "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"[18]
'C'
If we tried to look for a string which didn't exist, find() will return -1.
Here is my problem:
I have a substring which is guaranteed to be from the original string. However, characters in this substring have been randomly replaced with another character.
How could I algorithmically find the beginning coordinate of the substring (or at least, check if it's possible) if the substring has random characters '-' replacing certain letters?
Here's a concrete example:
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
substring = '-hri-t-asW-en'
Naturally, if I try original_string.find('-hri-t-asW-en'), but it would be possible to find hri begins at 19, and therefore with the prefix -, the substring original_string.find('-hri-t-asW-en') must be 18.
This is typically what regular expressions are for : find patterns. You can then try:
import re # use regexp
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
r = re.compile(".hri.t.asW.en") # constructs the search machinery
res = r.search(original_string) # search
print (res.group(0)) # get results
result will be:
ChristmasWhen
Now if your input (the search string) must use '-' as a wildcard you can then translate it to obtain the right regular expression:
import re
original_string = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = ".hri.t.asW.en" # supposedly inputed by user
s = s.replace('-','.') # translate to regexp syntax
r = re.compile(s)
res = r.search(original_string)
print (res.group(0))
perhaps use a regular expression? For instance, you can use the . (dot character) to match any character (other than a newline, by default). So if you modify your substring to use dots instead of dashes for the erased letters in the string, you can use re.search to locate those patterns:
text = 'TwasTheNightBeforeChristmasWhenAllThroughTheHouse';
re.search('.hri.t.asW.en', text)
You can use regular expresions to find both the match and the possition
import re
p = re.compile(".hri.t.asW.en")
for m in p.finditer('TwasTheNightBeforeChristmasWhenAllThroughTheHouse'):
print(m.start(), m.group())
out: (18 ChristmasWhen)
A non-regex approach, less efficient than the latter, but still a possibility:
o = "TwasTheNightBeforeChristmasWhenAllThroughTheHouse"
s = '-hri-t-asW-en'
r = next(i for i in range(len(o)-len(s)) if all(a == b or b == '-' for a, b in zip(o[i:i+len(s)], s)))
Output
18
I have a string like this:
s = k0+k1+k1k2+k2k3+1+12
I want to convert this, such that every number, which follows a letter (k here) becomes surrounded by square brackets:
k[0]+k[1]+k[1]k[2]+k[2]k[3]+1+12
What is a good way to do that?
What I tried: Use replace() function 4 times (but it cannot handle numbers not followed by letters).
Here is one option using re module with regex ([a-zA-Z])(\d+), which matches a single letter followed by digits and with sub, you can enclose the matched digits with a pair of brackets in the replacement:
import re
s = "k0+k1+k1k2+k2k3+1+12"
re.sub(r"([a-zA-Z])(\d+)", r"\1[\2]", s)
# 'k[0]+k[1]+k[1]k[2]+k[2]k[3]+1+12'
To replace the matched letters with upper case, you can use a lambda in the replacement positions to convert them to upper case:
re.sub(r"([a-zA-Z])(\d+)", lambda p: "%s[%s]" % (p.groups(0)[0].upper(), p.groups(0)[1]), s)
# 'K[0]+K[1]+K[1]K[2]+K[2]K[3]+1+12'
How about this?
s = re.sub('([a-z]+)([0-9]+)', r"\1" + '[' + r"\2" + ']', s)
How could I find all instances of a substring in a string?
For example I have the string ("%1 is going to the %2 with %3"). I need to extract all placeholders in this string (%1, %2, %3)
The current code could only find the first two because the ending is not a white space.
import re
string = "%1 is going to the %2 with %3"
r = re.compile('%(.*?) ')
m = r.finditer(string)
for y in m:
print (y.group())
Don't match on whitespace, match on a word boundary instead using \b:
r = re.compile(r'%(.*?)\b')
You may want to restrict your characters to word characters only instead of the . wildcard, and match at least one character:
r = re.compile(r'%(\w+)\b')
You don't appear to be using the capturing group either, so you could just omit that:
r = re.compile(r'%\w+\b')
I wanna find 6 digit in my webpage:
<td style="width:40px;">705214</td>
My code is:
s = f.read()
m = re.search(r'\A>\d{6}\Z<', s)
l = m.group(0)
If you just want to find 6 digits in between a > and < symbol, use the following regex:
import re
s = '<td style="width:40px;">705214</td>'
m = re.search(r'>(\d{6})<', s)
l = m.groups()[0]
Note the use of parentheses ( and ) to denote a capturing group.
You can also use a look-ahead and a look-behind for the checking:
m = re.search(r'(?<=>)\d{6}(?=<)', s)
l = m.group(0)
This regex will match to 6 digits that are preceded by a > and followed by a <.
You may want to check for any whitespace (tabs, space, newlines) between the tags. \s* means zero or more whitespace.
s='<td style="width:40px;">\n\n705214\t\n</td>'
m=re.search(r'>\s*(\d{6})\s*<',s)
m.groups()
('705214',)
Parsing HTML is a blast. Usually you treat the file as one long line, remove leading and trailing whitespace between the values contained inside the tags. Maybe looking into a HTML table parsing module may help, especially if you need to parse several columns.
stackoverflow answer using lxml etree
Also, htmp.parser was suggested. Food for thought.
(Still learning what modules python has to offer :) )
I think you want something like this:
m = re.search(r'>(\d{6})<', s)
l = m.group(1)
The ( ) around \d{6} indicate a subgroup of the result.
If you want to find multiple instances of 6-digit substrings between > and < then try this:
s = '<tag1>111111</tag1> <tag2>222222</tag2>'
m = re.findall(r'>(\d{6})<', s)
In this case, m will be ['111111','222222'].