I'm trying to do a simple regex split in Python. The string is in the form of FooX where Foo is some string and X is an arbitrary integer. I have a feeling this should be really simple, but I can't quite get it to work.
On that note, can anyone recommend some good Regex reading materials?
You can't use split() since that has to consume some characters, but you can use normal matching to do it.
>>> import re
>>> r = re.compile(r'(\D+)(\d+)')
>>> r.match('abc444').groups()
('abc', '444')
Using groups:
import re
m=re.match('^(?P<first>[A-Za-z]+)(?P<second>[0-9]+)$',"Foo9")
print m.group('first')
print m.group('second')
Using search:
import re
s='Foo9'
m=re.search('(?<=\D)(?=\d)',s)
first=s[:m.start()]
second=s[m.end():]
print first, second
Keeping it simple:
>>> import re
>>> a = "Foo1String12345"
>>> re.split(r'(\d+)$', a)[0:2]
['Foo1String', '12345']
Assuming you want to split between the "Foo" and the number, you'd want something like:
r/(?<=\D)(?=\d)/
Which will match at a point between a nondigit and a digit, without consuming any characters in the split.
>>> import re
>>> s="gnibbler1234"
>>> re.findall(r'(\D+)(\d+)',s)[0]
('gnibbler', '1234')
In the regex, \D means anything that is not a digit, so \D+ matches one or more things that are not digits.
Likewise \d means anything that is a digit, so \d+ matches one or more digits
Related
I have a more challenging task, but first I am faced with this issue. Given a string s, I want to extract all the groups of characters marked by some delimiter, e.g. parentheses. How can I accomplish this using regular expressions (or any Pythonic way)?
import re
>>> s = '(3,1)-[(7,2),1,(a,b)]-8a'
>>> pattern = r'(\(.+\))'
>>> re.findall(pattern, s).group() # EDITED: findall vs. search
['(3,1)-[(7,2),1,(a,b)']
# Desire result
['(3,1)', '(7,2)', '(a,b)']
Use findall() instead of search(). The former finds all occurences, the latter only finds the first.
Use the non-greedy ? operator. Otherwise, you'll find a match starting at the first ( and ending at the final ).
Note that regular expressions aren't a good tool for finding nested expressions like: ((1,2),(3,4)).
import re
s = '(3,1)-[(7,2),1,(a,b)]-8a'
pattern = r'(\(.+?\))'
print re.findall(pattern, s)
Use re.findall()
import re
data = '(3,1)-[(7,2),1,(a,b)]-8a'
found = re.findall('(\(\w,\w\))', data)
print found
Output:
['(3,1)', '(7,2)', '(a,b)']
I like to add [] around any sequence of numbers in a string e.g
"pixel1blue pin10off output2high foo9182bar"
should convert to
"pixel[1]blue pin[10]off output[2]high foo[9182]bar"
I feel there must be a simple way but its eluding me :(
Yes, there is a simple way, using re.sub():
result = re.sub(r'(\d+)', r'[\1]', inputstring)
Here \d matches a digit, \d+ matches 1 or more digits. The (...) around that pattern groups the match so we can refer to it in the second argument, the replacement pattern. That pattern simply replaces the matched digits with [...] around the group.
Note that I used r'..' raw string literals; if you don't you'd have to double all the \ backslashes; see the Backslash Plague section of the Python Regex HOWTO.
Demo:
>>> import re
>>> inputstring = "pixel1blue pin10off output2high foo9182bar"
>>> re.sub(r'(\d+)', r'[\1]', inputstring)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar'
You can use re.sub :
>>> s="pixel1blue pin10off output2high foo9182bar"
>>> import re
>>> re.sub(r'(\d+)',r'[\1]',s)
'pixel[1]blue pin[10]off output[2]high foo[9182]bar
Here the (\d+) will match any combinations of digits and re.sub function will replace it with the first group match within brackets r'[\1]'.
You can start here to learn regular expression http://www.regular-expressions.info/
I have a string thath goes like that:
<123, 321>
the range of the numbers can be between 0 to 999.
I need to insert those coordinates as fast as possible into two variables, so i thought about regex. I've already splitted the string to two parts and now i need to isolate the integer from all the other characters.
I've tried this pattern:
^-?[0-9]+$
but the output is:
[]
any help? :)
If your strings follow the same format <123, 321> then this should be a little bit faster than regex approach
def str_translate(s):
return s.translate(None, " <>").split(',')
In [52]: str_translate("<123, 321>")
Out[52]: ['123', '321']
All you need to do is to get rid of the anchors( ^ and $)
>>> import re
>>> string = "<123, 321>"
>>> re.findall(r"-?[0-9]+", string)
['123', '321']
>>>
Note ^ $ Anchors at the start and end of patterns -?[0-9]+ ensures that the string consists only of digits.
That is the regex engine attempts to match the pattern from the start of the string, ^ using -?[0-9]+ till the end of the string $. But it fails because < cannot be matched by -?[0-9]+
Where as the re.findall will find all substrings that match the pattern -?[0-9]+, that is the digits.
"^-?[0-9]+$" will only match a string that contains a number and nothing else.
You want to match a group and the extract that group:
>>> pattern = re.compile("(-?[0-9]+)")
>>> pattern.findall("<123, 321>")
['123', '321']
This code below should be self explanatory. The regular expression is simple. Why doesn't it match?
>>> import re
>>> digit_regex = re.compile('\d')
>>> string = 'this is a string with a 4 digit in it'
>>> result = digit_regex.match(string)
>>> print result
None
Alternatively, this works:
>>> char_regex = re.compile('\w')
>>> result = char_regex.match(string)
>>> print result
<_sre.SRE_Match object at 0x10044e780>
Why does the second regex work, but not the first?
Here is what re.match() says If zero or more characters at the beginning of string match the regular expression pattern ...
In your case the string doesn't have any digit \d at the beginning. But for the \w it has t at the beginning at your string.
If you want to check for digit in your string using same mechanism, then add .* with your regex:
digit_regex = re.compile('.*\d')
The second finds a match because string starts with a word character. If you want to find matches within the string, use the search or findall methods (I see this was suggested in a comment too). Or change your regex (e.g. .*(\d).*) and use the .groups() method on the result.
I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:
Name.Of.Show.S01E01
Name.Of.Show.0101
Name.Of.Show.01x01
Name.Of.Show.101
What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:
"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"
Some Examples:
>>> import re
>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')
So the question is how do I avoid the first group ending with a period? I realize I could simply do:
var.strip(".")
However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?
Thanks in advance.
I think this will do:
>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')
ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:
>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')
So the only real restriction on the last group is that it doesn’t contain a dot? Easy:
^(.*?)(\.[^.]+)$
This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.
This works with all your test cases.
It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.
I believe this will do what you want:
^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$
I tested this against the following list of shows:
30.Rock.S01E01
The.Office.0101
Lost.01x01
How.I.Met.Your.Mother.101
If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.
If the last part never contains a dot: ^(.*)\.([^\.]+)$