Using REGEX over split string - python

I'm trying to split a string into sub string, splitting by the 'AND' term, and after that
clean each sub string from "garbage".
The following code get the error:
AttributeError: 'NoneType' object has no attribute 'group'
import re
def fun(self, str):
for subStr in str.split('AND'):
p = re.compile('[^"()]+')
m = p.match(subStr)
print (m.group())

It means the match is not found, and it returned None.
Note that you might want to use re.search here instead of re.match. re.match matches only at the beginning of the string while re.search can search anywhere in the string.
From the docs:
Python offers two different primitive operations based on regular
expressions: re.match() checks for a match only at the beginning of
the string, while re.search() checks for a match anywhere in the
string (this is what Perl does by default).
If you already know that then you can handle that None using:
if m:
print (m.group())
else:
#do something else

If the code above is what you really want to do, wouldn't it be easier to remove the garbage first using string.translate. Something like:
import string
def clean_and_split(x):
return string.translate(x, None, r'^"()').split("AND")

Related

Can't use '\1' backreference to capture-group in a function call in re.sub() repr expression

I have a string S = '02143' and a list A = ['a','b','c','d','e']. I want to replace all those digits in 'S' with their corresponding element in list A.
For example, replace 0 with A[0], 2 with A[2] and so on. Final output should be S = 'acbed'.
I tried:
S = re.sub(r'([0-9])', A[int(r'\g<1>')], S)
However this gives an error ValueError: invalid literal for int() with base 10: '\\g<1>'. I guess it is considering backreference '\g<1>' as a string. How can I solve this especially using re.sub and capture-groups, else alternatively?
The reason the re.sub(r'([0-9])',A[int(r'\g<1>')],S) does not work is that \g<1> (which is an unambiguous representation of the first backreference otherwise written as \1) backreference only works when used in the string replacement pattern. If you pass it to another method, it will "see" just \g<1> literal string, since the re module won't have any chance of evaluating it at that time. re engine only evaluates it during a match, but the A[int(r'\g<1>')] part is evaluated before the re engine attempts to find a match.
That is why it is made possible to use callback methods inside re.sub as the replacement argument: you may pass the matched group values to any external methods for advanced manipulation.
See the re documentation:
re.sub(pattern, repl, string, count=0, flags=0)
If repl is a function, it is called for every non-overlapping
occurrence of pattern. The function takes a single match object
argument, and returns the replacement string.
Use
import re
S = '02143'
A = ['a','b','c','d','e']
print(re.sub(r'[0-9]',lambda x: A[int(x.group())],S))
See the Python demo
Note you do not need to capture the whole pattern with parentheses, you can access the whole match with x.group().

Python Regular expression return none on byte string [duplicate]

I'm having a hell of a time trying to transfer my experience with javascript regex to Python.
I'm just trying to get this to work:
print(re.match('e','test'))
...but it prints None. If I do:
print(re.match('e','est'))
It matches... does it by default match the beginning of the string? When it does match, how do I use the result?
How do I make the first one match? Is there better documentation than the python site offers?
re.match implicitly adds ^ to the start of your regex. In other words, it only matches at the start of the string.
re.search will retry at all positions.
Generally speaking, I recommend using re.search and adding ^ explicitly when you want it.
http://docs.python.org/library/re.html
the docs is clear i think.
re.match(pattern, string[, flags])¶
If zero or more characters **at the beginning of string** match the
regular expression pattern, return a corresponding MatchObject
instance. Return None if the string does not match the pattern; note
that this is different from a zero-length match.

String replacement under regular expression not working as expected

I'm trying to search and replace part of strings using re.sub and format capabilities of Python.
I want all text like 'ESO \d+-\d+" to be replaced in the format 'ESO \d{3}-\d{3}' using leading zeroes.
I thought that this would work:
re.sub(r"ESO (\d+)-(\d+)" ,"ESO {:0>3}-{:0>3}".format(r"\1",r"\2"), line)
But I get strange results:
'ESO 409-22' becomes 'ESO 0409-022'
'ESO 539-4' becomes 'ESO 0539-04'
I can't see the error, in fact if I use two operations I get the correct result:
>>> ricerca = re.search(r"ESO (\d+)-(\d+)","ESO 409-22")
>>> print("ESO {:0>3}-{:0>3}".format(ricerca.group(1),ricerca.group(2)))
ESO 409-022
"ESO {:0>3}-{:0>3}".format(r"\1",r"\2")
evaluates to the same as:
r"ESO 0\1-0\2"
and then the group substitution proceeds normally, so it just puts a 0 in front of the numbers.
Your last code sample is a very sensible way to solve this problem, stick to it. If you really need to use re.sub, pass a function as the replacement:
>>> import re
>>> line = 'ESO 409-22'
>>> re.sub(r"ESO (\d+)-(\d+)", lambda match: "ESO {:0>3}-{:0>3}".format(*match.groups()), line)
'ESO 409-022'
>>> help(re.sub)
Help on function sub in module re:
sub(pattern, repl, string, count=0, flags=0)
Return the string obtained by replacing the leftmost
non-overlapping occurrences of the pattern in string by the
replacement repl. repl can be either a string or a callable;
if a string, backslash escapes in it are processed. If it is
a callable, it's passed the match object and must return
a replacement string to be used.

Is there a way to use regular expressions in the replacement string in re.sub() in Python?

In Python in the re module there is the following function:
re.sub(pattern, repl, string, count=0, flags=0) – Return the string obtained by replacing the leftmost non-overlapping occurrences of pattern in string by the replacement repl. If the pattern isn’t found, string is returned unchanged.
I've found it can work like this:
print re.sub('[a-z]*\d+','lion','zebra432') # prints 'lion'
I was wondering, is there an easy way to use regular expressions in the replacement string, so that the replacement string contains part of the original regular expression/original string? Specifically, can I do something like this (which doesn't work)?
print re.sub('[a-z]*\d+', 'lion\d+', 'zebra432')
I want that to print 'lion432'. Obviously, it does not. Rather, it prints 'lion\d+'. Is there an easy way to use parts of the matching regular expression in the replacement string?
By the way, this is NOT a special case. Please do NOT assume that the number will always come at the end, the words will always come in the beginning, etc. I want to know a solution to all regexes in general.
Thanks
Place \d+ in a capture group (...) and then use \1 to refer to it:
>>> import re
>>> re.sub('[a-z]*(\d+)', r'lion\1', 'zebra432')
'lion432'
>>>
>>> # You can also refer to more than one capture group
>>> re.sub('([a-z]*)(\d+)', r'\1lion\2', 'zebra432')
'zebralion432'
>>>
From the docs:
Backreferences, such as \6, are replaced with the substring matched
by group 6 in the pattern.
Note that you will also need to use a raw-string so that \1 is not treated as an escape sequence.

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources