Wildcard matching in Python - python

I have a class called Pattern, and within it two methods, equates and setwildcard. Equates returns the index in which a substring first appears in a string, and setwildcard sets a wild card character in a substring
So
p = Pattern('xyz')
t = 'xxxxxyz'
p.equates(t)
Returns 4
Also
p = Pattern('x*z', '*')
t = 'xxxxxgzx'
p.equates(t)
Returns 4, because * is the wildcard and can match any letter within t, as long as x and z match.
What's the best way to implement this?

Regex, like the accepted answer suggests, is one way of handling the problem. Although, if you need a simpler pattern (such as Unix shell-style wildcards), then the fnmatch built in library can help:
Expressions:
* - matches everything
? - matches any single character
[seq] - matches any character in seq
[!seq] - matches any character not in seq
So for example, trying to find anything that would match with localhost:
import fnmatch
my_pattern = "http://localhost*"
name_to_check = "http://localhost:8080"
fnmatch.fnmatch(name_to_check, my_pattern) # True
The nice part of this is that / is not considered a special character, so for filename/URL matching this works out quite well without having to pre-escape all slashes!

It looks like you're essentially implementing a subset of regular expressions. Luckily, Python has a library for that built-in! If you're not familiar with how regular expressions (or, as their friends call them, regexes) work, I highly recommend you read through the documentation for them.
In any event, the function re.search is, I think, exactly what you're looking for. It takes, as its first argument, a pattern to match, and, as its second argument, the string to match it in. If the pattern is matched, search returns an SRE_Match object, which, conveniently, has a #start() method that returns the index at which the match starts.
To use the data from your example:
import re
start_index = re.search(r'x.z', 'xxxxxgzg').start()
Note that, in regexes, . - not * -- is the wildcard, so you'll have to replace them in the pattern you're using.

Related

python regex: match everything inside brackets including other brackets [duplicate]

In python, I can easily search for the first occurrence of a regex within a string like this:
import re
re.search("pattern", "target_text")
Now I need to find the last occurrence of the regex in a string, this doesn't seems to be supported by re module.
I can reverse the string to "search for the first occurrence", but I also need to reverse the regex, which is a much harder problem.
I can also iterate to find all occurrences from left to right, and just keep the last one, but that looks awkward.
Is there a smart way to find the rightmost occurrence?
One approach is to prefix the regex with (?s:.*) and force the engine to try matching at the furthest position and gradually backing off:
re.search("(?s:.*)pattern", "target_text")
Do note that the result of this method may differ from re.findall("pattern", "target_text")[-1], since the findall method searches for non-overlapping matches, and not all substrings which can be matched are included in the result.
For example, executing the regex a.a on abaca, findall would return aba as the only match and select it as the last match, while the code above will return aca as the match.
Yet another alternative is to use regex package, which supports REVERSE matching mode.
The result would be more or less the same as the method with (?s:.*) in re package as described above. However, since I haven't tried the package myself, it's not clear how backreference works in REVERSE mode - the pattern might require modification in such cases.
import re
re.search("pattern(?!.*pattern)", "target_text")
or
import re
re.findall("pattern", "target_text")[-1]
You can use these 2 approaches.
If you want positions use
x="abc abc abc"
print [(i.start(),i.end(),i.group()) for i in re.finditer(r"abc",x)][-1]
One approach is to use split. For example if you wanted to get the last group after ':' in this sample string:
mystr = 'dafdsaf:ewrewre:cvdsfad:ewrerae'
':'.join(mystr.split(':')[-1:])

Regex search multiple suffixes

I have a big list of target words I am searching
words = ['Word1', 'Word2', 'Word3']
I've been told that a regular expression of this sort:
suffix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join(words)))
Is pretty efficient, since it fails the regex evaluation immediately when a character that doesn't match the expression is met.
However, the other way around is not efficient:
prefix = re.compile('(\\w+) (?:{words})'.format(words='|'.join(words)))
Is there an elegant way to instruct python's regex to do the search in reverse ?
Edit
I've been asked to add example usages:
# prefix search
title = re.compile('(?:Mr.|Mrs.|Ms.|Dr. |Lt.) (\\w+)')
# suffix search
company = re.compile('(\\w+) (?:Inc.| LLP.|ltd.|GMBH)')
# invoking the regex
all_people_names = title.findall(document)
all_company_names = company.findall(document)
Edit 2
A lot of people had been skeptical regarding the significance of the timing differences.
I've implemented the 2 methods: endswith() and endswith_rev() that reverses the string and the results as kabanus suggested.
These are the results:
As you can see, it makes a huge difference, even with a small amount of suffixes.
Well, the way you did it you have to test all the possible prefixes up to the suffix. One way to beat this, only if the string is long enough, is to reverse everything, so you get back to your first example:
prefix = re.compile('(?:{words}) (\\w+)'.format(words='|'.join([word[::-1] for word in words])))
re.match(prefix,mystring[::-1])
so you are searching from the end, and get back the same pattern - remember to reverse the matches though. I wonder how long does the list of words and string need to be to make this worth it. Apparently this is a major optimization booster see OP for some timing.
Using Regular Expression is OK in some cases, or required in others, e.g. when you configure a system that allows you to match patterns and the input type is a RegEx pattern, but for this simple use case RegEx just wastes CPU cycles.
This use case is simple because you know the position at which you want to match the subbstrings - they are always at the end of the input, so each suffix either matches the given inputString or not:
inputString[ len(inputString) - len(suffix) : ] == suffix
But of course, you already have the Python method endswith(suffix), so you can test with:
inputString.endswith( suffix )
The suffix argument can be a tuple though, so you can do the following:
suffixes = ( "Inc.", "inc.", "Gmbh", "ltd.", "LTD", "LLP" )
inputString.endswith( suffixes )
Or for a case insensitive search:
suffixes = ( "inc.", "gmbh", "ltd.", "llp" )
inputString.lower().endswith( suffixes )
Anyway, if performance is really important then perhaps Python is not the best language.
try
.*\.(?:jpg|gif|png)
will match
1.jpg
b.png
c.gif
test it in https://regex101.com/
Non-capturing group (?:jpg|gif|png)
1st Alternative jpg
jpg matches the characters jpg literally (case sensitive)
2nd Alternative gif
gif matches the characters gif literally (case sensitive)
3rd Alternative png
png matches the characters png literally (case sensitive)
Global pattern flags
g modifier: global. All matches (don't return after first match)

Python regex find all matches

I'm using python 2.7 re library to find all numbers written in scientific form in a string. I'm using the following code:
import re
y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","{8.25e+07|8.26206e+07}")
print y
However, the output is only ['8.25e+07'] while I'm expecting something like [('8.25e+07'),(8.26206e+07)]. I've been trying around but couldn't find where the problem is. If I input y = re.findall(".([0-9]+\.[0-9]+[eE][-+]?[0-9]+).","|8.26206e+07}") then it gives ['8.26206e+07'] so the pattern is matching the second number but I don't get it why it doesn't match both at the same time.
You are slightly overcomplicating your regex by misusing the . which matches any character while not actually needing it and using a capturing group () without really using it.
With your pattern you are looking for a number in scientific notation which has to be BOTH preceded and followed by exactly one character.
{8.25e+07|8.26206e+07}
[--------]
After re.findall traverses your string from the beginning it finds your defined pattern, which then drops the { and the | because of your capturing group (..) and saves this as a match. It then continues but only has 8.26206e+07} left. That now does not satisfy your pattern, because it is missing one "any" character for your first ., and no further match is found. Note that findall only looks for non-overlapping matches[1].
To illustrate, change your input string by duplicating your separator |:
>>> p = ".([0-9]+\.[0-9]+[eE][-+]?[0-9]+)."
>>> s = "{8.25e+07||8.26206e+07}"
>>> print(re.findall(p, s))
['8.25e+07', '8.26206e+07']
To satisfy your two .s you need two separators between any two numbers.
Two things I would change in your pattern, (1) remove the .s and (2) remove your capturing group ( ), you have no need for it:
p = "[0-9]+\.[0-9]+[eE][-+]?[0-9]+"
Capturing groups can be very useful if you need to refer to specific captured groups again later, but your task at hand has no need for them.
[1] https://docs.python.org/2/library/re.html?highlight=findall#re.findall
Because findall is documented to
... Return all non-overlapping matches of pattern in string, as a list of strings.
But your patterns overlap: the leading . of the second match would have to be the | character, but that was already consumed by the trailing . of the first match.
Just remove those non-captured .s at the start and end of your regex.
i think you have extra dots.
try this below
import re
y = re.findall("([0-9]+\.[0-9]+[eE][-+]?[0-9]+)","{8.25e+07|8.26206e+07}")
print (y)
When you use regular expressions to match. The default mode will be to find all non-overlapping matches. Using the dots at both the end and the beginning, you make them overlap.
"([0-9]+\.[0-9]+[eE][-+]?[0-9]+)"
should work

re.match() multiple times in the same string with Python

I have a regular expression to find :ABC:`hello` pattern. This is the code.
format =r".*\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
m = patt.match(l.rstrip())
if m:
...
It works well when the pattern happens once in a line, but with an example ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`". It finds only the last one.
How can I find all the three patterns?
EDIT
Based on Paul Z's answer, I could get it working with this code
format = r"\:([^:]*)\:\`([^`]*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
print tag, ":::", value
Result
tagbox ::: Verilog
tagbox ::: Multiply
tagbox ::: VHDL
Yeah, dcrosta suggested looking at the re module docs, which is probably a good idea, but I'm betting you actually wanted the finditer function. Try this:
format = r"\:(.*)\:\`(.*)\`"
patt = re.compile(format, re.I|re.U)
for m in patt.finditer(l.rstrip()):
tag, value = m.groups()
....
Your current solution always finds the last one because the initial .* eats as much as it can while still leaving a valid match (the last one). Incidentally this is also probably making your program incredibly slower than it needs to be, because .* first tries to eat the entire string, then backs up character by character as the remaining expression tells it "that was too much, go back". Using finditer should be much more performant.
A good place to start is there module docs. In addition to re.match (which searches starting explicitly at the beginning of the string), there is re.findall (finds all non-overlapping occurrences of the pattern), and the methods match and search of compiled RegexObjects, both of which accept start and end positions to limit the portion of the string being considered. See also split, which returns a list of substrings, split by the pattern. Depending on how you want your output, one of these may help.
re.findall or even better regex.findall can do that for you in a single line:
import regex as re #or just import re
s = ":tagbox:`Verilog` :tagbox:`Multiply` :tagbox:`VHDL`"
format = r"\:([^:]*)\:\`([^`]*)\`"
re.findall(format,s)
result is:
[('tagbox', 'Verilog'), ('tagbox', 'Multiply'), ('tagbox', 'VHDL')]

Difference in regex behavior between Perl and Python?

I have a couple email addresses, 'support#company.com' and '1234567#tickets.company.com'.
In perl, I could take the To: line of a raw email and find either of the above addresses with
/\w+#(tickets\.)?company\.com/i
In python, I simply wrote the above regex as'\w+#(tickets\.)?company\.com' expecting the same result. However, support#company.com isn't found at all and a findall on the second returns a list containing only 'tickets.'. So clearly the '(tickets\.)?' is the problem area, but what exactly is the difference in regular expression rules between Perl and Python that I'm missing?
The documentation for re.findall:
findall(pattern, string, flags=0)
Return a list of all non-overlapping matches in the string.
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
Empty matches are included in the result.
Since (tickets\.) is a group, findall returns that instead of the whole match. If you want the whole match, put a group around the whole pattern and/or use non-grouping matches, i.e.
r'(\w+#(tickets\.)?company\.com)'
r'\w+#(?:tickets\.)?company\.com'
Note that you'll have to pick out the first element of each tuple returned by findall in the first case.
I think the problem is in your expectations of extracted values. Try using this in your current Python code:
'(\w+#(?:tickets\.)?company\.com)'
Two problems jump out at me:
You need to use a raw string to avoid having to escape "\"
You need to escape "."
So try:
r'\w+#(tickets\.)?company\.com'
EDIT
Sample output:
>>> import re
>>> exp = re.compile(r'\w+#(tickets\.)?company\.com')
>>> bool(exp.match("s#company.com"))
True
>>> bool(exp.match("1234567#tickets.company.com"))
True
There isn't a difference in the regexes, but there is a difference in what you are looking for. Your regex is capturing only "tickets." if it exists in both regexes. You probably want something like this
#!/usr/bin/python
import re
regex = re.compile("(\w+#(?:tickets\.)?company\.com)");
a = [
"foo#company.com",
"foo#tickets.company.com",
"foo#ticketsacompany.com",
"foo#compant.org"
];
for string in a:
print regex.findall(string)

Categories

Resources