How to match one of multiple regular expressions? - python

I have following words:
is\s?(this|that|it)\s?true\s[?]?
^real$
^reall[y]*[\s]?[?]*$
wh[a]*[t]*[?!][?]*
For every string, I have to search if any of these words are present in the string.
Whats the best way to do it?
I have tried using:
re.search(
'is\s?(this|that|it)\s?true\s[?]?|^real$|^reall[y]*[\s]?[?]*$|wh[a]*[t]*[?!][?]*',
string)
But it is very slow. Is there a better way to do this?

If you're using the same regular expression on a lot of strings, you can try using re.compile to save some time.

Related

In Python, how to check if a string match one of the regex in array `efficiently`?

I know there is already some threads about matching regex in array: How do you use a regex in a list comprehension in Python? But I don't think these approaches are very scalable.
My question is how to do the regex matching as efficiently as possible. For example, I have a profanity word list below (It has 2000 lines in total):
.*damn
bollock.*
...
(You get the idea…)
What I want to do is to find whether a sentence contains any profanity word/pattern as fast as possible. Concatenate all this pattern into a pattern by using | will lead to a super-huge pattern.. Does anyone have ideas about how to optimize it in Python?
I will give it a try for this library:
https://code.google.com/archive/p/esmre/
Regular expression acceleration in Python using Aho-Corasick
Or this:
https://github.com/WojciechMula/pyahocorasick/

Find string in possibly multiple parentheses?

I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1

Matching characters in two Python strings

I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.

How to find if a string contains all the certain characters?

For example:
Characters to match: 'czk'
string1: 'zack' Matches
string2: 'zak' Does not match
I tried (c)+(k)+(z) and [ckz] which are obviously wrong. I feel this is a simple task, but i am unable to find an answer
The most natural way would probably to use sets rather than regex, like so
set('czk').issubset(s)
Code is very often simpler and easier to maintain without using regex much.
Basically you have to sort the string first so you get "ackz" and then you can use a regex like /.*c.*k.*z.*/ to match against.

Pad an integer using a regular expression

I'm using regular expressions with a python framework to pad a specific number in a version number:
10.2.11
I want to transform the second element to be padded with a zero, so it looks like this:
10.02.11
My regular expression looks like this:
^(\d{2}\.)(\d{1})([\.].*)
If I just regurgitate back the matching groups, I use this string:
\1\2\3
When I use my favorite regular expression test harness (http://kodos.sourceforge.net/), I can't get it to pad the second group. I tried \1\20\3, but that interprets the second reference as 20, and not 2.
Because of the library I'm using this with, I need it to be a one liner. The library takes a regular expression string, and then a string for what should be used to replace it with.
I'm assuming I just need to escape the matching groups string, but I can't figure it out. Thanks in advance for any help.
How about a completely different approach?
nums = version_string.split('.')
print ".".join("%02d" % int(n) for n in nums)
What about removing the . from the regex?
^(\d{2})\.(\d{1})[\.](.*)
replace with:
\1.0\2.\3
Try this:
(^\d(?=\.)|(?<=\.)\d(?=\.)|(?<=\.)\d$)
And replace the match by 0\1. This will make any number at least two digits long.
Does your library support named groups? That might solve your problem.

Categories

Resources