For example:
Characters to match: 'czk'
string1: 'zack' Matches
string2: 'zak' Does not match
I tried (c)+(k)+(z) and [ckz] which are obviously wrong. I feel this is a simple task, but i am unable to find an answer
The most natural way would probably to use sets rather than regex, like so
set('czk').issubset(s)
Code is very often simpler and easier to maintain without using regex much.
Basically you have to sort the string first so you get "ackz" and then you can use a regex like /.*c.*k.*z.*/ to match against.
Related
I have an url like this
http://foo.com/bar_by_baz.html
now I want to extract baz from that URL using a regex. But so far I have managed to write this much only
[_]+?\w[^.]+
This is giving me
_by_baz
as output. Now I want to know that how can I select any special character exactly one time or what would be the best approach to solve this using regex ?
I am trying it on python 3.x
Here's your regex: [_]+?([^_.]+) the group match will return baz.. The concept is to isolate underscore and dot from the target match
In another case, this works based on capturing only the alphanumerics [_]+?([A-Za-z0-9]+)
I am going to assume from your profile that you are seeking a javascript-friendly solution (you should update your question & tags).
For javascript, you could use this pattern: /[^_]+(?=\.[a-z]+$)/
Demo Link The pattern matches the substring containing no underscores that is followed by a dot then one or more alphabetical characters until the end of the string.
There will be several ways to accomplish your task. Finding the best/most efficient one can only be achieved if you provide more information about the coding environment/language and a few more sample strings.
I know there is already some threads about matching regex in array: How do you use a regex in a list comprehension in Python? But I don't think these approaches are very scalable.
My question is how to do the regex matching as efficiently as possible. For example, I have a profanity word list below (It has 2000 lines in total):
.*damn
bollock.*
...
(You get the idea…)
What I want to do is to find whether a sentence contains any profanity word/pattern as fast as possible. Concatenate all this pattern into a pattern by using | will lead to a super-huge pattern.. Does anyone have ideas about how to optimize it in Python?
I will give it a try for this library:
https://code.google.com/archive/p/esmre/
Regular expression acceleration in Python using Aho-Corasick
Or this:
https://github.com/WojciechMula/pyahocorasick/
I have following words:
is\s?(this|that|it)\s?true\s[?]?
^real$
^reall[y]*[\s]?[?]*$
wh[a]*[t]*[?!][?]*
For every string, I have to search if any of these words are present in the string.
Whats the best way to do it?
I have tried using:
re.search(
'is\s?(this|that|it)\s?true\s[?]?|^real$|^reall[y]*[\s]?[?]*$|wh[a]*[t]*[?!][?]*',
string)
But it is very slow. Is there a better way to do this?
If you're using the same regular expression on a lot of strings, you can try using re.compile to save some time.
I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1
I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.