I am trying to print the shared characters between 2 sets of strings in Python, I am doing this with the hopes of actually finding how to do this using nothing but python regular expressions (I don't know regex so this might be a good time to learn it).
So if first_word = "peepa" and second_word = "poopa" I want the return value to be: "pa"
since in both variables the characters that are shared are p and a. So far I am following the documentation on how to use the re module, but I can't seem to grasp the basic concepts of this.
Any ideas as to how would I solve this problem?
This sounds like a problem where you want to find the intersection of characters between the two strings. The quickest way would be to do this:
>>> set(first_word).intersection(second_word)
set(['a', 'p'])
I don't think regular expressions are the right fit for this problem.
Use sets. Casting a string to a set returns an iterable with unique letters. Then you can retrieve the intersection of the two sets.
match = set(first_word.lower()) & set(second_word.lower())
Using regular expressions
This problem is tailor made for sets. But, you ask for "how to do this using nothing but python regular expressions."
Here is a start:
>>> import re
>>> re.sub('[^peepa]', '', "poopa")
'ppa'
The above uses regular expressions to remove from "poopa" every letter that was not already in "peepa". (As you see it leaves duplicated letters which sets would not do.)
In more detail, re.sub does substitutions based on regular expressions. [peepa] is a regular expression that means any of the letters peepa. The regular expression [^peepa] means anything that is not in peepa. Anything matching this regular expression is replaced with the empty string "", that is, it is removed. What remains are only the common letters.
Related
I want to find out if there could ever be conflicts between two known regular expressions, in order to allow the user to construct a list of mutually exclusive regular expressions.
For example, we know that the regular expressions below are quite different but they both match xy50:
'^xy1\d'
'[^\d]\d2$'
Is it possible to determine, using a computer algorithm, if two regular expressions can have such a conflict? How?
There's no halting problem involved here. All you need is to compute if the intersection of ^xy1\d and [^\d]\d2$ in non-empty.
I can't give you an algorithm here, but here are two discussions of a method to generate the intersection without resorting the construction of a DFA:
http://sulzmann.blogspot.com/2008/11/playing-with-regular-expressions.html
And then there's RAGEL
http://www.complang.org/ragel/
which can compute the intersection of regular expressions too.
UPDATE: I just tried out Ragel with OP's regexp. Ragel can generate a "dot" file for graphviz from the resulting state machine, which is terrific. The intersection of the OP's regexp looks like this in Ragel syntax:
('xy1' digit any*) & (any* ^digit digit '2')
and has the following state machine:
While the empty intersection:
('xy1' digit any*) & ('q' any* ^digit digit '2')
looks like this:
So if all else fails, then you can still have Ragel compute the intersection and check if it outputs the empty state machine, by comparing the generated "dot" file.
The problem can be restated as, "do the languages described by two or more regular
expressions have a non-empty intersection"?
If you confine the question to pure regular expressions (no backreferences, lookahead,
lookbehind, or other features that allow recognition of context-free or more complex
languages), the question is at least decidable. Regular languages are closed under
intersection, and there is an algorithm that takes the two regular expressions
as inputs and produces, in finite time, a DFA that recognizes the intersection.
Each regular expression can be converted to a nondeterministic finite automaton,
and then to a deterministic finite automaton. A pair of DFAs can be converted
to a DFA that recognizes the intersection. If there is a path from the
start state to any accepting state of that final DFA, the intersection is non-empty
(a "conflict", using your terminology).
Unfortunately, there is a possibly-exponential blowup when converting the initial NFA
to a DFA, so the problem quickly becomes infeasible in practice as the size of
the input expressions grows.
And if extensions to pure regular expressions are permitted, all bets are off --
such languages are no longer closed under intersection, so this construction won't
work.
Yes I think this is solvable: instead of thinking of regular expressions as matching strings, you can also think of them as generating strings. That is, all the strings they would match.
Let [R] be the set of strings generated by the regular expression R. Then given to regular expressions R and T, we could try to match T against [R]. That is [R] matches T iff there is a string in [R] which matches T.
It should be possible to develop this into an algorithm where [R] is lazily constructed as needed: where normal regular expression matching would try to match the next character from an input string and then advance to the next character in the string, the modified algorithm would check whether the FSM corresponding to the input regular expression can generate a matching character at its current state and then advances to 'all next states' simultaneously.
Another approach would be to leverage Dan Kogai's Perl Regexp::Optimizer instead.
use Regexp::Optimizer;
my $o = Regexp::Optimizer->new->optimize(qr/foobar|fooxar|foozap/);
# $re is now qr/foo(?:[bx]ar|zap)/
.. first, optimize and then discard all redundant patterns.
Maybe Ron Savage's Regexp::Assemble could be even more helpful.
It allows assembling an arbitrary number of regular expressions into a single regular expression that matches all that the individual REs match.* Or a combination of both.
* However, you need to be aware of some differences between Perl and Java or other PCRE-flavors.
If you are looking for a lib in Java you can use Automaton using '&' operator:
RegExp re = new RegExp("(ABC_123.*56.txt)&(ABC_12.*456.*\\.txt)", RegExp.INTERSECTION); // Parse RegExp
Automaton a = re.toAutomaton(); // convert RegExp to automaton
if(a.isEmpty()) { // Test if intersection is empty
System.out.println("Intersection is empty!");
}
else {
// Print the shortest accepted string
System.out.println("Intersection is non-empty, example: " + a.getShortestExample(true));
}
Original Answer:
Detecting if two regexes could possibly match the same string
I am looking for a regular expression that discriminates between a string that contains a numerical value enclosed between parentheses, and a string that contains outside of them. The problem is, parentheses may be embedded into each other:
So, for example the expression should match the following strings:
hey(example1)
also(this(onetoo2(hard)))
but(here(is(a(harder)one)maybe23)Hehe)
But it should not match any of the following:
this(one)is22misleading
how(to(go)on)with(multiple)3parent(heses(around))
So far I've tried
\d[A-Za-z] \)
and easy things like this one. The problem with this one is it does not match the example 2, because it has a ( string after it.
How could I solve this one?
The problem is not one of pattern matching. That means regular expressions are not the right tool for this.
Instead, you need lexical analysis and parsing. There are many libraries available for that job.
You might try the parsing or pyparsing libraries.
These type of regexes are not always easy, but sometimes it's possible to come up with a way provided the input remains somewhat consistent. A pattern generally like this should work:
(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)
Code:
import re
p = re.compile(ur'(.*(\([\d]+[^(].*\)|\(.*[^)][\d]+.*\)).*)', re.MULTILINE)
result = re.findall(p, searchtext)
print(result)
Result:
https://regex101.com/r/aL8bB8/1
For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'
This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How to check if text is “empty” (spaces, tabs, newlines) in Python?
I am trying to write a short function to process lines of text in a file. When it encounters a line with significant content (meaning more than just whitespace), it is to do something with that line. The control structure I wanted was
if '\S' in line: do something
or
if r'\S' in line: do something
(I tried the same combinations with double quotes also, and yes I had imported re.) The if statement above, in all the forms I tried, always returns False. In the end, I had to resort to the test
if re.search('\S', line) is not None: do something
This works, but it feels a little clumsy in relation to a simple if statement. My question, then, is why isn't the if statement working, and is there a way to do something as (seemingly) elegant and simple?
I have another question unrelated to control structures, but since my suspicion is that it is also related to a possibly illegal use of regular expressions, I'll ask it here. If I have a string
s = " \t\tsome text \t \n\n"
The code
s.strip('\s')
returns the same string complete with spaces, tabs, and newlines (r'\s' is no different). The code
s.strip()
returns "some text". This, even though strip called with no character string supposedly defaults to stripping whitespace characters, which to my mind is exactly what the expression '\s' is doing. Why is the one stripping whitespace and the other not?
Thanks for any clarification.
Python string functions are not aware of regular expressions, so if you want to use them you have to use the re module.
However if you are only interested in finding out of a string is entirely whitespace or not, you can use the str.isspace() function:
>>> 'hello'.isspace()
False
>>> ' \n\t '.isspace()
True
This is what you're looking for
if not line.isspace(): do something
Also, str.strip does not use regular expressions.
If you are really just want to find out if the line only consists of whitespace characters regex is a little overkill. You should got for the following instead:
if text.strip():
#do stuff
which is basically the same as:
if not text.strip() == "":
#do stuff
Python evaluates every non-empty string to True. So if text consists only of whitespace-characters, text.strip() equals "" and therefore evaluates to False.
The expression '\S' in line does the same thing as any other string in line test; it tests whether the string on the left occurs inside the string on the right. It does not implicitly compile a regular expression and search for a match. This is a good thing. What if you were writing a program that manipulated regular expressions input by the user and you actually wanted to test whether some sub-expression like \S was in the input expression?
Likewise, read the documentation of str.strip. Does it say that will treat it's input as a regular expression and remove matching strings? No. If you want to do something with regular expressions, you have to actually tell Python that, not expect it to somehow guess that you meant a regular expression this time while other times it just meant a plain string. While you might think of searching for a regular expression as very similar to searching for a string, they are completely different operations as far as the language implementation is concerned. And most str methods wouldn't even make sense when applied to a regular expression.
Because re.match objects are "truthy" in boolean context (like most class instances), you can at least shorten your if statement by dropping the is not None test. The rest of the line is necessary to actually tell Python what you want. As for your str.strip case (or other cases where you want to do something similar to a string operation but with a regular expression), have a look at the functions in the re module; there are a number of convenience functions on there that can be helpful. Or else it should be pretty easy to implement a re_split function yourself.
For example:
Characters to match: 'czk'
string1: 'zack' Matches
string2: 'zak' Does not match
I tried (c)+(k)+(z) and [ckz] which are obviously wrong. I feel this is a simple task, but i am unable to find an answer
The most natural way would probably to use sets rather than regex, like so
set('czk').issubset(s)
Code is very often simpler and easier to maintain without using regex much.
Basically you have to sort the string first so you get "ackz" and then you can use a regex like /.*c.*k.*z.*/ to match against.