Given:
ABC
content 1
123
content 2
ABC
content 3
XYZ
Is it possible to create a regex that matches the shortest version of "ABC[\W\w]+?XYZ"
Essentially, I'm looking for "ABC followed by any characters terminating with XYZ, but don't match if I encounter ABC in between" (but think of ABC as a potential regex itself, as it would not always be a set length...so ABC or ABcC could also match)
So, more generally: REGEX1 followed by any character and terminated by REGEX2, not matching if REGEX1 occurs in between.
In this example, I would not want the first 4 lines.
(I'm sure this explanation could potentially need...further explanation haha)
EDIT:
Alright, I see the need for further explanation now! Thanks for the suggestions thus far. I'll at least give you all more to think about while I start looking into how each of your proposed solutions can be applied to my problem.
Proposal 1: Reverse the string contents and the regex.
This is certainly a very fun hack that solves the problem based on what I explained. In simplifying the issue, I failed to also mention that the same thing could happen in reverse because the ending signature could exist later on also (and has proven to be in my specific situation). That introduces the problem illustrated below:
ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
MNO
content 5
XYZ
In this instance, I would check for something like "ABC through XYZ" meaning to catch [ABC, content 1, XYZ]...but accidentally catching [ABC, content 1, 123, content 2, ABC, content 3, XYZ]. Reversing that would catch [ABC, content 3, XYZ, content 4, MNO, content 5, XYZ] instead of the [ABC, content 2, XYZ] that we want again. The point is to try to make it as generalized as possible because I will also be searching for things that could potentially have the same starting signature (regex "ABC" in this case), and different ending signatures.
If there is a way to build the regexes so that they encapsulate this sort of limitation, it could prove much easier to just reference that any time I build a regex to search for in this type of string, rather than creating a custom search algorithm that deals with it.
Proposal 2: A+B+C+[^A]+[^B]+[^C]+XYZ with IGNORECASE flag
This seems nice in the case that ABC is finite. Think of it as a regex in itself though. For example:
Hello!GoodBye!Hello.Later.
VERY simplified version of what I'm trying to do. I would want "Hello.Later." given the start regex Hello[!.] and the end Later[!.]. Running something simply like Hello[!.]Later[!.] would grab the entire string, but I'm looking to say that if the start regex Hello[!.] exists between the first starting regex instance found and the first ending regex instance found, ignore it.
The convo below this proposal indicates that I might be limited by regular language limitations similar to the parentheses matching problem (Google it, it's fun to think about). The purpose of this post is to see if I do in fact have to resort to creating an underlying algorithm that handles the issue I'm encountering. I would very much like to avoid it if possible (in the simple example that I gave you above, it's pretty easy to design a finite state machine for...I hope that holds as it grows slightly more complex).
Proposal 3: ABC(?:(?!ABC).)*?XYZ with DOTALL flag
I like the idea of this if it actually allows ABC to be a regex. I'll have to explore this when I get in to the office tomorrow. Nothing looks too out of the ordinary at first glance, but I'm entirely new to python regex (and new to actually applying regexes in code instead of just in theory homework)
A regex solution would be ABC(?:(?!ABC).)*?XYZ with the DOTALL flag.
Edit
So after reading your further explanations, I would say that my previous proposal, as well as MRAB's one are somehow similar and won't be of any help here. Your problem is actually the prolem of nested structures.
Think of your 'prefix' and 'suffix' as symbols. You could easily replace them with an opening and a closing parenthesis or whatever, and what you want is being able to match only the smallest (then deepest) pair ...
For example if your prefix is 'ABC.' and your suffix is 'XYZ.':
ABChello worldABCfooABCbarXYZ
You want to get only ABCbarXYZ.
It's the same if the prefix is (, and the suffix is ), the string:
(hello world(foo(bar)
It would match ideally only (bar) ...
Definitely you have to use a context free grammar (like programming languages do: C grammar, Python grammar) and a parser, or make your own by using regex as well as the iterating and storing mechanisms of your programming language.
But that's not possible with only regular expressions. They would probably help in your algorithm, but they just are not designed to handle that alone. Not the good tool for that job ... You cannot inflate tires with a screwdriver. Therefore, you will have to use some external mechanisms, not complicated though, to store the context, your position in the nested stack. Using your regular expression in each single context still may be possible.
Finite state machines are finite, and nested structures have an arbitrary depth that would require your automaton to grow arbitrarily, thus they are not regular languages.
Since recursion in a grammar allows the definition of nested syntactic structures, any language (including any programming language) which allows nested structures is a context-free language, not a regular language. For example, the set of strings consisting of balanced parentheses [like a LISP program with the alphanumerics removed] is a context-free language
see here
Former proposal (not relevant anymore)
If I do:
>>> s = """ABC
content 1
123
content 2
ABC
content 3
XYZ"""
>>> r = re.compile(r'A+B+C+[^A]+[^B]+[^C]+XYZ', re.I)
>>> re.findall(r,s)
I get
['ABC\ncontent 3\nXYZ']
Is that what you want ?
There is another method of solving this problem: not trying to do it in one regex. You could split the string by the first regex, and then use the second one on the last part.
Code is the best explanation:
s = """ABC
content 1
123
content 2
ABC
content 3
XYZ
content 4
XYZ"""
# capturing groups to preserve the matched section
prefix = re.compile('(ABC)')
suffix = re.compile('(XYZ)')
# prefix.split(s) == ['', 'ABC', [..], 'ABC', '\ncontent 3\nXYZ\ncontent 4\nXYZ']
# prefixmatch ^^^^^ ^^^^^^^^^^^^ rest ^^^^^^^^^^^^^^^^
prefixmatch, rest = prefix.split(s)[-2:]
# suffix.split(rest,1) == ['\ncontent 3\n', 'XYZ', '\ncontent 4\nXYZ']
# ^^ interior ^^ ^^^^^ suffixmatch
interior, suffixmatch = suffix.split(rest,1)[:2]
# join the parts up.
result = '%s%s%s' % (prefixmatch, interior, suffixmatch)
# result == 'ABC\ncontent 3\nXYZ'
Some points:
there should be appropriate error handling (even just try: ... except ValueError: .. around the whole thing) to handle the case when either regex doesn't match at all and so the list unpacking fails.
this assumes that the desired segment will occur immediately after the last occurrence of prefix, if not, then you can iterate through the results of prefix.split(s) two at a time (starting at index 1) and do the same splitting trick with suffix to find all the matches.
this likely to be reasonably inefficient, since it creates quite a few intermediate data structures.
Related
I am trying create a python dictionary to reference 'WHM1',2,3, 'HISPM1',2,3, etc. and other iterations to create a new column with a specific string for ex. White or Hispanic. Using regex seems like the right path but I am missing something here and refuse to hard code the whole thing in the dictionary.
I have tried several iterations of regex and regexdict :
d = regexdict({'W*':'White', 'H*':'Hispanic'})
eeoc_nac2_All_unpivot_df['Race'] =
eeoc_nac2_All_unpivot_df['EEOC_Code'].map(d)
A new column will be created with 'White' or 'Hispanic' for each row based on what is in an existing column called 'EEOC_Code'.
Your regular expressions are wrong - you appear to be using glob syntax instead of proper regular expressions.
In regex, x* means "zero or more of x" and so both your regexes will trivially match the empty string. You apparently mean
d = regexdict({'^W':'White', '^H':'Hispanic'})
instead, where the regex anchor ^ matches beginning of string.
There are several third-party packages 1, 2, 3 named regexdict so you should probably point out which one you use. I can't tell whether the ^ is necessary here, or whether the regexes need to match the input completely (I have assumed a substring match is sufficient, as is usually the case in regex) because this sort of detail may well differ between implementations.
I'm not sure to have completely understood your problem. However, if all your labels have structure WHM... and HISP..., then you can simply check the first character:
for race in eeoc_nac2_All_unpivot_df['EEOC_Code']:
if race.startswith('W'):
eeoc_nac2_All_unpivot_df['Race'] = "White"
else:
eeoc_nac2_All_unpivot_df['Race'] = "Hispanic"
Note: it only works if what you have inside eeoc_nac2_All_unpivot_df['EEOC_Code'] is iterable.
Here's the skinny: how do you make a character set match NOT a previously captured character?
r'(.)[^\1]' # doesn't work
Here's the uh... fat? It's part of a (simple) cryptography program. Suppose "hobo" got coded to "fxgx". The program only gets the encoded text and has to figure what it could be, so it generates the pattern:
r'(.)(.)(.)\2' # 1st and 3rd letters *should* be different!
Now it (correctly) matches "hobo", but also matches "hoho" (think about it!). I've tried stuff like:
r'(.)([^\1])([^\1\2])\2' # also doesn't work
and MANY variations but alas! Alack...
Please help!
P.S. The work-around (which I had to implement) is to just retrieve the "hobo"s as well the "hoho"s, and then just filter the results (discarding the "hoho"s), if you catch my drift ;)
P.P.S Now I want a hoho
VVVVV THE ANSWER VVVVV
Yes, I re-re-read the documentation and it does say:
Inside the '[' and ']' of a character class, all numeric escapes are
treated as characters.
As well as:
Special characters lose their special meaning inside sets.
Which pretty much means (I think) NO, you can't do anything like:
re.compile(r'(.)[\1]') # Well you can, but it kills the back-reference!
Thanks for the help!
1st and 3rd letters should be different!
This cannot be detected using a regular expression (not just python's implementation). More specifically, it can't be detected using automata without memory. You'll have to use a different kind of automata.
The kind of grammar you're trying to discover (reduplication) is not regular. Moreover, it is not context-free.
Automata is the mechanism which allows regular expression match to be so efficient.
I'm parsing some TV episodes that have been transcribed by different people, meaning I need to search for a variety of formats. For example, new scenes are indicated one of two ways:
[A coffee shop]
or
INT. Coffee shop - NIGHT
Right now, I match this with the following regex in Python:
re.findall("(^\[(.+?)\]$)|(^[INTEXT]{3}\. .+?$)", text)
where "text" is the text of the entire script (hence using findall). This always appears on its own line, hence the ^$
This gives me something like: (None, None, "INT. Coffee Shop - NIGHT") for example.
My question: How do you construct a regex to search for one of two complex patterns, using the | notation, without also creating submatches that you don't really want? Or is there a better way?
Many thanks.
UPDATE: I had overlooked the idea of non-capturing groups. I can accomplish what I want with:
"(?:^\[.+?\]$)|(?:^[INTEX]{3}\. .+?$)"
However, this raises a new question. I don't actually want the brackets or the INT/EXT in the scenes, just the location. I thought that I could use actual groups within the none-capturing groups, but I'm still getting those blank matches for the other expression, like so:
import re
pattern = "(?:^\[(.+?)\]$)|(?:^[INTEX]{3}\. (.+?)$)"
examples = [
"[coffee shop]",
"INT. COFFEE SHOP - DAY",
"EXT. FIELD - NIGHT",
"[Hugh's aparment]"
]
for example in examples:
print re.findall(pattern, example)
'''
[('coffee shop', '')]
[('', 'COFFEE SHOP - DAY')]
[('', 'FIELD - NIGHT')]
[("Hugh's aparment", '')]
'''
I can just join() them, but is there a better way?
Based on the limited examples you've provided, how about using assertions for the brackets:
re.findall("((?<=^\[)[^[\]]+(?=\]$)|^[INTEXT]{3}\. .+?$)", text)
You may be better off just using two expressions.
patterns = [r'^\[(.+?)\]$', r'^(?:INT|EXT)\. (.+?)$']
for example in examples:
print re.findall(patterns[0], example) or re.findall(patterns[1], example)
This seems to do what you want:
(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))(?:\[\1\]|[INTEX]{3}\. \1)$
First the lookahead peeks at the text of the scene marker, capturing it in group #1. Then the rest of the regex goes ahead and consumes the whole line containing the marker. Although now I think about it, you don't really have to consume anything. This works, too:
result = re.findall(r"(?m)^(?=(?:\[|[INTEX]{3}\.\s+)([^\]\r\n]+))", subject)
The marker text is still captured in group #1, so it still gets added to the result of findall(). Then again, I don't see why you would want to use findall() here. If you're trying to normalize the scene markers by replacing them in place, you'll have to use the consuming version of the regex.
Also, notice the (?m). In your examples you always apply the regex to the scene markers in isolation. To pluck them out of the whole script, you have to set the MULTILINE flag, turning ^ and $ into line anchors.
For example, how could we recognize a string of the following format with a single RE:
LenOfStr:Str
An example string in this format is:
5:5:str
The string we're looking for is "5:str".
In python, maybe something like the following (this isn't working):
r'(?P<len>\d+):(?P<str>.{int((?P=len))})'
In general, is there a way to change the previously matched groups before using them or I just asked yet another question not meant for RE.
Thanks.
Yep, what you're describing is outside the bounds of regular expressions. Regular expressions only deal with actual character data. This provides some limited ability to make matches dependent on context (e.g., (.)\1 to match the same character twice), but you can't apply arbitrary functions to pieces of an in-progress match and use the results later in the same match.
You could do something like search for text matching the regex (\d+):\w+, and then postprocess the results to check if the string length is equal to the int value of the first part of the match. But you can't do that as part of the matching process itself.
Well this can be done with a regex (if I understand the question):
>>> s='5:5:str and some more characters...'
>>> m=re.search(r'^(\d+):(.*)$',s)
>>> m.group(2)[0:int(m.group(1))]
'5:str'
It just cannot be done by dynamically changing the previous match group.
You can make it lool like a single regex like so:
>>> re.sub(r'^(\d+):(.*)$',lambda m: m.group(2)[0:int(m.group(1))],s)
'5:str'
I'm parsing a source file, and I want to "suppress" strings. What I mean by this is transform every string like "bla bla bla +/*" to something like "string" that is deterministic and does not contain any characters that may confuse my parser, because I don't care about the value of the strings. One of the issues here is string formatting using e.g. "%s", please see my remark about this below.
Take for example the following pseudo code, that may be the contents of a file I'm parsing. Assume strings start with ", and escaping the " character is done by "":
print(i)
print("hello**")
print("hel"+"lo**")
print("h e l l o "+
"hello\n")
print("hell""o")
print(str(123)+"h e l l o")
print(uppercase("h e l l o")+"g o o d b y e")
Should be transformed to the following result:
print(i)
print("string")
print("string"+"string")
print("string"
"string")
print("string")
print(str(123)+"string")
print(uppercase("string")+"string")
Currently I treat it as a special case in the code (i.e. detect beginning of a string, and "manually" run until its end with several sub-special cases on the way). If there's a Python library function i can use or a nice regex that may make my code more efficient, that would be great.
Few remarks:
I would like the "start-of-string" character to be a variable, e.g. ' vs ".
I'm not parsing Python code at this stage, but I plan to, and there the problem obviously becomes more complex because strings can start in several ways and must end in a way corresponding to the start. I'm not attempting to deal with this right now, but if there's any well established best practice I would like to know about it.
The thing bothering me the most about this "suppression" is the case of string formatting with the likes of '%s', that are meaningful tokens. I'm currently not dealing with this and haven't completely thought it through, but if any of you have suggestions about how to deal with this that would be great. Please note I'm not interested in the specific type or formatting of the in-string tokens, it's enough for me to know that there are tokens inside the string (how many). Remark that may be important here: my tokenizer is not nested, because my goal is quite simple (I'm not compiling anything...).
I'm not quite sure about the escaping of the start-string character. What would you say are the common ways this is implemented in most programming languages? Is the assumption of double-occurrence (e.g. "") or any set of two characters (e.g. '\"') to escape enough? Do I need to treat other cases (think of languages like Java, C/C++, PHP, C#)?
Option 1: To sanitize Python source code, try the built-in tokenize module. It can correctly find strings and other tokens in any Python source file.
Option 3: Use pygments with HTML output, and replace anything in blue (etc.) with "string". pygments supports a few dozen languages.
Option 2: For most of the languages, you can build a custom regexp substitution. For example, the following sanitizes Python source code (but it doesn't work if the source file contains """ or '''):
import re
sanitized = re.sub(r'(#.*)|\'(?:[^\'\\]+|\\.)*\'|"(?:[^"\\]+|\\.)*"',
lambda match: match.group(1) or '"string"', source_code)
The regexp above works properly even if the strings contain backslashes (\", \\, \n, \\, \\", \\\" etc. all work fine).
When you are building your regexp, make sure to match comments (so your regexp substitution won't touch strings inside comments) and regular expression literals (e.g. in Perl, Ruby and JavaScript), and pay attention you match backslashes and newlines properly (e.g. in Perl and Ruby a string can contain a newline).
Use a dedicated parser for each language — especially since people have already done that work for you. Most of the languages you mentioned have a grammar.
Nowhere do you mention that you take an approach using a lexer and parser. If in fact you do not, have a look at e.g. the tokenize module (which is probably what you want), or the 3rd party module PLY (Python Lex-Yacc). Your problem needs a systematic approach, and these tools (and others) provide it.
(Note that once you have tokenized the code, you can apply another specialized tokenizer to the contents of the strings to detect special formatting directives such as %s. In this case a regular expression may do the job, though.)