Regex matching open or closed Python string - python

I'd like find the contents of all Python strings in source code as that code is being typed. I assume the string is contained within a single line, but it might not be closed yet.
Right now I have
for m in re.finditer('''(?P<open>(?:""")|"|(?:''\')|')(?:((?P<closed>.*?)(?P=open))|(?P<unclosed>.*))''', 'as"df'):
i = 3 if m.group(3) else 4
print m.group(i)
But I'd love to have a predictable match group to search on. Something like
re.finditer('''(?P<open>(?:""")|"|(?:''\')|')(.*?)(?P=open)''', line)
is nicer because the contents of the string literal will always be in match group (but this one doesn't match strings that aren't yet closed).
Edit: I'm fine with multiline matches, I just mean to make the problem simpler by excluding them from being in the input.

You can try this:
(?s)('''|"""|'|")((?:(?=([^"'\\]+|\\.|(?!\1)["']))\3)*)\1?
The quote is captured in group 1, a backreference is used at the end to close the string \1.
[^"'\\]+ | \\. | (?!\1)["'] describes allowed content:
[^"'\\]+ # all that is not a quote or a backslash
\\. # an escaped character
(?!\1)["'] # a quote that is not the captured quote
Then, to repeat these elements without risking a catastrophic backtracking, I emulate an atomic group with this trick:(?>subpattern)* => (?:(?=(subpattern))\1)*
Note: If you want to forbid multiline matches, you only need to change the allowed content to
[^"'\r\n\\]+ | \\. | (?!\1)["'] and to remove the (?s) modifier.
[EDIT]
If you want to match a backslash at the end of the string (example: text = r'''abc def ghi\), you need to change the pattern to:
multiline mode:
(?m)('''|"""|'|")((?:(?=([^"'\r\n\\]+|\\(?:.|$)|(?!\1)["']))\3)*)\1?
singleline mode:
(?s)('''|"""|'|")((?:(?=([^"'\\]+|\\(?:.|$)|(?!\1)["']))\3)*)\1?

How about this:
^("""|'''|"|')((?!\\").*?)(?:(?<!\\)\1$|$)
I'm not exactly sure as to the behaviour you would like when it comes down strings that are syntactically wrong, when you have two double quotes before you get to three (at the start and end). But from what I understand this should do the job.
Use the second match group in your code.

Related

regex: replace hyphens with en-dashes with re.sub

I am using a small function to loop over files so that any hyphens - get replaced by en-dashes – (alt + 0150).
The function I use adds some regex flavor to a solution in a related problem (how to replace a character INSIDE the text content of many files automatically?)
def mychanger(fileName):
with open(fileName,'r') as file:
str = file.read()
str = str.decode("utf-8")
str = re.sub(r"[^{]{1,4}(-)","–", str).encode("utf-8")
with open(fileName,'wb') as file:
file.write(str)
I used the regular expression [^{]{1,4}(-) because the search is actually performed on latex regression tables and I only want to replace the hyphens that occur around numbers.
To be clear: I want to replace all hyphens EXCEPT in cases where we have genuine latex code such as \cmidrule(lr){2-4}.
In this case there is a { close (within 3-4 characters max) to the hyphen and to the left of it. Of course, this hyphen should not be changed into an en-dash otherwise the latex code will break.
I think the left part condition of the exclusion is important to write the correct exception in regex. Indeed, in a regression table you can have things like -0.062\sym{***} (that is, a { on the close right of the hyphen) and in that case I do want to replace the hyphen.
A typical line in my table is
variable & -2.061\sym{***}& 4.032\sym{**} & 1.236 \\
& (-2.32) & (-2.02) & (-0.14)
However, my regex does not appear to be correct. For instance, a (-1.2) will be replaced as –1.2, dropping the parenthesis.
What is the problem here?
Thanks!
I can offer the following two step replacement:
str = "-1 Hello \cmidrule(lr){2-4} range 1-5 other stuff a-5"
str = re.sub(r"((?:^|[^{])\d+)-(\d+[^}])","\\1$\\2", str).encode("utf-8")
str = re.sub(r"(^|[^0-9])-(\d+)","\\1$\\2", str).encode("utf-8")
print(str)
The first replacement targets all ranges which are not of the LaTex form {1-9} i.e. are not contained within curly braces. The second replacement targets all numbers prepended with a non number or the start of the string.
Demo
re.sub replaces the entire match. In this case that includes the non-{ character preceding your -. You can wrap that bit in parentheses to create a \1 group and include that in your substitution (you also don't need parentheses around your –):
re.sub(r"([^{]{1,4})-",r"\1–", str)

Matching everything after series of hyphens

I'm trying to capture all the remaining text in a file after three hyphens at the start of a line (---).
Example:
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Everything after the first set of three hyphens should be captured. The closest I've gotten is using this regex [^(---)]+$ which works slightly. It will capture everything after the hyphens, but if the user places any hyphens after that point it instead then captures after the last hyphen the user placed.
I am using this in combination with python to capture text.
If anyone can help me sort out this regex problem I'd appreciate it.
pat = re.compile(r'(?ms)^---(.*)\Z')
The (?ms) adds the MULTILINE and DOTALL flags.
The MULTILINE flag makes ^ match the beginning of lines (not just the beginning of the string.) We need this because the --- occurs at the beginning of a line, but not necessarily the beginning of the string.
The DOTALL flag makes . match any character, including newlines. We need this so that (.*) can match more than one line.
\Z matches the end of the string (as opposed to the end of a line).
For example,
import re
text = '''\
Anything above this first set of hyphens should not be captured.
---
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
'''
pat = re.compile(r'(?ms)^---(.*)\Z')
print(re.search(pat, text).group(1))
prints
This is content. It should be captured.
Any sets of three hyphens beyond this point should be ignored.
Note that when you define a regex character class with brackets, [...], the stuff inside the brackets are (in general, except for hyphenated ranges like a-z) interpreted as single characters. They are not patterns. So [---] is not different than [-]. In fact, [---] is the range of characters from - to -, inclusive.
The parenthese inside the character class are interpreted as literal parentheses too, not grouping delimiters. So [(---)] is equivalent to [-()], the character class including the hyphen and left and right parentheses.
Thus the character class [^(---)]+ matches any character other than the hyphen or parentheses:
In [23]: re.search('[^(---)]+', 'foo - bar').group()
Out[23]: 'foo '
In [24]: re.search('[^(---)]+', 'foo ( bar').group()
Out[24]: 'foo '
You can see where this is going, and why it does not work for your problem.
Sorry for not directly answering your question, but I wonder if regular expressions are overcomplicating the problem? You could do something like this:
f = open('myfile', 'r')
for i in f:
if i[:3] == "---":
break
text = f.readlines()
f.close()
Or, am I missing something?
I tend to find that regular expressions are difficult enough to maintain that if you don't need their unique capabilities for a given purpose it'll be cleaner and more readable to avoid using them entirely.
s = open(myfile).read().split('\n\n---\n\n', 1)
print s[0] # first part
print s[1] # second part after the dashes
This should work for your example. The second parameter to split specifies how many times to split the string.

Finding big string sequence between two keywords within multiple lines

I have a file with the format of
sjaskdjajldlj_abc:
cdf_asjdl_dlsf1:
dfsflks %jdkeajd
sdjfls:
adkfld %dk_.(%sfj)sdaj, %kjdflajfs
afjdfj _ajhfkdjf
zjddjh -15afjkkd
xyz
and I want to find the text in between the string _abc: in the first line and xyz in the last line.
I have already tried print
re.findall(re.escape("*_abc:")+"(*)"+re.escape("xyz"),line)
But I got null.
If I understood the requirement correctly:
a1=re.search(r'_abc(.*)xyz',line,re.DOTALL)
print a1.group(1)
Use re.DOTALL which will enable . to match a newline character as well.
You used re.escape on your pattern when it contains special characters, so there's no way it will work.
>>>>re.escape("*_abc:")
'\\*_abc\\:'
This will match the actual phrase *_abc:, but that's not what you want.
Just take the re.escape calls out and it should work more or less correctly.
It sounds like you have a misunderstanding about what the * symbol means in a regular expression. It doesn't mean "match anything", but rather "repeat the previous thing zero or more times".
To match any string, you need to combine * with ., which matches any single character (almost, more on this later). The pattern .* matches any string of zero or more characters.
So, you could change your pattern to be .*abc(.*)xyz and you'd be most of the way there. However, if the prefix and suffix only exist once in the text the leading .* is unnecessary. You can omit it and just let the regular expression engine handle skipping over any unmatched characters before the abc prefix.
The one remaining issue is that you have multiple lines of text in your source text. I mentioned above that the . patter matches character, but that's not entirely true. By default it won't match a newline. For single-line texts that doesn't matter, but it will cause problems for you here. To change that behavior you can pass the flag re.DOTALL (or its shorter spelling, re.S) as a third argument to re.findall or re.search. That flag tells the regular expression system to allow the . pattern to match any character including newlines.
So, here's how you could turn your current code into a working system:
import re
def find_between(prefix, suffix, text):
pattern = r"{}.*{}".format(re.escape(prefix), re.escape(suffix))
result = re.search(pattern, text, re.DOTALL)
if result:
return result.group()
else:
return None # or perhaps raise an exception instead
I've simplified the pattern a bit, since your comment suggested that you want to get the whole matched text, not just the parts in between the prefix and suffix.

How could I get regex to start when it has reached a specific point within a string?

Say I have a string like {{ComputersRule}} and a regex like: [^\}]+. How would I get regular expressions to start at a specified point in the string, i.e. Once it has reached the third character in the string. If it's relevant, and I doubt it is, I'm working in Python version 2.7.3. Thank you.
I'd recommend using Python to grab the substring from the third character onwards, and then apply the regex to the rest.
Otherwise, you could just use the regex . (any character except newline) to gobble up the first n characters:
^.{3}([^\}]+)
Notice the ^.{3} which forces the [^\}]+ to not include the first three characters of the string (the ^ anchors to the start of the string/line). The brackets capture the bit you want to extract (so get capturing group 1).
In your particular case, if it's just a case of "I want the text inside the {{ and }}" you could do \{\{([^\}]+)\}\} or [^\{\}]+.
It appears that what you want to do is to match text within the double braces.
The trick is to specify the braces in the regex but capture the part within. In this case try
\{\{([^}]+)\}\}

Python: Regex to extract part of URL found between parentheses

I have this weirdly formatted URL. I have to extract the contents in '()'.
Sample URL : http://sampleurl.com/(K(ThinkCode))/profile/view.aspx
If I can extract ThinkCode out of it, I will be a happy man! I am having a tough time with regexing special chars like '(' and '/'.
>>> foo = re.compile( r"(?<=\(K\()[^\)]*" )
>>> foo.findall( r"http://sampleurl.com/(K(ThinkCode))/profile/view.aspx" )
['ThinkCode']
Explanation
In regex-world, a lookbehind is a way of saying "I want to match ham, but only if it's preceded by spam. We write this as (?<=spam)ham. So in this case, we want to match [^\)]*, but only if it's preceded by \(K\(.
Now \(K\( is a nice, easy regex, because it's plain text! It means, match exactly the string (K(. Notice that we have to escape the brackets (by putting \ in front of them), since otherwise the regex parser would think they were part of the regex instead of a character to match!
Finally, when you put something in square brackets in regex-world, it means "any of the characters in here is OK". If you put something inside square brackets where the first character is ^, it means "any character not in here is OK". So [^\)] means "any character that isn't a right-bracket", and [^\)]* means "as many characters as possible that aren't right-brackets".
Putting it all together, (?<=\(K\()[^\)]* means "match as many characters as you can that aren't right-brackets, preceded by the string (K(.
Oh, one last thing. Because \ means something inside strings in Python as well as inside regexes, we use raw strings -- r"spam" instead of just "spam". That tells Python to ignore the \'s.
Another way
If lookbehind is a bit complicated for you, you can also use capturing groups. The idea behind those is that the regex matches patterns, but can also remember subpatterns. That means that you don't have to worry about lookaround, because you can match the entire pattern and then just extract the subpattern inside it!
To capture a group, simply put it inside brackets: (foo) will capture foo as the first group. Then, use .groups() to spit out all the groups that you matched! This is the way the other answer works.
It's not too hard, especially since / isn't actually a special character in Python regular expressions. You just backslash the literal parens you want. How about this:
s = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
mo = re.match(r"http://sampleurl\.com/\(K\(([^)]+)\)\)/profile.view\.aspx", s);
print mo.group(1)
Note the use of r"" raw strings to preserve the backslashes in the regular expression pattern string.
If you want to have special characters in a regex, you need to escape them, such as \(, \/, \\.
Matching things inside of nested parenthesis is quite a bit of a pain in regex. if that format is always the same, you could use this:
\(.*?\((.*?)\).*?\)
Basically: find a open paren, match characters until you find another open paren, group characters until I see a close paren, then make sure there are two more close paren somewhere in there.
mystr = "http://sampleurl.com/(K(ThinkCode))/profile/view.aspx"
import re
re.sub(r'^.*\((\w+)\).*',r'\1',mystr)

Categories

Resources