Python, applying regex negative lookahead recursivly - python

In python, I am trying to implement a user defined regex expression by parsing it to a custom regex expression. this custom regex expression is then applied on a space-sperated string. The idea is to apply user regex on second column without using a for loop.
Stream //streams/sys_util mainline none 'sys_util'
Stream //streams/gta mainline none 'gta'
Stream //streams/gta_client development //streams/gta_cdevelop 'gta_client'
Stream //streams/gta_develop development //streams/gta 'gta_develop'
Stream //streams/gta_infrastructure development //streams/gta 'gta_infrastructure'
Stream //streams/gta_server development //streams/gta_cdevelop 'gta_server'
Stream //streams/0222_ImplAlig1.0 task none '0222_ImplAlig1.0'
Stream //streams/0377_kzo_the_wart task //streams/applications_int '0377_tta'
Expected output should be
//streams/gta
//streams/gta_client
//streams/gta_develop
//streams/gta_infrastructure
//streams/gta_server
here is my code,
import re
mystring = "..."
match_rgx = r'Stream\s(\/\/streams\/gta.*)(?!\s)'
result = re.findall(match_rgx, mystring, re.M)
NOTE: The expression inside first parenthesis can not be changed (as it is parsed from user input) so \/\/streams\/gta.* must remain as it is.
how can I improve negative look-ahead to get the desired results?

You can use:
match_rgx = 'Stream\s(//streams/gta.*?)\s'
result = re.findall(match_rgx, mystring)
By default, the operator * is greedy, so it will try to catch as much text as possible (for example: "//streams/gta mainline none" will match without the ?). But you only want the second column, so, with ? your operator become non-greedy, and stop at the minimal pattern, here, at the first occurrence of \s ("//streams/gta").
Hope this is clear, put a look at the doc (https://docs.python.org/2/library/re.html#contents-of-module-re) if it's not.
Btw, you don't have to escape the /, it is not a special character.
And it's useless to use the re.M flag if you don't use ^ or $.
Edit: Since your edit, if you don't want to catch development, some informations became useless.
Edit 2: Didn't see you don't want to change the pattern. In this case, just do:
match_rgx = 'Stream\s(\/\/streams\/gta.*?)\s'
Edit3: See comment.

Tested on https://regex101.com/ , this should do the work for all 2nd columns:
(?:\w+\s([^\s]+)\s.*[\n|\n\r]*)
And this for the GTAs 2nd column only:
(?:\w+\s(\/\/streams\/gta[^\s]*)\s.*[\n|\n\r]*)
For one line it would be just like (2nd col):
\w+\s([^\s]+)\s.*
Gta only for 1 line:
\w+\s(\/\/streams\/gta[^\s]*)\s.*

Related

Regex to match only part of certain line

I have some config file from which I need to extract only some values. For example, I have this:
PART
{
title = Some Title
description = Some description here. // this 2 params are needed
tags = qwe rty // don't need this param
...
}
I need to extract value of certain param, for example description's value. How do I do this in Python3 with regex?
Here is the regex, assuming that the file text is in txt:
import re
m = re.search(r'^\s*description\s*=\s*(.*?)(?=(//)|$)', txt, re.M)
print(m.group(1))
Let me explain.
^ matches at beginning of line.
Then \s* means zero or more spaces (or tabs)
description is your anchor for finding the value part.
After that we expect = sign with optional spaces before or after by denoting \s*=\s*.
Then we capture everything after the = and optional spaces, by denoting (.*?). This expression is captured by parenthesis. Inside the parenthesis we say match anything (the dot) as many times as you can find (the asterisk) in a non greedy manner (the question mark), that is, stop as soon as the following expression is matched.
The following expression is a lookahead expression, starting with (?= which matches the thing right after the (?=.
And that thing is actually two options, separated by the vertical bar |.
The first option, to the left of the bar says // (in parenthesis to make it atomic unit for the vertical bar choice operation), that is, the start of the comment, which, I suppose, you don't want to capture.
The second option is $, meaning the end of the line, which will be reached if there is no comment // on the line.
So we look for everything we can after the first = sign, until either we meet a // pattern, or we meet the end of the line. This is the essence of the (?=(//)|$) part.
We also need the re.M flag, to tell the regex engine that we want ^ and $ match the start and end of lines, respectively. Without the flag they match the start and end of the entire string, which isn't what we want in this case.
The better approach would be to use an established configuration file system. Python has built-in support for INI-like files in the configparser module.
However, if you just desperately need to get the string of text in that file after the description, you could do this:
def get_value_for_key(key, file):
with open(file) as f:
lines = f.readlines()
for line in lines:
line = line.lstrip()
if line.startswith(key + " ="):
return line.split("=", 1)[1].lstrip()
You can use it with a call like: get_value_for_key("description", "myfile.txt"). The method will return None if nothing is found. It is assumed that your file will be formatted where there is a space and the equals sign after the key name, e.g. key = value.
This avoids regular expressions altogether and preserves any whitespace on the right side of the value. (If that's not important to you, you can use strip instead of lstrip.)
Why avoid regular expressions? They're expensive and really not ideal for this scenario. Use simple string matching. This avoids importing a module and simplifies your code. But really I'd say to convert to a supported configuration file format.
This is a pretty simple regex, you just need a positive lookbehind, and optionally something to remove the comments. (do this by appending ?(//)? to the regex)
r"(?<=description = ).*"
Regex101 demo

Vim searching: avoid matches within comments

Using vim searching capabilities, is it possible to avoid matching on a comment block/line.
As an example, I would like to match on 'image' in this python code, but only in the code, not in the comment:
# Export the image
if export_images and i + j % 1000 == 0:
export_image(img_mask, "images/image{}.png".format(image_id))
image_id += 1
Using regular expressions I would do something like this: /^[^#].*(image).*/gm
But it does not seem to work in vim.
You can use
/^[^#].*\zsimage\ze
The \zs and \ze signalize the start and end of a match respectively.
setting the start and end of the match: \zs \ze
Note that this will not match several "image"s on a line, just the last one.
Also, perhaps, a "negative lookahead" would be better than a negated character class at the beginning:
/^#\#!.*\zsimage\ze
^^^^
The #\#! is equal to (?!#) in Python.
And since look-behinds are non-fixed-width in Vim (like (?<=pattern) in Perl, but Vim allows non-fixed-width patterns), you can match all occurrences of the character sequence image with
/\(^#\#!.*\)\#<=image
And to finally skip matching image on an indented comment line, you just need to match optional (zero or more) whitespace symbol(s) at the beginning of the line:
\(^\(\s*#\)\#!.*\)\#<=image
^^^^^^^^^^^
This \(\s*#\)\#! is equivalent to Python (?!\s*#) (match if not followed by zero or more whitespace followed with a #).
This mailing list post suggest using folds:
To search only in open folds (unfolded text):
:set fdo-=search
To fold # comments, adapting on this Vi and Vim post (where an autocmd for Python files is given):
set foldmethod=expr foldexpr=getline(v:lnum)=~'^\s*#'
However, folding by default works only on multiple lines. You need to enable folding of a single line, for single-line comments to be excluded:
set fml=0
After folding everything (zM, since I did not have anything else to be folded), a search for /image does not match anything in the comments.
A more generic way to ignore matches inside end-of-line comment markers (which does not account for the more complicated case of avoiding literal string delimiters, which could be achieved for a simpl-ish case if you want) is:
/\v%(^.{-}\/\/.{-})#<!<%(this|that|self|other)>
Where:
/ is the ex command to execute the search (remove if using the regex as part of another command, or a vimscript expression).
\v forces the "very magic" for this regex.
\/\/ is the end-of-line comment token (escaped, so the / characters are not interpreted as the end of the regex by vim). The example above works for C/C++, JavaScript, Node.JS, etc.
(^.{-}\/\/.{-})#<! is the zero-width expression that means "there should not be any comments preceding the start of the expression following this" (see vim's documentation for \#<!). In our case, this expression is just trying to find the end-of-line comment token at any point in the line (hence the ^.{-}) and making sure that the zero-width match can end up in the first character of the positive expression that follows this one (achieved by the .{-} at the end of the parenthesised expression).
<%(this|that|self|other)> can be replaced by any regex. In this case, this shows that you can match an arbitrary expression at this point, without worrying about the comments (which are handled by the zero-width expression preceding this one).
As an example, consider the following piece of code:
some.this = 'hello'
some.this = 'hello' + this.someother;
some.this = 'hello' + this.someother; // this is a comment, and this is another
The above expression will match all the this words, except the ones inside the comment (or any other //-prefixed comments, for that matter).
(Note: links all pointing to the vim-7.0 reference documentation, which should (and does in my own testing) also work in the latest vim and nvim releases as of the time of writing)

Python non-greedy regular expression is not exactly what I expected

string: XXaaaXXbbbXXcccXXdddOO
I want to match the minimal string that begin with 'XX' and end with 'OO'.
So I write the non-greedy reg: r'XX.*?OO'
>>> str = 'XXaaaXXbbbXXcccXXdddOO'
>>> re.findall(r'XX.*?OO', str)
['XXaaaXXbbbXXcccXXdddOO']
I thought it will return ['XXdddOO'] but it was so 'greedy'.
Then I know I must be mistaken, because the qualifier above will firstly match the 'XX' and then show it's 'non-greedy'.
But I still want to figure out how can I get my result ['XXdddOO'] straightly. Any reply appreciated.
Till now, the key point is actually not about non-greedy , or in other words, it is about the non-greedy in my eyes: it should match as few characters as possible between the left qualifier(XX) and the right qualifier(OO).
And of course the fact is that the string is processed from left to right.
How about:
.*(XX.*?OO)
The match will be in group 1.
Regex work from left to the right: non-greedy means that it will match XXaaaXXdddOO and not XXaaaXXdddOOiiiOO. If your data structure is that fixed, you could do:
XX[a-z]{3}OO
to select all patterns like XXiiiOO (it can be adjusted to fit your your needs, with XX[^X]+?OO for instance selecting everything in between the last XX pair before an OO up to that OO: for example in XXiiiXXdddFFcccOOlll it would match XXdddFFcccOO)
Indeed, issue is not with greedy/non-greedy… Solution suggested by #devnull should work, provided you want to avoid even a single X between your XX and OO groups.
Else, you’ll have to use a lookahead (i.e. a piece of regex that will go “scooting” the string ahead, and check whether it can be fulfilled, but without actually consuming any char). Something like that:
re.findall(r'XX(?:.(?!XX))*?OO', str)
With this negative lookahead, you match (non-greedily) any char (.) not followed by XX…
The behaviour is due to the fact that the string is processed from left to right. A way to avoid the problem is to use a negated character class:
XX(?:(?=([^XO]+|O(?!O)|X(?!X)))\1)+OO

Regular expression code is not working (Python)

Assume I have a word AB1234XZY or even 1AB1234XYZ.
I want to extract ONLY 'AB1234' or 1AB1234 (ie. everything up until the letters at the end).
I have used the following code to extract that but it's not working:
base= re.match(r"^(\D+)(\d+)", word).group(0)
When I print base, it's not working for the second case. Any ideas why?
Your regex doesn't work for the second case because it starts with a number; the \D at the beginning of your pattern matches anything that ISN'T a number.
You should be able to use something quite simple for this--simpler, in fact, than anything else I see here.
'.*\d'
That's it! This should match everything up to and including the last number in your string, and ignore everything after that.
Here's the pattern working online, so you can see for yourself.
(.+?\d+)\w+ would give you what you want.
Or even something like this
^(.+?)[a-zA-Z]+$
re.match starts at the beginning of the string, and re.search simply looks for it in the string. both return the first match. .group(0) is everything included in the match, if you had capturing groups, then .group(1) is the first group...etc etc... as opposed to normal convention where 0 is the first index, in this case, 0 is a special use case meaning everything.
in your case, depending on what you really need to capture, maybe using re.search is better. and instead of using 2 groups, you can use (\D+\d+) keep in mind, it will capture the first (non-digits,digits) group. it might be sufficient for you, but you might want to be more specific.
after reading your comment "everything before the letters at the end"
this regex is what you need:
regex = re.compile(r'(.+)[A-Za-z]')

Repeating a python regular expression until a certain char

I want to get all of the text until a ! appears. Example
some textwfwfdsfosjtortjk\n
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf\n
sfsgdfgdfgdgdfgdg\n
!
The number of lines before the ! changes so I can't hardcode a reg exp like this
"+\n^.+\n^.+"
I am using re.MULTLINE, but should I be using re.DOTALL?
Thanks
Why does this need a regular expression?
index = str.find('!')
if index > -1:
str = str[index:] # or (index+1) to get rid of the '!', too
So you want to match everything from the beginning of the input up to (but not including) the first ! character? This should do it:
re.match(r'[^!]*', input)
If there are no exclamation points this will match the whole string. If you want to match only strings with ! in them, add a lookahead:
re.match(r'[^!]*(?=!)', input)
The MULTILINE flag is not needed because there are no anchors (^ and $), and DOTALL isn't needed because there are no dots.
Following the Python philosophy of "Easier to Ask Forgiveness Than Permission" (EAFP), I suggest you create a subroutine which is easy to understand and later maintain, should your separator change.
SEPARATOR = u"!"
def process_string(s):
try:
return s[:s.index(SEPARATOR)]
except ValueError:
return s
This function will return the string from the beginning up to, and not including, whatever you defined as separator. If the separator is not found, it will return the whole string. The function works regardless of new lines. If your separator changes, simply change SEPARATOR and you are good to go.
ValueError is the exception raised when you request the index of a character not in the string (try it in the command line: "Hola".index("1") (will raise ValueError: substring not found). The workflow then assumes that most of the time you expect the SEPARATOR character to be in the string, so you attempt that first without asking for permission (testing if SEPARATOR is in the string); if you fail (the index method raises ValueError) then you ask forgiveness (return the string as originally received). This approach (EAFP) is considered Pythonic when it applies, as it does in this case.
No regular expressions needed; this is a simple problem.
Look into a 'lookahead' for that particular character you're reading, and match the whole first part as a pattern instead.
I'm not sure exactly how Python's regex reader is different from Ruby, but you can play with it in rubular.com
Maybe something like:
([^!]*(?=\!))
(Just tried this, seems to work)
It should do the job.
re.compile('(.*?)!', re.DOTALL).match(yourString).group(1)
I think you're making this more complex than it needs to be. Your reg exp just needs to say "repeat(any character except !) followed by !". Remember [^!] means "any character except !".
So, like this:
>>> import re
>>> rexp = re.compile("([^!]*)!")
>>> test = """sdasd
... asdasdsa
... asdasdasd
... asdsadsa
... !"""
>>> rexp.findall(test)
['sdasd\nasdasdsa\nasdasdasd\nasdsadsa\n']
>>>
re.DOTALL should be sufficient:
import re
text = """some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg
!"""
rExp = re.compile("(.*)\!", re.S)
print rExp.search(text).groups()[0]
some textwfwfdsfosjtortjk
sdsfsdfsdfsdfsdfsdfsdfsfsfsdfsdfsdf
sfsgdfgdfgdgdfgdg

Categories

Resources