FutureWarning: Possible nested set at position 1 Error Python - python

I was working on something and at some point, I needed to check whether the string satisfies this:
The string must contain at least 5 words and each separated by a hyphen(-) or an underscore(_).
Here is the code that I wrote:
password=eval(input('Password:'))
pattern=r'[[\w][-_]]{5,}'
import re
re.fullmatch(pattern,password)
But it gives ' ipython-input-32-7c87b09218f8>:4: FutureWarning: Possible nested set at position 1
re.fullmatch(pattern,password) ' error. Why that happens, any idea?Thanks in advance.Btw I'm using Jupyter notebook.

You can match 1+ word characters, and then repeat at least 4 times matching either _ or / and again 1 or more word characters.
\w+(?:[/_]\w+){4,}
Explanation
\w+ Match 1+ word characters
(?: Non capture group to repeat as a whole part
[/_] Character class matching either / or _
\w+ Match 1+ word characters
){4,} close the no capture group and repeat 4 or more times
See a regex demo.

Related

Not able get desired output after string parsing through regex

input =
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
6:/BENM/Tabuler Trading/REM//IMP/2020-341
original_regex = 6:[A-Za-z0-9 \/\.\-:] - bt this is taking full string 6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
modified_regex_pattern = 6:[A-Za-z0-9 \/\.\-:]{1,}[\/-:]
In the first string i want output till
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20
but its giving till :65:
Can anyone suggest better way to write this.
Example as below
https://regex101.com/r/pAduvy/1
You could for example use a capturing group with an optional part at the end to match the :digits:a-z part.
(6:[A-Za-z0-9 \/.:-]+?)(?::\d+:[a-z]+)?$
( Capture group 1
6:[A-Za-z0-9 \/.:-]+? Match any of the listed in the character class as least as possible
) Close group 1
(?::\d+:[a-z]+)? optionally match the part at the end that you don't want to include
$ End of string
Regex demo
Note Not sure if intended, but the last part of your pattern [\/-:] denotes a range from ASCII range 47 - 58.
Or a more precise pattern to get the match only
6:/\w+/\w+ \w+/[A-Z]+//[A-Z]+(?:: \d+)?/[A-Z]*\d+(?:-\d+)?
6:/\w+/\w+ Match 6 and 2 times / followed by 1+ word chars and a space
\w+/[A-Z]+//[A-Z]+ Match 1+ word chars, / and uppercase chars, // and again uppercase chars
(?:: \d+)? Optionally match a space and 1+ digits
/[A-Z]*\d+ Match /, optional uppercase chars and 1+ digits
(?:-\d+)? Optionally match - and 1+ digits
Regex demo

Key error when using regex quantifier python

I am trying to capture words following specified stocks in a pandas df. I have several stocks in the format $IBM and am setting a python regex pattern to search each tweet for 3-5 words following the stock if found.
My df called stock_news looks as such:
Word Count
0 $IBM 10
1 $GOOGL 8
etc
pattern = ''
for word in stock_news.Word:
pattern += '{} (\w+\s*\S*){3,5}|'.format(re.escape(word))
However my understanding is that {} should be a quantifier, in my case matching between 3 to 5 times however I receive the following KeyError:
KeyError: '3,5'
I have also tried using rawstrings with r'{} (\w+\s*\S*){3,5}|' but to no avail. I also tried using this pattern on regex101 and it seems to work there but not in my Pycharm IDE. Any help would be appreciated.
Code for finding:
pat = re.compile(pattern, re.I)
for i in tweet_df.Tweets:
for x in pat.findall(i):
print(x)
When you build your pattern, there is an empty alternative left at the end, so your pattern effectively matches any string, every empty space before non-matching texts.
You need to build the pattern like
(?:\$IBM|\$GOOGLE)\s+(\w+(?:\s+\S+){3,5})
You may use
pattern = r'(?:{})\s+(\w+(?:\s+\S+){{3,5}})'.format(
"|".join(map(re.escape, stock_news['Word'])))
Mind that the literal curly braces inside an f-string or a format string must be doubled.
Regex details
(?:\$IBM|\$GOOGLE) - a non-capturing group matching either $IBM or $GOOGLE
\s+ - 1+ whitespaces
(\w+(?:\s+\S+){3,5}) - Capturing group 1 (when using str.findall, only this part will be returned):
\w+ - 1+ word chars
(?:\s+\S+){3,5} - a non-capturing* group matching three, four or five occurrences of 1+ whitespaces followed with 1+ non-whitespace characters
Note that non-capturing groups are meant to group some patterns, or quantify them, without actually allocating any memory buffer for the values they match, so that you could capture only what you need to return/keep.

how to have regex stop searching for A after a max number of matching of B

I am trying to search a keyword A in a group of lines with Python re library. The number of lines in a group is in a range of 3 to 5. Each line is enclosed by "" and "". The keyword A may or may not appear in the group. If it doesn't, I want it to get a None to me. A sample of the text looks like:
<BR>GROUP #1</BR>
<BR>arbitrary characters 1</BR>
<BR>arbitrary characters 2</BR>
<BR>arbitrary characters 3</BR>
<BR>GROUP #2</BR>
<BR>arbitrary characters 4</BR>
<BR>arbitrary characters 5</BR>
<BR>KEYWORD_A_2</BR>
<BR>Group #3</BR>
<BR>arbitrary characters 6</BR>
<BR>arbitrary characters 7</BR>
<BR>arbitrary characters 8</BR>
<BR>KEYWORD_A_3</BR>
....
(Note: the uppercase characters may be keywords and should appear exactly same it the original text.)
My first attempt, '<BR>Group #(\d+)</BR>.*?<BR>Keyword_A_(\d+)</BR>' obviously may cross the border of the groups and get a match of (1, 2), instead of (1, None) as I wished.
My next attempt is '<BR>Group #(\d+)</BR>(?:<BR>.*?</BR>){,3}<BR>Keyword_A_(\d+)</BR>', to limit the .. pairs to be 3. But that will be a greedy match so that 'KEYWORD_A_3' is matched and (1, 3) is returned.
So, in summary, I am trying to have regex to find 'KEYWORD_A_(\d+)' after maximum of 5 lines after a match of 'GROUP #(\d+)'. If no match beyond 5 lines, just stop searching, return None, and set the regex's current position at the end of match of 'GROUP #(\d+)', so I can start to search in next group.
Is that possible with re library of Python? Thanks for any helps.
You may use
re.findall(r'<BR>Group\s+#(\d+)</BR>((?:(?!<BR>Group\s+#\d).)*?)<BR>Keyword_A_(\d+)</BR>', text, re.DOTALL)
See the regex demo
Details
<BR>Group - a literal <BR>Group string
\s+ - 1+ whitespaces
# - a # char
(\d+) - Capturing group 1: one or more digits
</BR> - a substring
((?:(?!<BR>Group\s+#\d).)*?) - Capturing group 2: any char, 0 or more but as few as possible occurrences that does not start a <BR>Group\s+#\d pattern
<BR>Keyword_A_ - a literal substring
(\d+) - Capturing group 3: one or more digits
</BR> - a substring

strange output regular expression r'[-.\:alnum:](.*)'

I expect to fetch all alphanumeric characters after "-"
For an example:
>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)', str1)
[' mystr']
First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].
Second, I cannot understand why this can be fetched, if there is no "-":
>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)', str2)
['io']
First of all, Python re does not support POSIX character classes.
The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.
Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use
re.findall(r'-\s*([^\W_]+)', s)
See the regex demo
Details
- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.
Python demo:
print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio')) # => []
Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".
In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.
Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.
As #revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.
Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).
However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).
Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.
If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.
re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)
If you want to find each string of alphanumerics after a each '-' then you can do this.
re.findall('(?<=-)[a-zA-Z\d]+')

Trying to repeat the regex breaks the regex

I have a working regex that matches ONE of the following lines:
A punctuation from the following list [.,!?;]
A word that is preceded by the beginning of the string or a space.
Here's the regex in question ([.,!?;] *|(?<= |\A)[\-'’:\w]+)
What I need it to do however is for it to match 3 instances of this. So, for example, the ideal end result would be something like this.
Sample text: "This is a test. Test"
Output
"This" "is" "a"
"is" "a" "test"
"a" "test" "."
"test" "." "Test"
I've tried simply adding {3} to the end in the hopes of it matching 3 times. This however results in it matching nothing at all or the occasional odd character. The other possibility I've tried is just repeating the whole regex 3 times like so ([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+)([.,!?;] *|(?<= |\A)[\-'’:\w]+) which is horrible to look at but I hoped it would work. This had the odd effect of working, but only if at least one of the matches was one of the previously listed punctuation.
Any insights would be appreciated.
I'm using the new regex module found here so that I can have overlapping searches.
What is wrong with your approach
The ([.,!?;] *|(?<= |\A)[\-'’:\w]+) pattern matches a single "unit" (either a word or a single punctuation from the specified set [.,!?;] followed with 0+ spaces. Thus, when you fed this pattern to the regex.findall, it only could return just the chunk list ['This', 'is', 'a', 'test', '. ', 'Test'].
Solution
You can use a slightly different approach: match all words, and all chunks that are not words. Here is a demo (note that C'est and AUX-USB are treated as single "words"):
>>> pat = r"((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*))\s*((?1))\s*((?1))"
>>> results = regex.findall(pat, text, overlapped = True)
>>> results
[("C'est", 'un', 'test'), ('un', 'test', '....'), ('test', '....', 'aux-usb')]
Here, the pattern has 3 capture groups, and the second and third one contain the same pattern as in Group 1 ((?1) is a subroutine call used in order to avoid repeating the same pattern used in Group 1). Group 2 and Group 3 can be separated with whitespaces (not necessarily, or the punctuation glued to a word would not be matched). Also, note the negative lookbehind (?<!') that will ensure that C'est is treated as a single entity.
Explanation
The pattern details:
((?:[^\w\s'-]+(?=\s|\b)|\b(?<!')\w+(?:['-]\w+)*)) - Group 1 matching:
(?:[^\w\s'-]+(?=\s|\b) - 1+ characters other than [a-zA-Z0-9_], whitespace, ' and - immediately followed with a whitespace or a word boundary
| - or
\b(?<!')\w+(?:['-]\w+)*) - 1+ word characters not preceded with a ' (due to (?<!')) and preceded with a word boundary (\b) and followed with 0+ sequences of - or ' followed with 1+ word characters.
\s* - 0+ whitespaces
((?1)) - Group 2 (same pattern as for Group 1)
\s*((?1)) - see above

Categories

Resources