Regex to not match expressions that contain a specific number - python

I want to match a regex like this
] prima 1 words 2 words
And not if it's
] prima 1 words 2 words 3 words
My trial is this one:
\]\s*prima\s*1([\w\s]+)\s2([\w\s][^3]+)
But it matches only part of the expression I don't want to match at all. My exclusion si wrong. How to do it? I need to insert it in re.compile so it has to be one line.

This pattern will match the example data, but note that \w by itself can also match a digit.
If you want to match 1 or more whitespace characters (which could also match newlines), you could use \s+ instead of a space.
^\] prima 1 \w+ 2 \w+$
Regex demo
If you want to match ] prima followed by 1 and 2 which by them selves can be followed by 1 or more words that can not start with a digit:
^] prima 1 [^\W\d]\w*(?: [^\W\d]\w*)* 2 [^\W\d]\w*(?: [^\W\d]\w*)*$
^ Start of string
] prima 1 Match literally
[^\W\d]\w* Match a word char does not start with a digit
(?: [^\W\d]\w*)* Repeat 0+ times matching a space and a word that does not start with a digit
2 Match literally
[^\W\d]\w* Match a word char does not start with a digit
(?: [^\W\d]\w*)* Repeat 0+ times matching a space and a word that does not start with a digit
$ End of string
Regex demo
If the following words can not consists solely of digits, you can use a negative lookahead (?!\d+\b) checking for digits only
^\] prima 1 (?!\d+\b)\w+(?: (?!\d+\b)\w+)* 2 (?!\d+\b)\w+(?: (?!\d+\b)\w+)*$
Regex demo

Related

Not able get desired output after string parsing through regex

input =
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
6:/BENM/Tabuler Trading/REM//IMP/2020-341
original_regex = 6:[A-Za-z0-9 \/\.\-:] - bt this is taking full string 6:/BENM/Gravity Exports/REM//INV: 3267/FEB20:65:ghgh
modified_regex_pattern = 6:[A-Za-z0-9 \/\.\-:]{1,}[\/-:]
In the first string i want output till
6:/BENM/Gravity Exports/REM//INV: 3267/FEB20
but its giving till :65:
Can anyone suggest better way to write this.
Example as below
https://regex101.com/r/pAduvy/1
You could for example use a capturing group with an optional part at the end to match the :digits:a-z part.
(6:[A-Za-z0-9 \/.:-]+?)(?::\d+:[a-z]+)?$
( Capture group 1
6:[A-Za-z0-9 \/.:-]+? Match any of the listed in the character class as least as possible
) Close group 1
(?::\d+:[a-z]+)? optionally match the part at the end that you don't want to include
$ End of string
Regex demo
Note Not sure if intended, but the last part of your pattern [\/-:] denotes a range from ASCII range 47 - 58.
Or a more precise pattern to get the match only
6:/\w+/\w+ \w+/[A-Z]+//[A-Z]+(?:: \d+)?/[A-Z]*\d+(?:-\d+)?
6:/\w+/\w+ Match 6 and 2 times / followed by 1+ word chars and a space
\w+/[A-Z]+//[A-Z]+ Match 1+ word chars, / and uppercase chars, // and again uppercase chars
(?:: \d+)? Optionally match a space and 1+ digits
/[A-Z]*\d+ Match /, optional uppercase chars and 1+ digits
(?:-\d+)? Optionally match - and 1+ digits
Regex demo

Match text with 4 to 5 CAPITAL ALPAHABETS along with minimum 1 or maximum 2 digit number

Requirement:
4 to 5 CAPITAL ALPAHABETS along with minimum 1 or maximum 2 digit number
I have created a REGEX which matches string with CAPITAL ALPHABETS which has more than 1 digit but I want to match Text which has only 1 or 2 digits.
\b(?=.*\d){1,2}(?=.*[A-Z])[A-Z\d]{4,5}\b
Match Cases:
Allow
8HB8
H8ER
D5KC2
Disallow
8HB88
HEER
D54C2
Edit 1:
I should be able to match WORDs of that format with in sentence also not alone as word.
Allow:
This is a valid 9CB8 code
This is another valid H1CS code
One option is to assert 4-5 chars [A-Z0-9].
Then match at least 1 digit 0-9 between optional chars [A-Z] and optionally match a second digit.
^(?=[A-Z0-9]{4,5}$)[A-Z]*[0-9][A-Z]*(?:[0-9][A-Z]*)?$
In parts
^ Start of string
(?=[A-Z0-9]{4,5}$) Assert 4-5 chars A-Z0-9
[A-Z]*[0-9][A-Z]* Match a digit between optional chars A-Z
(?: Non capture group
[0-9][A-Z]* match a digit 0-9
)? Close group and make it optional
$ End of string
Regex demo
So maybe you could use:
^(?=[A-Z0-9]{4,5}$)(?:\D*\d\D*){1,2}$
I based my answer on the same principle as I did here.
^ - Start of string ancor
(?=[A-Z0-9]{4,5}$) - A positive lookahead for a minimum of 4 and a maximum of 5 characters in the range of [A-Z0-9] before the end of string ancor, $.
(?:\D*\d\D*) - A non-capture group where we have a combination of: zero or more non-digits followed by a digit and again zero or more non-digits.
{1,2} - Allow the previous non-capture group to occur a minimum of 1 and a maximum of two times (to make sure there are only 1 or 2 digits.
$ - End of string ancor.
See the online demo here and below is a visualization of the pattern from left to right:

Regex complete words pattern

I want to get patterns involving complete words, not pieces of words.
E.g. 12345 [some word] 1234567 [some word] 123 1679. Random text and the pattern appears again 1111 123 [word] 555.
This should return
[[12345, 1234567, 123, 1679],[1111, 123, 555]]
I am only tolerating one word between the numbers otherwise the whole string would match.
Also note that it is important to capture that 2 matches were found and so a two-element list was returned.
I am running this in python3.
I have tried:
\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b
but I am not sure how to scale this to an unrestricted number of matches.
re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', string)
This matches [number] [word] [number] but not any number that might follow with or without a word in between.
Are you expecting re.findall() to return a list of lists? It will only return a list - no matter what regex you use.
One approach is to split your input string into sentences and then loop through them
import re
inputArray = re.split('<pattern>',inputText)
outputArray = []
for item in inputArray:
outputArray.append(re.findall('\b(\d+)\b\s\b(\w+)?\b\s\b(\d+)\b', item))
the trick is to find a <pattern> to split your input.
You can't do this in one operation with the Python re engine.
But you could match the sequence with one match, then extract the
digits with another.
This matches the sequence
r"(?<!\w)\d+(?:(?:[^\S\r\n]+[a-zA-Z](?:\w*[a-zA-Z])*)?[^\S\r\n]+\d+)*(?!\w)"
https://regex101.com/r/73AYLU/1
Explained
(?<! \w ) # Not a word behind
\d+ # Many digits
(?: # Optional word block
(?: # Optional words
[^\S\r\n]+ # Horizontal whitespace
[a-zA-Z] # Starts with a letter
(?: \w* [a-zA-Z] )* # Can be digits in middle, ends with a letter
)? # End words, do once
[^\S\r\n]+ # Horizontal whitespace
\d+ # Many digits
)* # End word block, do many times
(?! \w ) # Not a word ahead
This gets the array of digits from the sequence matched above (use findall)
r"(?<!\S)(\d+)(?!\S)"
https://regex101.com/r/BHov38/1
Explained
(?<! \S ) # Whitespace boundary
( \d+ ) # (1)
(?! \S ) # Whitespace boundary
This is a bit complicated, maybe this expression would be just something to look into:
(((\d+)\s*)*(?:\s*\[.*?\]\s*)((\d+)\s*)*)|([A-za-z\s]+)
and script the rest of the problem for a valid solution.
Demo

Build regular expression to recognize at least a given interval

I have a regular expression given by a word and a range of words following.
For example:
pattern = 'word \\w+ \\w+ \\w+"
result = [text[match.start():match.end()] for match in re.finditer(pattern, text)]
How could you modify the regular expression so that when there is a smaller number of elements that in the interval also recognize it? For example if the word is in the end of the string I would like it to return that interval too.
Always if possible to return the greatest possible pattern.
Your 'word \\w+ \\w+ \\w+" regex matches a word and then 3 more "words" (space separated). You want to match 0 to 3 of these words. Use
re.findall(r'word(?:\s+\w+){0,3}', s)
Or, to allow any non-word chars in between the "words", replace \s with \W:
re.findall(r'word(?:\W+\w+){0,3}', s)
Details:
word - word string
(?:\s+\w+){0,3} - 0 to 3 sequences (the {0,3} is a greedy version of the limiting quantifier, it will match as many occurrences as possible) of:
\s+ - 1+ whitespaces
\w+ - 1 or more word chars.
See the regex demo.

How do I get strings which starts with number as start of line and ends with 5 digit number

I have text like:
asf aSD ikugfr jddc ghddfj gjn dfxg
sdgal fghfh 16 rgjodrisgj frth fghsdf,
dfghdf dfhgdh gho h ghdof 67676
szdgfads
2 adf dojosd hsh fghs,
zfgdf dhgdzsfb dfgdz,
dzgdzfvg 47564
asdgasdg asdg
4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe,
sfsdgfg-43647
I need to extract all string in which start of the line is number and ends with 5 digits. There can be multiple lines in between.
2 adf dojosd hsh fghs,
zfgdf dhgdzsfb dfgdz,
dzgdzfvg 47564
4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe,
sfsdgfg-43647
I have tried with this regex but failed to do so. Its taking exactly two line, not single lines or more than two lines together.
regex = ^[0-9](.*)(?<=,)*\n?(.*\D\d{5}\D)
Your ^[0-9](.*)(?<=,)*\n?(.*\D\d{5}\D) regex matches the start of a string/line, then 1 digit, then 0+ any characters (except newlines if DOTALL mode is not used), then (?<=,)* is supposed to check 0+ times if the preceding character is a comma (which does not make much sense though Python does not mind it), then \n? matches 1 or 0 newlines, .* matches 0+ any chars except newline, \D matches a non-digit, \d{5} matches 5 digits, and \D again matches a non-digit. Yucky. I do not think it can work for any matches that contain more than 3 lines (note that \D matches a newline), and it will never match a valid match at the end of the string as the last \D requires a character after the last 5 digits.
You may use
re.compile(r'^\d.*?\b\d{5}$', re.M|re.DOTALL)
See the regex demo
You need to use a DOTALL modifier with the pattern so that . could match a newline, and MULTILINE modifier for the ^ and $ to match start/end of the line. The \b will not allow matching strings with more than 5 digits at the end of the line.
Use with re.findall, see demo:
import re
p = re.compile(r'^\d.*?\b\d{5}$', re.MULTILINE | re.DOTALL)
test_str = "asf aSD ikugfr jddc ghddfj gjn dfxg \nsdgal fghfh 16 rgjodrisgj frth fghsdf,\ndfghdf dfhgdh gho h ghdof 67676\n\nszdgfads\n2 adf dojosd hsh fghs, \nzfgdf dhgdzsfb dfgdz,\ndzgdzfvg 47564\n\nasdgasdg asdg\n4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe, \nsfsdgfg-43647"
print(p.findall(test_str))
# => ['2 adf dojosd hsh fghs, \nzfgdf dhgdzsfb dfgdz,\ndzgdzfvg 47564', '4334 ersga errr ertgerfd ertera erers qereadf erfesfdc wefadfe, \nsfsdgfg-43647']

Categories

Resources