Match a sequence of numbers preceded by certain text - python

How do I match a sequence of numbers preceded by certain text but not return the text, just the sequence of numbers?
For example, let's assume I have the following string:
url = "sampleurl/485734/abcdefgh/83275/"
I want to match all numbers that comes after the word sampleurl. So far, I`ve been using the following code
re.search("sampleurl/[0-9]+", url).group(0)[9:]
that works, but I'm assuming there is a fancier way of doing that instead of needing to use [9:] at the end.
For a quick reference, I've been using regex101 to check the validation of the regex.

You can place a capturing group around the part you want and refer to that group number for the match result.
re.search(r'sampleurl/(\d+)', url).group(1)
Another way would be implementing a lookaround assertion.
re.search(r'(?<=sampleurl/)\d+', url).group(0)

Related

Python regex expression example

I have an input that is valid if it has this parts:
starts with letters(upper and lower), numbers and some of the following characters (!,#,#,$,?)
begins with = and contains only of numbers
begins with "<<" and may contain anything
example: !!Hel##lo!#=7<<vbnfhfg
what is the right regex expression in python to identify if the input is valid?
I am trying with
pattern= r"([a-zA-Z0-9|!|#|#|$|?]{2,})([=]{1})([0-9]{1})([<]{2})([a-zA-Z0-9]{1,})/+"
but apparently am wrong.
For testing regex I can really recommend regex101. Makes it much easier to understand what your regex is doing and what strings it matches.
Now, for your regex pattern and the example you provided you need to remove the /+ in the end. Then it matches your example string. However, it splits it into four capture groups and not into three as I understand you want to have from your list. To split it into four caputre groups you could use this:
"([a-zA-Z0-9!##$?]{2,})([=]{1}[0-9]+)(<<.*)"
This returns the capture groups:
!!Hel##lo!#
=7
<<vbnfhfg
Notice I simplified your last group a little bit, using a dot instead of the list of characters. A dot matches anything, so change that back to your approach in case you don't want to match special characters.
Here is a link to your regex in regex101: link.

Regular expression - How to eliminate certain pattern in python

I have some articles containing match scores like 13-9, 34-12, 22-10 which I want to extract using a regular expression to find the pattern in Python. re.compile(r'[0-9]+-[0-9]')works but how can I modify to eliminate 1999-06, 2020-01? I tried re.compile(r'[0-9]{1,2}-[0-9]')but those year values return as 99-06 which is also invalid in my case.
You can match for exact number of digits required with look behind assertions, not to slice log numbers, like below
(?<!\d)\d{2}-\d{1,2}
Demo
You can avoid matching in the middle of a number with
r'(?<!\d)[0-9]{1,2}-[0-9]'
The negative lookbehind prohibits matching immediately after another digit.
Perhaps also add
(?!\d)
at the end to impose a similar restriction at the end of the match.

Regex: Another way to match structure separated by commas

I want to know if a string is a collection of, by example, numbers ([0-9]).
I this case, i'm using the regular expression [0-9](,[0-9])* to find one or more numbers separated by commas (A collection of numbers).
Is there a better way to do it? I mean a shorter expression perhaps.
I would suggest the following pattern:
(?<=^|,|\s)(\d+)
(?<=...) is a lookbehind assertion that will not be captured into the groups nor be included into the matched string. It is used to identify the starting position of the number to be matched.
You can try the above pattern interactively in the following website:
https://regex101.com/r/IKGWtA/1
\d*(,\d*)* will catch the situation where you have multiple digits before and after a comma e.g. 100,000. This regex will only grab 0,0 from that same number.

Python Regex: Match paragraph numbers

I am attempting to match paragraph numbers inside my block of text. Given the following sentence:
Refer to paragraph C.2.1a.5 for examples.
I would like to match the word C.2.1a.5.
My current code like so:
([0-9a-zA-Z]{1,2}\.)
Only matches C.2.1a. and es., which is not what I want. Is there a way to match the full C.2.1a.5 and not match es.?
https://regex101.com/r/cO8lqs/13723
I have attempted to use ^ and $, but doing so returns no matches.
You should use following regex to match the paragraph numbers in your text.
\b(?:[0-9a-zA-Z]{1,2}\.)+[0-9a-zA-Z]\b
Try this demo
Here is the explanation,
\b - Matches a word boundary hence avoiding matching partially in a large word like examples.
(?:[0-9a-zA-Z]{1,2}\.)+ - This matches an alphanumeric text with length one or two as you tried to match in your own regex.
[0-9a-zA-Z] - Finally the match ends with one alphanumeric character at the end. In case you want it to match one or two alphanumeric characters at the end too, just add {1,2} after it
\b - Matches a word boundary again to ensure it doesn't match partially in a large word.
EDIT:
As someone pointed out, in case your text has strings like A.A.A.A.A.A. or A.A.A or even 1.2 and you don't want to match these strings and only want to match strings that has exactly three dots within it, you should use following regex which is more specific in matching your paragraph numbers.
(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)
This new regex matches only paragraph numbers having exactly three dots and those negative look ahead/behind ensures it doesn't match partially in large string like A.A.A.A.A.A
Updated regex demo
Check these python sample codes,
import re
s = 'Refer to paragraph C.2.1a.5 for examples. Refer to paragraph A.A.A.A.A.A.A for examples. Some more A.A.A or like 1.22'
print(re.findall(r'(?<!\.)\b(?:[0-9a-zA-Z]{1,2}\.){3}[0-9a-zA-Z]\b(?!\.)', s))
Output,
['C.2.1a.5']
Also for trying to use ^ and $, they are called start and end anchors respectively, and if you use them in your regex, then they will expect matching start of line and end of line which is not what you really intend to do hence you shouldn't be using them and like you already saw, using them won't work in this case.
If simple version is required, you can use this easy to understand and modify regex ([A-Z]{1}\.[0-9]{1,3}\.[0-9]{1,3}[a-z]{1}\.[0-9]{1,3})
I think we should keep the regex expression simple and readable.
You can use the regex
**(?:[a-zA-Z]+\.){3}[a-zA-Z]+**
Explanation -
The expression (?:[a-zA-Z]+.){3} ensures that the group (?:[a-zA-Z]+.) is to be repeated 3 times within the word. The group contains an alphabetic character followed a dot.
The word would end with an alphabetic character.
Output:
['C.2.1a.5']

Negating match if a string is just before another string

I'm struggling to get a regex to work where it matches a certain pattern, so long as isn't proceeded by another. For example,
Accessory for MyProduct01 <<< Should be classified as an accessory
MyProduct01 with accessory << Should be classified as a product
So I need to add something to my 'accessory' regex, something like 'match "accessory" so long as the word before isn't "with"'.
I have seen some examples where people are using negative lookaheads to find if a word is anywhere in the string, but I want to be a bit more specific regarding the position of the word to negate. Something like:
(?!with\s)accessory
Just use a negative look-behind in your regex:
(?<!with\s)accessory
Since Python doesn't support unbounded lookbehinds, I think you are going to have to use a lookahead similar to what you are currently using, but change the original pattern a bit.
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b)
Here, the negative lookahead is used to ensure that "accessory" doesn't come after the word "with". Then, the positive lookahead is used to ensure that the word "accessory" occurs within the string, captured with a group if you need to capture it for some reason.
Based on the way that I wrote the above, you'd want to use the search method and not the match method. In order to use match, which requires that the entire search string match the pattern, you'd need to add a bit more to the pattern:
^(?!\bwith\b.*\baccessory\b)(?=.*\b(accessory)\b).*$

Categories

Resources