Regex for Backward Look up - python

Hi I have a string that I want to parse with Python.
I am new to regex, so really appreciate help.
ABC_XYZ::A_BCD_XYZ_C9_KDFJ_7011_1_11_14
Output C9 : Always starts with letter and a digit
Output 7011: Always 4 or more digits
Output 1, 11, 14: Always at the end of the string. One or two digits. May have more than 3.
Update.
I was using [^_]+ and it parses everything '_'. I wanted just those matches.

you can use this regex
((?<=_)\d{4})|((?<=_)\w?\d{1})
https://regex101.com/r/0fhJFY/1

You might get along with
import re
def get_values(string):
rx = re.compile(r'_([A-Z]\d)_.*?_(\d{4,}(?=_)).*?((?:_\d{1,2})+)')
m = rx.search(string)
if m:
return (m.group(1), m.group(2), [item for item in m.group(3).split("_") if item])
print(get_values("ABC_XYZ::A_BCD_XYZ_C9_KDFJ_7011_1_11_14"))
# ('C9', '7011', ['1', '11', '14'])
See a demo for the expression on regex101.com.

i don't understand what you need exactly but the regex:
[^_]+_[^_]+::[^_]_[^_]+_[^_]+_([A-Z]\d)_[^_]+_(\d{4,})_(\d)_(\d+)_(\d+)
give the output you want for the string you provided.
To test and learn regex I advise you to visit site like this.

Related

Is there a way to find (potentially) multiple results with re.search?

While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )

regex to match version number

hi everyone i have data parsed that i want to match.
i have list two strings i have parsed with:
technologytitle=technologytitle.lower()
vulntitle=vulntitle.lower()
ree1=re.split(technologytitle, vulntitle)
This produces the following:
['\nmultiple cross-site scripting (xss) vulnerabilities in', '9.0.1 and earlier\n\n\n\n\n']
I am now trying to formulate writing re.match to match the second value with:
ree2=re.match(r'^[0-9].[0-9]*$', ree1[1])
print("ree2 {}".format(ree2))
however this is returning None .
Any thoughts? Thanks
Unclear if you wanted the whole string, or individual parts, but you can do both without ^ or $
import re
regex = r'((?P<major>\d+)\.(?P<minor>\d+)\.(?P<patch>\d+))'
s = '9.0.1 and earlier\n\n\n\n\n'
matches = re.search(regex, s)
print(matches.group(0))
for v in ['major', 'minor', 'patch']:
print(v, matches.group(v))
Output
9.0.1
major 9
minor 0
patch 1
i used this one and it worked for me since dollar sign means the end of pattern and your pattern does not end with a number between 0-9 then it gives you none
regexPattern = "[0-9].*[0-9]"

Regex in python: combining 2 regex expressions into one

Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation

Grabbing multiple patterns in a string using regex

In python I'm trying to grab multiple inputs from string using regular expression; however, I'm having trouble. For the string:
inputs = 12 1 345 543 2
I tried using:
match = re.match(r'\s*inputs\s*=(\s*\d+)+',string)
However, this only returns the value '2'. I'm trying to capture all the values '12','1','345','543','2' but not sure how to do this.
Any help is greatly appreciated!
EDIT: Thank you all for explaining why this is does not work and providing alternative suggestions. Sorry if this is a repeat question.
You could try something like:
re.findall("\d+", your_string).
You cannot do this with a single regex (unless you were using .NET), because each capturing group will only ever return one result even if it is repeated (the last one in the case of Python).
Since variable length lookbehinds are also not possible (in which case you could do (?<=inputs.*=.*)\d+), you will have to separate this into two steps:
match = re.match(r'\s*inputs\s*=\s*(\d+(?:\s*\d+)+)', string)
integers = re.split(r'\s+',match.group(1))
So now you capture the entire list of integers (and the spaces between them), and then you split that capture at the spaces.
The second step could also be done using findall:
integers = re.findall(r'\d+',match.group(1))
The results are identical.
You can embed your regular expression:
import re
s = 'inputs = 12 1 345 543 2'
print re.findall(r'(\d+)', re.match(r'inputs\s*=\s*([\s\d]+)', s).group(1))
>>>
['12', '1', '345', '543', '2']
Or do it in layers:
import re
def get_inputs(s, regex=r'inputs\s*=\s*([\s\d]+)'):
match = re.match(regex, s)
if not match:
return False # or raise an exception - whatever you want
else:
return re.findall(r'(\d+)', match.group(1))
s = 'inputs = 12 1 345 543 2'
print get_inputs(s)
>>>
['12', '1', '345', '543', '2']
You should look at this answer: https://stackoverflow.com/a/4651893/1129561
In short:
In Python, this isn’t possible with a single regular expression: each capture of a group overrides the last capture of that same group (in .NET, this would actually be possible since the engine distinguishes between captures and groups).

How do I extract some string from a long string in Python?

I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]

Categories

Resources