While parsing file names of TV shows, I would like to extract information about them to use for renaming. I have a working model, but it currently uses 28 if/elif statements for every iteration of filename I've seen over the last few years. I'd love to be able to condense this to something that I'm not ashamed of, so any help would be appreciated.
Phase one of this code repentance is to hopefully grab multiple episode numbers. I've gotten as far as the code below, but in the first entry it only displays the first episode number and not all three.
import re
def main():
pattern = '(.*)\.S(\d+)[E(\d+)]+'
strings = ['blah.s01e01e02e03', 'foo.s09e09', 'bar.s05e05']
#print(strings)
for string in strings:
print(string)
result = re.search("(.*)\.S(\d+)[E(\d+)]+", string, re.IGNORECASE)
print(result.group(2))
if __name__== "__main__":
main()
This outputs:
blah.s01e01e02e03
01
foo.s09e09
09
bar.s05e05
05
It's probably trivial, but regular expressions might as well be Cuneiform most days. Thanks in advance!
No. You can use findall to find all e\d+, but it cannot find overlapping matches, which makes it impossible to use s\d+ together with it (i.e. you can't distinguish e02 in "foo.s01e006e007" from that of "age007.s01e001"), and Python doesn't let you use variable-length lookbehind (to make sure s\d+ is before it without overlapping).
The way to do this is to find \.s\d+((?:e\d+)+)$ then split the resultant group 1 in another step (whether by using findall with e\d+, or by splitting with (?<!^)(?=e)).
text = 'blah.s01e01e02e03'
match = re.search(r'\.(s\d+)((?:e\d+)+)$', text, re.I)
season = match.group(1)
episodes = re.findall(r'e\d+', match.group(2), re.I)
print(season, episodes)
# => s01 ['e01', 'e02', 'e03']
re.findall instead of re.search will return a list of all matches
If you can make use of the PyPi regex module you could make use of repeating capture groups in the pattern, and then use .captures()
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(s\d+)(e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.captures(1)[0], m.captures(2))
Output:
s01 ['e01', 'e02', 'e03']
See a Python demo and a regex101 demo.
Or using .capturesdict () with named capture groups.
For example:
import regex
s = "blah.s01e01e02e03"
pattern = r"\.(?P<season>s\d+)(?P<episodes>e\d+)+"
m = regex.search(pattern, s, regex.IGNORECASE)
if m:
print(m.capturesdict())
Output:
{'season': ['s01'], 'episodes': ['e01', 'e02', 'e03']}
See a Python demo.
Note that the notation [E(\d+)] that you used is a character class, that matches 1 or the listed characters like E ( a digit + )
Related
Suppose I have the following list:
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','persons']
I want to remove all elements, that contain numbers and elements, that end with dots.
So I want to delete '35','7,000','10,000','mr.','rev.'
I can do it separately using the following regex:
regex = re.compile('[a-zA-Z\.]')
regex2 = re.compile('[0-9]')
But when I try to combine them I delete either all elements or nothing.
How can I combine two regex correctly?
This should work:
reg = re.compile('[a-zA-Z]+\.|[0-9,]+')
Note that your first regex is wrong because it deletes any string within a dot inside it.
To avoid this, I included [a-zA-Z]+\. in the combined regex.
Your second regex is also wrong as it misses a "+" and a ",", which I included in the above solution.
Here a demo.
Also, if you assume that elements which end with a dot might contain some numbers the complete solution should be:
reg = re.compile('[a-zA-Z0-9]+\.|[0-9,]+')
If you don't need to capture the result, this matches any string with a dot at the end, or any with a number in it.
\.$|\d
You could use:
(?:[^\d\n]*\d)|.*\.$
See a demo on regex101.com.
Here is a way to do the job:
import re
a = ['35','years','opened','7,000','churches','rev.','mr.','brandt','said','adding','denomination','national','goal','one','church','every','10,000','per.sons']
b = []
for s in a:
if not re.search(r'^(?:[\d,]+|.*\.)$', s):
b.append(s)
print b
Output:
['years', 'opened', 'churches', 'brandt', 'said', 'adding', 'denomination', 'national', 'goal', 'one', 'church', 'every', 'per.sons']
Demo & explanation
Is there any mechanism in Python for combining compiled regular expressions?
I know it's possible to compile a new expression by extracting the plain-old-string .pattern property from existing pattern objects. But this fails in several ways. For example:
import re
first = re.compile(r"(hello?\s*)")
# one-two-three or one/two/three - but not one-two/three or one/two-three
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
# Incorrect - back-reference \1 would refer to the wrong capturing group now,
# and we get an error "redefinition of group name 'r1' as group 3; was
# group 2 at position 47" for the `(?P)` group.
# Result is also now case-sensitive, unlike 'second' which is IGNORECASE
both = re.compile(first.pattern + second.pattern + second.pattern)
The result I'm looking for is achievable like so in Perl:
$first = qr{(hello?\s*)};
# one-two-three or one/two/three - but not one-two/three or one/two-three
$second = qr{one([-/])two\g{-1}three}i;
$both = qr{$first$second$second};
A test shows the results:
test($second, "...one-two-three..."); # Matches
test($both, "...hello one-two-THREEone-two-three..."); # Matches
test($both, "...hellone/Two/ThreeONE-TWO-THREE..."); # Matches
test($both, "...HELLO one/Two/ThreeONE-TWO-THREE..."); # No match
sub test {
my ($pat, $str) = #_;
print $str =~ $pat ? "Matches\n" : "No match\n";
}
Is there a library somewhere that makes this use case possible in Python? Or a built-in feature I'm missing somewhere?
(Note - one very useful feature in the Perl regex above is \g{-1}, which unambiguously refers to the immediately preceding capture group, so that there are no collisions of the type that Python is complaining about when I try to compile the combined expression. I haven't seen that anywhere in Python world, not sure if there's an alternative I haven't thought of.)
Ken, this is an interesting problem. I agree with you that the Perl solution is very slick.
I came up with something, but it is not so elegant. Maybe it gives you some idea to further explore the solution using Python. The idea is to simulate the concatenation using Python re methods.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one(?P<r1>[-/])two(?P=r1)three", re.IGNORECASE)
str="...hello one-two-THREEone/two/three..."
#str="...hellone/Two/ThreeONE-TWO-THREE..."
if re.search(first,str):
first_end_pos = re.search(first,str).end()
if re.match(second,str[first_end_pos:]):
second_end_pos = re.match(second,str[first_end_pos:]).end() + first_end_pos
if re.match(second,str[second_end_pos:]):
print ('Matches')
It will work for most of the cases but it is not working for the below case:
...hellone/Two/ThreeONE-TWO-THREE...
So, yes I admit it is not a complete solution to your problem. Hope this helps though.
I'm not a perl expert, but it doesn't seem like you're comparing apples to apples. You're using named capture groups in python, but I don't see any named capture groups in the perl example. This causes the error you mention, because this
both = re.compile(first.pattern + second.pattern + second.pattern)
tries to create two capture groups named r1
For example, if you use the regex below, then try to access group_one by name, would you get the numbers before "some text" or after?
# Not actually a valid regex
r'(?P<group_one>[0-9]*)some text(?P<group_one>[0-9]*)'
Solution 1
An easy solution is probably to remove the names from the capture groups. Also add the re.IGNORECASE to both. The code below works, although I'm not sure the resulting regex pattern will match what you want it to match.
first = re.compile(r"(hello?\s*)")
second = re.compile(r"one([-/])two([-/])three", re.IGNORECASE)
both = re.compile(first.pattern + second.pattern + second.pattern, re.IGNORECASE)
Solution 2
What I'd probably do instead is define the separate regular expressions as strings, then you can combine them however you'd like.
pattern1 = r"(hello?\s*)"
pattern2 = r"one([-/])two([-/])three"
first = re.compile(pattern1, re.IGNORECASE)
second = re.compile(pattern2, re.IGNORECASE)
both = re.compile(r"{}{}{}".format(pattern1, pattern2, pattern2), re.IGNORECASE)
Or better yet, for this specific example, don't repeat pattern2 twice, just account for the fact that it'll repeat in the regex:
both = re.compile("{}({}){{2}}".format(pattern1, pattern2), re.IGNORECASE)
which gives you the following regex:
r'(hello?\s*)(one([-/])two([-/])three){2}'
I am trying to look for multiple specific sequences in a DNA sequence within a FASTA format and then print them out. For simplicity, I made a short string sequence to show my problem.
import re
seq = "QPPLSK"
find_in_seq = re.search(r"[^P](P|K|R|H|W)", seq)
print find_in_seq.string[find_in_seq.start():find_in_seq.end()]
I only get one output of a match "QP" when there are 2 matches "QP" and "SK". How do I get to show the 2 matches instead of just only showing the first match?
Thanks
Use re.findall and change the regex so that there is no more capturing group - [^P](?:P|K|R|H|W) or [^P][PKRHW]:
import re
seq = "QPPLSK"
find_in_seq = re.findall(r"[^P][PKRHW]", str(seq))
print(find_in_seq)
See the Python demo
Note that if you want to match any letter other than P, you'd better use [A-OQ-Z].
I'm looking for a squaremeter term in some kind of text using this RegExpression:
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
Works pretty well.
Now, this thing should only be matched if before OR after it, a string like "Wohnfläche"/"Wohnfl"/"Wfl" exists. In other words: the latter term is mandatory, however its positon is not.
Writing a RegEx for this is not the issue in general, my problem is how to write it most elegant. Currently I only see one approach:
^[.]*[Wohnfläche|Wohnfl|Wfl]([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]
new search, kombined with 'or' statement (I'm using Python)
([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2][.]*[Wohnfläche|Wohnfl|Wfl]$
Ugly, isn't it? ;)
You can use alternation like this:
(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)
And check which capture group matched. It is just not possible to use the restrictive strings optionally in the regex on both sides, the will just be ignored.
See the regex demo
IDEONE demo:
import re
pat = re.compile(r'(?:Wohnfläche|Wohnfl|Wfl)\s*(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)|(\d{1,3}(?:[.,]\d{1,2})?)\s?m\s?(qm|m\u00B2)\s*(?:Wohnfläche|Wohnfl|Wfl)')
strs = ["12,56m qm Wohnfläche", "14.54 mqm Wohnfl", "Wfl 134 m qm"]
for x in strs:
m = pat.search(x)
if m:
if m.group(1): # First alternative found a match
print("{}".format(m.group(1), " - ", m.group(2)))
else: # Second alternative "won"
print("{}".format(m.group(3), " - ", m.group(4)))
Specify a logical conjunction in the controlling application, like (pseudo-code) <area-regex>.match(string) and <text-regex>.match(string).
This assumes that any pair of matches of the two regexen on the same string will never overlap ( if they did, you'd get a false positive ). Your regexen meet this requirement.
Note that your regex for the textual context contains the additional restriction that your test string either starts or ends with a match, while in your informal description you just require a match to either occur before or after the area spec. This difference is incorporated in pt vs pt_anchored in the code below.
Python fragment (untested):
import re
...
# pa: <area_regex>
# pt: <text_regex>
# pt_anchored: <text_regex>, anchored
#
pa = re.compile ( r'([0-9]{1,3}[\.|,]?[0-9]{1,2}?)\s?m\s?[qm|m\u00B2]' )
pt = re.compile ( r'[.]*[Wohnfläche|Wohnfl|Wfl]' )
pt_anchored = re.compile ( r'^[.]*[Wohnfläche|Wohnfl|Wfl]|[.]*[Wohnfläche|Wohnfl|Wfl]$' )
if pa.match(<teststring>) and pt.match(<teststring>):
print 'Match found: '
else:
print 'No match'
...
I have a lot of long strings - not all of them have the same length and content, so that's why I can't use indices - and I want to extract a string from all of them. This is what I want to extract:
http://www.someDomainName.com/anyNumber
SomeDomainName doesn't contain any numbers and and anyNumber is different in each long string. The code should extract the desired string from any string possible and should take into account spaces and any other weird thing that might appear in the long string - should be possible with regex right? -. Could anybody help me with this? Thank you.
Update: I should have said that www. and .com are always the same. Also someDomainName! But there's another http://www. in the string
import re
results = re.findall(r'\bhttp://www\.someDomainName\.com/\d+\b', long_string)
>>> import re
>>> pattern = re.compile("(http://www\\.)(\\w*)(\\.com/)(\\d+)")
>>> matches = pattern.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(0)
print matches.group(1)
print matches.group(2)
print matches.group(3)
print matches.group(4)
http://www.someDomainName.com/2134
http://www.
someDomainName
.com/
2134
In the above pattern, we have captured 5 groups -
One is the complete string that is matched
Rest are in the order of the brackets you see.. (So, you are looking for the second one..) - (\\w*)
If you want, you can capture only the part of the string you are interested in.. So, you can remove the brackets from rest of the pattern that you don't want and just keep (\w*)
>>> pattern = re.compile("http://www\\.(\\w*)\\.com/\\d+")
>>> matches = patter.search("http://www.someDomainName.com/2134")
>>> if matches:
print matches.group(1)
someDomainName
In the above example, you won't have groups - 2, 3 and 4, as in the previous example, as we have captured only 1 group.. And yes group 0 is always captured.. That is the complete string that matches..
Yeah, your simplest bet is regex. Here's something that will probably get the job done:
import re
matcher = re.compile(r'www.(.+).com\/(.+)
matches = matcher.search(yourstring)
if matches:
str1,str2 = matches.groups()
If you are sure that there are no dots in SomeDomainName you can just take the first occurence of the string ".com/" and take everything from that index on
this will avoid you the use of regex which are harder to maintain
exp = 'http://www.aejlidjaelidjl.com/alieilael'
print exp[exp.find('.com/')+5:]