Regular expression to match video file names - python

To be more specific I do not know what the exact values will be.
Essentially i need to find 3 or 4 digit characters followed by the literal letter p?
ie 1080p, 720p, 576p etc...
Looked around a lot of stuff but just couldnt make it work.
Heres' an example string
Back.to.the.Future.1985.720p.BluRay.X264-AMIABLE.mkv
In this case I want to return
Back.to.the.Future.1985.
re.search("^[^\d{3,4}p]+","Back.to.the.Future.1985.720p.BluRay.X264-AMIABLE.mkv")
however, returns
'Back.to.the.Future.'
Thanks a lot

i need to find 3 or 4 digit characters followed by the literal letter p
\d{3,4}p
If you do not want to match 3456p in 123456poop you can add some assertions, like:
\b\d{3,4}p\b

>>> import re
>>> re.search("(.*)\d{3,4}p",
"Back.to.the.Future.1985.720p.BluRay.X264-AMIABLE.mkv").groups()[0]
'Back.to.the.Future.1985.'

It sounds like what you want is "anything, followed by one or more digits, followed by 'p'". Then you want to 'capture' the "anything" part.
That looks like
(.*)\d+p
Or to specify the range of digit quantities,
(.*)\d{3,4}p

Related

How to search double digit numbers in python with regular expression?

I have a piece of code that records times in this format:
0.0-8.0
0.0-9.0
0.0-10.0
I want to use a regular expression that will find all of these strings and have checked here and here for help but am still confused. I understand how to do it if I only wanted to do single digit numbers, but I can't figure out how to handle double digit numbers like 10 or 20.
It is also important that the expression does not find the string
0.0-1.0
as it should be ignored.
So far my expression looks like this:
expression = re.compile(',0\.0\-[0-2][0-9])
If you want to match each line shown in your question, try an expression like this:
0\.0\-[0-2]?\d\.\d
\d is the same as [0-9]. The ? means 0 or 1 occurrences, so this will only match 1- or 2-digit numbers. If you need the comma at the start of the regex, add that in.
If you want to exclude 0.0-1.0, then you should do that in code, not in the regular expression, since that would make it less readable. But if you insist, I have included one that will exclude that string for you:
Try it here
0\.0\-[0-2]?[0-9]\.(?<!0-1\.)\d
This uses a negative lookbehind to ensure the previous part is not 0-1., which would only occur in the match you didn't want.

Python regex for int with at least 4 digits

I am just learning regex and I'm a bit confused here. I've got a string from which I want to extract an int with at least 4 digits and at most 7 digits. I tried it as follows:
>>> import re
>>> teststring = 'abcd123efg123456'
>>> re.match(r"[0-9]{4,7}$", teststring)
Where I was expecting 123456, unfortunately this results in nothing at all. Could anybody help me out a little bit here?
#ExplosionPills is correct, but there would still be two problems with your regex.
First, $ matches the end of the string. I'm guessing you'd like to be able to extract an int in the middle of the string as well, e.g. abcd123456efg789 to return 123456. To fix that, you want this:
r"[0-9]{4,7}(?![0-9])"
^^^^^^^^^
The added portion is a negative lookahead assertion, meaning, "...not followed by any more numbers." Let me simplify that by the use of \d though:
r"\d{4,7}(?!\d)"
That's better. Now, the second problem. You have no constraint on the left side of your regex, so given a string like abcd123efg123456789, you'd actually match 3456789. So, you need a negative lookbehind assertion as well:
r"(?<!\d)\d{4,7}(?!\d)"
.match will only match if the string starts with the pattern. Use .search.
You can also use:
re.findall(r"[0-9]{4,7}", teststring)
Which will return a list of all substrings that match your regex, in your case ['123456']
If you're interested in just the first matched substring, then you can write this as:
next(iter(re.findall(r"[0-9]{4,7}", teststring)), None)

Parsing a regex for optional sections

Similar to this question but with a difference subtle enough that I still need some help.
Currently I have:
'(.*)\[(\d+\-\d+)\]'
as my regex, which matches any number of characters followed by square brackets [] that contain two decimals separated by a dash. My issue is, I'd like it to also match with just one decimal number between the square brackets, and possibly even with nothing in between the square brackets. So:
word[1-5] = match
word[5] = match
word[] = match (not essential)
and ensuring
word[-5] = no match
Could anyone possibly point my in the direction of the next step. I currently find regex to be a bit of a guessing game though I would like to become better with them.
Go with yours and make the last part optional
(.*)\[(\d+(-\d+)?)\]
Using ?.
To accomplish the other task, well, go with ? again
(.*)\[(\d+(-\d+)?)?\]
^here
A working example http://rubular.com/r/t0MaHyHfeS
Use ? to match 0 or 1 match
So use ? for the -\d+ and for both the digits separated by -
(.*)\[(\d+(-\d+)?)?\]
No need to escape -..It has special meaning only if its's between a character class.
(.*)\[((\d+(?:\-\d+)?)?)\]
This will match everything, even with 0 digits in there and will backreference you (in match[1-5]):
1- match
2- 1-5
Not every regex interpreter supports this, but you could try an "or" operator for the part inside the brackets:
'(.*)\[(\d+\-\d+|\d+)\]'

Regular expression capturing entire match consisting of repeated groups

I've looked thrould the forums but could not find exactly how exactly to solve my problem.
Let's say I have a string like the following:
UDK .636.32/38.082.4454.2(575.3)
and I would like to match the expression with a regex, capturing the actual number (in this case the '.636.32/38.082.4454.2(575.3)').
There could be some garbage characters between the 'UDK' and the actual number, and characters like '.', '/' or '-' are valid parts of the number. Essentially the number is a sequence of digits separated by some allowed characters.
What I've came up with is the following regex:
'UDK.*(\d{1,3}[\.\,\(\)\[\]\=\'\:\"\+/\-]{0,3})+'
but it does not group the '.636.32/38.082.4454.2(575.3)'! It leaves me with nothing more than a last digit of the last group (3 in this case).
Any help would be greatly appreciated.
First, you need a non-greedy .*?.
Second, you don't need to escape some chars in [ ].
Third, you might just consider it as a sequence of digits AND some allowed characters? Why there is a \d{1,3} but a 4454?
>>> re.match(r'UDK.*?([\d.,()\[\]=\':"+/-]+)', s).group(1)
'.636.32/38.082.4454.2(575.3)'
Not so much a direct answer to your problem, but a general regexp tip: use Kodos (http://kodos.sourceforge.net/). It is simply awesome for composing/testing out regexps. You can enter some sample text, and "try out" regular expressions against it, seeing what matches, groups, etc. It even generates Python code when you're done. Good stuff.
Edit: using Kodos I came up with:
UDK.*?(?P<number>[\d/.)(]+)
as a regexp which matches the given example. Code that Kodos produces is:
import re
rawstr = r"""UDK.*?(?P<number>[\d/.)(]+)"""
matchstr = """UDK .636.32/38.082.4454.2(575.3)"""
# method 1: using a compile object
compile_obj = re.compile(rawstr)
match_obj = compile_obj.search(matchstr)
# Retrieve group(s) by name
number = match_obj.group('number')

matching 3 or more of the same character in python

I'm trying to use regular expressions to find three or more of the same character in a string. So for example:
'hello' would not match
'ohhh' would.
I've tried doing things like:
re.compile('(?!.*(.)\1{3,})^[a-zA-Z]*$')
re.compile('(\w)\1{5,}')
but neither seem to work.
(\w)\1{2,} is the regex you are looking for.
In Python it could be quoted like r"(\w)\1{2,}"
if you're looking for the same character three times consecutively, you can do this:
(\w)\1\1
if you want to find the same character three times anywhere in the string, you need to put a dot and an asterisk between the parts of the expression above, like so:
(\w).*\1.*\1
The .* matches any number of any character, so this expression should match any string which has any single word character that appears three or more times, with any number of any characters in between them.
Hope that helps.

Categories

Resources