Regex to ensure group match doesn't end with a specific character - python

I'm having trouble coming up with a regular expression to match a particular case. I have a list of tv shows in about 4 formats:
Name.Of.Show.S01E01
Name.Of.Show.0101
Name.Of.Show.01x01
Name.Of.Show.101
What I want to match is the show name. My main problem is that my regex matches the name of the show with a preceding '.'. My regex is the following:
"^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})"
Some Examples:
>>> import re
>>> SHOW_INFO = re.compile("^([0-9a-zA-Z\.]+)(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3})")
>>> match = SHOW_INFO.match("Name.Of.Show.S01E01")
>>> match.groups()
('Name.Of.Show.', 'S01E01')
>>> match = SHOW_INFO.match("Name.Of.Show.0101")
>>> match.groups()
('Name.Of.Show.0', '101')
>>> match = SHOW_INFO.match("Name.Of.Show.01x01")
>>> match.groups()
('Name.Of.Show.', '01x01')
>>> match = SHOW_INFO.match("Name.Of.Show.101")
>>> match.groups()
('Name.Of.Show.', '101')
So the question is how do I avoid the first group ending with a period? I realize I could simply do:
var.strip(".")
However, that doesn't handle the case of "Name.Of.Show.0101". Is there a way I could improve the regex to handle that case better?
Thanks in advance.

I think this will do:
>>> regex = re.compile(r'^([0-9a-z.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}x[0-9]{2})$', re.I)
>>> regex.match('Name.Of.Show.01x01').groups()
('Name.Of.Show', '01x01')
>>> regex.match('Name.Of.Show.101').groups()
('Name.Of.Show', '101')
ETA: Of course, if you're just trying to extract different bits from trusted strings you could just use string methods:
>>> 'Name.Of.Show.101'.rpartition('.')
('Name.Of.Show', '.', '101')

So the only real restriction on the last group is that it doesn’t contain a dot? Easy:
^(.*?)(\.[^.]+)$
This matches anything, non-greedily. The important part is the second group, which starts with a dot and then matches any non-dot character until the end of the string.
This works with all your test cases.

It seems like the problem is that you haven't specified that the period before the last group is required, so something like ^([0-9a-zA-Z\.]+)\.(S[0-9]{2}E[0-9]{2}|[0-9]{4}|[0-9]{2}x[0-9]{2}|[0-9]{3}) might work.

I believe this will do what you want:
^([0-9a-z\.]+)\.(?:S[0-9]{2}E[0-9]{2}|[0-9]{3,4}|[0-9]{2}(?:x[0-9]+)?)$
I tested this against the following list of shows:
30.Rock.S01E01
The.Office.0101
Lost.01x01
How.I.Met.Your.Mother.101
If those 4 cases are representative of the types of files you have, then that regex should place the show title in its own capture group and toss away the rest. This filter is, perhaps, a bit more restrictive than some others, but I'm a big fan of matching exactly what you need.

If the last part never contains a dot: ^(.*)\.([^\.]+)$

Related

How to I write a regular expression in Python 2.7 to return two words in a string with an underscore between them

I have strings consistent with this example:
>>> s = "plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater"
Every string has the "OS=" expression and its latter part comprises words linked by underscores. The first part of the string up to "OS=" and the actual words linked by underscores differ among strings.
I want to write a regular expression with the 're' module to ignore the first part of the string up to the pattern part, and then return the first two words within that pattern maintaining the underscore between them.
I want:
>>> 'puffin_CuteDeer'
I can get rid of the first part, and am getting close (I think) to handling the pattern part. Here's what I have and what it returns:
>>> example = re.search('(?<=OS=)(.*(?=_))',s)
>>> example.group(0)
>>> 'puffin_CuteDeer_cat'
I have tried many, many different possibilities and none of them are working.
I was surprised that
>>> example = re.search('(?<=OS=)(.*(?=_){2})',s)
did not work.
Your help is sincerely appreciated.
Update: I realize that there are non-regex ways of obtaining the desired output. However, for reasons that are probably beyond the scope of the question, I think regex is the best choice for me.
You can do:
(?<=OS=)[^_]+_[^_]+
The zero-width positive lookbehind, (?<=OS=), matches OS=
[^_]+ matches one or more characters upto next _, _ matches a literal _
Example:
In [90]: s
Out[90]: 'plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater'
In [91]: re.search(r'(?<=OS=)[^_]+_[^_]+', s).group()
Out[91]: 'puffin_CuteDeer'
You can try this:
import re
s = "plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater"
s = re.findall('(?<=OS\=)[a-zA-Z]+_[a-zA-Z]+', s)[0]
Output:
'puffin_CuteDeer'
The following uses a capturing group (...) and negation [^...] to get the desired part:
>>> re.search(r'OS=([^_]+_[^_]+)', s).group(1)
'puffin_CuteDeer'
Regex may not be necessary:
s = "plant yard !!# blah HELLO OS=puffin_CuteDeer_cat_anteater"
right_side = s.split("=")[-1]
"_".join(right_side.split("_")[:2])
# 'puffin_CuteDeer'

Regex for the repeated pattern

Can you please get me python regex that can match
9am, 5pm, 4:30am, 3am
Simply saying - it has the list of times in csv format
I know the pattern for time, here it is:
'^(\\d{1,2}|\\d{1,2}:\\d{1,2})(am|pm)$'
^(\d+(:\d+)?(am|pm)(, |$))+ will work for you.
Demo here
If you have a regex X and you want a list of them separated by comma and (optional) spaces, it's a simple matter to do:
^X(,\s*X)*$
The X is, of course, your current search pattern sans anchors though you could adapt that to be shorter as well. To my mind, a better pattern for the times would be:
\d{1,2}(:\d{2})?[ap]m
meaning that the full pattern for what you want would be:
^\d{1,2}(:\d{2})?[ap]m(,\s*\d{1,2}(:\d{2})?[ap]m)*$
You can use re.findall() to get all the matches for a given regex
>>> str = "hello world 9am, 5pm, 4:30am, 3am hai"
>>> re.findall(r'\d{1,2}(?::\d{1,2})?(?:am|pm)', str)
['9am', '5pm', '4:30am', '3am']
What it does?
\d{1,2} Matches one or two digit
(?::\d{1,2}) Matches : followed by one ore 2 digits. The ?: is to prevent regex from capturing the group.
The ? at the end makes this part optional.
(?:am|pm) Match am or pm.
Use the following regex pattern:
tstr = '9am, 5pm, 4:30am, 3amsdfkldnfknskflksd hello'
print(re.findall(r'\b\d+(?::\d+)?(?:am|pm)', tstr))
The output:
['9am', '5pm', '4:30am', '3am']
Try this,
((?:\d?\d(?:\:?\d\d?)?(?:am|pm)\,?\s?)+)
https://regex101.com/r/nkcWt5/1

How can I express 'repeat this part' in a regular expression?

Suppose I want to match a string like this:
123(432)123(342)2348(34)
I can match digits like 123 with [\d]* and (432) with \([\d]+\).
How can match the whole string by repeating either of the 2 patterns?
I tried [[\d]* | \([\d]+\)]+, but this is incorrect.
I am using python re module.
I think you need this regex:
"^(\d+|\(\d+\))+$"
and to avoid catastrophic backtracking you need to change it to a regex like this:
"^(\d|\(\d+\))+$"
You can use a character class to match the whole of string :
[\d()]+
But if you want to match the separate parts in separate groups you can use re.findall with a spacial regex based on your need, for example :
>>> import re
>>> s="123(432)123(342)2348(34)"
>>> re.findall(r'\d+\(\d+\)',s)
['123(432)', '123(342)', '2348(34)']
>>>
Or :
>>> re.findall(r'(\d+)\((\d+)\)',s)
[('123', '432'), ('123', '342'), ('2348', '34')]
Or you can just use \d+ to get all the numbers :
>>> re.findall(r'\d+',s)
['123', '432', '123', '342', '2348', '34']
If you want to match the patter \d+\(\d+\) repeatedly you can use following regex :
(?:\d+\(\d+\))+
You can achieve it with this pattern:
^(?=.)\d*(?:\(\d+\)\d*)*$
demo
(?=.) ensures there is at least one character (if you want to allow empty strings, remove it).
\d*(?:\(\d+\)\d*)* is an unrolled sub-pattern. Explanation: With a bactracking regex engine, when you have a sub-pattern like (A|B)* where A and B are mutually exclusive (or at least when the end of A or B doesn't match respectively the beginning of B or A), you can rewrite the sub-pattern like this: A*(BA*)* or B*(AB*)*. For your example, it replaces (?:\d+|\(\d+\))*
This new form is more efficient: it reduces the steps needed to obtain a match, it avoids a great part of the eventual bactracking.
Note that you can improve it more, if you emulate an atomic group (?>....) with this trick (?=(....))\1 that uses the fact that a lookahead is naturally atomic:
^(?=.)(?=(\d*(?:\(\d+\)\d*)*))\1$
demo (compare the number of steps needed with the previous version and check the debugger to see what happens)
Note: if you don't want two consecutive numbers enclosed in parenthesis, you only need to change the quantifier * with + inside the non-capturing group and to add (?:\(\d+\))? at the end of the pattern, before the anchor $:
^(?=.)\d*(?:\(\d+\)\d+)*(?:\(\d+\))?$
or
^(?=.)(?=(\d*(?:\(\d+\)\d+)*(?:\(\d+\))?))\1$

Regex Price Matching

I have a webscraper that scrapes prices, for that I need it to find following prices in strings:
762,50
1.843,75
In my first naive implementation, I didn't take the . into consideration and matched the first number with this regex perfectly:
re.findall("\d+,\d+", string)[0]
Now I need to match both cases and my initial idea was this:
re.findall("(\d+.\d+,\d+|\d+,\d+)", string)[0]
With an idea, that using the or operator, could find either the first or the second, which don't work, any suggestions?
No need to use a or, just add the first part as an optional parameter:
(?:\d+\.)?\d+,\d+
The ? after (?:\d+\.) makes it an optional parameter.
The '?:' indicate to not capture this group, just match it.
>>> re.findall(r'(?:\d+\.)?\d+,\d+', '1.843,75 762,50')
['1.843,75', '762,50']
Also note that you have to escape the . (dot) that would match any character except a newline (see http://docs.python.org/2/library/re.html#regular-expression-syntax)
In regular expression, dot (.) matches any character (except newline unless DOTALL flag is not set). Escape it to match . literally:
\d+\.\d+,\d+|\d+,\d+
^^
To match multiple leading digits, the regular expression should be:
>>> re.findall(r'(?:\d+\.)*\d+,\d+', '1,23 1.843,75 123.456.762,50')
['1,23', '1.843,75', '123.456.762,50']
NOTE used non-capturing group because re.findall return a list of groups If one or more groups are present in the pattern.
UPDATE
>>> re.findall(r'(?<![\d.])\d{1,3}(?:\.\d{3})*,\d+',
... '1,23 1.843,75 123.456.762,50 1.2.3.4.5.6.789,123')
['1,23', '1.843,75', '123.456.762,50']
How about:
(\d+[,.]\d+(?:[.,]\d+)?)
Matches:
- some digits followed by , or . and some digits
OR
- some digits followed by , or . and some digits followed by , or . and some digits
It matches: 762,50 and 1.843,75 and 1,75
It will also match 1.843.75 are you OK with that?
See it in action.
I'd use this:
\d{1,3}(?:\.\d{3})*,\d\d
This will match number that have dot as thousand separator
\d*\.?\d{3},\d{2}
See the working example here
This might be slower than regex, but given that the strings you are parsing are probably short, it should not matter.
Since the solution below does not use regex, it is simpler, and you can be more sure you are finding valid floats. Moreover, it parses the digit-strings into Python floats which is probably the next step you intend to perform anyway.
import locale
locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
def float_filter(iterable):
result = []
for item in iterable:
try:
result.append(locale.atof(item))
except ValueError:
pass
return result
text = 'The price is 762,50 kroner'
print(float_filter(text.split()))
yields
[762.5]
The basic idea: by setting a Danish locale, locale.atof parses commas as the decimal marker and dots as the grouping separator.
In [107]: import locale
In [108]: locale.setlocale(locale.LC_ALL, 'en_DK.UTF-8')
Out[108]: 'en_DK.UTF-8'
In [109]: locale.atof('762,50')
Out[109]: 762.5
In [110]: locale.atof('1.843,75')
Out[110]: 1843.75
In general, you have a set of zero or more XXX., followed by one or more XXX,, each up to 3 numbers, followed by two numbers (always). Do you want to also support numbers like 1,375 (without 'cents'?). You also need to avoid some false detection cases.
That looks like this:
matcher=r'((?:(?:(?:\d{1,3}\.)?(?:\d{3}.)*\d{3}\,)|(?:(?<![.0-9])\d{1,3},))\d\d)'
re.findall(matcher, '1.843,75 762,50')
This detects a lot of boundary cases, but may not catch everything....

Find ISBN with regex in Python

If have a text (actually lots of texts), where somewhere is one ISBN inside, and I have to find it.
I know: my ISBN-13 will start with "978" followed by 10 digits.
I don't kow: how many '-' (minus) there are and if they are at the correct place.
My code will only find me the ISBN without any Minus:
regex=r'978[0-9]{10}'
pattern = re.compile(regex, re.UNICODE)
for match in pattern.findall(mytext):
print(match)
But how can I find ISBN like these:
978-123-456-789-0
978-1234-567890
9781234567890
etc...
Is this possible with one regex-pattern?
Thanks!
This matches 10 digits and allows one optional hyphen before each:
regex = r'978(?:-?\d){10}'
Since you can't have 2 consecutive hyphens, and it must end with a digit:
r'978(-?\d){10}'
... allowing for a hyphen right after then 978, mandating a digit after every hyphen (does not end in a hyphen), and allowing for consecutive digits by making each hyphen optional.
I would add \b before the 978 and after then {10}, to make sure the ISBN's are well separated from surrounding text.
Also, I would add ?: right after the opening parenthesis, to make those non-capturing (slightly better performance, and also more expressive), making it:
r'\b978(?:-?\d){10}\b'
What about adding the - char in the pattern for the regex? This way, it will look for any combination of (number or -)x10 times.
regex=r'978[0-9\-]{10}'
Although it may be better to use
regex=r'978[0-9\-]+'
because otherwise if we use {10} and some - are found, not all the digits will be found.
Test
>>> import re
>>> regex=r'978[0-9\-]+'
>>> pattern = re.compile(regex, re.UNICODE)
>>> mytext="978-123-456-789-0"
>>> for match in pattern.findall(mytext):
... print(match)
...
978-123-456-789-0
>>> mytext="978-1234-567890"
>>> for match in pattern.findall(mytext):
... print(match)
...
978-1234-567890
>>> mytext="9781234567890"
>>> for match in pattern.findall(mytext):
... print(match)
...
9781234567890
>>>
You can try to match every digits and - characters. In that case you can't know how many characters find however:
regex=r'978[\d\-]+\d'
pattern = re.compile(regex, re.UNICODE)
for match in pattern.findall(mytext):
print(match)
If your ISBN is stucked between other digits or hyphens, you'll have some problems, but if it's clearly seperated, no worries :)
EDIT: According to the first comment, you can add an extra \d at the end of the regex (I've updated my code just below) because you know that an ISBN ends with a digit.
The simplest way should be
regex=r'978[-0-9]{10,15}'
which will accept them.
If someone is still looking : ISBN Detail and Contraints
Easy one regex = r'^(978-?|979-?)?\d(-?\d){9}$'
Strong one isbnRegex = r'^(978-?|979-?)?\d{1,5}-?\d{1,7}-?\d{1,6}-?\d{1,3}$' and include length check of 10 and 13 after removing hypen (Note : Also add substring check for length = 13 ie. only for 978 or 979, Some edge case still need to be checked)

Categories

Resources