Do not match if word appears in regex - python

I have a url, and I want it to NOT match if the word 'season' is contained in the url. Here are two examples:
CONTAINS SEASON, DO NOT MATCH
'http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7'
DOES NOT CONTAIN SEASON, MATCH
'http://imdb.com/title/tt0285331/
Here is what I have so far, but I'm afraid the .+ will match everything until the end. What would be the correct regex to use here?
r'http://imdb.com/title/tt(\d)+/.+^[season].+'

Use a negative lookahead:
urls='''\
http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
http://imdb.com/title/tt0285331/'''
import re
print re.findall(r'^(?!.*\bseason\b)(.*)', urls, re.M)
# ['http://imdb.com/title/tt0285331/']

You cannot use whole words inside of character classes, you have to use a Negative Lookahead.
>>> s = '''
http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
http://imdb.com/title/tt0285331/
http://imdb.com/title/tt1111111/episodes?this=2
http://imdb.com/title/tt0123456/episodes?this=1&season=1&ref_=tt_eps_sn_1'''
>>> import re
>>> re.findall(r'\bhttp://imdb.com/title/tt(?!\S+\bseason)\S+', s)
# ['http://imdb.com/title/tt0285331/', 'http://imdb.com/title/tt0285331/episodes?this=2']

Use a negative lokahead just after to tt\d+/,
>>> import re
>>> s = """http://imdb.com/title/tt0285331/episodes?this=1&season=7&ref_=tt_eps_sn_7
... http://imdb.com/title/tt0285331/
... """
>>> m = re.findall(r'^http://imdb.com/title/tt\d+/(?:(?!season).)*$', s, re.M)
>>> for i in m:
... print i
...
http://imdb.com/title/tt0285331/

Related

String with no spaces needs to split based on pattern

I have a string
number234-456-132
abc235-456-456
bhjklsds:456-133-456
I want split the strings as
number 234-456-132
abc 235-456-456
bhjklsds: 456-133-456
There is no pattern to the text which is joined with the number.
try this regex --> '([^0-9]*)(.*)'
>>> import re
>>> def foo(text):
... result = re.search('([^0-9]*)(.*)', text)
... return " ".join(result.groups())
...
>>> foo("number234-456-132")
'number 234-456-132'
>>> foo("abc235-456-456")
'abc 235-456-456'
>>> foo("bhjklsds:456-133-456")
'bhjklsds: 456-133-456'
>>>
I would try explicitly to match the three groups of digits at the end, and include anything else in the first string:
for string in strings:
match = re.match("(.*)(\d{3}-\d{3}-\d{3})$", string)
print([match[1], match[2]])

Combining two patterns with named capturing group in Python?

I have a regular expression which uses the before pattern like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:)([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:I118uailfriedx151201005423521">>')
>>> x.group('sid')
'I118uailfriedx151201005423521'
and another like so:
>>> RE_SID = re.compile(r'(?P<sid>(?<=sid:<<")([A-Za-z0-9]+))')
>>> x = RE_SID.search('sid:<<"I118uailfriedx151201005423521')
>>> x.group('sid')
'I118uailfriedx151201005423521'
How can I combine these two patterns in a way that, after parsing these two different lines,:
sid:A111uancalual2626x151130185758596
sid:<<"I118uailfriedx151201005423521">>
returns only the corresponding id to me.
RE_SID = re.compile(r'sid:(<<")?(?P<sid>([A-Za-z0-9]+))')
Use this, I've just tested and it is working for me. I've moved some part out.
Instead of tweaking your regex, you can make your strings easier to parse by just removing any characters except alphanumeric and a colon. Then, just split by colon and get the last item:
>>> import re
>>>
>>> test_strings = ['sid:I118uailfriedx151201005423521">>', 'sid:<<"I118uailfriedx151201005423521']
>>> pattern = re.compile(r"[^A-Za-z0-9:]")
>>> for test_string in test_strings:
... print(pattern.sub("", test_string).split(":")[-1])
...
I118uailfriedx151201005423521
I118uailfriedx151201005423521
You can achieve what you want with a single regex:
\bsid:\W*(?P<sid>\w+)
See the regex demo
The regex breakdown:
\bsid - whole word sid
: - a literal colon
\W* - zero or more non-word characters
(?P<sid>\w+) - one or more word characters captured into a group named "sid"
Python demo:
import re
p = re.compile(r'\bsid:\W*(?P<sid>\w+)')
#test_str = "sid:I118uailfriedx151201005423521\">>" # => I118uailfriedx151201005423521
test_str = "sid:<<\"I118uailfriedx151201005423521" # => I118uailfriedx151201005423521
m = p.search(test_str)
if m:
print(m.group("sid"))

Python Regex using a wildcard to match the beginning of a string and replacing the entire string

I'm trying to match the beginning of a word and then replace the entire word with something. Below is what I'm trying to do.
add23khh234 > REMOVED
add2asdf675 > REMOVED
Below is the regex statement I'm using.
string_reg = re.sub(ur'add*', 'REMOVED', string_reg)
But this code gives me the following.
add23khh234 > REMOVED23khh234
add2asdf675 > REMOVED2asdf675
add* is ad '*d'. From the document:
'*'
Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match a, ab, or a followed by any number of bs.
So it matchs ad or add or adddddd.... But it doesn't match neither add23khh234 nor add2asdf675(or something like these).
You should use .+? or .*? here(not .*, that's greedy). Try something like this:
string_reg = re.sub(ur'add.+? ', 'REMOVED ', string_reg)
Demo:
>>> import re
>>> string_reg = """\
... add23khh234 > REMOVED23khh234
... add2asdf675 > REMOVED2asdf675"""
>>> string_reg = re.sub(ur'add.+? ', 'REMOVED ', string_reg)
>>> print string_reg
REMOVED > REMOVED23khh234
REMOVED > REMOVED2asdf675
>>>
Try this
string_reg = re.sub(ur'^add.*', 'REMOVED', string_reg)
if you have mulitple patterns on one line
string_reg=re.sub("add[^ ]+","REMOVED",string_reg)
Short answer
\badd\w*
A quantifier such as * is applied to the previous token or subpattern. for example, the regex you're using add* matches a literal ad followed by any number of subsequent d.
Meeting your criteria
You need to match add at the beggining of a word, so use a word boundary \b
Then you also need to match the rest of the word in order to replace it. \w is a shorthand for [a-zA-Z0-9_], which matches 1 word character, and that's what you need to repeat any number of times with *.
Code
import re
string_reg = 'add23khh234 ... add2asdf675 ... xxxadd2axxx'
string_reg = re.sub(ur'\badd\w*', 'REMOVED', string_reg)
print(string_reg)
Output
REMOVED ... REMOVED ... xxxadd2axxx
ideone demo

python 3 regular expression match string meta-character

I want to write a line of regular expression that can match strings like "(2000)" with years in parentheses. then I can check if any string contains the substring "2000".
for example, I want the regex to match (2000) not 2000, or (20000),or (200).
That is to say: they have to have exactly four digits, the first digit between 1 and 2; they have to include the parentheses.
also 2000 is just an example I use but really I want to the regex to include all the possible years.
You have to escape the open and close paranthesis,
>>> import re
>>> str = """foo(2000)bar(1000)foobar2000"""
>>> regex = r'\(2000\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)']
OR
>>> import re
>>> str = """foo(2000)bar(1000)foobar(2014)barfoo(2020)"""
>>> regex = r'\([0-9]{4}\)'
>>> result = re.findall(regex, str)
>>> print(result)
['(2000)', '(1000)', '(2014)', '(2020)']
It matches all the four digit numbers(year's) present within the paranthesis.
Special characters need to be escaped with a backslash. A parenthesis ( becomes \(. Therefore (2000) becomes \(2000\).
Then you can do something like:
if re.search(r"\(2000\)", subject):
# Successful match
else:
# Match attempt failed
>>> import re
>>> x = re.match(r'\((\d*?)\)', "(2000)")
>>> x.group(1)
'2000'

How do I extract certain parts of strings in Python?

Say I have three strings:
abc534loif
tvd645kgjf
tv96fjbd_gfgf
and three lists:
beginning captures just the first part of the string "the name"
middle captures just the number
end contains only the rest of the characters that are after the number portion
How do I accomplish this in the most efficent way?
Use regular expressions?
>>> import re
>>> strings = 'abc534loif tvd645kgjf tv96fjbd_gfgf'.split()
>>> for s in strings:
... for match in re.finditer(r'\b([a-z]+)(\d+)(.+?)\b', s):
... print match.groups()
...
('abc', '534', 'loif')
('tvd', '645', 'kgjf')
('tv', '96', 'fjbd_gfgf')
This is language agnostic approach that aims at higher efficiency:
find first digit in the string and save its position p0
find last digit in the string and save its position p1
extract substring from 0 to p0-1 into beginning
extract substring from p0 to p1 into middle
extract substring from p1+1 to length-1 into end
I guess you're looking for re.findall:
strs = """
abc534loif
tvd645kgjf
tv96fjbd_gfgf
"""
import re
print re.findall(r'\b(\w+?)(\d+)(\w+)', strs)
>> [('abc', '534', 'loif'), ('tvd', '645', 'kgjf'), ('tv', '96', 'fjbd_gfgf')]
>>> import itertools as it
>>> s="abc534loif"
>>> [''.join(j) for i,j in it.groupby(s, key=str.isdigit)]
['abc', '534', 'loif']
I'd something like this:
>>> import re
>>> l = ['abc534loif', 'tvd645kgjf', 'tv96fjbd_gfgf']
>>> regex = re.compile('([a-z_]+)(\d+)([a-z_]+)')
>>> beginning, middle, end = zip(*[regex.match(s).groups() for s in l])
>>> beginning
('abc', 'tvd', 'tv')
>>> middle
('534', '645', '96')
>>> end
('loif', 'kgjf', 'fjbd_gfgf')
I wouls use regualar expressions like:
(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)
and pull out the three matching sections.
import re
m = re.match(r"(?P<beginning>[^0-9]*)(?P<middle>[^0-9]*)(?P<end>[^0-9]*)", "abc534loif")
m.group('beginning')
m.group('middle')
m.group('end')
import re #You want to match a string against a pattern so you import the regular expressions module 're'
mystring = "abc1234def" #Just a string to test with
match = re.match(r"^(\D+)([0)9]+](\D+)$") #Our regular expression. Everything between brackets is 'captured', meaning that it is accessible as one of the 'groups' in the returned match object. The ^ sign matches at the beginning of a string, while the $ matches the end. the characters in between the square brackets [0-9] are character ranges, so [0-9] matches any digit character, \D is any non-digit character.
if match: # match will be None if the string didn't match the pattern, so we need to check for that, as None.group doesn't exist.
beginning = match.group(1)
middle = match.group(2)
end = match.group(3)

Categories

Resources