how to look behind in regex without matching a pattern itself? - python

Lets say we want to extract the link in a tag like this:
input:
<p><b>some text</b></p>
desired output:
http://www.google.com/home/etc
the first solution is to find the match with reference using this href=[\'"]?([^\'" >]+) regex
but what I want to achieve is to match the link followed by href. so trying this (?=href\")... (lookahead assertion: matches without consuming) is still matching the href itself.
It is a regex only question.

One of many regex based solutions would be a capturing group:
>>> re.search(r'href="([^"]*)"', s).group(1)
'http://www.google.com/home/etc'
[^"]* matches any number non-".

A solution could be:
(?:href=)('|")(.*)\1
(?:href=) is a non capturing group. It means that the parser use href during the matching but it actually does not return it. As a matter of fact if you try this in regex you will see there's no group holding it.
Besides, every time you open and close a round bracket, you create a group. As a consequence, ('|") defines the group #1 and the URL you want will be in group #2. The way you retrieve this info depends on the programming language.
At the end, the \1 returns the value hold by group #1 (in this case it will be ") to provide a delimiter to the URL

Make yourself comfortable with a parser, e.g. with BeautifulSoup.
With this, it could be achieved with
from bs4 import BeautifulSoup
html = """<p><b>some text</b></p>"""
soup = BeautifulSoup(html, "html5lib")
print(soup.find('a').text)
# some text
BeautifulSoup supports a number of selectors including CSS selectors.

Related

Unable to extract the time from tycho.usno.navy.mil/timer.html with regex

I need to extract the time from US Naval Observatory Master Clock Time webpage for EDT, MDT from the mentioned URL. I've been trying to extract it using re.findall but I am unable. I am using the following regex \d{2}\:\d{2}\:\d{2}\s(AM|PM)\s(MDT|PDT). The output is only PM and MDT or PDT.
First of all, that's a HTML page and using regex with HTML (or any nested/hierarchical data) is a bad idea. That being said, given the relative simplicity of the page we can let it slide in this instance but keep in mind that this is not the recommended way of doing things.
Your issue is that re.findall() returns only the captured groups ((AM|PM) and (MDT|PDT)) if your pattern contains capturing groups. You can turn them into non-capturing groups to collect the whole pattern, i.e.:
matches = re.findall(r"\d{2}:\d{2}:\d{2}\s(?:AM|PM)\s(?:MDT|PDT)", your_data)
Or, alternatively, you can use re.finditer() and extract the matches:
matches = [x.group() for x in re.finditer(r"\d{2}:\d{2}:\d{2}\s(AM|PM)\s(MDT|PDT)", data)]

Regular expression to find the http:// link out of the provided google search results link

I have
/url?q=http://dl.mytehranmusic.com/1392/Poya/New/1392/7/8/1/&sa=U&ved=0ahUKEwjIhcufvJXOAhWKrY8KHWjQBgQQFggTMAA&usg=AFQjCNF4phMtVM1Gmm1_kTpNOM6CXO0wIw
/url?q=http://mp3lees.org/index.php%3Fq%3DSia%2B-%2BElastic%2BHeart%2B(Feat.%2BThe%2BWeeknd%2B%2B%2BDiplo)&sa=U&ved=0ahUKEwjIhcufvJXOAhWKrY8KHWjQBgQQFggZMAE&usg=AFQjCNED4J0NRY5dmpC_cYMDJP9YM_Oxww
I am trying to find the http:// link out of the provided google search results link.
I have tried href = re.findall ('/url?q=(+/S)&', mixed) where mixed is variable name in which the unformatted link is stored.
You do not really need a regex to parse query strings. Use urlparse:
import urlparse
s = '/url?q=http://dl.mytehranmusic.com/1392/Poya/New/1392/7/8/1/&sa=U&ved=0ahUKEwjIhcufvJXOAhWKrY8KHWjQBgQQFggTMAA&usg=AFQjCNF4phMtVM1Gmm1_kTpNOM6CXO0wIw'
res = urlparse.parse_qs(urlparse.urlparse(s).query)
if (res['q']):
print(res['q'][0])
See the Python demo
If you absolutely want to have a regex solution for the reason you have not explained, I'd suggest
r'/url\?(?:\S*?&)?q=([^&]+)'
See the regex demo.
The (?:\S*?&) part allows to match the q anywhere inside the query string, and ([^&]+) will match 1 or more characters other than & and capture into a group returned with re.findall.

Regex quantifiers

I'm new to regex and this is stumping me.
In the following example, I want to extract facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info. I've read up on lazy quantifiers and lookbehinds but I still can't piece together the right regex. I'd expect facebook.com\/.*?sk=info to work but it captures too much. Can you guys help?
<i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_96df30"></i></span><span class="fbProfileBylineLabel"><span itemprop="address" itemscope="itemscope" itemtype="http://schema.org/PostalAddress">7508 15th Avenue, Brooklyn, New York 11228</span></span></span><span class="fbProfileBylineFragment"><span class="fbProfileBylineIconContainer"><i class="mrs fbProfileBylineIcon img sp_2p7iu7 sx_9f18df"></i></span><span class="fbProfileBylineLabel"><span itemprop="telephone">(718) 837-9004</span></span></span></div></div></div><a class="title" href="https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info" aria-label="About Dr. Morris Westfried - Dermatologist">
As much as I love regex, this is an html parsing task:
>>> from bs4 import BeautifulSoup
>>> html = .... # that whole text in the question
>>> soup = BeautifulSoup(html)
>>> pred = lambda tag: tag.attrs['href'].endswith('sk=info')
>>> [tag.attrs['href'] for tag in filter(pred, soup.find_all('a'))]
['https://www.facebook.com/pages/Dr-Morris-Westfried-Dermatologist/176363502456825?id=176363502456825&sk=info']
This works :)
facebook\.com\/[^>]*?sk=info
Debuggex Demo
With only .* it finds the first facebook.com, and then continues until the sk=info. Since there's another facebook.com between, you overlap them.
The unique thing between that you don't want is a > (or <, among other characters), so changing anything to anything but a > finds the facebook.com closest to the sk=info, as you want.
And yes, using regex for HTML should only be used in basic tasks. Otherwise, use a parser.
Why your pattern doesn't work:
You pattern doesn't work because the regex engine try your pattern from left to right in the string.
When the regex engine meets the first facebook.com\/ in the string, and since you use .*? after, the regex engine will add to the (possible) match result all the characters (including " or > or spaces) until it finds sk=info (since . can match any characters except newlines).
This is the reason why fejese suggests to replace the dot with [^"] or aliteralmind suggests to replace it with [^>] to make the pattern fail at this position in the string (the first).
Using an html parser is the easiest way if you want to deal with html. However, for a ponctual match or search/replace, note that if an html parser provide security, simplicity, it has a cost in term of performance since you need to load the whole tree of your document for a single task.
The problem is that you have an other facebook.com part. You can restrict the .* not to match " so it needs to stay within one attribute:
facebook\.com\/[^"]*;sk=info

Python Regex 'not' to identify pattern within <a></a>

I am dealing a problem to write a python regex 'not'to identify a certain pattern within href tags.
My aim is to replace all occurrences of DSS[a-z]{2}[0-9]{2} with a href link as shown below,but without replacing the same pattern occurring inside href tags
Present Regex:
replaced = re.sub("[^http://*/s](DSS[a-z]{2}[0-9]{2})", "\\1", input)
I need to add this new regex using an OR operator to the existing one I have
EDIT:
I am trying to use regex just for a simple operation. I want to replace the occurrences of the pattern anywhere in the html using a regex except occurring within<a><\a>.
The answer to any question having regexp and HTML in the same sentence is here.
In Python, the best HTML parser is indeed Beautilf Soup.
If you want to persist with regexp, you can try a negative lookbehind to avoid anything precessed by a ". At your own risk.

python regex to find any link that contains the text 'abc123'

I am using beautifuly soup to find all href tags.
links = myhtml.findAll('a', href=re.compile('????'))
I need to find all links that have 'abc123' in the href text.
I need help with the regex , see ??? in my code snippet.
If 'abc123' is literally what you want to search for, anywhere in the href, then re.compile('abc123') as suggested by other answers is correct. If the actual string you want to match contains punctuation, e.g. 'abc123.com', then use instead
re.compile(re.escape('abc123.com'))
The re.escape part will "escape" any punctuation so that it's taken literally, just like alphanumerics are; without it, some punctuation gets interpreted in various ways by RE's engine, for example the dot ('.') in the above example would be taken as "any single character whatsoever", so re.compile('abc123.com') would match, e.g. 'abc123zcom' (and many other strings of a similar nature).
"abc123" should give you what you want
if that doesn't work, than BS is probably using re.match in which case you would want ".*abc123.*"
If you want all the links with exactly 'abc123' you can simply put:
links = myhtml.findAll('a', href=re.compile('abc123'))

Categories

Resources