Python regex to match HTML - python

I am trying to request a web page via urllib2 using a regex.
Here is my code
def Get(url):
request = urllib2.Request(url)
page = urlOpener.open(request)
return page.read()
page = Get(myurl)
#page = "<html>.....</html>" #local string for test
pattern = re.compile(r'^\s*(<tr>$\s*<td height="25.*?</tr>)$', re.M | re.I | re.DOTALL)
for task in pattern.findall(taskListPage):
If I use a local string (same as Get(myurl)' s result) for test, the pattern works, but if i use Get(myurl), the pattern does not work.
I will be grateful if someone can tell me why.

Valid reservations about using regex on html aside, try this regex instead:
(<tr>\s*<td height="25.*?</tr>)
You were finding only matches at end of input $, and had problem terms at front of regex.
This match is a brittle - let's hope the web guy doesn't change the height of the rows...

Related

Unable to get regex in python to match pattern

I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated
You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))
I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.

finding the regex to get a url between two phrases

I have the following script trying to get this url: https://clips-media-assets.twitch.tv/178569498.mp4 which is in between {"quality":"1080","source":" and a " but my regex doesn't seem to be working
dt = """
<body>
<script>jQuery(window).load(function () {
setTimeout(function(){s
}, 1000);quality_options: [{"quality":"1080","source":"https://clips-media-assets.twitch.tv/178569498.mp4","frame_rate":60},{"quality":"720","source":"https://clips-media-assets.twitch.tv/AT-178569498-1280x720.mp4","frame_rate":60},{"quality":"480","source":"https://clips-media-assets.twitch.tv/AT-178569498-854x480.mp4","frame_rate":30},{"quality":"360","source":"https://clips-media-assets.twitch.tv/AT-178569498-640x360.mp4","frame_rate":30}]
});</script>
</body>
[download] 28.2x of 57.90MiB at 1.54MiB/s ETA 00:26
"""
pattern = re.compile(r'(?:\G(?!\A)|quality\":\"1080\",\"source\":\")(?:(?!\").)*', re.MULTILINE | re.DOTALL)
clipHTML = BeautifulSoup(dt, "html.parser")
scripts = clipHTML.findAll(['script'])
for script in scripts:
if script:
match = pattern.search(script.text)
if match:
email = match.group(0)
print(email)
If you insist on using a regex to solve this, try this one (as shown here):
(?<=quality\":\"1080\",\"source\":\")[^\"]+(?=\")
I don't know specifically about this case, but I have to mention that in general it's not ideal to parse JSON with regular expressions. Of course you can add dynamic-numbered spaces to the regex using ( *), but still I think it's better to use a JSON parser.

Python strip Google Alerts URL

I've currently got a dataframe filled with Google Alert URLS that look like:
link = 'https://www.google.com/url?rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q'
and I just want the part following url= and before the junk.
http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/
I used urllib.parse.urlparse(link) to get a list of URL elements...
parsed = ParseResult(scheme='https', netloc='www.google.com', path='/url', params='', query='rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q', fragment='')
but even then parsed[4] only breaks it down to...
'rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q'
I found other queries on Stack with this same question but they were in other programming languages than Python.
Any ideas on a Python approach?
You may use a regex on parsed[4] to extract that URL:
(?:^|&)url=([^&]+)
See the regex demo
Details:
(?:^|&) - either start of string or &
url= - literal text url=
([^&]+) - Group 1 capturing one or more symbols other than &.
Python demo:
import re
p = re.compile(r'(?:^|&)url=([^&]+)')
s = "rct=j&sa=t&url=http://3dprint.com/4353/littledlper-dlp-3d-printer-kickstarter/&ct=ga&cd=CAEYBCoSODQ1OTg1ODMwMzQwNDUzMTUxMhw2NTFlMTg3MTI1ZGE4Nzc3OmNvLnVrOmVuOkdC&usg=AFQjCNF0HOEhqIZHEpdkH1eVdXt-JRBF3Q"
mObj = p.search(s)
if mObj:
print(mObj.group(1))

How to match 0 or 1 time character at the end of line?

I am trying to normalize a URL, to extract the content after :// and before the last / at the end of line if it exists.
Here is my script:
url = "https://example.com/25194425/"
matchUrl = re.findall(r'://(.*)/?$', url)
print matchUrl
What I want is example.com/25194425, but I get example.com/25194425/. How to deal with the last /?
Why doesn't /? work?
An alternative way to do it without using regex is using urlparse
>>> from urlparse import urlparse
>>> url = 'https://example.com/25194425/'
>>> '{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'example.com/25194425'
Later on, if you want to include the protocol, port, params, ... parts into the normalized url. It can be done easier (than updating the regex)
>>> '{url.scheme}://{url.netloc}{url.path}'.format(url=urlparse(url)).rstrip('/')
'https://example.com/25194425'
As one of the commenters said, you just need to make the quantifier non-greedy:
://(.*?)/?$
However, the result of findall() is a list, not a string. In this case it's list with only one entry, but it's still a list. To get the actual string, you need to provide the index:
url = "https://example.com/25194425/"
match = re.findall(r'://(.*?)/?$', url)
print match[0]
But that seems like an inappropriate use of findall() to me. I would have gone with search():
url = "https://example.com/25194425/"
match = re.search(r'://(.*?)/?$', url)
if match:
print match.group(1)
The default is possible because the regular match more characters. So '(.*) /' will match to the last slash.
You can use it:
matchUrl = re.findall(r'://(.*)/[^/]?$', url)
EDIT Please try the following pattern (python 2.7x):
import re
url1 = 'https://example.com/25194425?/'
url2 = 'https://example.com/25194425?'
print re.findall('https?://([\S]+)(?<!/)[/]?', url1)
print re.findall('https?://([\S]+)(?<!/)[/]?', url2)
Output:
['example.com/25194425?']
['example.com/25194425?']
Thanks #Alan Moore for pointing out the word boundary issue. Now it should work for both scenarios.

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Categories

Resources