For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html
Related
I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated
You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))
I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.
I am trying to use regular expression to extract phone number from web links. The problem I am facing is with unwanted id's and other elements of webpage. If anyone can suggest some improvements, it would be really helpful. Below is the code and regular expression I am using in Python,
from urllib2 import urlopen as uReq
uClient = uReq(url)
page_html = uClient.read()
print re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)
Now, for most of the website, the script getting some page element values and sometimes accurate. Please suggest some modifications in expression
re.findall(r"(\(?\d{3}\D{0,3}\d{3}\D{0,3}\d{4}).*?",page_html)
My output looks like below for different url's
http://www.fraitagengineering.com/index.html
['(877) 424-4752']
http://hunterhawk.com/
['1481240672', '1481240643', '1479852632', '1478013441', '1481054486', '1481054560', '1481054598', '1481054588', '1476820246', '1481054521', '1481054540', '1476819829', '1481240830', '1479855986', '1479855990', '1479855994', '1479855895', '1476819760', '1476741750', '1476741750', '1476820517', '1479862863', '1476982247', '1481058326', '1481240672', '1481240830', '1513106590', '1481240643', '1479855986', '1479855990', '1479855994', '1479855895', '1479852632', '1478013441', '1715282331', '1041873852', '1736722557', '1525761106', '1481054486', '1476819760', '1481054560', '1476741750', '1481054598', '1476741750', '1481054588', '1476820246', '1481054521', '1476820517', '1479862863', '1481054540', '1476982247', '1476819829', '1481058326', '(925) 798-4950', '2093796260']
http://www.lbjewelrydesign.com/
['213-629-1823', '213-629-1823']
I want just phone numbers with (000) 000-0000
(not that I have added space after parenthesis),(000)-000-0000or000-000-0000` format. Any suggestions appreciated. Please note that I have already referred to this link : Find phone numbers in python script
I need improvement in regex for my specific needs.
The following regular expression can be used to match the samples that you presented and other similar numbers:
(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}
The following example script can be used to test positive and negative cases other than play with the regular expression:
import re
positiveExamples = [
'(000) 000-0000',
'(000)-000-0000',
'(000)000-0000',
'000-000-0000'
]
negativeExamples = [
'000 000-0000',
'000-000 0000',
'000 000 0000',
'000000-0000',
'000-0000000',
'0000000000'
]
reObj = re.compile(r"(\([0-9]{3}\)[\s-]?|[0-9]{3}-)[0-9]{3}-[0-9]{4}")
for example in positiveExamples:
print 'Asserting positive example: %s' % example
assert reObj.match(example)
for example in negativeExamples:
print 'Asserting negative example: %s' % example
assert reObj.match(example) == None
You can avoid searching inside ids, other attributes or inside HTML markup at all if only you would be able to search the plain text of the web page only. You can do it by processing the web page content through BeautifulSoup HTML parser:
from urllib2 import urlopen as uReq
from bs4 import BeautifulSoup
page_text = BeautifulSoup(uReq(url), "html.parser").get_text()
Then, as Jake mentioned in comments, you can make your regular expression more reliable:
Find phone numbers in python script
I am new to Python. A coder helped me out by giving me some code to parse HTML. I'm having trouble understanding how it works. My idea is for it to grab (consume?) HTML from
funtweets.com/random and basically tell me a funny joke in the morning as an alarm clock. It currently extracts all jokes on the page and I only want one. Either modifying the code or a detailed explanation as to how the code works would be helpful to me. This is the code:
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
user = re.compile(r'<span>#</span>(\w+)')
text = re.compile(r"</b></a> (\w.*)")
user_lst =[match.group(1) for match in re.finditer(user, page)]
text_lst =[match.group(1) for match in re.finditer(text, page)]
for _user, _text in zip(user_lst, text_lst):
print '#{0}\n{1}\n'.format(_user,_text)
user3530608 you want one match, instead of iterating through matches?
This is a nice way to get started with python regular expressions.
Here is a small tweak to your code. I don't have python in front of me to test it, so let me know if you run into any issues.
import re
import urllib2
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print '#{0}\n{1}\n'.format(user,text)
Although you can parse html by regex , but I strongly suggest you to use some python third's lib.
My favorest htmlparser lib is PyQuery, you can use it as jquery:
such as
from pyquery import PyQuery as pq
page=pq(url='http://www.m.funtweets.com/random')
users=page("#user_id")
a_first=page("a:first")
...
You can find it here:https://pypi.python.org/pypi/pyquery
Just:
pip install PyQuery
or
easy_install PyQuery
You'll love it !
Another htmlparse-lib: https://pypi.python.org/pypi/beautifulsoup4/4.3.2
If anyone is interested in getting only one joke from the html with no html tags, here is the final code:
import re
import urllib2
def remove_html_tags(text):
pattern = re.compile(r'</b></a>')
return pattern.sub('', text)
page = urllib2.urlopen("http://www.m.funtweets.com/random").read()
umatch = re.search(r"<span>#</span>(\w+)", page)
user = umatch.group()
utext = re.search(r"</b></a> (\w.*)", page)
text = utext.group()
print remove_html_tags(text)
I need to extract the meta keywords from a web page using Python. I was thinking that this could be done using urllib or urllib2, but I'm not sure. Anyone have any ideas?
I am using Python 2.6 on Windows XP
lxml is faster than BeautifulSoup (I think) and has much better functionality, while remaining relatively easy to use. Example:
52> from urllib import urlopen
53> from lxml import etree
54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )
62> for i in m:
..> print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>
Edit: another example.
75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[#name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"
BTW: XPath is worth knowing.
Another edit:
Alternatively, you can just use regexp:
87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...
...but I find it less readable and more error prone (but involves only standard module and still fits on one line).
BeautifulSoup is a great way to parse HTML with Python.
Particularly, check out the findAll method:
http://www.crummy.com/software/BeautifulSoup/documentation.html
Why not use a regular expression
keywordregex = re.compile('<meta\sname=
["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
keywordlist = keywordregex.findall(html)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")
I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)
My program currently searches the file and finds the line containing the div tag, but now I need a way to store the next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC
What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.
Sample code:
import urllib2, BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')
response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1').read()
bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()
You should be using a HTML parser such as BeautifulSoup or lxml instead.
to get the next line, you can use
htmlsource = open('carrier.html', 'r')
for line in htmlsource:
if '<div class="carrier_result">' in line:
nextline = htmlsource.next()
print nextline
A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg
data=open("carrier.html").read().split("</div>")
for item in data:
if '<div class="carrier_result">' in item:
print item.split('<div class="carrier_result">')[-1].strip()
by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.