Extract Meta Keywords From Webpage? - python

I need to extract the meta keywords from a web page using Python. I was thinking that this could be done using urllib or urllib2, but I'm not sure. Anyone have any ideas?
I am using Python 2.6 on Windows XP

lxml is faster than BeautifulSoup (I think) and has much better functionality, while remaining relatively easy to use. Example:
52> from urllib import urlopen
53> from lxml import etree
54> f = urlopen( "http://www.google.com" ).read()
55> tree = etree.HTML( f )
61> m = tree.xpath( "//meta" )
62> for i in m:
..> print etree.tostring( i )
..>
<meta http-equiv="content-type" content="text/html; charset=ISO-8859-2"/>
Edit: another example.
75> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
76> tree = etree.HTML( f )
85> tree.xpath( "//meta[#name='Keywords']" )[0].get("content")
85> "xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql,colors,soap,php,authoring,programming,training,learning,b
eginner's guide,primer,lessons,school,howto,reference,examples,samples,source code,tags,demos,tips,links,FAQ,tag list,forms,frames,color table,w3c,cascading
style sheets,active server pages,dynamic html,internet,database,development,Web building,Webmaster,html guide"
BTW: XPath is worth knowing.
Another edit:
Alternatively, you can just use regexp:
87> f = urlopen( "http://www.w3schools.com/XPath/xpath_syntax.asp" ).read()
88> import re
101> re.search( "<meta name=\"Keywords\".*?content=\"([^\"]*)\"", f ).group( 1 )
101>"xml,tutorial,html,dhtml,css,xsl,xhtml,javascript,asp,ado,vbscript,dom,sql, ...etc...
...but I find it less readable and more error prone (but involves only standard module and still fits on one line).

BeautifulSoup is a great way to parse HTML with Python.
Particularly, check out the findAll method:
http://www.crummy.com/software/BeautifulSoup/documentation.html

Why not use a regular expression
keywordregex = re.compile('<meta\sname=
["\']keywords["\']\scontent=["\'](.*?)["\']\s/>')
keywordlist = keywordregex.findall(html)
if len(keywordlist) > 0:
keywordlist = keywordlist[0]
keywordlist = keywordlist.split(", ")

Related

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Python Yahoo Stock Exchange (Web Scraping)

I'm having trouble with the following code, it's suppose to print the stock prices by accessing yahoo finance but I can't figure out why its returning empty strings?
import urllib
import re
symbolslist = ["aapl","spy", "goog","nflx"]
i = 0
while i < len(symbolslist):
url = "http://finance.yahoo.com/q?s="+symbolslist[i]+"&q1=1"
htmlfile = urllib.urlopen(url)
htmltext = htmlfile.read()
regex = '<span id="yfs_l84_' + symbolslist[i] + '">(.+?)</span>'
pattern = re.compile(regex)
price = re.findall(pattern,htmltext)
print price
i+=1
Edit: It works fine now, it was a syntax error. Edited the code above as well.
These are just a few helpful tips for python development (and scraping):
Python Requests library.
The python requests library is excellent at simplifying the requests process.
No need to use a while loop
for loops are really useful in this situation.
symbolslist = ["aapl","spy", "goog","nflx"]
for symbol in symbolslist:
# Do logic here...
Use xpath over regular expressions
import requests
import lxml
url = "http://www.google.co.uk/finance?q="+symbol+"&q1=1"
r = requests.get(url)
xpath = '//your/xpath'
root = lxml.html.fromstring(r.content)
No need to compile your regular expressions each time.
Compiling regex's takes time and effort. You can abstract these out of your loop.
regex = '<span id="yfs_l84_' + symbolslist[i] + '">(.+?)</span>'
pattern = re.compile(regex)
for symbol in symbolslist:
# do logic
External Libraries
As mentioned in the comment by drewk both Pandas and Matplot have native functions to get Yahoo quotes or you can use the ystockquote library to scrape from Yahoo. This is used like so:
#!/bin/env python
import ystockquote
symbolslist = ["aapl","spy", "goog","nflx"]
for symbol in symbolslist:
print (ystockquote.get_price(symbol))

Python if-statement based on content of HTML title tag

We are trying to write a Python script to parse HTML with the following conditions:
If the HTML title tag contains the string "Record doesn't exist," then continue running a loop.
If NOT, download the page contents.
How do we write an if-statement based on the conditions?
We're aware of Beautiful Soup, unfortunately we don't have permission to install it on the machine we're using.
Our code:
import urllib2
opp1 = 1
oppn = 2
for opp in range(opp1, oppn + 1):
oppurl = (something.com)
response = urllib2.urlopen(oppurl)
html = response.read()
# syntax error on the next line #
if Title == 'Record doesn't exist':
continue
else:
oppfilename = 'work/opptest' + str(opp) + '.htm'
oppfile = open(oppfilename, 'w')
opp.write(opphtml)
print 'Wrote ', oppfile
votefile.close()
You can use a regular expression to get the contents of the title tag:
m = re.search('<title>(.*?)</title>', html)
if m:
title = m.group(1)
Try Beautiful Soup. It's an amazingly easy to use library for parsing HTML documents and fragments.
import urllib2
from BeautifulSoup import BeautifulSoup
for opp in range(opp1,oppn+1):
oppurl = (www.myhomepage.com)
response = urllib2.urlopen(oppurl)
html = response.read()
soup = BeautifulSoup(html)
if soup.head.title == "Record doesn't exist":
continue
else:
oppfilename = 'work/opptest'+str(opp)+'.htm'
oppfile = open(oppfilename,'w')
opp.write(opphtml)
print 'Wrote ',oppfile
votefile.close()
---- EDIT ----
If Beautiful Soup isn't an option, I personally would resort to a regular expression. However, I refuse to admit that in public, as I won't let allow people to know I would stoop to the easy solution. Let's see what's in that "batteries included" bag of tricks.
HTMLParser looks promising, let's see if we can bent it to our will.
from HTMLParser import HTMLParser
def titleFinder(html):
class MyHTMLParser(HTMLParser):
def handle_starttag(self, tag, attrs):
self.intitle = tag == "title"
def handle_data(self, data):
if self.intitle:
self.title = data
parser = MyHTMLParser()
parser.feed(html)
return parser.title
>>> print titleFinder('<html><head><title>Test</title></head>'
'<body><h1>Parse me!</h1></body></html>')
Test
That's incredibly painful. That almost as wordy as Java. (just kidding)
What else is there? There's xml.dom.minidom A "Lightweight DOM implementation". I like the sound of "lightweight", means we can do it with one line of code, right?
import xml.dom.minidom
html = '<html><head><title>Test</title></head><body><h1>Parse me!</h1></body></html>'
title = ''.join(node.data for node in xml.dom.minidom.parseString(html).getElementsByTagName("title")[0].childNodes if node.nodeType == node.TEXT_NODE)
>>> print title
Test
And we have our one-liner!
So I heard that these regular expressions things are pretty efficient as extracting bits of text from HTML. I think you should use those.

Find and append each reference to a html link - Python

I have a HTML file I got from Wikipedia and would like to find every link on the page such as /wiki/Absinthe and replace it with the current directory added to the front such as /home/fergus/wikiget/wiki/Absinthe so for:
Absinthe
becomes:
Absinthe
and this is throughout the whole document.
Do you have any ideas? I'm happy to use BeautifulSoup or Regex!
If that's really all you have to do, you could do it with sed and its -i option to rewrite the file in-place:
sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html
However, here's a Python solution using the lovely lxml API, in case you need to do anything more complex or you might have badly formed HTML, etc.:
from lxml import etree
import re
parser = etree.HTMLParser()
with open("wiki-file.html") as fp:
tree = etree.parse(fp, parser)
for e in tree.xpath("//a[#href]"):
link = e.attrib['href']
if re.search('^/wiki',link):
e.attrib['href'] = '/home/fergus/wikiget'+link
# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
fp.write(etree.tostring(tree))
Note that lxml is probably a better option than BeautifulSoup for this kind of task nowadays, for the reasons given by BeautifulSoup's author.
This is solution using re module:
#!/usr/bin/env python
import re
open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))
Here's another one without using re:
#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))
You can use a function with re.sub:
def match(m):
return '<a href="/home/fergus/wikiget' + m.group(1) + '">'
r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)
An example:
>>> s = 'Absinthe'
>>> r.sub(match, s)
'Absinthe'
from lxml import html
el = html.fromstring('word')
# or `el = html.parse(file_or_url).getroot()`
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print(html.tostring(el))
el.rewrite_links(repl)
print(html.tostring(el))
Output
word
word
You could also use the function lxml.html.rewrite_links() directly:
from lxml import html
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print html.rewrite_links(htmlstr, repl)
I would do
import re
ch = 'Absinthe'
r = re.compile('(<a\s+href=")(/wiki/[^"]+">[^<]+</a>)')
print ch
print
print r.sub('\\1/home/fergus/wikiget\\2',ch)
EDIT:
this solution have been said not to capture tags with additional attribute. I thought it was a narrow pattern of string that was aimed, such as WORD
If not, well, no problem, a solution with a simpler RE is easy to write
r = re.compile('(<a\s+href="/)([^>]+">)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/\\2',ch)
or why not:
r = re.compile('(<a\s+href="/)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/',ch)

Simple scraping of youtube xml to get a Python list of videos

I have an xml feed, say:
http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/
I want to get the list of hrefs for the videos:
['http://www.youtube.com/watch?v=aJvVkBcbFFY', 'ht....', ... ]
from xml.etree import cElementTree as ET
import urllib
def get_bass_fishing_URLs():
results = []
data = urllib.urlopen(
'http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/')
tree = ET.parse(data)
ns = '{http://www.w3.org/2005/Atom}'
for entry in tree.findall(ns + 'entry'):
for link in entry.findall(ns + 'link'):
if link.get('rel') == 'alternate':
results.append(link.get('href'))
as it appears that what you get are the so-called "alternate" links. The many small, possible variations if you want something slightly different, I hope, should be clear from the above code (plus the standard Python library docs for ElementTree).
Have a look at Universal Feed Parser, which is an open source RSS and Atom feed parser for Python.
In such a simple case, this should be enough:
import re, urllib2
request = urllib2.urlopen("http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/")
text = request.read()
videos = re.findall("http:\/\/www\.youtube\.com\/watch\?v=[\w-]+", text)
If you want to do more complicated stuff, parsing the XML will be better suited than regular expressions
import urllib
from xml.dom import minidom
xmldoc = minidom.parse(urllib.urlopen('http://gdata.youtube.com/feeds/api/videos/-/bass/fishing/'))
links = xmldoc.getElementsByTagName('link')
hrefs = []
for links in link:
if link.getAttribute('rel') == 'alternate':
hrefs.append( link.getAttribute('href') )
hrefs

Categories

Resources