python lxml not showing all content - python

I am trying to scrape a specific section of a web page, and eventually calculate word frequency. But I am finding it difficult to get the entire text. As far as I understand from looking at the HTML code, my script omits the part of that section that are in a break line but without <br> tag.
My code:
import urllib
from lxml import html as LH
import lxml
import requests
scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
scripthtml=urllib.urlopen(scripturl).read()
scripthtml=requests.get(scripturl)
tree = LH.fromstring(scripthtml.content)
script=tree.xpath('//div[#class="scrolling-script-container"]/text()')
print script
print type(script)
This is the output:
["\n\n\n\n \t\t\t ( radio clicks, \r music plays ) \r \r Disc jockey: \r
New York's classic rock \r q104.", '3.', '
\r \r Good morning.', " \r I'm jim kerr.",
' \r \r Coming up \r
When I iterate the result only the phrases that follow the /r and are followed by a comma or double comma.
for res in script:
print res
The output is:
q104.
3.
Good morning.
I'm jim kerr.
I am not confined to lxml, but because I am rather new, I am less familiar with other methods.

An lxml element has both a text and tail method. You are searching for text, but if there is am HTML element embedded in the element (br, for example), your search for text will only go as deep as the first text the parser gets from the element's text() method.
try:
script = tree.xpath('//div[#class="scrolling-script-container"]')
print join(" ", (script[0].text(), script[0].tail()))

This was bothering me, I wrote out a solution:
import requests
import lxml
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)
script = root.xpath('//div[#class="scrolling-script-container"]')
text_list = []
for elem in script:
print(elem.attrib)
if hasattr(elem, 'text'):
text_list.append(elem.text)
if hasattr(elem, 'tail'):
text_list.append(elem.tail)
for elem in text_list:
# only gets the first block of text before
# it encounters a br tag
print(elem)
for elem in script:
# prints everything
for sib in elem.iter():
print(sib.attrib)
if hasattr(sib, 'text'):
print(sib.text)
if hasattr(sib, 'tail'):
print(sib.tail)

Related

Unable to get regex in python to match pattern

I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated
You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))
I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.

Parse complex matching delimiters

Structures like HTML-Tags have an opening and a closing part, sharing an identical tag to match them to each other.
<tag> ... </tag>
I want to capture these pairs and their content using the pyparsing library. I know how to specify a single tag.
from pyparsing import SkipTo, makeHTMLTags
open, close = makeHTMLTags("tag")
(open + SkipTo(close) + close).parseString("<tag> Tag content </tag>")
# yields ['tag', False, 'Tag content ', '</tag>']
I am also aware that, when specifying multiple distinct tags, each of them needs a dedicated rule to avoid that one tag closes another. So when the set of tags is Or(("tag", "other")) simply extending the former example
from pyparsing import SkipTo, makeHTMLTags, Or
open, close = makeHTMLTags(Or(("tag", "other")))
(open + SkipTo(close) + close).parseString("<other><tag> Tag content </tag></other>")
# yields ['other', False, '<tag> Tag content ', '</tag>']
yields mismatched tags. The parser closes the opening <other> with </tag>. This can be amended by specifying dedicated rules for each tag.
from pyparsing import SkipTo, makeHTMLTags, Or
Or((
open + SkipTo(close) + close
for open, close in
map(makeHTMLTags, ("tag", "other"))
)).parseString("<other><tag> Tag content </tag></other>")
# yields ['other', False, '<tag> Tag content </tag>', '</other>']
Now I would, for example, like to find all tags starting with t, thus searching for Word('t', alphas) instead of Or(("tag", "other", ...)). How can I make tags match when the set of tags to match is possibly infinite?
I'm not familiar with pyparsing module, but your problem seems can be solved by lxml(Library for processing XML and HTML in Python). Following is my example code using lxml:
# -*- coding: utf-8 -*-
from lxml import etree
def pprint(l):
for i, tag in enumerate(l):
print 'Matched #%s: tag name=%s, content=%s' % (i + 1, tag.tag, tag.text)
def main():
# Finding all <tag> tags
pprint(etree.HTML('<tag>Tag content</tag>').xpath("//tag"))
# Finding all stags starts with "t"
pprint(etree.HTML('<tag>tag1 content</tag><tag2>tag2 conent</tag2><other>other</other>').xpath(
"//*[starts-with(local-name(), 't')]"))
if __name__ == '__main__':
main()
This will output:
Matched #1: tag name=tag, content=Tag content
Matched #1: tag name=tag, content=tag1 content
Matched #2: tag name=tag2, content=tag2 conent
Hope it helps.

count the number of images on a webpage, using urllib

For a class, I have an exercise where i need to to count the number of images on any give web page. I know that every image starts with , so I am using a regexp to try and locate them. But I keep getting a count of one which i know is wrong, what is wrong with my code:
import urllib
import urllib.request
import re
img_pat = re.compile('<img.*>',re.I)
def get_img_cnt(url):
try:
w = urllib.request.urlopen(url)
except IOError:
sys.stderr.write("Couldn't connect to %s " % url)
sys.exit(1)
contents = str(w.read())
img_num = len(img_pat.findall(contents))
return (img_num)
print (get_img_cnt('http://www.americascup.com/en/schedules/races'))
Don't ever use regex for parsing HTML, use an html parser, like lxml or BeautifulSoup. Here's a working example, how to get img tag count using BeautifulSoup and requests:
from bs4 import BeautifulSoup
import requests
def get_img_cnt(url):
response = requests.get(url)
soup = BeautifulSoup(response.content)
return len(soup.find_all('img'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Here's a working example using lxml and requests:
from lxml import etree
import requests
def get_img_cnt(url):
response = requests.get(url)
parser = etree.HTMLParser()
root = etree.fromstring(response.content, parser=parser)
return int(root.xpath('count(//img)'))
print(get_img_cnt('http://www.americascup.com/en/schedules/races'))
Both snippets print 106.
Also see:
Python Regex - Parsing HTML
Python regular expression for HTML parsing (BeautifulSoup)
Hope that helps.
Ahhh regular expressions.
Your regex pattern <img.*> says "Find me something that starts with <img and stuff and make sure it ends with >.
Regular expressions are greedy, though; it'll fill that .* with literally everything it can while leaving a single > character somewhere afterwards to satisfy the pattern. In this case, it would go all the way to the end, <html> and say "look! I found a > right there!"
You should come up with the right count by making .* non-greedy, like this:
<img.*?>
Your regular expression is greedy, so it matches much more than you want. I suggest using an HTML parser.
img_pat = re.compile('<img.*?>',re.I) will do the trick if you must do it the regex way. The ? makes it non-greedy.
A good website for checking what your regex matches on the fly: http://www.pyregex.com/
Learn more about regexes: http://docs.python.org/2/library/re.html

Find and append each reference to a html link - Python

I have a HTML file I got from Wikipedia and would like to find every link on the page such as /wiki/Absinthe and replace it with the current directory added to the front such as /home/fergus/wikiget/wiki/Absinthe so for:
Absinthe
becomes:
Absinthe
and this is throughout the whole document.
Do you have any ideas? I'm happy to use BeautifulSoup or Regex!
If that's really all you have to do, you could do it with sed and its -i option to rewrite the file in-place:
sed -e 's,href="/wiki,href="/home/fergus/wikiget/wiki,' wiki-file.html
However, here's a Python solution using the lovely lxml API, in case you need to do anything more complex or you might have badly formed HTML, etc.:
from lxml import etree
import re
parser = etree.HTMLParser()
with open("wiki-file.html") as fp:
tree = etree.parse(fp, parser)
for e in tree.xpath("//a[#href]"):
link = e.attrib['href']
if re.search('^/wiki',link):
e.attrib['href'] = '/home/fergus/wikiget'+link
# Or you can just specify the same filename to overwrite it:
with open("wiki-file-rewritten.html","w") as fp:
fp.write(etree.tostring(tree))
Note that lxml is probably a better option than BeautifulSoup for this kind of task nowadays, for the reasons given by BeautifulSoup's author.
This is solution using re module:
#!/usr/bin/env python
import re
open('output.html', 'w').write(re.sub('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe', open('file.html').read()))
Here's another one without using re:
#!/usr/bin/env python
open('output.html', 'w').write(open('file.html').read().replace('href="http://en.wikipedia.org', 'href="/home/fergus/wikiget/wiki/Absinthe'))
You can use a function with re.sub:
def match(m):
return '<a href="/home/fergus/wikiget' + m.group(1) + '">'
r = re.compile(r'<a\shref="([^"]+)">')
r.sub(match, yourtext)
An example:
>>> s = 'Absinthe'
>>> r.sub(match, s)
'Absinthe'
from lxml import html
el = html.fromstring('word')
# or `el = html.parse(file_or_url).getroot()`
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print(html.tostring(el))
el.rewrite_links(repl)
print(html.tostring(el))
Output
word
word
You could also use the function lxml.html.rewrite_links() directly:
from lxml import html
def repl(link):
if link.startswith('/'):
link = '/home/fergus/wikiget' + link
return link
print html.rewrite_links(htmlstr, repl)
I would do
import re
ch = 'Absinthe'
r = re.compile('(<a\s+href=")(/wiki/[^"]+">[^<]+</a>)')
print ch
print
print r.sub('\\1/home/fergus/wikiget\\2',ch)
EDIT:
this solution have been said not to capture tags with additional attribute. I thought it was a narrow pattern of string that was aimed, such as WORD
If not, well, no problem, a solution with a simpler RE is easy to write
r = re.compile('(<a\s+href="/)([^>]+">)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/\\2',ch)
or why not:
r = re.compile('(<a\s+href="/)')
ch = '<a href="/wiki/Aide:Homonymie" title="Aide:Homonymie">'
print ch
print r.sub('\\1home/fergus/wikiget/',ch)

How do I print a line following a line containing certain text in a saved file in Python?

I have written a Python program to find the carrier of a cell phone given the number. It downloads the source of http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1 (where 1112223333 is the phone number to lookup) and saves this as carrier.html. In the source, the carrier is in the line after the [div class="carrier_result"] tag. (switch in < and > for [ and ], as stackoverflow thought I was trying to format using the html and would not display it.)
My program currently searches the file and finds the line containing the div tag, but now I need a way to store the next line after that as a string. My current code is: http://pastebin.com/MSDN0vbC
What you really want to be doing is parsing the HTML properly. Use the BeautifulSoup library - it's wonderful at doing so.
Sample code:
import urllib2, BeautifulSoup
opener = urllib2.build_opener()
opener.addheaders[0] = ('User-agent', 'Mozilla/5.1')
response = opener.open('http://www.whitepages.com/carrier_lookup?carrier=other&number_0=1112223333&response=1').read()
bs = BeautifulSoup.BeautifulSoup(response)
print bs.findAll('div', attrs={'class': 'carrier_result'})[0].next.strip()
You should be using a HTML parser such as BeautifulSoup or lxml instead.
to get the next line, you can use
htmlsource = open('carrier.html', 'r')
for line in htmlsource:
if '<div class="carrier_result">' in line:
nextline = htmlsource.next()
print nextline
A "better" way is to split on </div>, then get the things you want, as sometimes the stuff you want can occur all in one line. So using next() if give wrong result.eg
data=open("carrier.html").read().split("</div>")
for item in data:
if '<div class="carrier_result">' in item:
print item.split('<div class="carrier_result">')[-1].strip()
by the way, if its possible, try to use Python's own web module, like urllib, urllib2 instead of calling external wget.

Categories

Resources