Parse complex matching delimiters - python

Structures like HTML-Tags have an opening and a closing part, sharing an identical tag to match them to each other.
<tag> ... </tag>
I want to capture these pairs and their content using the pyparsing library. I know how to specify a single tag.
from pyparsing import SkipTo, makeHTMLTags
open, close = makeHTMLTags("tag")
(open + SkipTo(close) + close).parseString("<tag> Tag content </tag>")
# yields ['tag', False, 'Tag content ', '</tag>']
I am also aware that, when specifying multiple distinct tags, each of them needs a dedicated rule to avoid that one tag closes another. So when the set of tags is Or(("tag", "other")) simply extending the former example
from pyparsing import SkipTo, makeHTMLTags, Or
open, close = makeHTMLTags(Or(("tag", "other")))
(open + SkipTo(close) + close).parseString("<other><tag> Tag content </tag></other>")
# yields ['other', False, '<tag> Tag content ', '</tag>']
yields mismatched tags. The parser closes the opening <other> with </tag>. This can be amended by specifying dedicated rules for each tag.
from pyparsing import SkipTo, makeHTMLTags, Or
Or((
open + SkipTo(close) + close
for open, close in
map(makeHTMLTags, ("tag", "other"))
)).parseString("<other><tag> Tag content </tag></other>")
# yields ['other', False, '<tag> Tag content </tag>', '</other>']
Now I would, for example, like to find all tags starting with t, thus searching for Word('t', alphas) instead of Or(("tag", "other", ...)). How can I make tags match when the set of tags to match is possibly infinite?

I'm not familiar with pyparsing module, but your problem seems can be solved by lxml(Library for processing XML and HTML in Python). Following is my example code using lxml:
# -*- coding: utf-8 -*-
from lxml import etree
def pprint(l):
for i, tag in enumerate(l):
print 'Matched #%s: tag name=%s, content=%s' % (i + 1, tag.tag, tag.text)
def main():
# Finding all <tag> tags
pprint(etree.HTML('<tag>Tag content</tag>').xpath("//tag"))
# Finding all stags starts with "t"
pprint(etree.HTML('<tag>tag1 content</tag><tag2>tag2 conent</tag2><other>other</other>').xpath(
"//*[starts-with(local-name(), 't')]"))
if __name__ == '__main__':
main()
This will output:
Matched #1: tag name=tag, content=Tag content
Matched #1: tag name=tag, content=tag1 content
Matched #2: tag name=tag2, content=tag2 conent
Hope it helps.

Related

Unable to get regex in python to match pattern

I'm trying to pull out a number from a copy of an HTML page which I got from using urllib.request
I've tried a few different patterns in regex but keep getting none as the output so I'm clearly not formatting the pattern correctly but can't get it to work
Below is a small part of the HTML I have in the string
</ul>\n \n <p>* * * * *</p>\n -->\n \n <b>DistroWatch database summary</b><br/>\n <ul>\n <li>Number of all distributions in the database: 926<br/>\n <li>Number of <a href="search.php?status=Active">
I'm trying to just get the 926 out of the string and my code is below and I can't figure out what I'm doing wrong
import urllib.request
import re
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
#print(page.read())
print(page.read())
pageString = str(page.read())
#print(pageString)
DistroCount = re.search('^all distributions</a> in the database: ....<br/>\n$', pageString)
print(DistroCount)
any help, pointers or resource suggestions would be much appreciated
You can use BeautifulSoup to convert HTML to text, and then apply a simple regex to extract a number after a hardcoded string:
import urllib.request, re
from bs4 import BeautifulSoup
page = urllib.request.urlopen('http://distrowatch.com/weekly.php?issue=current')
html = page.read()
soup = BeautifulSoup(html, 'lxml')
text = soup.get_text()
m = re.search(r'all distributions in the database:\s*(\d+)', text)
if m:
print(m.group(1))
# => 926
Here,
soup.get_text() converts HTML to plain text and keeps it in the text variable
The all distributions in the database:\s*(\d+) regex matches all distributions in the database:, then zero or more whitespace chars and then captures into Group 1 any one or more digits (with (\d+))
I think your problem is that you are reading the whole document into a single string, but use "^" at beginning of your regex and "$" at the end, so the regex will only match the entire string.
Either drop ^ and $ (and \n as well…), or process your document line by line.

python lxml not showing all content

I am trying to scrape a specific section of a web page, and eventually calculate word frequency. But I am finding it difficult to get the entire text. As far as I understand from looking at the HTML code, my script omits the part of that section that are in a break line but without <br> tag.
My code:
import urllib
from lxml import html as LH
import lxml
import requests
scripturl="http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
scripthtml=urllib.urlopen(scripturl).read()
scripthtml=requests.get(scripturl)
tree = LH.fromstring(scripthtml.content)
script=tree.xpath('//div[#class="scrolling-script-container"]/text()')
print script
print type(script)
This is the output:
["\n\n\n\n \t\t\t ( radio clicks, \r music plays ) \r \r Disc jockey: \r
New York's classic rock \r q104.", '3.', '
\r \r Good morning.', " \r I'm jim kerr.",
' \r \r Coming up \r
When I iterate the result only the phrases that follow the /r and are followed by a comma or double comma.
for res in script:
print res
The output is:
q104.
3.
Good morning.
I'm jim kerr.
I am not confined to lxml, but because I am rather new, I am less familiar with other methods.
An lxml element has both a text and tail method. You are searching for text, but if there is am HTML element embedded in the element (br, for example), your search for text will only go as deep as the first text the parser gets from the element's text() method.
try:
script = tree.xpath('//div[#class="scrolling-script-container"]')
print join(" ", (script[0].text(), script[0].tail()))
This was bothering me, I wrote out a solution:
import requests
import lxml
from lxml import etree
from io import StringIO
parser = etree.HTMLParser()
base_url = "http://www.springfieldspringfield.co.uk/view_episode_scripts.php?tv-show=the-sopranos&episode=s06e21"
resp = requests.get(base_url)
root = etree.parse(StringIO(resp.text), parser)
script = root.xpath('//div[#class="scrolling-script-container"]')
text_list = []
for elem in script:
print(elem.attrib)
if hasattr(elem, 'text'):
text_list.append(elem.text)
if hasattr(elem, 'tail'):
text_list.append(elem.tail)
for elem in text_list:
# only gets the first block of text before
# it encounters a br tag
print(elem)
for elem in script:
# prints everything
for sib in elem.iter():
print(sib.attrib)
if hasattr(sib, 'text'):
print(sib.text)
if hasattr(sib, 'tail'):
print(sib.tail)

How to use lxml.html.document_fromtstring() to remove tag content using python and lxml

I want to remove the contents present between <pre><code> and </code></pre> tags. I looked at remove(), strip_elements() as well as BeautifulSoup methods but all the examples that I saw contained only a single tag like only <pre> or only <code>. How can I use them when both of them are present together(like in case of formated code block in stackoverflow posts).
EDIT: if I have something like this(This is how formatted code is present in stackoverflow posts)
<pre><code> some code stuff </code></pre> then I want to remove all the contents between <pre><code></code></pre> including the tags also.
EDITED CODE:
The code that I have is given below but it throws error as lxml.etree.XMLSyntaxError: Extra content at the end of the document at line doc = doc = etree.fromstring(record[1]) :
from lxml import etree
cur.execute('SELECT Title, Body FROM posts')
for item in cur:
record = list(item)
doc = etree.fromstring(record[1]) # error thrown here
for node in doc.xpath('pre[code]'):
doc.remove(node)
record[1] = etree.tostring(doc)
page = lxml.html.document_fromstring(record[1])
record[0] = str(record[0])
record[1] = str(page.text_content()) # Stripping HTML Tags
print record[1]
UPDATE: I understand that the format of the XML that I have is not standard and as such I will need to use lxml.html.document_fromtstring() to remove the tag contents instead of etree.fromstring(). Can anyone provide me an example of it as I cannot find any implementation of lxml.html.document_fromtstring()to remove the contents of a tag.

Regex Selection of Strings

Trying to use regex to select values between <title> </title>.
However sometimes these two tags are on different lines.
As the others have stated, it's more powerful and less brittle to use a full fledged markup language parser, like the htmlparser from stdlib or even BeautifulSoup, over regex. Though, since regex seems to be a requirement, maybe something like this will work:
import urllib2
import re
URL = 'http://amazon.com'
page = urllib2.urlopen(URL)
stream = page.readlines()
flag = False
for line in stream:
if re.search("<title>", line):
print line
if not re.search("</title>", line):
flag = True
elif re.search("</title>", line):
print line
flag = False
elif flag == True:
print line
When it finds the <title> tag it prints the line, checks to make sure the closing tag isn't on the same line, and then continues to print lines until it finds the closing </title>.
If you can't use a parser, just do it by brute force. Read the HTML doc into the string doc then:
try:
title = doc.split('<title>')[1].split('</title>')[0]
except IndexError:
## no title tag, handle error as you see fit
Note that if there is an opening title tag without a matching closing tag, the search succeeds. Not a likely scenario in a well-formatted HTML doc, but FYI.

Extracting parts of a webpage with python

So I have a data retrieval/entry project and I want to extract a certain part of a webpage and store it in a text file. I have a text file of urls and the program is supposed to extract the same part of the page for each url.
Specifically, the program copies the legal statute following "Legal Authority:" on pages such as this. As you can see, there is only one statute listed. However, some of the urls also look like this, meaning that there are multiple separated statutes.
My code works for pages of the first kind:
from sys import argv
from urllib2 import urlopen
script, urlfile, legalfile = argv
input = open(urlfile, "r")
output = open(legalfile, "w")
def get_legal(page):
# this is where Legal Authority: starts in the code
start_link = page.find('Legal Authority:')
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
legal = page[start_legal+2: end_link]
return legal
for line in input:
pg = urlopen(line).read()
statute = get_legal(pg)
output.write(get_legal(pg))
Giving me the desired statute name in the "legalfile" output .txt. However, it cannot copy multiple statute names. I've tried something like this:
def get_legal(page):
# this is where Legal Authority: starts in the code
end_link = ""
legal = ""
start_link = page.find('Legal Authority:')
while (end_link != '</a> '):
start_legal = page.find('">', start_link+1)
end_link = page.find('<', start_legal+1)
end2 = page.find('</a> ', end_link+1)
legal += page[start_legal+2: end_link]
if
break
return legal
Since every list of statutes ends with '</a> ' (inspect the source of either of the two links) I thought I could use that fact (having it as the end of the index) to loop through and collect all the statutes in one string. Any ideas?
I would suggest using BeautifulSoup to parse and search your html. This will be much easier than doing basic string searches.
Here's a sample that pulls all the <a> tags found within the <td> tag that contains the <b>Legal Authority:</b> tag. (Note that I'm using requests library to fetch page content here - this is just a recommended and very easy to use alternative to urlopen.)
import requests
from BeautifulSoup import BeautifulSoup
# fetch the content of the page with requests library
url = "http://www.reginfo.gov/public/do/eAgendaViewRule?pubId=200210&RIN=1205-AB16"
response = requests.get(url)
# parse the html
html = BeautifulSoup(response.content)
# find all the <a> tags
a_tags = html.findAll('a', attrs={'class': 'pageSubNavTxt'})
def fetch_parent_tag(tags):
# fetch the parent <td> tag of the first <a> tag
# whose "previous sibling" is the <b>Legal Authority:</b> tag.
for tag in tags:
sibling = tag.findPreviousSibling()
if not sibling:
continue
if sibling.getText() == 'Legal Authority:':
return tag.findParent()
# now, just find all the child <a> tags of the parent.
# i.e. finding the parent of one child, find all the children
parent_tag = fetch_parent_tag(a_tags)
tags_you_want = parent_tag.findAll('a')
for tag in tags_you_want:
print 'statute: ' + tag.getText()
If this isn't exactly what you needed to do, BeautifulSoup is still the tool you likely want to use for sifting through html.
They provide XML data over there, see my comment. If you think you can't download that many files (or the other end could dislike so many HTTP GET requests), I'd recommend asking their admins if they would kindly provide you with a different way of accessing the data.
I have done so twice in the past (with scientific databases). In one instance the sheer size of the dataset prohibited a download; they ran a SQL query of mine and e-mailed the results (but had previously offered to mail a DVD or hard disk). In another case, I could have done some million HTTP requests to a webservice (and they were ok) each fetching about 1k bytes. This would have taken long, and would have been quite inconvenient (requiring some error-handling, since some of these requests would always time out) (and non-atomic due to paging). I was mailed a DVD.
I'd imagine that the Office of Management and Budget could possibly be similar accomodating.

Categories

Resources