Regular text between nested elements using LXML / Etree [duplicate] - python

Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.
I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?

This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.

What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.
Hope this would be helpful.

Related

Scrape a <p> with optional <spans> with regex

I am trying to scrape a table like this:
<table><tr>
<td width="100"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My title example:</span></p></td>
<td width="440"><p><span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt;">My text example.</span></p></td>
</tr>
<tr>
<td width="100">My second title:</p></td>
<td width="440"><p>My <span style=" font-family:'MS Shell Dlg 2'; font-size:8.25pt; text-decoration: underline;">second</span> text example.</p></td>
</tr></table>
To show the output in a simple list of dictionaries like this:
[
{"title": "My title example", "text": "My text example"},
{"title": "My other example", "text": "My <u>second</u> text example"},
{"title": "My title example", "text": "My new example"},
]
But I need to sanitize the code and swap the underline sections to tags. So this is the code that I have so far:
from bs4 import BeautifulSoup
import re
# Find the rows in the table
for table_row in html.select("table tr"):
cells = table_row.findAll('td')
if len(cells) > 0:
row_title = cells[0].text.strip()
paragraphs = []
# Find all spans in a row
for run in cells[1].findAll('span'):
print(run)
if "text-decoration: underline" in str(run):
paragraphs.append("{0}{1}{2}".format("<u>", run.text, "</u>"))
else:
paragraphs.append(run.text)
# Build up a sanitized string with all the runs.
row_text = "".join(paragraphs)
row = {"title": row_title, "text": row_text}
data.append(row)
print(data)
The issue: As you may noticed, it scrapes the row with spans perfectly (the first example) but it fails on the second one and it only scrapes the underline parts (because the text is not inside span tags). So I was thinking that instead of trying to find spans, I would just remove all the spans and replace the ones that I need with Regex, something like this:
# Find all runs in a row
for paragraph in cells[1].findAll('p'):
re.sub('<.*?>', '', str(paragraph))
And that would create text with no tags, but also without underline formatting, and that's where I am stuck.
I don't know how to add such a condition on regex. Any help is welcome.
Expected output: Remove all tags from paragraph but replace spans where text-decoration: underline is found with <u></u> tags.
One idea would be to use .replace_with() to replace the "underline" span elements with the u elements and then use .encode_contents() to get the inner HTML of the "text" cells:
result = []
for row in soup.select("table tr"):
title_cell, data_cell = row('td')[:2]
for span in data_cell('span'):
if 'underline' in span.get('style', ''):
u = soup.new_tag("u")
u.string = span.get_text()
span.replace_with(u)
else:
# replacing the "span" element with its contents
span.unwrap()
# replacing the "p" element with its contents
data_cell.p.unwrap()
result.append({
"title": title_cell.get_text(strip=True),
"test": str(data_cell.encode_contents())
})
When you find a <span> tag with the underline attribute, you can change its text to add the <u>...</u> tags using span.string = '<u>{}</u>'.format(span.text). After modifying the text, you can remove the <span> tag using unwrap().
result = []
for row in soup.select('table tr'):
columns = row.find_all('td')
title = columns[0]
txt = columns[1]
for span in txt.find_all('span', style=lambda s: 'text-decoration: underline' in s):
span.string = '<u>{}</u>'.format(span.text)
span.unwrap()
result.append({'title': title.text, 'text': txt.text})
print(result)
# [{'title': 'My title example:', 'text': 'My text example.'}, {'title': 'My second title:', 'text': 'My <u>second</u> text example.'}]
Note: This approach won't actually change the tag. It modifies the string and removes the tag.

How to use beautifulsoup to get node text and children tag separately

My html is like:
<a class="title" href="">
<b>name
<span class="c-gray">position</span>
</b>
</a>
I want to get name and position string separately. So my script is like:
lia = soup.find('a',attrs={'class':'title'})
pos = lia.find('span').get_text()
lia.find('span').replace_with('')
name = lia.get_text()
print name.strip()+','+pos
Although it can do the job, I don't think is a beautiful way. Any brighter idea?
You can use .contents method this way:
person = lia.find('b').contents
name = person[0].strip()
position = person[1].text
The idea is to locate the a element, then, for the name - get the first text node from an inner b element and, for the position - get the span element's text:
>>> a = soup.find("a", class_="title")
>>> name, position = a.b.find(text=True).strip(), a.b.span.get_text(strip=True)
>>> name, position
(u'name', u'position')

Python get tag with certain text

I've string with html blocks, like
a = '<div>Test moree test <div> London is ... <p>mooo</p></div></div>'
I need get block with certain text, for example
super_func("London", a) ==> '<div> London is ... <p>mooo</p></div>'
super_func('mooo', a) = '<p>mooo</p>'
You can use the following XPath query to find an element containing certain text, regardless the element name and it's location within the HTML document :
//*[contains(text(),'certain text')]
This is a working example using lxml.html library :
from lxml import html
def super_func(keyword, htmldoc):
query = '//*[contains(text(),"{0}")]'
result = htmldoc.xpath(query.format(keyword))
if len(result) > 0:
return html.tostring(result[0])
else:
return ''
a = '<div>Test moree test <div> London is ... <p>mooo</p></div></div>'
doc = html.fromstring(a)
text = 'London'
print super_func(text, doc)
text = 'mooo'
print super_func(text, doc)
output :
<div> London is ... <p>mooo</p></div>
<p>mooo</p>

Scraping an Element using lxml and Xpath

The issue I'm having is scraping out the element itself. I'm able to scrape the first two (IncidentNbr and DispatchTime ) but I can't get the address... (1300 Dunn Ave) I want to be able to scrape that element but also have it dynamic enough so I'm not actually parsing for "1300 Dunn Ave" I'm parsing for that element. Here is the source code
<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>
And here is my code:
from lxml import html
import requests
page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)
callSignal = tree.xpath('//span[#id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[#id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[#id="lstCallsForService_ctrl0_lnkAddress"]/text()')
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
And this is my output:
Call Signal: ['150318182198']
Dispatch Time: ['3-18 10:25']
Location: []
Any idea on how I can scrape out the address?
First of all, it is an a element, not a span. And you need a double slash before the text():
//a[#id="lstCallsForService_ctrl0_lnkAddress"]//text()
Why a double slash? This is because in reality this a element has no direct text node children:
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
<u>5100 CLEVELAND RD</u>
</a>
You could also reach the text through u tag:
//a[#id="lstCallsForService_ctrl0_lnkAddress"]/u/text()
Besides, to scale the solution into multiple results:
iterate over table rows
for every row find the cell values using a partial id attribute match using contains()
use text_content() method to get the text
Implementation:
for item in tree.xpath('//tr[#class="closedCall"]'):
callSignal = item.xpath('.//span[contains(#id, "lblIncidentNbr")]')[0].text_content()
dispatchTime = item.xpath('.//span[contains(#id, "lblDispatchTime")]')[0].text_content()
location = item.xpath('.//a[contains(#id, "lnkAddress")]')[0].text_content()
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
print "------"
Prints:
Call Signal: 150318182333
Dispatch Time: 3-18 11:22
Location: 9600 APPLECROSS RD
------
Call Signal: 150318182263
Dispatch Time: 3-18 11:12
Location: 1100 E 1ST ST
------
...
This is the element you are looking for:
<a id="lstCallsForService_ctrl0_lnkAddress"
href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
As you can see, it is not a span element. Your current XPath expression:
//span[#id="lstCallsForService_ctrl0_lnkAddress"]/text()
is looking for a span element with this ID, when it should actually be selecting an a element. Use
//a[#id="lstCallsForService_ctrl0_lnkAddress"]/text()
instead. Then, the result should be
Location: ['1300 DUNN AVE']
Please also read alecxe's answer which has more practical advice than mine.

Retrieve all content between a closing and opening html tag using Beautiful Soup

I am parsing content using Python and Beautiful Soup then writing it to a CSV file, and have run into a bugger of a problem getting a certain set of data. The data is ran through an implementation of TidyHTML that I have crafted and then other not needed data is stripped out.
The issue is that I need to retrieve all data between a set of <h3> tags.
Sample Data:
<h3>Pages 1-18</h3>
<ul><li>September 13 1880. First regular meeting of the faculty;
September 14 1880. Discussion of curricular matters. Students are
debarred from taking algebra until they have completed both mental
and fractional arithmetic; October 4 1880.</li><li>All members present.</li></ul>
<ul><li>Moved the faculty henceforth hold regular weekkly meetings in the
President's room of the University building; 11 October 1880. All
members present; 18 October 1880. Regular meeting 2. Moved that the
President wait on the property holders on 12th street and request
them to abate the nuisance on their property; 25 October 1880.
Moved that the senior and junior classes for rhetoricals be...</li></ul>
<h3>Pages 19-33</h3>`
I need to retrieve all of the content between the first closing </h3> tag and the next opening <h3> tag. This shouldn't be hard, but my thick head isn't making the necessary connections. I can grab all of the <ul> tags but that doesn't work because there is not a one to one relationship between <h3> tags and <ul> tags.
The output I am looking to achieve is:
Pages 1-18|Vol-1-pages-001.pdf|content between and tags.
The first two parts have not been a problem but content between a set of tags is difficult for me.
My current code is as follows:
import glob, re, os, csv
from BeautifulSoup import BeautifulSoup
from tidylib import tidy_document
from collections import deque
html_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1'
csv_path = 'Z:\\Applications\\MAMP\\htdocs\\uoassembly\\AssemblyRecordsVol1\\archiveVol1.csv'
html_cleanup = {'\r\r\n':'', '\n\n':'', '\n':'', '\r':'', '\r\r': '', '<img src="UOSymbol1.jpg" alt="" />':''}
for infile in glob.glob( os.path.join(html_path, '*.html') ):
print "current file is: " + infile
html = open(infile).read()
for i, j in html_cleanup.iteritems():
html = html.replace(i, j)
#parse cleaned up html with Beautiful Soup
soup = BeautifulSoup(html)
#print soup
html_to_csv = csv.writer(open(csv_path, 'a'), delimiter='|',
quoting=csv.QUOTE_NONE, escapechar=' ')
#retrieve the string that has the page range and file name
volume = deque()
fileName = deque()
summary = deque()
i = 0
for title in soup.findAll('a'):
if title['href'].startswith('V'):
#print title.string
volume.append(title.string)
i+=1
#print soup('a')[i]['href']
fileName.append(soup('a')[i]['href'])
#print html_to_csv
#html_to_csv.writerow([volume, fileName])
#retrieve the summary of each archive and store
#for body in soup.findAll('ul') or soup.findAll('ol'):
# summary.append(body)
for body in soup.findAll('h3'):
body.findNextSibling(text=True)
summary.append(body)
#print out each field into the csv file
for c in range(i):
pages = volume.popleft()
path = fileName.popleft()
notes = summary
if not summary:
notes = "help"
if summary:
notes = summary.popleft()
html_to_csv.writerow([pages, path, notes])
Extract content between </h3> and <h3> tags:
from itertools import takewhile
h3s = soup('h3') # find all <h3> elements
for h3, h3next in zip(h3s, h3s[1:]):
# get elements in between
between_it = takewhile(lambda el: el is not h3next, h3.nextSiblingGenerator())
# extract text
print(''.join(getattr(el, 'text', el) for el in between_it))
The code assumes that all <h3> elements are siblings. If it is not the case then you could use h3.nextGenerator() instead of h3.nextSiblingGenerator().
If you try to extract data between <ul><li></ul></li> tags in lxml, it provides a great functionality of using CSSSelector
import lxml.html
import urllib
data = urllib.urlopen('file:///C:/Users/ranveer/st.html').read() //contains your html snippet
doc = lxml.html.fromstring(data)
elements = doc.cssselect('ul li') // CSSpath[using firebug extension]
for element in elements:
print element.text_content()
after executing the above code you will get all text between the ul,li tags. It is much cleaner than beautiful soup.
If you by any chance plan to use lxml than you can evaluate XPath expressions in the following way-
import lxml
from lxml import etree
content = etree.HTML(urllib.urlopen("file:///C:/Users/ranveer/st.html").read())
content_text = content.xpath("html/body/h3[1]/a/#href | //ul[1]/li/text() | //ul[2]/li/text() | //h3[2]/a/#href")
print content_text
You can change XPath according to your need.

Categories

Resources