Scraping an Element using lxml and Xpath

Scraping an Element using lxml and Xpath - python

The issue I'm having is scraping out the element itself. I'm able to scrape the first two (IncidentNbr and DispatchTime ) but I can't get the address... (1300 Dunn Ave) I want to be able to scrape that element but also have it dynamic enough so I'm not actually parsing for "1300 Dunn Ave" I'm parsing for that element. Here is the source code
<td><span id="lstCallsForService_ctrl0_lblIncidentNbr">150318182198</span></td>
<td><nobr><span id="lstCallsForService_ctrl0_lblDispatchTime">3-18 10:25</span></nobr></td>
<td>
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL" target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
</td>
And here is my code:
from lxml import html
import requests
page = requests.get('http://callsforservice.jaxsheriff.org/')
tree = html.fromstring(page.text)
callSignal = tree.xpath('//span[#id="lstCallsForService_ctrl0_lblIncidentNbr"]/text()')
dispatchTime = tree.xpath('//span[#id="lstCallsForService_ctrl0_lblDispatchTime"]/text()')
location = tree.xpath('//span[#id="lstCallsForService_ctrl0_lnkAddress"]/text()')
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
And this is my output:
Call Signal: ['150318182198']
Dispatch Time: ['3-18 10:25']
Location: []
Any idea on how I can scrape out the address?

First of all, it is an a element, not a span. And you need a double slash before the text():
//a[#id="lstCallsForService_ctrl0_lnkAddress"]//text()
Why a double slash? This is because in reality this a element has no direct text node children:
<a id="lstCallsForService_ctrl0_lnkAddress" href="https://maps.google.com/?q=5100 CLEVELAND RD, Jacksonville, FL" target="_blank">
<u>5100 CLEVELAND RD</u>
</a>
You could also reach the text through u tag:
//a[#id="lstCallsForService_ctrl0_lnkAddress"]/u/text()
Besides, to scale the solution into multiple results:
iterate over table rows
for every row find the cell values using a partial id attribute match using contains()
use text_content() method to get the text
Implementation:
for item in tree.xpath('//tr[#class="closedCall"]'):
callSignal = item.xpath('.//span[contains(#id, "lblIncidentNbr")]')[0].text_content()
dispatchTime = item.xpath('.//span[contains(#id, "lblDispatchTime")]')[0].text_content()
location = item.xpath('.//a[contains(#id, "lnkAddress")]')[0].text_content()
print 'Call Signal: ', callSignal
print "Dispatch Time: ", dispatchTime
print "Location: ", location
print "------"
Prints:
Call Signal: 150318182333
Dispatch Time: 3-18 11:22
Location: 9600 APPLECROSS RD
------
Call Signal: 150318182263
Dispatch Time: 3-18 11:12
Location: 1100 E 1ST ST
------
...

This is the element you are looking for:
<a id="lstCallsForService_ctrl0_lnkAddress"
href="https://maps.google.com/?q=1300 DUNN AVE, Jacksonville, FL"
target="_blank" style="text-decoration:underline;">1300 DUNN AVE</a>
As you can see, it is not a span element. Your current XPath expression:
//span[#id="lstCallsForService_ctrl0_lnkAddress"]/text()
is looking for a span element with this ID, when it should actually be selecting an a element. Use
//a[#id="lstCallsForService_ctrl0_lnkAddress"]/text()
instead. Then, the result should be
Location: ['1300 DUNN AVE']
Please also read alecxe's answer which has more practical advice than mine.

Related

Regular text between nested elements using LXML / Etree [duplicate]

Python 2.7 using lxml
I have some annoyingly formed html that looks like this:
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
So basically it's a single td with a ton of stuff in it. I'm trying to compile a list or dict of the names and their addresses.
So far what I've done is gotten a list of nodes with names using tree.xpath('//td/b'). So let's assume I'm currently on the b node for John.
I'm trying to get whatever.xpath('string()') for everything following the current node but preceding the next b node (Sally). I've tried a bunch of different xpath queries but can't seem to get this right. In particular, any time I use an and operator in an expression that has no [] brackets, it returns a bool rather than a list of all nodes meeting the conditions. Can anyone help out?

This should work:
from lxml import etree
p = etree.HTMLParser()
html = open(r'./test.html','r')
data = html.read()
tree = etree.fromstring(data, p)
my_dict = {}
for b in tree.iter('b'):
br = b.getnext().tail.replace('\n', '')
my_dict[b.text.replace('\n', '')] = br
print my_dict
This code prints:
{'"John"': '"123 Main st."', '"Sally"': '"101 California St."'}
(You may want to strip the quotation marks out!)
Rather than using xpath, you could use one of lxml's parsers in order to easily navigate the HTML. The parser will turn the HTML document into an "etree", which you can navigate with provided methods. The lxml module provides a method called iter() which allows you to pass in a tag name and receive all elements in the tree with that name. In your case, if you use this to obtain all of the <b> elements, you can then manually navigate to the <br> element and retrieve its tail text, which contains the information you need. You can find information about this in the "Elements contain text" header of the lxml.etree tutorial.

What not use getchildren function from view of each td. For example:
from lxml import html
s = """
<td>
<b>"John"
</b>
<br>
"123 Main st.
"
<br>
"New York
"
<b>
"Sally"
</b>
<br>
"101 California St.
"
<br>
"San Francisco
"
</td>
"""
records = []
cur_record = -1
cur_field = 1
FIELD_NAME = 0
FIELD_STREET = 1
FIELD_CITY = 2
doc = html.fromstring(s)
td = doc.xpath('//td')[0]
for child in td.getchildren():
if child.tag == 'b':
cur_record += 1
record = dict()
record['name'] = child.text.strip()
records.append(record)
cur_field = 1
elif child.tag == 'br':
if cur_field == FIELD_STREET:
records[cur_record]['street'] = child.tail.strip()
cur_field += 1
elif cur_field == FIELD_CITY:
records[cur_record]['city'] = child.tail.strip()
And the results are:
records = [
{'city': '"New York\n"', 'name': '"John"\n', 'street': '"123 Main st.\n"'},
{'city': '"San Francisco\n"', 'name': '\n"Sally"\n', 'street': '"101 California St.\n"'}
]
Note you should use tag.tail if you want to get text of some non-close html tag, e.g., <br>.
Hope this would be helpful.

Using Selenium to find indexed element within a div

I'm scraping the front-end of a webpage and having difficulty getting the HMTL text of a div within a div.
Basically, I'm simulating clicks - one for each event listed on the page. From there, I want to scrape the date and time of the event, as well as the location of the event.
Here's an example of one of the pages I'm trying to scrape:
https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event
<div class="eventInfoContainer-54d5deb3">
<div class="lineupContainer-570750d2">
<div class="eventInfoContainer-9e539994">
<img src="assets.bandsintown.com/images.clock.svg">
<div>Sunday, April 21st, 2019</div> <!––***––>
<div class="eventInfoContainer-50768f6d">5:00PM</div><!––***––>
</div>
<div class="eventInfoContainer-1a68a0e1">
<img src="assets.bandsintown.com/images.clock.svg">
<div class="eventInfoContainer-2d9f07df">
<div>Aura Nightclub</div> <!––***––>
<div>283 1st St., San Jose, CA 95113</div> <!––***––>
</div>
I've marked the elements I want to extract with asterisks - the date, time, venue, and address. Here's my code:
base_url = 'https://www.bandsintown.com/?came_from=257&page='
events = []
eventContainerBucket = []
for i in range(1, 2):
driver.get(base_url + str(i))
# get events links
event_list = driver.find_elements_by_css_selector('div[class^=eventList-] a[class^=event-]')
# collect href attribute of events in even_list
events.extend(list(event.get_attribute("href") for event in event_list))
# iterate through all events and open them.
for event in events:
driver.get(event)
uniqueEventContainer = driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0]
print "Event information: "+ uniqueEventContainer.text
This prints:
Event information: Sunday, April 21st, 2019
3:00 PM
San Francisco Brewing Co.
3150 Polk St, Sf, CA 94109
View All The Fourth Son Tour Dates
My issue is that I can't access the nested eventInfoContainer divs individually. For example, the 'date' div is position [1], as it is the second element (after img) in it's parent div "eventInfoContainer-9e539994". The parent div "eventInfoContainer-9e539994" is in position [1] is it is likewise the second element in it's parent div "eventInfoContainer-54d5deb3" (after "lineupContainer).
By this logic, shouldn't I be able to access the date text by this code: (accessing the 1st position element, with it's parent being the 1st position element, within the container (the 0th position element)?
for event in events:
driver.get(event)
uniqueEventContainer = driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0][1][1]
I get the following error:
TypeError: 'WebElement' object does not support indexing

When you index into webElements list (which is what find_elements_by_css_selector('div[class^=eventInfoContainer-]') returns) you get a webElement, you cannot further index into that. You can split the text of a webElement to generate a list for further indexing.
If there is regular structure across pages you could load html for div into BeautifulSoup. Example url:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event')
soup = bs(d.find_element_by_css_selector('[class^=eventInfoContainer-]').get_attribute('outerHTML'), 'lxml')
date = soup.select_one('img + div').text
time = soup.select_one('img + div + div').text
venue = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div > div').text
address = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div + div').text
print(date, time, venue, address)
If line breaks were consistent:
containers = d.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
array = containers[0].text.split('\n')
date = array[3]
time = array[4]
venue = array[5]
address = array[6]
print(date, time, venue, address)
With index and split:
from selenium import webdriver
from bs4 import BeautifulSoup as bs
d = webdriver.Chrome()
d.get('https://www.bandsintown.com/e/1013664851-los-grandes-de-la-banda-at-aura-nightclub?came_from=257&utm_medium=web&utm_source=home&utm_campaign=event')
containers = d.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
date_time = containers[1].text.split('\n')
i_date = date_time[0]
i_time = date_time[1]
venue_address = containers[3].text.split('\n')
venue = venue_address[0]
address = venue_address[1]
print(i_date, i_time, venue, address)

As the error suggests, webelements doesn't have indexing. What you are confusing with is list.
Here
driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')
This code returns a list of webelements. That is why you can access a webelement using the index of the list. But that element doesn't have indexing to another webelement. You are not getting a list of lists.
That is why
driver.find_elements_by_css_selector('div[class^=eventInfoContainer-]')[0] works. But driver.find_elements_by_css_selector('div[class^=eventInfoContainer-][0][1]') doesn't.
Edit:(Answer for quesion in the comment)
It is not slenium code.
The code posted in the answer by QHarr uses BeautifulSoup. It is a python package for parsing HTML and XML documents.
BeautifulSoup has a .select() method which uses CSS selector against a parsed document and return all the matching elements.
There’s also a method called select_one(), which finds only the first tag that matches a selector.
In the code,
time = soup.select_one('img + div + div').text
venue = soup.select_one('[class^=eventInfoContainer-]:nth-of-type(3) div > div').tex
It gets the first element found by the given CSS selector and returns the text inside the tag. The first line finds a img tag then finds the immediate sibling div tag, then again finds the sibling dev tag of the previous div tag.
In the second line it finds the third sibling tag that has class starts with eventInfoContainer- and then it finds the child div and find the child of that div.
Check out CSS selectors
This could be done directly using selenium:
date = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='clock.svg'] + div")
time = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'] + div + div")
venue = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='pin.svg'] + div > div")
address = driver.find_element_by_css_selector("img[class^='eventInfoContainer-'][src$='pin.svg'] + div > div:nth-of-type(2)")
I've used differnt CSS selectors but it still selects the same elements.
I'm not sure about BeautifulSoup but in the answer of QHarr, the date selector would return other value instead of intended value for selenium.

BS4 How to get text without using .text?

This is the source code layout from the website:
<div class="address">
<a href="https://website.ca/classifieds/59-barclay-street/">
59 Some Street<br />City, Zone 1
</a>
</div>
I would like to get the street number, route, and city for Google Geocoding. If I do this
>>>article.find('div', {'class': 'address'}).text
'59 Some StreetCity, Zone 1'
It takes away the <br /> and I'm left with no way to split the route from the city. If I do str().replace('<br />',', ') then I have to somehow convert it back again to whatever type it was before so I can do .text to get the actual text between the <a href>, it's inefficient. I'd like to use the functionality that .text uses to get the actual text, without the functionality where it removes the <br> stuff. I couldn't find a file called BeautifulSoup.py in my env, so I'm looking at the BeautifulSoup source code on GitHub, and I can't find a def text in there, I don't know where else to look.
Update:
articles = page_soup.find('h2', text='Ads').find_next_siblings('article')
for article in articles:
link = article.find('a')
br = link.find('br')
ad_address = br.previous_sibling.strip() + ', ' + br.next_sibling.strip().partition(', Zone ')[0]
#ad_address = link.br.replace_with(', ').get_text().strip().partition(', Zone ')

You can locate the br delimiter tag and get the siblings around it:
In [4]: br = soup.select_one("div.address > a > br")
In [5]: br.previous_sibling.strip()
Out[5]: u'59 Some Street'
In [6]: br.next_sibling.strip()
Out[6]: u'City, Zone 1'
You may also locate the br element and replace it with a space using replace_with():
In [4]: a = soup.select_one("div.address > a")
In [5]: a.br.replace_with(" ")
In [6]: a.get_text().strip()
Out[6]: u'59 Some Street City, Zone 1'
Or, you can join all text nodes inside the a tag:
In [7]: a = soup.select_one("div.address > a")
In [8]: " ".join(a.find_all(text=True)).strip()
Out[8]: u'59 Some Street City, Zone 1'

Try:
soup.find('div', {'class':'address'}).get_text(separator=u"<br/>").split(u'<br/>')
The separator keyword defines inner HTML which concatenates text.
http://omz-software.com/pythonista/docs/ios/beautifulsoup_ref.html

Try:
for link_to_text in links:
Print link_to_text.get_text()

Python scraper (Scraperwiki) only getting half of the table

I'm learning how to write scrapers using Python in Scraperwiki. So far so good, but I have spent a couple of days scratching my head now over a problem I can't get my head around. I am trying to take all links from a table. It works, but from the list of links which go from 001 to 486 it only ever starts grabbing them at 045. The url/source is just a list of cities on a website, the source can be seen here:
http://www.tripadvisor.co.uk/pages/by_city.html and the specific html starts here:
</td></tr>
<tr><td class=dt1>'s-Gravenzande, South Holland Province - Aberystwyth, Ceredigion, Wales</td>
<td class=dt1>Los Corrales de Buelna, Cantabria - Lousada, Porto District, Northern Portugal</td>
</tr>
<tr><td class=dt1>Abetone, Province of Pistoia, Tuscany - Adamstown, Lancaster County, Pennsylvania /td>
<td class=dt1>Louth, Lincolnshire, England - Lucciana, Haute-Corse, Corsica</td>
</tr>
<tr><td class=dt1>Adamswiller, Bas-Rhin, Alsace - Aghir, Djerba Island, Medenine Governorate </td>
<td class=dt1>Luccianna, Haute-Corse, Corsica - Lumellogno, Novara, Province of Novara, Piedmont</td>
</tr>
What I am after is the links from "by_city_001.html" through to "by_city_486.html". Here is my code:
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
print html
links = root.cssselect('td.dt1 a')
for link in links:
url = 'http://www.tripadvisor.co.uk' + link.attrib['href']
print url
Called in the code as follows:
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
Now when I run it, it only ever returns the links starting at 0045!
The output (045~486)
http://www.tripadvisor.co.ukby_city_045.html
http://www.tripadvisor.co.ukby_city_288.html
http://www.tripadvisor.co.ukby_city_046.html
http://www.tripadvisor.co.ukby_city_289.html
http://www.tripadvisor.co.ukby_city_047.html
http://www.tripadvisor.co.ukby_city_290.html and so on...
I've tried changing the selector to:
links = root.cssselect('td.dt1')
And it grabs 487 'elements' like this:
<Element td at 0x13d75f0>
<Element td at 0x13d7650>
<Element td at 0x13d76b0>
But I'm not able to get the 'href' value from this. I can't figure out why it loses the first 44 links when I select 'a' in the cssselect line. I've looked at the code but I have no clue.
Thanks in advance for any help!
Claire

Your code works fine. You can see it in action here: https://scraperwiki.com/scrapers/tripadvisor_cities/
I've added in saving to the datastore so you can see that it actually processes all the links.
import scraperwiki
import lxml.html
def scrapeCityList(pageUrl):
html = scraperwiki.scrape(pageUrl)
root = lxml.html.fromstring(html)
links = root.cssselect('td.dt1 a')
print len(links)
batch = []
for link in links[1:]: #skip the first link since it's only a link to tripadvisor and not a subpage
record = {}
url = 'http://www.tripadvisor.co.uk/' + link.attrib['href']
record['url'] = url
batch.append(record)
scraperwiki.sqlite.save(["url"],data=batch)
scrapeCityList('http://www.tripadvisor.co.uk/pages/by_city.html')
If you use the second css selector:
links = root.cssselect('td.dt1')
then you are selecting the td element and not the a element (which is a sub-element of the td). You could select the a by doing this:
url = 'http://www.tripadvisor.co.uk/' + link[0].attrib['href']
where you are selecting the first sub-element of the td (that's the [0]).
If you want to see all attributes of an element in lxml.html then use:
print element.attrib
which for the td gives:
{'class': 'dt1'}
{'class': 'dt1'}
{'class': 'dt1'}
...
and for the a:
{'href': 'by_city_001.html'}
{'href': 'by_city_244.html'}
{'href': 'by_city_002.html'}
...

Python BeautifulSoup parsing

I am trying to scrape some content (am very new to Python) and I have hit a stumbling block. The code I am trying to scrape is:
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22"</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
What I am trying to scrape is the link text (Spear & Jackson etc) and the price (£5.95). I have looked about on Google, the BeautifulSoup documentation and on this forum and I managed to get to extract the "Now: £5.95" using this code:
for node in soup.findAll('p', { "class" : "productlist_grid_price" }):
print ''.join(node.findAll(text=True))
However the result I am after is just 5.95. I have also had limited success trying to get the link text (Spear & Jackson) using:
soup.h2.a.contents[0]
However of course this returns just the first result.
The ultimate result that I am aiming for is to have the results look like:
Spear & Jackson Predator Universal Hardpoint Saw - 22 5.95
etc
etc
As I am looking to export this to a csv, I need to figure out how to put the data into 2 columns. Like I say I am very new to python so I hope this makes sense.
I appreciate any help!
Many thanks

I think what you're looking for is something like this:
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(open('prueba.html').read())
item = re.sub('\s+', ' ', soup.h2.a.text)
price = soup.find('p', {'class': 'productlist_mostwanted_price'}).text
price = re.search('\d+\.\d+', price).group(0)
print item, price
Example output:
Spear & Jackson Predator Universal Hardpoint Saw - 22" 5.95
Note that for the item, the regular expression is used just to remove extra whitespace, while for the price is used to capture the number.

html = '''
<h2>Spear & Jackson Predator Universal Hardpoint Saw - 22</h2>
<p><span class="productlist_mostwanted_rrp">
Was: <span class="strikethrough">£12.52</span></span><span class="productlist_mostwanted_save">Save: £6.57(52%)</span></p>
<div class="clear"></div>
<p class="productlist_mostwanted_price">Now: £5.95</p>
'''
from BeautifulSoup import BeautifulSoup
import re
soup = BeautifulSoup(html)
desc = soup.h2.a.getText()
price_str = soup.find('p', {"class": "productlist_mostwanted_price" }).getText()
price = float(re.search(r'[0-9.]+', price_str).group())
print desc, price

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping an Element using lxml and Xpath - python

Related

Regular text between nested elements using LXML / Etree [duplicate]

Using Selenium to find indexed element within a div

BS4 How to get text without using .text?

Python scraper (Scraperwiki) only getting half of the table

Python BeautifulSoup parsing

Categories

Resources