Can I access the subchild of a parent in XPath? - python

So as the title states I have some HTML code from http://chem.sis.nlm.nih.gov/chemidplus/name/acetone that I am parsing and want to extract some data like the Acetone under MeSH Heading from my similar post How to set up XPath query for HTML parsing?
<div id="names">
<h2>Names and Synonyms</h2>
<div class="ds">
<button class="toggle1Col" title="Toggle display between 1 column of wider results and multiple columns.">↔</button>
<h3>Name of Substance</h3>
<div class="yui3-g-r">
<div class="yui3-u-1-4">
<ul>
<li id="ds2">
<div>2-Propanone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds3">
<div>Acetone</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds4">
<div>Acetone [NF]</div>
</li>
</ul>
</div>
<div class="yui3-u-1-4">
<ul>
<li id="ds5">
<div>Dimethyl ketone</div>
</li>
</ul>
</div>
</div>
<h3>MeSH Heading</h3>
<ul>
<li id="ds6">
<div>Acetone</div>
</li>
</ul>
</div>
</div>
Previously in other pages I would do mesh_name = tree.xpath('//*[text()="MeSH Heading"]/..//div')[1].text_content() to extract the data because other pages had similar structures, but now I see that is not the case as I didn't account for inconsistency. So, is there a way of after going to the node that I want and then obtaining it's subchild, allowing for consistency across different pages?
Would doing tree.xpath('//*[text()="MeSH Heading"]//preceding-sibling::text()[1]') work?

From what I understand, you need to get the list of items by a heading title.
How about making a reusable function that would work for every heading in the "Names and Synonyms" container:
from lxml.html import parse
tree = parse("http://chem.sis.nlm.nih.gov/chemidplus/name/acetone")
def get_contents_by_title(tree, title):
return tree.xpath("//h3[. = '%s']/following-sibling::*[1]//div/text()" % title)
print get_contents_by_title(tree, "Name of Substance")
print get_contents_by_title(tree, "MeSH Heading")
Prints:
['2-Propanone', 'Acetone', 'Acetone [NF]', 'Dimethyl ketone']
['Acetone']

Related

Python BeautifulSoup Loop through divs and multiple elements

I have a website containing film listings, I've put together a simplified HTML of the website. Please note that for the real world example the <ul> tags are not direct children of the class film_listing or showtime. They are found under several <div> or <ul> elements.
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15:00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc">
</li>
<li class="open_cap">
</li>
</ul>
</li>
</ul>
</li>
I have created a Python script to scrape the website which currently lists all film titles with the first showtime and first attribute of each. However, I am trying to list all showtimes. The final aim is to only list film titles with open captions and the showtime of those open captions performances.
Here is the python script with a nested for loop that doesn't work and prints all showtimes for all films, rather than showtimes for a specific film. It is also not set up to only list captioned films yet. I suspect the logic may be wrong and would appreciate any advice. Thanks!
for i in soup.findAll('li', {'class':'film_listing'}):
film_title=i.find('h3', {'class':'film_title'}).text
print(film_title)
for j in soup.findAll('li', {'class':'showtime'}):
print(j['showtime.text'])
#For the time listings, find ones with Open Captioned
i=filmlisting.find('li', {'class':'open_cap'})
print(film_access)
edit: small correction to html script
There are many ways how you could extract the information. One way is to "search backwards". Search for <li> with class="open_cap" and the find previous start time and film title:
from bs4 import BeautifulSoup
txt = '''
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15:00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc">
</li>
<li class="open_cap">
</li>
</ul>
</li>
</ul>
</li>'''
soup = BeautifulSoup(txt, 'html.parser')
for open_cap in soup.select('.open_cap'):
print('Name :', open_cap.find_previous(class_='film_title').text)
print('Start time :', open_cap.find_previous(class_='start_time').text)
print('-' * 80)
Prints:
Name : James Bond
Start time : 19:00
--------------------------------------------------------------------------------
Content of read.html
<li class="film_listing">
<h3 class="film_title">James Bond</h3>
<ul class="showtimes">
<li class="showtime">
<p class="start_time">15: 00</p>
</li>
<li class="showtime">
<p class="start_time">19:00</p>
<ul class="attributes">
<li class="audio_desc"></li>
<li class="open_cap"></li>
</ul>
</li>
</ul>
</li>
As you said <ul> tags are not direct children of the class film_listing or showtime then you can try find() to get first element with specified tag name or you can use find_all() to get list of elements with specified tag name.
You can try this
from bs4 import BeautifulSoup as bs
text = open("read.html", "r")
soup = bs(text.read(), 'html.parser')
for listing in soup.find_all("li", class_="film_listing"):
print("Film name: ", listing.find("h3", class_="film_title").text)
print("Start time: ", listing.find("p", class_="start_time").text)
Output:
Film name: James Bond
Start time: 15: 00
instead of find() you can use find_all() method which will return all the tags which that name <p> and class start_time

Extracting full URL from href tag in scrapy

I'm trying to use scrapy to scrape URLs from offers from this site
This is the code I tried:
url = response.css('a[data-tracking="click_body"]::attr(href)').extract()
But my code returns something very different from a URL.
Here is the HTML code of the div I'm interested in.
<div class="offer-item-details">
<header class="offer-item-header">
<h3>
<a href="https://www.otodom.pl/oferta/gdansk-pod-inwestycje-cicha-lokalizacja-ID46DXu.html#ab04badaa0" data-tracking="click_body" data-tracking-data="{"touch_point_button":"title"}" data-featured-name="promo_top_ads">
<strong class="visible-xs-block">42 m²</strong>
<span class="text-nowrap">
<span class="offer-item-title">Gdańsk/ Pod Inwestycje/ Cicha Lokalizacja</span>
</span>
</a>
</h3>
<p class="text-nowrap"><span class="hidden-xs">Mieszkanie na sprzedaż: </span>Gdańsk, Ujeścisko-Łostowice, Łostowice</p>
<div class="vas-list-no-offer">
<a class="button-observed observe-link favourites-button observed-text svg-heart add-to-favourites" data-statkey="ad.observed.list" rel="nofollow" data-id="60688916" href="#" title="Obserwuj">
<div class="observed-text-container" style="display: flex;">
<span class="icon observed-60688916"></span>
<i class="icon-heart-filled"></i>
<div class="observed-label">Dodaj do ulubionych</div>
</div>
</a>
</div>
</header>
<ul class="params
" data-tracking="click_body" data-tracking-data="{"touch_point_button":"body"}">
<li class="offer-item-rooms hidden-xs">2 pokoje</li>
<li class="offer-item-price">
346 000 zł </li>
<li class="hidden-xs offer-item-area">42 m²</li>
<li class="hidden-xs offer-item-price-per-m">8 238 zł/m²</li>
</ul>
</div>
Copied selector of that tag:
#offer-item-ad_id45Wog > div.offer-item-details > header > h3 > a
Copied xPath
//*[#id="offer-item-ad_id45Wog"]/div[1]/header/h3/a
Copied full xPath
/html/body/div[3]/main/section[2]/div/div/div[1]/div/article[1]/div[1]/header/h3/a
Your code gives you a list of the URLs. The extract() method in this case gets a list. To allow scrapy to extract the data you will have to do a for loop and yield statement.
url = response.css('a[data-tracking="click_body"]::attr(href)').extract()
for a in url:
yield{'url', a}

Selenium : How to wait then click?

I'm using selenium for automation and i want to click in each one of the <ul>elements then wait before clicking again in the element. This is my code but it doesn't seem to be the solution :
def navBar():
driver=setup()
navBar_List = driver.find_element_by_class_name("nav")
listItem = navBar_List.find_elements_by_tag_name("li")
for item in listItem :
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.TAG_NAME,"li")))
item.click()
Here the HTLM code :
<ul class="nav navbar-nav">
<li tabindex="0">
<a class="h">
<div class="icon-left-navbar">
...
</div>
</a>
</li>
<li tabindex="0">
<a class="h">
<div class="icon-left-navbar">
...
</div>
</a>
</li>
<li tabindex="0">
<a class="h">
<div class="icon-left-navbar">
...
</div>
</a>
</li>
</ul>
Is Thread.sleep(100) an option?
Define your li with .find_elements.
Use xpath for recognize them : //ul[#class='nav navbar-nav']//li.
With loop you can utilize increment to wait each li. I'm imagine it will produce like below:
(xpath)[1]
(xpath)[2]
etc...
And try the following code:
listItem = WebDriverWait(driver, 30).until(EC.visibility_of_all_elements_located((By.XPATH,"//ul[#class='nav navbar-nav']//li")))
for x in range(1, len(listItem)+1):
WebDriverWait(driver,10).until(EC.element_to_be_clickable((By.XPATH,"(//ul[#class='nav navbar-nav']//li)[" +str(x) +"]"))).click()

Using Selenium and BS4 is it possible to scrape the text outside the "=" within the div tag

I am looking at scraping the below information using both selenium and bs4, and was wondering if I find the below div tag, is it possible to scrape the data inside the quotation marks? for exmaple: data-room-type-code="SUK"
<div
class="sl-flexbox room-price-item hidden-top-border"
data-room-name="Superior Shard Room"
data-bed-type="K"
data-bed-name="King"
data-pay-type-tag-filter="No Prepayment"
data-cancel-tag-filter=""
data-breakfast-tag-filter=""
data-room-type-code="SUK"
data-rate-code="ZBAR"
data-price="430"
>
<div class="room-price-basic-info">
<div class="room-price-title title-regular">Flexible Rate / CustomStay</div>
<ul class="abstract text-regular">
<li>No Prepayment</li>
</ul>
<div
class="show-detail text-btn js-show-detail"
data-index="0-productRates-0"
>
OFFER DETAILS
</div>
</div>
<div class="room-price-book-info">
<div class="number text-medium">GBP 430</div>
</div>
<div class="boot-btn text-medium js-booking-room" data-type="PRICE">
Book Now
</div>
</div>

Edit text from html with BeautifulSoup

I'm currently trying to extract the html elements which have a text on their own and wrap them with a special tag.
For example, my HTML looks like this:
<ul class="myBodyText">
<li class="fields">
This text still has children
<b>
Simple Text
</b>
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello <br/>
World
</li>
</ul>
</div>
</li>
</ul>
I'm trying to wrap tags only around the tags, so I can further parse them at a later time, so I tried to make it look like this:
<ul class="bodytextAttributes">
<li class="field">
[Editable]This text still has children[/Editable]
<b>
[Editable]Simple Text[/Editable]
</b>
<div class="sectionFields">
<ul class="section">
<li style="padding-left: 10px;">
[Editable]Hello [/Editable]<br/>
[Editable]World[/Editable]
</li>
</ul>
</div>
</li>
</ul>
My script so far, which iterates just fine, but the placement of the edit placeholders isn't working and I have currently no idea how I can check this:
def parseSection(node):
b = str(node)
changes = set()
tag_start, tag_end = extractTags(b)
# index 0 is the element itself
for cell in node.findChildren()[1:]:
if cell.findChildren():
cell = parseSection(cell)
else:
# safe to extract with regular expressions, only 1 standardized tag created by BeautifulSoup
subtag_start, subtag_end = extractTags(str(cell))
changes.add((str(cell), "[/EditableText]{0}[EditableText]{1}[/EditableText]{2}[EditableText]".format(subtag_start, str(cell.text), subtag_end)))
text = extractText(b)
for change in changes:
text = text.replace(change[0], change[1])
return bs("{0}[EditableText]{1}[/EditableText]{2}".format(tag_start, text, tag_end), "html.parser")
The script generates following:
<ul class="myBodyText">
[EditableText]
<li class="fields">
This text still has children
[/EditableText]
<b>
[EditableText]
Simple Text
[/EditableText]
</b>
[EditableText]
<div class="s">
<ul class="section">
<li style="padding-left: 10px;">
Hello [/EditableText]
<br/>
[EditableText][/EditableText]
<br/>
[EditableText]
World
</li>
</ul>
</div>
</li>
[/EditableText]
</ul>
How I can check this and fix it? I'm grateful for every possible answer.
There is a built-in replace_with() method that fits the use case nicely:
soup = BeautifulSoup(data)
for node in soup.find_all(text=lambda x: x.strip()):
node.replace_with("[Editable]{}[/Editable]".format(node))
print soup.prettify()

Categories

Resources