Python Beautiful Soup 4 - finding element by class and aria-label - python

I am trying to find an element with a particular class name and aria-label using Beautiful Soup 4. More specifically, I am scrapping an HTML code where each item on the list has the same class (nd-list__item in-feat-item) but a different aria-label (e.g. aria-label="rooms"). Source code below:
I have to search for a specific combination of class and aria-label because if I am unable to find it, I must return a None value (e.g. if there is none <li .... aria-label="rooms"></li> I must return None. Using bs_object.find_all method on the whole list and then iterating over each of the list elements is rather inefficient, as some listings may have different orderings (e.g. if there are no numbers of rooms provided, then the first element will be "aria-label="surface") -> so I must be able to query directly whether the particular element is contained in the bs object.
Do you have some recommendations on how to do that without going in for bs_object.find_all('li', class_='nd-list_item in-feat__item') and then iterating over the whole list? I also thought about searching for the parent <ul></ul> tag and then using Regex - but it is also an overly complicated procedure. Thanks in advance for all the answers!

Related

Webscraping specific sections of page without 'class' or 'id' identifiers

I am having issues web-scraping a tag element while using BeautifulSuop4 in Python. Typically the elements are given a class or id identifier where I can use:
.find_all(<p>, class_ = 'class-name')
to find the element however the elements I am trying to isolate are in a consecutive list of tags all of which have no identifier for their element.
Is there a way to choose every tag after a tag that has an identifier? Or maybe a way to isolate the specific tags I want without them having any shared class/id?
You could use find_next_sibling to find the classless next sibling of an element.
Consider this example HTML. The first div has the class "blah". The second div has no class but is beside the first div.
html='<div><div class="blah">1</div><div>no class</div></div>'
import bs4
soup = bs4.BeautifulSoup(html,'html.parser')
soup.find('div',{'class':"blah"}).find_next_sibling()
#outputs second div without a class
<div>no class</div>
See this and this for more details.

How do I match up two lists and only change the second in each pair?

I’m trying to make a Python plugin to automate adding HTML attributes to hyperlinks that fulfil certain criteria for footnotes (in ebooks) – for example, if it’s superscript, if it’s a number in a square or round bracket… so far so good and I’ve managed to add the attributes using Beautiful Soup for these conditions.
There are many footnote pairs in different ebooks. Ebooks are all made differently (so the footnotes don't necessarily all have the same class, for example). Each footnote number has an URL with a fragment identifier that is bidirectionally linked to another link with a corresponding ID to help the reader navigate.
For example:
// on chapter.xhtml
Footnote 1 <a id="fn1" href="../Text/chapter.xhtml#rfn1">[1]</a>
Footnote 2 <a id="fn2" href="../Text/chapter.xhtml#rfn2">[2]</a>
1. <a id="rfn1" href="../Text/chapter.xhtml#fn1">1.</a> Footnote 1
2. <a id="rfn2" href="../Text/chapter.xhtml#fn2">2.</a> Footnote 2
Desired result - but the returning links can appear anywhere in the ebook which is why it's useful to automate this process:
Footnote 1 <a id="fn1" href="../Text/chapter.xhtml#rfn1">[1]</a>
Footnote 2 <a id="fn2" href="../Text/chapter.xhtml#rfn2">[2]</a>
1. <a id="rfn1" href="../Text/chapter.xhtml#fn1" role="doc-backlink">1.</a> Footnote 1
2. <a id="rfn2" href="../Text/chapter.xhtml#fn2" role="doc-backlink">2.</a> Footnote 2
Now I wish to add an HTML attribute to all the links that have the job of returning back to the initial link in the pair. These will always be the links in the footnote pair that come second in the ebook (but their identifier could be named anything.) However there are many footnotes and I’m struggling to do a matching exercise.
So a few questions which I’d really appreciate some help with:
How do I find the fragment identifier of every footnote link?
How do I find the ID of every footnote link?
How do I compare the fragment identifiers and IDs?
How do I then add an HTML attribute to only the second occurrence in the ebook in each footnote pair?
I've tried nested for loops but I'm not actually sure how to achieve this. Currently I'm finding all the links using Beautiful Soup and, if they satisfy certain criteria, adding the relevant attributes using Beautiful Soup.
There are multiple chapters (xhtml files) in the ebooks so I'm hoping this won't affect the outcome of the plugin.
I’m completely new to this, so thanks for your time.
Assumption: Footnotes always come second.
We'll iterate over all the links in a page, trying to see if each link contains a fragment identifier in its href attribute. If it does, we'll use that to fetch the matching link.
We'll use find_next instead of find, because the latter will fetch a matching tag from anywhere in the document, whereas find_next will only try to find from the position of the object being processed. I'll make it clearer with an example:
some_link['href']
# ../Text/chapter.xhtml#rfn1
some_link.find('a', {'id': 'rfn1'})
# <a id="rfn1" href="../Text/chapter.xhtml#fn1" role="doc-backlink">1.</a>
If we use find we can't be sure if the link found appeared before the original link or after that. However, if we use find_next...
footnote_link = some_link.find_next('a', {'id': 'rfn1'})
footnote_link
# <a id="rfn1" href="../Text/chapter.xhtml#fn1" role="doc-backlink">1.</a>
footnote_link.find_next('a', {'id': 'fn1'})
# None
... we can be sure that this link was appeared second (and hence a footnote), because find_next will return None if it can't find a match, starting from the position of the object on which we call find_next.
Here's what the full code will probably look like:
for link in soup.find_all('a'):
try:
fragment_id = link['href'].rsplit('#', maxsplit=1)[1]
except IndexError:
# the `rsplit` returned only one string, meaning '#' wasn't found in the string
continue
footnote = link.find_next('a', {'id': fragment_id})
if footnote:
# a matching footnote has been found
# you can add attributes to it by modifying `footnote`

I have created a list using find_all in Beautiful soup based on an attribute. How do I return he node I want?

I have a MS word document template that has Structured documents tags, including repeating sections. I am using a Python script to pull the important parts and and send them to a dataframe. My script works as intended on 80% of the documents I have attempted but I am often failing. The issue is when finding the first repeating section I have been doing the following:
from bs4 import BeautifulSoup as BS
soup = BS(f, 'xml') # entire xml; file is called soup
soupdocument=soup.document #document only child node of soup
soupbody=soupdocument.body # body is the only child node of document
ODR=soupbody.contents[5]
which often works however some users have managed to hit enter in some places in the document that are not locked down. I know the issue should be resolved by not choosing the 5th element of soupbody.
soupbody.find_all({tag})
><w:tag w:val="First Name"/>,
<w:tag w:val="Last Name"/>,
<w:tag w:val="Position"/>,
<w:tag w:val="Phone Number"/>,
<w:tag w:val="Email"/>,
<w:tag w:val="ODR Repeating Section"/>,
the above is a partial list of what is returned the actual list several dozen tags and some are repeated. the section I want is the last one I listed above and is usually but not always found by the first code block. I believe I can put a colon after find_all({tag:SOMETHING}} I have tried cutting and pasting all different parts of "ODR Repeating Section" but It doesn't work. What is the correct way to find this section?
Hi perhaps specify the attribute you're searching for in addition to the tag name?
tags = soup.findAll('tag', {'val" : 'ODR Repeating Section'})

PYTHON - Unable To Find Xpath Using Selenium

I have been struggling with this for a while now.
I have tried various was of finding the xpath for the following highlighted HTML
I am trying to grab the dollar value listed under the highlighted Strong tag.
Here is what my last attempt looks like below:
try:
price = browser.find_element_by_xpath(".//table[#role='presentation']")
price.find_element_by_xpath(".//tbody")
price.find_element_by_xpath(".//tr")
price.find_element_by_xpath(".//td[#align='right']")
price.find_element_by_xpath(".//strong")
print(price.get_attribute("text"))
except:
print("Unable to find element text")
I attempted to access the table and all nested elements but I am still unable to access the highlighted portion. Using .text and get_attribute('text') also does not work.
Is there another way of accessing the nested element?
Or maybe I am not using XPath as it properly should be.
I have also tried the below:
price = browser.find_element_by_xpath("/html/body/div[4]")
UPDATE:
Here is the Full Code of the Site.
The Site I am using here is www.concursolutions.com
I am attempting to automate booking a flight using selenium.
When you reach the end of the process of booking and receive the price I am unable to print out the price based on the HTML.
It may have something to do with the HTML being a java script that is executed as you proceed.
Looking at the structure of the html, you could use this xpath expression:
//div[#id="gdsfarequote"]/center/table/tbody/tr[14]/td[2]/strong
Making it work
There are a few things keeping your code from working.
price.find_element_by_xpath(...) returns a new element.
Each time, you're not saving it to use with your next query. Thus, when you finally ask it for its text, you're still asking the <table> element—not the <strong> element.
Instead, you'll need to save each found element in order to use it as the scope for the next query:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tbody = table.find_element_by_xpath(".//tbody")
tr = tbody.find_element_by_xpath(".//tr")
td = tr.find_element_by_xpath(".//td[#align='right']")
strong = td.find_element_by_xpath(".//strong")
find_element_by_* returns the first matching element.
This means your call to tbody.find_element_by_xpath(".//tr") will return the first <tr> element in the <tbody>.
Instead, it looks like you want the third:
tr = tbody.find_element_by_xpath(".//tr[3]")
Note: XPath is 1-indexed.
get_attribute(...) returns HTML element attributes.
Therefore, get_attribute("text") will return the value of the text attribute on the element.
To return the text content of the element, use element.text:
strong.text
Cleaning it up
But even with the code working, there’s more that can be done to improve it.
You often don't need to specify every intermediate element.
Unless there is some ambiguity that needs to be resolved, you can ignore the <tbody> and <td> elements entirely:
table = browser.find_element_by_xpath(".//table[#role='presentation']")
tr = table.find_element_by_xpath(".//tr[3]")
strong = tr.find_element_by_xpath(".//strong")
XPath can be overkill.
If you're just looking for an element by its tag name, you can avoid XPath entirely:
strong = tr.find_element_by_tag_name("strong")
The fare row may change.
Instead of relying on a specific position, you can scope using a text search:
tr = table.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
Other <table> elements may be added to the page.
If the table had some header text, you could use the same text search approach as with the <tr>.
In this case, it would probably be more meaningful to scope to the #gdsfarequite <div> rather than something as ambiguous as a <table>:
farequote = browser.find_element_by_id("gdsfarequote")
tr = farequote.find_element_by_xpath(".//tr[contains(text(), 'Base Fare')]")
But even better, capybara-py provides a nice wrapper on top of Selenium, helping to make this even simpler and clearer:
fare_quote = page.find("#gdsfarequote")
base_fare_row = fare_quote.find("tr", text="Base Fare"):
base_fare = tr.find("strong").text

Accessing content of all divs having same class name but different xpaths

I am trying to extract data from two divisions in XHTML having the same class name using python but when I try to take out their xpaths, they are different. I tried using
driver = webdriver.Chrome()
content = driver.find_element_by_class_name("abc")
print content.text
but it gives only the content of first div. I heard this can be done using xpath. The xpaths of the divs are as follows:
//*[#id="u_jsonp_2_o"]/div[2]/div[1]/div[3]
//*[#id="tl_unit_-5698935668596454905"]/div/div[2]/div[1]/div[3]
//*[#id="u_jsonp_3_c"]/div[2]/div[1]/div[3]
What I thought, since each xpath has same ending, how can we use this similarity in ending and then access the divisions in python by writing [1],[2],[3].... at the end of the xpath?
Also, I want make content an array containing all the content of classes named abc. Moreover, I don't know how many abcs exist! How to integrate the data of all of them in one content array?
In your case it doesn't matter if you use class name or css, you are only searching for "one" element with find_element but you want to find several elements:
you need to use find_element**s**_by_class_name
content = driver.find_elements_by_class_name("abc")
for element in content:
// here you can work with every single element that has class "abc"
// do whatever you want
print element.text

Categories

Resources