Webscraping specific sections of page without 'class' or 'id' identifiers - python

I am having issues web-scraping a tag element while using BeautifulSuop4 in Python. Typically the elements are given a class or id identifier where I can use:
.find_all(<p>, class_ = 'class-name')
to find the element however the elements I am trying to isolate are in a consecutive list of tags all of which have no identifier for their element.
Is there a way to choose every tag after a tag that has an identifier? Or maybe a way to isolate the specific tags I want without them having any shared class/id?

You could use find_next_sibling to find the classless next sibling of an element.
Consider this example HTML. The first div has the class "blah". The second div has no class but is beside the first div.
html='<div><div class="blah">1</div><div>no class</div></div>'
import bs4
soup = bs4.BeautifulSoup(html,'html.parser')
soup.find('div',{'class':"blah"}).find_next_sibling()
#outputs second div without a class
<div>no class</div>
See this and this for more details.

Related

Python Beautiful Soup 4 - finding element by class and aria-label

I am trying to find an element with a particular class name and aria-label using Beautiful Soup 4. More specifically, I am scrapping an HTML code where each item on the list has the same class (nd-list__item in-feat-item) but a different aria-label (e.g. aria-label="rooms"). Source code below:
I have to search for a specific combination of class and aria-label because if I am unable to find it, I must return a None value (e.g. if there is none <li .... aria-label="rooms"></li> I must return None. Using bs_object.find_all method on the whole list and then iterating over each of the list elements is rather inefficient, as some listings may have different orderings (e.g. if there are no numbers of rooms provided, then the first element will be "aria-label="surface") -> so I must be able to query directly whether the particular element is contained in the bs object.
Do you have some recommendations on how to do that without going in for bs_object.find_all('li', class_='nd-list_item in-feat__item') and then iterating over the whole list? I also thought about searching for the parent <ul></ul> tag and then using Regex - but it is also an overly complicated procedure. Thanks in advance for all the answers!

How to select the html attribute inside a selenium object

I am learning web scraping using selenium and I've come into an issue when trying to select an attribute inside of a selenium object. I can get the broader data if I just print elems.text inside the loop (this outputs the whole paragraph for each listing) however when I try to access the xpath of the h2 title tag of all the listings inside this broader element, it only appends the first listing to the titles array, whereas I want all of them. I checked the XPATH and they are the same for each listing. How can I get all of the listings instead of just the first one?
titles = []
driver.get("https://www.sellmytimesharenow.com/timeshare/All+Timeshare/vacation/buy-timeshare/")
results = driver.find_elements(By.CLASS_NAME, "results-list")
for elems in results:
print(elems.text) #this prints out full description paragraphs
elem_title = elems.find_element(By.XPATH, '//*[#id="search-page"]/div[3]/div/div/div[2]/div/div[2]/div/a[1]/div/div[1]/div/h2')
titles.append(elem_title.text)
If you aren't limited to accessing the elements by XPATH only, then here is my solution:
results = driver.find_elements(By.CLASS_NAME, "result-box")
for elems in results:
titles.append(elems.text.split("\n")[0])
When you try getting the listings, you use find_elements(By.CLASS_NAME, "results-list"), but on the website, there is only one element with the class name "results-list". This aggregates all the text in this div into one long string and therefore you can't get the heading.
However, there are multiple elements with the class name "result-box", so find_elements will store each as its own item in "results". Because the title of each listing is on the first line, you can split the text of each element by the newline.

How to select all table elements inside a div parent node with BeautifulSoup?

I am trying to select all table elements from a div parent node by using a customized function.
This is what I've got so far:
import BeautifulSoup
import requests
import lxml
url = 'https://www.salario.com.br/profissao/abacaxicultor-cbo-612510'
def getTables(url):
url = requests.get(url)
soup=BeautifulSoup(url.text, 'lxml')
div_component = soup.find('div', attrs={'class':'td-post-content'})
tables = div_component.find_all('table', attrs={'class':'listas'})
return tables
However when applied as getTables(url) the output is an empty list [].
I expect this function to return all html tables elements inside div node given specific his specific attributes.
How could I adjust this function?
Is there any other library I could use to accomplish this task?
Taking what the other commenters have said, and expanding on it.
Your div_component returns 1 element and doesn't contain tables, but using find_all() yeilds 8 elements:
len(soup.find_all('div', attrs={'class':'td-post-content'}))
So you can't just use find() on a list you'll need to iterate through it to find a div that contains tables.
Another way to just go after the tables you want, you can just use
tables = soup.find_all('table', attrs={'class':'listas'})
where tables is a list with 6 elements. If you know which table you want, you can iterate through the tables until you find the one you want.
The first problem is that "find" finds only the first such match. The first td-post-content <div> does not contain any tables. I think you want "findall". Second, you can use CSS selectors with BeautifulSoup. So, you can search for soup.findall('div.td-post-content') without using the attributes parameter.

Python/Selenium Finding a specific class element, analyzing if it contains a specific span class, if it does, copy the link

Trying to create a script that loops through my inbox and find all div classes that contain "relative flex", if the div class contains a span class labelled "dn dib-1" then it copies and saves the following href link to my list and moves onto the next div.
Here is the html code:
<div class="relative flex">
<span class="dn dib-l" style="left: -16px;"</span>
hey how are you?
Here is the code I have now:
link_list = []
sex_list = []
message = browser.find_elements_by_xpath('//*[#class="relative flex"]')
message_new = browser.find_elements_by_xpath('//*[#class="dn dib-l"]')
for item in message:
link = item.find_element_by_xpath('.//a').get_attribute('href')
if message_new in message:
link_list.append(link)
Issue:
message, message_new all contain data when requested, however despite there being multiple messages with these classes, link variable only contains one element and link_list contains no elements. What changes do I need to make in my code in order for it to save all links within div classes that contain this span class?
I would restructure this code a bit to make it more efficient. To me, it sounds like you want to analyze all of the div elements that have class relative flex. Then, if the div contains a certain span element, you want to save the href tag of the following a item. Here's how I would write this:
# locate the span elements which exist under your desired div
spans_to_iterate = browser.find_elements_by_xpath("//div[contains(#class, 'relative flex')]/span[contains(#class, 'dn dib-1')]")
link_list = []
# iterate span elements to save the href attribute of a element
for span in spans_to_iterate:
# get the href element, where 'a' element is following sibling of span.
link_text = span.find_element_by_xpath("following-sibling::a").get_attribute("href")
link_list.append(link_text)
The idea behind this code is that we first retrieve the span elements that exist in your desired div. In your problem description, you mentioned you only wanted to save the link if the div and span elements contained specific class names. So, we query directly on the elements that you have mentioned, rather than find div first then find span.
Then, we iterate these span elements and use XPath's following-sibling notation to grab the a element that appears right after. We can get get_attribute to grab the href tag, and then append the link to the list.
Try this:
xpth = "//div[#class='relative flex' and /span[#class='dn dib-l']]//#href"
links = browser.find_elements_by_xpath(xpth)

Accessing content of all divs having same class name but different xpaths

I am trying to extract data from two divisions in XHTML having the same class name using python but when I try to take out their xpaths, they are different. I tried using
driver = webdriver.Chrome()
content = driver.find_element_by_class_name("abc")
print content.text
but it gives only the content of first div. I heard this can be done using xpath. The xpaths of the divs are as follows:
//*[#id="u_jsonp_2_o"]/div[2]/div[1]/div[3]
//*[#id="tl_unit_-5698935668596454905"]/div/div[2]/div[1]/div[3]
//*[#id="u_jsonp_3_c"]/div[2]/div[1]/div[3]
What I thought, since each xpath has same ending, how can we use this similarity in ending and then access the divisions in python by writing [1],[2],[3].... at the end of the xpath?
Also, I want make content an array containing all the content of classes named abc. Moreover, I don't know how many abcs exist! How to integrate the data of all of them in one content array?
In your case it doesn't matter if you use class name or css, you are only searching for "one" element with find_element but you want to find several elements:
you need to use find_element**s**_by_class_name
content = driver.find_elements_by_class_name("abc")
for element in content:
// here you can work with every single element that has class "abc"
// do whatever you want
print element.text

Categories

Resources