How to extract anchor elements nested in multiple division elements

How to extract anchor elements nested in multiple division elements - python

I am trying to extract anchor elements from my beautiful soup object with a common class attr each nested in multiple divisions. The divisions are repeated and separated with some scripts
I have tried to take advantage of the common class attrs in the anchor elements to extract them
The code I got:
<div id='container'>
<div class='nested'>
<a href='some url' class='link'>
</a>
</div>
</div>
#some scripts ....
<div id='container'>
<div class='nested'>
<a href='some url' class='link'>
</a>
</div>
</div>
What I tried:
import requests, bs4, webbrowser
webpage=requests.get('some url')
webpage.raise_for_status()
soup=bs4.BeautifulSoup(webpage.text)
links=soup.select('.link a')
for i in range(0,5):
webrowser.open('intial site url'+links[i].get('href'))
print(links)
No tabs were opened.Print links gave a blank list

Replace your line code:
links=soup.select('.link a')
To
links=soup.find_all('a',{'class':'link'})
print(links)
O/P:
[<a class="link" href="some url">
</a>, <a class="link" href="some url">
</a>]
To Get href form a tag:
for link in links:
href = link['href']
print(href)

.link a will do all child a tags with parents having class link. The space between is actually a css descendant combinator which means the lhs is parent and rhs is child. Remove the space to apply to same element. Notice that you need to extract the href attribute from the matched tags.
links = [item['href'] for item in soup.select('a.link')]
If you need to specify the parent div by class then it is
.nested a.link
or more simply
.nested .link

Related

beautifulsoup filtering descendents using regex

I'm trying to use beautiful-soup to return elements of the DOM that contain children that match filtering criteria.
In the example below,I want to return both divs based on finding a regex match in a child element.
<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>
The basic code setup is as follows:
from bs4 import BeautifulSoup as soup
page = soup(html)
Results = page.find_all('div')
How do I add a regex test that evaluates the children of the target div? I.e, how would I add the regex call below to the 'find' or 'find_all' functions of beautiful-soup?
re.compile('regexmatch\d')

The approach I landed with was find_parent, which will return the parent element of the beautifulsoup results regardless of the method used to find the original result (regex or otherwise). For the example above:
childOfResults = page.find_all('span', string=re.compile('regexmatch\d'))
Results = childOfResult[0].find_parent()
...modified with the loop of your choice to cycle through all the members of childOfResult

Get the first div then run for loop on all div's
Example
from bs4 import BeautifulSoup
html = """<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>"""
page_soup = BeautifulSoup(html, features='html.parser')
elements = page_soup.select('body > div')
for element in elements:
print(element.select("span:nth-child(1)")[0].text)
it prints out
regexmatch1
regexmatch2

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?

The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

Get all links from DOM except from a certain div tag selenium python

How to get all links of the DOM except from a certain div tag??
This is the div I don't want links from:
<div id="yii-debug-toolbar">
<div class="yii-debug-toolbar_bar">
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
<div class="yii-debug-toolbar_block>
<a>...</a>
</div>
</div>
</div>
I get the links in my code lke this:
links = driver.find_elements_by_xpath("//a[#href]")
But I don't want to get the ones from that div, how can I do that?

I'm not sure if there is a simple way to do this with just seleniums xpath capabilities. However, a simple solution could be to parse the HTML with something like BeautifulSoup, get rid of all the <div id="yii-debug-toolbar">...</div> Elements, and then select the remaining links.
from bs4 import BeautifulSoup
...
soup = BeautifulSoup(wd.page_source)
for div in soup.find_all("div", {'id':'yii-debug-toolbar'}):
div.decompose()
soup.find_all('a', href=True)

Scrapy: How do I select the first a tag inside a div element using XPath

I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]

product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag

Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

BeautifulSoup.find_all for nested divs without class attribute

I am working with python2 and I wanted to get the content of a div in html page.
<div class="lts-txt2">
Some Content
</div>
If the div class is like above then I can get the content using
BeautifulSoup.find_all('div', attrs={"class": 'lts-txt2'})
But if the div is like,
<div class="lts-txt2">
<div align="justify">
Some Content
</div>
</div>
then using
BeautifulSoup.find_all('div', attrs={"class": 'lts-txt2'})
isn't return the content.
So I tried with
BeautifulSoup.find_all('div', attrs={"align": 'justify'})
But it also wasn't worked.
How can I solve the problem.

You can extract all text from the node including nested nodes with the Element.get_text() method:
[el.get_text() for el in soup.find_all('div', attrs={"class": 'lts-txt2'})]
This produces a list with the textual content of each such a div, wether or not there is a nested div inside.
You could also use the CSS selector Element.select() function to select the nested div:
soup.select('div.lts-txt2 > div')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to extract anchor elements nested in multiple division elements - python

Replace your line code: links=soup.select('.link a') To links=soup.find_all('a',{'class':'link'}) print(links) O/P: [<a class="link" href="some url"> </a>, <a class="link" href="some url"> </a>] To Get href form a tag: for link in links: href = link['href'] print(href)

Related

beautifulsoup filtering descendents using regex

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

Get all links from DOM except from a certain div tag selenium python

Scrapy: How do I select the first a tag inside a div element using XPath

BeautifulSoup.find_all for nested divs without class attribute

Categories

Resources