beautifulsoup filtering descendents using regex - python

I'm trying to use beautiful-soup to return elements of the DOM that contain children that match filtering criteria.
In the example below,I want to return both divs based on finding a regex match in a child element.
<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>
The basic code setup is as follows:
from bs4 import BeautifulSoup as soup
page = soup(html)
Results = page.find_all('div')
How do I add a regex test that evaluates the children of the target div? I.e, how would I add the regex call below to the 'find' or 'find_all' functions of beautiful-soup?
re.compile('regexmatch\d')

The approach I landed with was find_parent, which will return the parent element of the beautifulsoup results regardless of the method used to find the original result (regex or otherwise). For the example above:
childOfResults = page.find_all('span', string=re.compile('regexmatch\d'))
Results = childOfResult[0].find_parent()
...modified with the loop of your choice to cycle through all the members of childOfResult

Get the first div then run for loop on all div's
Example
from bs4 import BeautifulSoup
html = """<body>
<div class="randomclass1">
<span class="randomclass">regexmatch1</span>
<h2>title</h2>
</div>
<div class="randomclass2">
<span class="randomclass">regexmatch2</span>
<h2>title</h2>
</div>
</body>"""
page_soup = BeautifulSoup(html, features='html.parser')
elements = page_soup.select('body > div')
for element in elements:
print(element.select("span:nth-child(1)")[0].text)
it prints out
regexmatch1
regexmatch2

Related

Python 3.8 - BeautifulSoup 4 - unwrap() does not remove all tags

I've been googling through SO for quite some time, but I couldn't find a solution for this one. Sorry if it's a duplicate.
I'm trying to remove all the HTML tags from a snippet, but I don't want to use get_text() because there might be some other tags, like img, that I'd like to use later. BeautifulSoup doesn't quite behave as I expect it to:
from bs4 import BeautifulSoup
html = """
<div>
<div class="somewhat">
<div class="not quite">
</div>
<div class="here">
<blockquote>
<span>
<br />content<br />
</span>
</blockquote>
</div>
<div class="not here either">
</div>
</div>
</div>
"""
soup = BeautifulSoup(html, 'html.parser')
la_lista = []
for x in soup.find_all('div', {"class":"somewhat"}): # in all the "somewhat" divs
for y in x.find_all('div', {"class":"here"}): # find all the "here" divs
for inp in y.find_all("blockquote"): # in a "here" div find all blockquote tags for the relevant content
for newlines in inp('br'):
inp.br.replace_with("\n") # replace br tags
for link in inp('a'):
inp.a.unwrap() # unwrap all a tags
for quote in inp('span'):
inp.span.unwrap() # unwrap all span tags
for block in inp('blockquote'):
inp.blockquote.unwrap() # <----- should unwrap blockquote
la_lista.append(inp)
print(la_lista)
The result is as follows:
[<blockquote>
content
</blockquote>]
Any ideas?
The type that return from y.find_all("blockquote") is a bs4.element.Tag upon him you can't call the tag himself with inp('blockquote').
The solution for you is to remove:
for block in inp('blockquote'):
inp.blockquote.unwrap()
and replace:
la_lista.append(inp)
with:
la_lista.append(inp.decode_contents())
The answer is based on the following answer BeautifulSoup innerhtml

python iterate multiple tags using beautiful soup

i'm using python 3 and what i want to do is analyze an HTML page and extract some informations from specific tag.
This operation must be done multiple time. To get the HTML page i'm using beautifulsoup module and i can get correctly the html code by this way:
import urllib.request as req
import bs4
url = 'http://myurl.com'
reqq = req.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
reddit_file = req.urlopen(reqq)
reddit_data = reddit_file.read().decode('utf-8')
soup = bs4.BeautifulSoup(reddit_data, 'lxml')
my html structure is the following:
<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
What i want to extract is the href value, and so http://www.myurl.com/
I tried using the find() function like this way and it works:
div = soup.find("div", {"class" : "first_div"})
But if i try to find directly the second div:
div = soup.find("div", {"class" : "second_div"})
it returns empty value
Thanks
EDIT:
the source html page is the following:
view-source:https://www.amazon.it/gp/goldbox/ref=gbps_ftr_s-5_2d1d_page_1?gb_f_deals1=dealTypes:LIGHTNING_DEAL%252CBEST_DEAL%252CDEAL_OF_THE_DAY,sortOrder:BY_SCORE&pf_rd_p=82dc915a-4dd2-4943-b59f-dbdbc6482d1d&pf_rd_s=slot-5&pf_rd_t=701&pf_rd_i=gb_main&pf_rd_m=A11IL2PNWYJU7H&pf_rd_r=5Q5APCV900GSWS51A6QJ&ie=UTF8
What i have to extract is the href value from the a-row dealContainer dealTile div class
find Return only the first child of this Tag matching the given criteria.
But findAll Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
Here if you want to extract all href so you need to use for loop:
href = soup.findAll("div", {"class" : "first_div"})
for item in href:
print(img.get('href'))
Use Css selector which is much faster.
from bs4 import BeautifulSoup
reddit_data='''<div class="first_div" id="12345">
<div class="second_div">
<div class="third_div">
<div class="fourth_div">
<div class="fifth_div">
<a id="dealImage" class="checked_div" href="http://www.myurl.com/">
</div>
</div>
</div>
</div>
</div>'''
soup = BeautifulSoup(reddit_data, 'lxml')
for item in soup.select(".first_div a[href]"):
print(item['href'])

Delete block in HTML based on text

I have an HTML snippet below and I need to delete a block based on its text for example Name: John. I know I can do this with decompose() from BeautifulSoup using the class name sample but I cannot applied decompose because I have different block attributes as well as tag name but the text within has the same pattern. Is there any modules in bs4 that can solve this?
<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
<div>
result:
<div id="container"><div>
To find tags based on inner text see How to select div by text content using Beautiful Soup?
Once you have the required div, you can simply call decompose():
html = '''<div id="container">
<div class="sample">
Name:<br>
<b>John</b>
</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
sample = soup.find(text=re.compile('Name'))
sample.parent.decompose()
print(soup.prettify())
Side note: notice that I fixed the closing tag for your "container" div!

How to extract anchor elements nested in multiple division elements

I am trying to extract anchor elements from my beautiful soup object with a common class attr each nested in multiple divisions. The divisions are repeated and separated with some scripts
I have tried to take advantage of the common class attrs in the anchor elements to extract them
The code I got:
<div id='container'>
<div class='nested'>
<a href='some url' class='link'>
</a>
</div>
</div>
#some scripts ....
<div id='container'>
<div class='nested'>
<a href='some url' class='link'>
</a>
</div>
</div>
What I tried:
import requests, bs4, webbrowser
webpage=requests.get('some url')
webpage.raise_for_status()
soup=bs4.BeautifulSoup(webpage.text)
links=soup.select('.link a')
for i in range(0,5):
webrowser.open('intial site url'+links[i].get('href'))
print(links)
No tabs were opened.Print links gave a blank list
Replace your line code:
links=soup.select('.link a')
To
links=soup.find_all('a',{'class':'link'})
print(links)
O/P:
[<a class="link" href="some url">
</a>, <a class="link" href="some url">
</a>]
To Get href form a tag:
for link in links:
href = link['href']
print(href)
.link a will do all child a tags with parents having class link. The space between is actually a css descendant combinator which means the lhs is parent and rhs is child. Remove the space to apply to same element. Notice that you need to extract the href attribute from the matched tags.
links = [item['href'] for item in soup.select('a.link')]
If you need to specify the parent div by class then it is
.nested a.link
or more simply
.nested .link

How to parse HTML to a string template in Python?

I want to parse HTML and turn them into string templates. In the example below, I seeked out elements marked with x-inner and they became template placeholders in the final string. Also x-attrsite also became a template placeholder (with a different command of course).
Input:
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
Desired output:
<div class="x,y,z" {attrsite}>{inner}<div>{inner}</div></div>
I know there is HTMLParser and BeautifulSoup, but I am at a loss on how to extract the strings before and after the x-* markers and to escape those strings for templating.
Existing curly braces are handled sanely, like this sample:
<div x-maybe-highlighted> The template string "there are {n} message{suffix}" can be used.</div>
BeautifulSoup can handle the case:
find all div elements with x-attrsite attribute, remove the attribute and add {attrsite} attribute with a value None (produces an attribute with no value)
find all div elements with x-inner attribute and use replace_with() to replace the element with a text {inner}
Implementation:
from bs4 import BeautifulSoup
data = """
<div class="x,y,z" x-attrsite>
<div x-inner></div>
<div>
<div x-inner></div>
</div>
</div>
"""
soup = BeautifulSoup(data, 'html.parser')
for div in soup.find_all('div', {'x-attrsite': True}):
del div['x-attrsite']
div['{attrsite}'] = None
for div in soup.find_all('div', {'x-inner': True}):
div.replace_with('{inner}')
print(soup.prettify())
Prints:
<div class="x,y,z" {attrsite}>
{inner}
<div>
{inner}
</div>
</div>

Categories

Resources