Extract Parent Text without Children Text; Parsing HTML

Extract Parent Text without Children Text; Parsing HTML - python

I have this small bit of a soup tag element that I pulled using Selenium & BeautifulSoup.
<footer>
<p class="tags environment-tags">Environment:
<span class="tag environment-tag">Desert</span>
</p>
<p class="source monster-source">Basic Rules
<span class="page-number">, pg. 334</span>
</p>
</footer>
I am trying to grab the Text from just the p elements, but every time I try it grabs the span as well. So far this is what I tried:
for p in Environment.findAll('p'):
print(p.text)
I have also tried to extract the information using .extract() but that doesn't seem to work for me.

You can use .contents and access the 0th element:
for tag in soup.find_all("p"):
print(tag.contents[0].strip())
Output:
Environment:
Basic Rules
Or with your attempt, you can remove the <span>'s using .extract() by:
for tag in soup.select("p span"):
tag.extract()
print(soup.prettify())
Output:
<footer>
<p class="tags environment-tags">
Environment:
</p>
<p class="source monster-source">
Basic Rules
</p>
</footer>

Related

Using beatifulsoup to find text on html

This is my first time using beautifulsoup as a scraper tool and I just follow thru slowly with each step.
I've used soup.find_all("div", class_="product-box__inner") find a list of element I want and this partiful stuff not going thru my mind right now. my question below,
here is the HTML and my target is "$0" and I have tried
element.find("span", title= re.compile("$")) and I can't use element.select("dt > dd > span > span") because there's multiple one with same tag format which I dont need at all, Is there way I can target span data-fees-annual-value="" to get .text working?
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>

You are close to your goal with css selectors and they could be used more specific and reference directly on the attribute data-fees-annual-value:
soup.select_one('span[data-fees-annual-value]').text
Example
from bs4 import BeautifulSoup
html="""
<div class="product-box__features-item">
<dt class="f-body-3 product-box__features-label">Annual fee</dt>
<dd class="f-title-5 product-box__features-text u-margin-0">
<span>
<span data-fees-annual-value="">$0</span>
</span>
</dd>
</div>
"""
soup=BeautifulSoup(html,"html.parser")
soup.select_one('span[data-fees-annual-value]').text
Output
$0

If you want to find element by text, use string instead of title:
element.find("span", string=re.compile('$'))
Output:
<span data-fees-annual-value="">$0</span>

How to ignore tags on beautifulsoup4 python

I'm working on a new project and I have some issues.
My problem as like that.
<div class="news">
<p class="breaking"> </p>
...
<p> i need to pull here. </p>
but class = "breaking" is not let me to do it. I want to ignore the class "breaking" and pull the <p>.

Maybe, class='' would do with find_all or findAll:
from bs4 import BeautifulSoup
html = """
<div class="news">
<p class="breaking"> </p>
...
<p> i need to pull here. </p>
"""
soup = BeautifulSoup(html, 'html.parser')
print(soup.find_all('p', class_=''))
print(soup.findAll(True, {'class': ''}))
Output
[<p> i need to pull here. </p>]
[<p> i need to pull here. </p>]

How to wrap string by tag in Beautifulsoup?

I want to wrap the content of a lot of div-elements/blocks with p tags:
<div class='value'>
some content
</div>
It should become:
<div class='value'>
<p>
some content
</p>
</div>
My idea was to get the content (using bs4) by filtering strings with find_all and then wrap it with the new tag. Don't know, if its working. I cant filter content from tags with specific attributes/values.
I can do this instead of bs4 with regex. But I'd like to do all transformations (there are some more beside this one) in bs4.

Believe it or not, you can use wrap. :-)
Because you might, or might not, want to wrap inner div elements I decided to alter your HTML code a little bit, so that I could give you code that shows how to alter an inner div without changing the one 'outside' it. You will see how to alter all divs, I'm sure.
Here's how.
>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(open('pjoern.htm').read(), 'lxml')
>>> inner_div = soup.findAll('div')[1]
>>> inner_div
<div>
some content
</div>
>>> inner_div.contents[0].wrap(soup.new_tag('p'))
<p>
some content
</p>
>>> print(soup.prettify())
<html>
<body>
<div class="value">
<div>
<p>
some content
</p>
</div>
</div>
</body>
</html>

Scrapy: How do I select the first a tag inside a div element using XPath

I am using Scrapy's SitemapSpider to pull all product links from their respective collections. My list of sites are all Shopify stores and and the code that links to the products look like this:
<div class="grid__item grid-product medium--one-half large--one-third">
<div class="grid-product__wrapper">
<div class="grid-product__image-wrapper">
<a class="grid-product__image-link" href="/collections/accessories/products/black-double-layer-braided-leather-bracelet">
<img src="//cdn.shopify.com/s/files/1/1150/5108/products/product-image_50ce19b1-c700-4a77-9638-e2ac66a3acef_grande.jpg?v=1457310318" alt="Black Double Layer Braided Leather Bracelet" class="grid-product__image">
</a>
</div>
<a href="/collections/accessories/products/black-double-layer-braided-leather-bracelet" class="grid-product__meta">
<span class="grid-product__title">Black Double Layer Braided Leather Bracelet</span>
<span class="grid-product__price-wrap">
<span class="long-dash">—</span>
<span class="grid-product__price">
$ 15
</span>
</span>
</a>
</div>
</div>
Obviously, both href's are the exact same. The problem I'm having is scraping both links when using the following code:
product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
I'm trying to select the div element that has both a tags as descendants. From that, I only want to pull the href from the first a tag to avoid duplicate links.
Although each site is a Shopify, their source code for the collections page isn't the exact same. So the depth of the a tag under the div element is inconsistent and I'm not able to add a predicate like
//div[#class="grid__item grid-product medium--one-half large--one-third"]

product_links = response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")][1]/#href').extract()
print(product_links[0]) # This is your first a Tag

Just use the extract_first() command to to extract only the first matched element. And benifit of using this is that it avoids an IndexError and returns None when it doesn’t find any element matching the selection.
So, it should be :
>>> response.xpath('//div//a[contains(#href, "collections") and contains(#href, "products")]/#href').extract_first()
u'/collections/accessories/products/black-double-layer-braided-leather-bracelet'

Exclude hidden tags while scraping using b4

I have a website that has plenty of hidden tags in the html.
I have pasted the source code below.
The challenge is that there are 2 types on hidden tags,
1. Ones with style="display:none"
2. They have list of styles mentioned under every td tag.
And it changes with every td tag.
for the example below it has the following styles,
hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
So the elements with class=hLcj, kUC, mXJU, rr9s,etc are hidden elements
I want to extract the text of entire tr but exclude these hidden tags.
I have been scratching my head for hours and still no success.
Any help would be much appreciated. Thanks
I am using bs4 and python 2.7
<td class="leftborder timestamp" rel="1416853322">
<td>
<span>
<style>
.hLcj{display:none}
.J9pE{display:inline}
.kUC-{display:none}
.Dzkb{display:inline}
.mXJU{display:none}
.DZqk{display:inline}
.rr9s{display:none}
.nGF_{display:inline}
</style>
<span class="rr9s">35</span>
<span></span>
<div style="display:none">121</div>
<span class="226">199</span>
.
<span class="rr9s">116</span>
<div style="display:none">116</div>
<span></span>
<span class="Dzkb">200</span>
<span style="display: inline">.</span>
<span style="display:none">86</span>
<span class="kUC-">86</span>
<span></span>
120
<span class="kUC-">134</span>
<div style="display:none">134</div>
<span class="mXJU">151</span>
<div style="display:none">151</div>
<span class="rr9s">154</span>
<span class="Dzkb">.</span>
<span class="119">36</span>
<span class="kUC-">157</span>
<div style="display:none">157</div>
<span class="rr9s">249</span>
<div style="display:none">249</div>
</span>
</td>
<td> 7808</td>

Using selenium would make the task much easier since it knows what elements are hidden and which aren't.
But, anyway, here's a basic code that you would probably need to improve more. The idea here is to parse the style tag and get the list of classes to exclude, have a list of tags to exclude and check the style attribute of each child element in tr:
import re
from bs4 import BeautifulSoup
data = """ your html here """
soup = BeautifulSoup(data)
tr = soup.tr
# get classes to exclude
classes_to_exclude = []
for line in tr.style.text.split():
match = re.match(r'^\.(.*?)\{display:none\}', line)
if match:
classes_to_exclude.append(match.group(1))
tags_to_exclude = ['style', 'script']
texts = []
for item in tr.find_all(text=True):
if item.parent.name in tags_to_exclude:
continue
class_ = item.parent.get('class')
if class_ and class_[0] in classes_to_exclude:
continue
if item.parent.get('style') == 'display:none':
continue
texts.append(item)
print ''.join(texts.strip())
Prints:
199.200.120.36
Also see:
BeautifulSoup Grab Visible Webpage Text

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extract Parent Text without Children Text; Parsing HTML - python

Related

Using beatifulsoup to find text on html

How to ignore tags on beautifulsoup4 python

How to wrap string by tag in Beautifulsoup?

Scrapy: How do I select the first a tag inside a div element using XPath

Exclude hidden tags while scraping using b4

Categories

Resources