Parsing HTML with BeatifulSoup class == AND title CONTAINS

Parsing HTML with BeatifulSoup class == AND title CONTAINS - python

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593

I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]

To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.

.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

Find text between specific id beautifulsoup

I've an html like the following example:
<a class="anchor" id="category-1"></a>
<h2 class="text-muted">First Category</h2>
<div class="row">
<a class="anchor-entry" id="cat1-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<a class="anchor-entry" id="cat1-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat1-third-id"></a>
<div class="col-lg-10">
<h3>Third H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-2"></a>
<h2 class="text-muted">Second Category</h2>
<div class="row">
<a class="anchor-entry" id="cat2-first-id"></a>
<div class="col-lg-10">
<h3>First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat2-second-id"></a>
<div class="col-lg-10">
<h3>Second H3 Title</h3>
</div>
</div>
</div>
<a class="anchor" id="category-3"></a>
<h2 class="text-muted">Third Category</h2>
<div class="row">
<a class="anchor-entry" id="cat3-first-id"></a>
<div class="col-lg-10">
<h3>Cat-3 First H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-second-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Second H3 Title</h3>
</div>
<div class="row">
<a class="anchor-entry" id="cat3-third-id"></a>
<div class="col-lg-10">
<h3>Cat-3 Third H3 Title</h3>
</div>
</div>
</div>
</div>
so there are some blocks not within any div, but contained between a with the specific id.
I've the list of every id I need (category-1, category-2) and I would like to get in a python object (dict, dataframe, whatever) all the h3 text for each category:
d = {
'category-1': ['Cat-1 First H3 Title', 'Cat-1 Second H3 Title', 'Cat-1 Third H3 Title'],
'categor-2': ['Cat-2 First H3 Title', 'Cat-2 Second H3 Title']
}
The problem is that I didn't find any method to get in between information:
import requests
from bs4 import BeautifulSoup
url = 'myUrl'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
category_list = ['category-1', 'category-2']
for i in category_list:
# list like: [<a class="anchor" id="category-1"></a>]
catid = soup.find_all(id=i)
# long list like: [<a class="anchor-entry" id="cat1-first-id"></a>, ...]
cata = soup.find_all('a', {'class': 'anchor-entry'})
But catid and cata aren't linked and I stopped here.

Your code will only select a tags with class anchor-entry.
category_list = ['category-1', 'category-2', 'category-3']
category_tags = soup.find_all("a", {"class": "anchor"})
d = {}
for i in category_list:
tag = soup.find("a", {"id": i}).find_next()
while tag not in category_tags:
tag = tag.find_next()
if tag is None: break
if tag.name == "h3":
if d.get(i): d[i].append(tag.text)
else: d[i] = [tag.text]
My approach is to traverse the html tree, get h3 headers and store them in d until another category-id is found.

BeautifulSoap get multiple element for all img in a div with specific class

I am trying to get the links in image-file attribute (relative link as it is) in img tags under div with id previewImages (I don't want the src link).
Here is the sample HTML:
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
I tried the following but it only gives me the first link and not all:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
if images_box.find('img'):
imagesurl = images_box.find('img').get('image-file')
print imagesurl
How can I get all the links in image-file attritube for img tags in div with class previewImages?

Use .findAll
Ex:
from bs4 import BeautifulSoup
html = """<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
images_box = soup.find('div', attrs={'id': 'previewImages'})
for link in images_box.findAll("img"):
print link.get('image-file')
Output:
/image/15.jpg
/image/2.jpg
/image/0.jpg
/image/3.jpg
/image/4.jpg

I think it faster to use id with attribute selector passed to select
from bs4 import BeautifulSoup as bs
html = '''
<div id="previewImages">
<div class="thumb"> <a><img src="https://example.com/s/15.jpg" image-file="/image/15.jpg" /></a> </div>
<div class="thumb"> <a><img src="https://example.com/s/2.jpg" image-file="/image/2.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/0.jpg" image-file="/image/0.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/3.jpg" image-file="/image/3.jpg" /> </a> </div>
<div class="thumb"> <a><img src="https://example.com/s/4.jpg" image-file="/image/4.jpg" /> </a> </div>
</div>
'''
soup = bs(html, 'lxml')
links = [item['image-file'] for item in soup.select('#previewImages [image-file]')]
print(links)

BeautifulSoup has method .find_all() - check the docs. This is how you can use it in your code:
import sys
import urllib2
from bs4 import BeautifulSoup
quote_page = sys.argv[1] # this should be the first argument on the command line
page = urllib2.urlopen(quote_page)
soup = BeautifulSoup(page, 'html.parser')
images_box = soup.find('div', attrs={'id': 'previewImages'})
links = [img['image-file'] for img in images_box('img')]
print links # in Python 3: print(links)

To Add up if in case we have do the same scenario with lxml,
import lxml.html
tree = lxml.html.fromstring(sample)
images = tree.xpath("//img/#image-file")
print(images)
Output
['/image/15.jpg', '/image/2.jpg', '/image/0.jpg', '/image/3.jpg', '/image/4.jpg']

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?

Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation

How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")

You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Python v3 , Beautifoulsoup - multiple div tags with same name

soup = BeautifulSoup(html, "html.parser") # BeautifulSoup(markup, "lxml")
items = soup.find_all("div","_3u1 _gli _uvb", recursive=True)
for item in items:
abouts = item.find_all("div", {"class":"_glo"}, recursive = True)[0].text
print (abouts)
HTML page:
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
Afternoon , i am trying to scrape a webpage using beautifullsoup, python. I need al the "text" strings in a separate variable. When i print abouts i get :"text text text" I want it to be seperated.
Kind regards

Try this:
items = soup.find_all('div', attrs={'class':'_ajw'})
dict = {}
for i in range(len(items)):
dict['text'+str(i+1)] = item[i].find('div', attrs={'class':'_52eh'}).text
print(dict)
This will give you something like this:
{'text1': text, 'text2': text, 'text3': text}

I'd use soup.select to apply a class selector to the html. It is a fast method to get a list of the appropriate elements by class
from bs4 import BeautifulSoup as bs
html = '''
<div class="_glo">
<div>
<div class="_ajw">
<div class="_52eh">
"text
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
<div class="_ajw">
<div class="_52eh">
"text"
</div>
</div>
</div>
</div>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('._52eh')]
print(items)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing HTML with BeatifulSoup class == AND title CONTAINS - python

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

Find text between specific id beautifulsoup

BeautifulSoap get multiple element for all img in a div with specific class

BeautifulSoup: find all tags before stopping condition is met

Python v3 , Beautifoulsoup - multiple div tags with same name

Categories

Resources