I am working on a webscraper project and can't get BeautifulSoup to give me the text between the Div. Below is my code. Any suggestions on how to get python to print just the "5x5" without the "Div to /Div" and without the whitespace?
source = requests.get('https://www.stor-it.com/self-storage/meridian-id-83646').text
soup = BeautifulSoup(source, 'lxml')
unit = soup.find('div', class_="unit-size")
print (unit)
This script returns the following:
<div class="unit-size">
5x5 </div>
You can use text to retrieve the text, then strip to remove whitespace
Try unit.text.strip()
Change your print statement from print(unit) to print(unit.text)
Use a faster css class selector
from bs4 import BeautifulSoup
source= '''
<div class="unit-size">
5x5 </div>
'''
soup = BeautifulSoup(source, 'lxml')
unit = soup.select('.unit-size')
print(unit[0].text.strip())
Related
I'm trying to scrape some data out of a web page, the data that I want to scrape is set like this:
<div id="pagetitle">
some_text
"some_text2"
some_text3
</div>
and I'm trying to get some_text3 I'm trying with this code
soup = soup(page, "html5lib")
author = soup.find('div', {'id' : 'pagetitle'}).a.string
print(author)
when I do this I only get some_text I also tried with:
author = soup.find_all('a', {'id' : 'pagetitle'})
but I get an empty list, I also tried it with:
author = soup.find(id='pagetitle').prettify()
and I get the whole code but I don't know how to get only some_text3
I also tried to use different parsers but none of them worked
also sorry if this is hard to understand but It's my second question here, I would kindly accept all recommendations if there are.
You can use CSS selector with :nth-last-child(). For example:
from bs4 import BeautifulSoup
html_doc = """
<div id="pagetitle">
some_text
"some_text2"
some_text3
</div>"""
soup = BeautifulSoup(html_doc, "html.parser")
txt = soup.select_one("#pagetitle > a:nth-last-child(1)").text
print(txt)
Prints:
some_text3
Or: use [-1] to get last element:
txt = soup.select("#pagetitle a")[-1].text
print(txt)
I am using BeautifulSoup...
When I run this code:
inside_branding_info = container.div.find("div", "item-branding")
print(inside_branding_info)
It returns:
div class="item-branding">
<a class="item-rating" href="https://www.newegg.com/gigabyte-geforce-rtx-2060-super-gv-n206swf2oc-8gd/p/N82E16814932174?cm_sp=SearchSuccess-_-INFOCARD-_-graphics+cards-_-14-932-174-_-1&Description=graphics+cards&IsFeedbackTab=true#scrollFullInfo"><i class="rating rating-4"></i><span class="item-rating-num">(12)</span></a>
</div>
However, in the HTML inspection this is what I see:
Raw Site HTML
Everytime I run:
inside_branding_info.a.img["title"]
...python thinks I want the "a" tag "item-rating"...not the "a" href tag nested inside of the div "item-branding".
How do I get inside of the "a href" tag, then into the "img", to finally extract the "title" (title = "MSI")? I want the title/brand of the item on the website. I am new to Python. I have only used R and SQL before this instance, any help would be greatly appreciated.
You need a selector path .
Accroding to the img you provided...
soup = BeautifulSoup(data)
img = soup.select('.item-brand > img')
print(img['title'])
The above should work for you.
Try the following
from bs4 import BeautifulSoup
html = """<div class="item-branding">
<a href="https://www.newegg.com/" class="item-brand">
<img src="https://www.newegg.com/" title="MSI" alt="MSI"> ==$0
</a></div>"""
soup = BeautifulSoup(html, features="lxml")
element = soup.select('.item-brand > img:nth-of-type(1)')[0]['title']
print(element)
I would like to extract the text 'THIS IS THE TEXT I WANT TO EXTRACT' from the snippet below. Does anyone have any suggestions? Thanks!
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
from bs4 import BeautifulSoup
html = """<span class="cw-type__h2 Ingredients-title">Ingredients</span><p>THIS IS THE TEXT I WANT TO EXTRACT</p>"""
soup = BeautifulSoup(html,'lxml')
print(soup.p.text)
Assuming there is likely more html, I would use the class of the preceeding span with adjacent sibling combinator and p type selector to target the appropriate p tag
from bs4 import BeautifulSoup as bs
html = '''
<span class="cw-type__h2 Ingredients-title">Ingredients</span>
<p>
THIS IS THE TEXT I WANT TO EXTRACT</p>
'''
soup = bs(html, 'lxml')
print(soup.select_one('.Ingredients-title + p').text.strip())
I'm using Python/beautifulSoup to find a div of a specific class, and I want to nuke that entire html element from a file.
This is what I have --
with open(url) as f:
elementToDelete = BeautifulSoup(f.read()).find("div", {'class': 'element-that-needs-to-go'})
removeTheElement = elementToDelete.replace('THISISWHEREIMSTUCK', '')
with open(url, 'w') as f:
f.write(removeTheElement)
I can't seem to find the right method to do what I want.
use the decompose method :
Python Code :
from bs4 import BeautifulSoup
html = '''
<div>
<div class="element-that-needs-to-go">
</div>
</div>
'''
soup = BeautifulSoup(html)
tag_to_remove = soup.find("div", {'class': 'element-that-needs-to-go'})
tag_to_remove.decompose()
print(soup)
Demo : Here
I started to learn the beautifulsoup. I am trying to remove from html script a line of code containing </div> .
The most examples in the documentation are presented for the whole tags (opening and closing part).
Is it possible to modify just one part of a tag?
For example:
</div>
<div >Hello</div>
<div data-foo="value">foo!</div>
how to remove just the first line of the code?
You can use BeautifulSoup's unwrap() to specify the invalid tag, which will only remove the extra tags that don't have a open/close counterpart, while retaining others:
soup = BeautifulSoup(html_doc, 'html.parser')
invalid_tags = ['</div>']
for tag in invalid_tags:
for match in soup.findAll(tag):
match.unwrap()
print(soup)
result:
<div>Hello</div>
<div data-foo="value">foo!</div>
you don't need do anything it will repaired automatically
from bs4 import BeautifulSoup
html_doc = '''</div>
<div>World</div>
<div data-foo="value">foo!''' # also invalid, no closing
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup)
output
<div>World</div>
<div data-foo="value">foo!</div>
unwrap() is for removing not repairing tag.