I have the following html:
<div class="leftColumn">
<div>
<div class="static">
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
How can I select just the text lines using beautiful soup.
I've tried a variety of things like:
soup.select('.leftColumn div').text
but so far no dice
Mauro's answer is probably more what you wanted, but this is another way to do it and how I thought about getting the inner div text:
from bs4 import BeautifulSoup
html = '''<div class="leftColumn">
<div>
<div class="static">
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
'''
bs = BeautifulSoup(html)
for div in bs.findAll('div', attrs={'class': 'leftColumn'}):
print div.findNext('div').findNext('div').text
BeautifouSoup select retrives a list. You must specify the index.
soup.select('.leftColumn div')[0].text.split()
Related
Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.
You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022
I need to scrape the data from the web page which is in this format . I only need the inner text of h2 and h3's first child i.e from the first span and all other <p> tags
<div class="info">
<h2>
<span>first heading</span>
<span> not required</span>
</h2>
<p> 1 paragraph</p>
<p> 2 paragraph</p>
<div> some tags</div>
<h3>
<span>second heading</span>
<span> not required</span>
</h3>
<p> 3 paragraph</p>
<p> 4 paragraph</p>
</div>
Outputs;
first heading
1 paragraph
2 paragraph
second heading
3 paragraph
4 paragraph
soup.find_all(["h1", "p","h2","h3"])
after trying this I'm also getting the second spans inner text which I don't want.
I need only the inner text of h2 and h3' first span content and p tag content.
I am new to python and soup any help would be appreciated.
Try this one
from bs4 import BeautifulSoup as bs
my_data = [your html above]
soup = bs(my_data, "lxml")
for head in ["h2", "h3"]:
target = soup.find(head)
print(target.findChild().text)
Output:
first heading
second heading
You can use find_all() to get the tags you want then use findChild() on the elements that you want the first child only
from bs4 import BeautifulSoup
html = """
<div class="info">
<h2>
<span>first heading</span>
<span> not required</span>
</h2>
<p> 1 paragraph</p>
<p> 2 paragraph</p>
<div> some tags</div>
<h3>
<span>second heading</span>
<span> not required</span>
</h3>
<p> 3 paragraph</p>
<p> 4 paragraph</p>
</div>
"""
soup = BeautifulSoup(html, "lxml")
for elem in soup.find_all(['h2', 'h3', 'p']):
if elem.name == 'p':
print(elem.text)
else:
print(elem.findChild().text)
Outputs;
first heading
1 paragraph
2 paragraph
second heading
3 paragraph
4 paragraph
I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?
Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation
How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
I have a page which contains several repetitions of: <div...><h4>...<p>... For example:
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
If I write print soup.select('div[class^="proletariat"] > h4 ~ p'), I get:
[<p>Ignore this text</p>, <p>This is the text we want</p>]
How do I specify that I only want the text of p when it is preceded by <h4>hammer</h4>?
Thanks
html = '''
<div class="proletariat">
<h4>sickle</h4>
<p>Ignore this text</p>
</div>
<div class="proletariat">
<h4>hammer</h4>
<p>This is the text we want</p>
</div>
'''
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
print(soup.find("h4", text=re.compile('hammer')).next_sibling.next.text)
This is the text we want
:contains() could help here, but it is not supported.
Taking this into account, you can use select() in conjunction with the find_next_sibling():
print next(h4.find_next_sibling('p').text
for h4 in soup.select('div[class^="proletariat"] > h4')
if h4.text == "hammer")
I have the following html:
<div class="leftColumn">
<div>
<div class="static">
.............................
</div>
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
.........................
</div>
</div>
I've just been shown that the way to get the text is
soup.select('.leftColumn div')[0].text.split()
This works but there is so much junk left over from the 2 divs that it is very difficult to pick out the text I need reliably. Is there a way to remove the 2 classes (static and summary ) which would make it much easier to process the remainder?
Here is an example based on your snippet:
from bs4 import BeautifulSoup
text = """
<div class="leftColumn">
<div>
<div class="static">
.............................
</div>
text1
<br>
text2
<br>
(222) 123 - 4567
<br>
<div class="summary">
.........................
</div>
</div>
</div>
"""
soup = BeautifulSoup(text)
# Find divs with class "static" or "summary" and remove them using `extract`
div_nodes = soup.find_all('div', {'class': ['static', 'summary']})
[div.extract() for div in div_nodes]
print soup.text.split()
If you run the code, you will see that the static and summary divs are removed, and you get:
[u'text1', u'text2', u'(222)', u'123', u'-', u'4567']