beautiful soup misses text when scraping html

beautiful soup misses text when scraping html - python

Here is a sample of my HTML. I need to parse all content for each record. This is one sample record.
<div class="list-group-item card-contact">
<div class="card-base">
<div class="card-name hot">ronald <b>loudia</b></div>
<div class="card-last-seen"><i>Inquired:</i> 8 day ago, Value $205,000 Engaged </div>
</div>
<a class="card-expand" href="#"></a>
<div class="card-mobile">
<ul class="card-activities" data-score="27">
<li class="card-update"><a class="btn" href="wrap.php?leaduuid=e888888e&aid=0">Update</a></li>
<li class="card-notes"><a class="btn" href="contact-notes.php?leaduuid=e888888e">Notes</a></li>
<li class="card-email">
<a class="btn" href="mailto:lou#ad.fr" title="leald#icloud.fr">
<div class="activity-label">Email:</div>
<div class="activity-value">lou#ad.fr</div>
</a>
</li>
<li class="card-sms"><a class="btn" href="sms:(222) 125-4444">SMS</a></li>
<li class="card-phone">
<a class="btn" href="tel:(222) 125-4444">
<div class="activity-label">Phone:</div>
<div class="activity-value">(222) 125-4444</div>
</a>
</li>
</ul>
</div>
</div>
I can't seem to get all the content between the tags.
Here is my code
from bs4 import BeautifulSoup
import re
import mechanize
import pandas as pd
br = mechanize.Browser()
br.open(url)
mylist = []
html_doc = br.response().read()
soup = BeautifulSoup(html_doc, 'html.parser')
mydivs = soup.find_all('div', ['card-name',
'card-last-seen',
'activity-value',
'card-activities',
'card-last-seen',
'activity-value',
'card-mobile'])
st = ''
for div in mydivs:
if re.search('^\([0-9][0-9][0-9]\)', div.text):
st += f'{div.text}\n'
else:
st += f'{div.text}, '
mylist.append(st)
#print(mylist)
smallerlist = [l.split(', ') for l in ', '.join(mylist).split('\n')]
smallerlist
df = pd.DataFrame(smallerlist)
I only get some of the content, missing data-score and other content.
Not sure how to get both div and ul content.
Html has many records that I need to loop through and write to pandas DataFrame.
The expected output in dataframe:
FirstName LastName LastLogin Value Score Email sms phone-activity-Value
ronald Lecloudia 8 day ago $205,000 Engaged 27 lou#ad.fr (222) 125-4444 (222) 125-4444

You can access the data-score attribute like this:
soup.ul.attrs['data-score']
Output:
'88888827'
More data:
import re
new_line_re = re.compile('\n{2,}')
new_line_re.sub('\n', soup.div.text).strip().split('\n')
Output:
['ronald Lecloudia',
'Inquired: 888888 day ago, Value $205,000 Engaged ',
'Update',
'Notes',
'Email:',
'leald#icloud.fr',
'SMS',
'Phone:',
'(232) 78888885-4444']

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

I am trying to do web scraping using BeautifulSoup. The code I have written is below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(type(questions[0]))
When I run the code, I get the error message below:
print(type(questions[10]))
IndexError: list index out of range
Then i tried to print the list like below:
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select(".question-summary")
print(questions)
And then I get an empty list: []
What am I doing wrong?
Thanks for your answers.

.question-summary is incorrect locator because it's a portion of id meaning each id value start with question-summary. Now it's working.
import requests
from bs4 import BeautifulSoup
response = requests.get("https://stackoverflow.com/questions")
soup = BeautifulSoup(response.text, "html.parser")
questions = soup.select('[id^="question-summary"]')
print(questions)
Output:
1" data-post-type-id="1" id="question-summary-71715531">
<div class="s-post-summary--stats js-post-summary-stats">
<div class="s-post-summary--stats-item s-post-summary--stats-item__emphasized" title="Score of 0">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">votes</span>
</div>
<div class="s-post-summary--stats-item" title="0 answers">
<span class="s-post-summary--stats-item-number">0</span>
<span class="s-post-summary--stats-item-unit">answers</span>
</div>
<div class="s-post-summary--stats-item" title="5 views">
<span class="s-post-summary--stats-item-number">5</span>
<span class="s-post-summary--stats-item-unit">views</span>
</div>
</div>
<div class="s-post-summary--content">
<h3 class="s-post-summary--content-title">
<a class="s-link" href="/questions/71715531/is-it-possible-to-draw-a-logistic-regression-graph-with-multiple-x-variable">Is it possible to draw a
logistic regression graph with multiple x variable?</a>
</h3>
<div class="s-post-summary--content-excerpt">
Currently, this is my X and V value. May I know is it possible to draw a logistic regression curve with X that has multiple column? Or I am required to draw multiple graphs to do so?
X = df1.drop(['...
</div>
<div class="s-post-summary--meta">
<div class="s-post-summary--meta-tags tags js-tags t-python-3ûx t-machine-learning">
<a class="post-tag flex--item mt0 js-tagname-python-3ûx" href="/questions/tagged/python-3.x" rel="tag" title="show questions tagged 'python-3.x'">python-3.x</a> <a class="post-tag flex--item mt0 js-tagname-machine-learning" href="/questions/tagged/machine-learning" rel="tag" title="show questions tagged 'machine-learning'">machine-learning</a>
</div>
<div class="s-user-card s-user-card__minimal">
<a class="s-avatar s-avatar__16 s-user-card--avatar" href="/users/14128881/christopher-chua"> <div class="gravatar-wrapper-16" data-user-id="14128881">
<img ,="" alt="user avatar" class="s-avatar--image" height="16" src="https://lh6.googleusercontent.com/-Sn3B_E5hiJc/AAAAAAAAAAI/AAAAAAAAAAA/AMZuucl1oyfdhJiXhrx73JLYqzKAK9icag/photo.jpg?sz=32" width="16"/>
</div>
</a>
<div class="s-user-card--info">
<div class="s-user-card--link d-flex gs4">
<a class="flex--item" href="/users/14128881/christopher-chua">Christopher Chua</a>
</div>
<ul class="s-user-card--awards">
<li class="s-user-card--rep"><span class="todo-no-class-here" dir="ltr" title="reputation score ">7</span></li>
</ul>
</div>
<time class="s-user-card--time">asked <span class="relativetime" title="2022-04-02 07:03:06Z">13 mins ago</span></time>
.. so on

Scrape values inside span class webpage with beautifulsoup python

Hello everyone I have a webpage I'm trying to scrape and the page has tons of span classes and most of which is useless information I posted a section of the span class data that I need but I'm not able to do find.all span because there are 100's of others not needed.
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
I need the span titles:
File Number, Location, Date
and then the values that match:
"A-21-897274", "Ohio", "07/01/2022"
I need this printed out so I can make a pandas data frame. But I cant seem to get the specific spans printed with their value.
What I've tried:
import bs4
from bs4 import BeautifulSoup
soup = BeautifulSoup(..., 'lxml')
for title_tag in soup.find_all('span', class_='text-muted'):
# get the last sibling
*_, value_tag = title_tag.next_siblings
title = title_tag.text.strip()
if isinstance(value_tag, bs4.element.Tag):
value = value_tag.text.strip()
else: # it's a navigable string element
value = value_tag.strip()
print(title, value)
output:
File Number "A-21-897274"
Location "Ohio"
Operations_Manager "Joanna"
Date "07/01/2022"
Type "Transfer"
Status "Open"
ETC "ETC"
ETC "ETC"
This will print out everything I need BUT it also prints out 100's of other values I don't want/need.

You can use function in soup.find_all to select only wanted elements and then .find_next_sibling() to select the value. For example:
from bs4 import BeautifulSoup
html_doc = """
<div class="col-md-4">
<p>
<span class="text-muted">File Number</span><br>
A-21-897274
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Location</span><br>
Ohio
</p>
</div>
<div class="col-md-4">
<p>
<span class="text-muted">Date</span><br>
07/01/2022
</p>
</div>
</div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
def correct_tag(tag):
return tag.name == "span" and tag.get_text(strip=True) in {
"File Number",
"Location",
"Date",
}
for t in soup.find_all(correct_tag):
print(f"{t.text}: {t.find_next_sibling(text=True).strip()}")
Prints:
File Number: A-21-897274
Location: Ohio
Date: 07/01/2022

GitHub get commits number using python and beautiful soup

I'm trying to get the number of commits of github repos using python and beautiful soup
html code:
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>
my code:
r = requests.get(source_code_link)
soup = bs(r.content, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
number = span.select_one('strong')
sometimes works but sometimes no because there are more then one span tag with class d-none d-sm-inline.
how can i solve ?

Here's an approach using list commits from GitHub's REST API
import requests
user = ... # username or organisation
repo = ... # repository name
response = requests.get(f"https://api.github.com/repos/{user}/{repo}/commits")
if response.ok:
ncommits = len(response.json())
else:
raise ValueError(f"error: {response.url} responded {response.reason}")
print(ncommits)

from bs4 import BeautifulSoup as bs
html="""<span class="d-none d-sm-inline">
<strong>26</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
<div class="flex-shrink-0">
<h2 class="sr-only">Git stats</h2>
<ul class="list-style-none d-flex">
<li class="ml-0 ml-md-3">
<a data-pjax href="..." class="pl-3 pr-3 py-3 p-md-0 mt-n3 mb-n3 mr-n3 m-md-0 Link--primary no-underline no-wrap">
<span class="d-none d-sm-inline">
<strong>23</strong>
<span aria-label="Commits on master" class="color-text-secondary d-none d-lg-inline">
commits
</span>
</span>
</a>
</li>
</ul>
</div>"""
I have combine both example which look up for tag strong and based on that prints data by using .contents method
soup = bs(html, 'lxml')
spans = soup.find_all('span', class_='d-none d-sm-inline')
for span in spans:
for tag in span.contents:
if tag.name=="strong" :
print(tag.get_text())
using list comprehension :
for span in spans:
data=[tag for tag in span.contents if tag.name=="strong"]
print(data[0].get_text())
Ouput for both case:
26
23

You can use the find_next() method to look for a <strong> after the class d-none d-sm-inline.
In your case:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
for tag in soup.find_all("span", class_="d-none d-sm-inline"):
print(tag.find_next("strong").text)

Parsing HTML with BeatifulSoup class == AND title CONTAINS

I am trying to parse the following HTML:
<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
,
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>
I am trying to get the 'id' where the title contains 'Blue' AND the item is not sold.
I have tried:
soup.find_all("a",href=re.compile("Blue"),class_="")
links = soup.find_all("a", href=re.compile("Blue", "Add To Cart"))
ids = [tag["id"] for tag in soup.find_all("a", href=re.compile("Blue"))]
But it is not returning the info I'm looking for.
I would like it to return:
AddToCartSimple-3593

I think your html is corrupted. You can do the entire filtering with css selectors using :has, :not, and :contains (:-soup-contains - latest soupsieve), along with attribute = value selectors. The ^ is a starts with operator, meaning attribute value starts with the string after the =. The ~ is a general sibling combinator and the > is a child combinator. This means looking for a sibling with class (.) tocart and then a child with id that starts with AddToCartSimple-, but that doesn't have text containing SOLD displayed. Less specific than !="SOLD" , as it can be a partial string exclusion. Depends on observed variation in actual data.
from bs4 import BeautifulSoup as bs
html ='''
<div class="product-details">
<h4 class="title">Blue - Standard</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a>
</div>
<div class="product-details">
<h4 class="title">Blue - Wide</h4> <a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart"> <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576">SOLD</a>
</div>
'''
soup = bs(html, 'html.parser')
print(soup.select_one('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')['id'])
You should check there was a match before accessing with ['id'] of course. You could also go for all matches as follows:
[i['id'] for i in soup.select('.title:has([title^="Blue -"]) ~ .tocart > [id^=AddToCartSimple-]:not(:contains("SOLD"))')]

To get the data where the "title" contains "Blue" and the item is not "SOLD":
Use a CSS selector .product-details > h4 a[title*='Blue'] which will select all a where the title=Blue under an h4 under the class product-details
Find the next div using the find_next() method, and check that the text is not "SOLD".
Print the next div's id
from bs4 import BeautifulSoup
html = """<div class="product-details">
<h4 class="title" >Blue - Standard</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="" href="/store/addtocartplp?productId=3593" id="AddToCartSimple-3593">Add To Cart</a></div>
</div>
<div class="product-details">
<h4 class="title" >Blue - Wide</h4>
<a class="learn-more" data-test-selector="linkViewMoreDetails" href="https://productwebpage.com">Learn More</a>
<div class="tocart" <a class="disAddtoCardBtn" href="javascript:void(0)" id="AddToCartSimple-3576" >SOLD</a></div>
</div>"""
soup = BeautifulSoup(html, "html.parser")
for tag in soup.select(".product-details > h4 a[title*='Blue']"):
if tag.find_next("div").text != "SOLD":
print(tag.find_next("div")["id"])
Output:
AddToCartSimple-3593

BeautifulSoup: find all tags before stopping condition is met

I'm trying to extract a class tag from an HTML file, but only if it is located before a given stopping point. What I have is:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
class_extr = soup.find_all("span", class_="myclass")
This works, but it finds all instances of myclass, and i only want those before the following text shows in the soup:
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
The thing that makes this block unique are the Title text N lines, especially the Title text N2. line. There are many cat-title tags before, so I can't use that as a stopping condition.
The code surrounding this block looks like this:
...
<div class="myc">
<a class="bbb" href="linkhere_893">
<span class="myclass">Text893</span>
<img data-lazy="https://link893.jpg"/>
</a>
</div>
<div class="myc">
<a class="bbb" href="linkhere_96">
<span class="myclass">Text96</span>
<img data-lazy="https://link96.jpg"/>
</a>
</div>
</div><!-- This closes a list that starts above -->
<h4 class="cat-title" id="55">Title text N1 <small> Title text N2.</small></h4>
<div class="list" id="55">
<div class="myc">
<a class="bbb" href="linkhere_34">
<span class="myclass">Text34</span>
<img data-lazy="https://link34.jpg"/>
</a>
</div>
<div class="myc">
...
continuing both above and below.
How can I do this?

Try using find_all_previous():
import requests
from bs4 import BeautifulSoup
page = requests.get("https://mysite")
soup = BeautifulSoup(page.content, 'html.parser')
stop_at = soup.find("h4", class_="cat-title", id='55') # finds your stop tag
class_extr = stop_at.find_all_previous("span", class_="myclass")
This will stop at the first <h4 class='cat-title', id=55> tag in the event that there are multiple.
Reference: Beautiful Soup Documentation

How about this:
page = requests.get("https://mysite")
# Split your page and unwanted string, then parse with BeautifulSoup
text = page.text.split('Title text N2.')
soup = BeautifulSoup(text[0], 'html.parser')
class_extr = soup.find_all("span", class_="myclass")

You can try something like this:
from bs4 import BeautifulSoup
page = """
<html><body><p>
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>
</p>
<h4 class="cat-title" id="55">
Title text N1
<small>
Title text N2.
</small>
</h4>
<p>
<span class="myclass">text 3</span>
<span class="myclass">text 4</span>
</p>
</body>
</html>
"""
soup = BeautifulSoup(page, 'html.parser')
for i in soup.find_all():
if i.name == 'h4' and i.has_attr('class') and i['class'][0] == 'cat-title' and i.has_attr('id') and i['id'] == '55':
if i.find("small") and i.find("small").text.strip()== "Title text N2.":
break
elif i.name == 'span'and i.has_attr('class') and i['class'][0] == 'myclass':
print (i)
Outputs:
<span class="myclass">text 1</span>
<span class="myclass">text 2</span>

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

beautiful soup misses text when scraping html - python

Related

BeautifulSoup Returns empty list which leads to an IndexError in my Python code

Scrape values inside span class webpage with beautifulsoup python

GitHub get commits number using python and beautiful soup

Parsing HTML with BeatifulSoup class == AND title CONTAINS

BeautifulSoup: find all tags before stopping condition is met

Categories

Resources