Get data from HTML page - python

I have some data from a HTML page as follows
<span class="some class abc-vc"> 123</span>
<span class="some class vde-bc"> 435</span>
<span class="some class v9mo-04mg"> 456 </span>
I would only like to search for
some class
part of the tag so that I can store the variables one by one
How can I achieve this?
code:
from urllib.request import Request, urlopen
import bs4
url = 'url'
page = urlopen(url).read()
soup = bs4.BeautifulSoup(page, 'html.parser')
data = soup.find('span',{'class':'some class'})
print (data.text)

You can use regular expression to find specific items.Try below code.
from bs4 import BeautifulSoup
import re
data='''<span class="some class abc-vc"> 123</span>
<span class="some class vde-bc"> 435</span>
<span class="some class v9mo-04mg"> 456 </span>'''
soup=BeautifulSoup(data,'html.parser')
for item in soup.find_all('span',class_=re.compile('some class')):
print(item.text)
Output:
123
435
456

In HTML, distinct classes are separated by spaces. So that bottom span for example has three classes: some, class, and v9mo-04mg.
To find all tags that contain the class some and the class class, use a list as your dictionary value:
data = soup.find('span', {'class':['some', 'class']})
If you need multiple, then replace the .find() method with .find_all().

They are compound classes. You can join them with "." and pass to select
elements = [item for item in soup.select('.some.class')]

Related

Scraping text from a <span> from <span> but both have inner text

this is the html tags i want get text from its span
<span class="ms-2 d-flex">
<span class="d-none d-xl-block me-1"> mobile </span>
ItsMobileNumber
</span>
so its one main span with span and some text 'ItsMobileNumber'
i want get the 'ItsMobileNumber' but when i use get_text() it getting both text like this :
mobile
ItsMobileNumber
and this is my python code
print(title.find("span").get_text())
how can i get just 'ItsMobileNumber' not inner span text ?
Try something like this:
from bs4 import BeautifulSoup as bs
soup = bs([your html file],'lxml')
data = soup.select("span.ms-2.d-flex")
for datum in data:
print(list(datum.strings)[2].strip())
The output, based only on your sample html, should be
ItsMobileNumber
Have you tried this?:
data = title.find("span").get_text()
number = [d.text for d in data]
Or,
import re
number = re.findall("[0-9]+",data)
If you want the main span as well, you can do:
main = data.contents[0]

Find Tags that Match Specific Classes but one class keeps changing

I want to extract information from a div tag which has some specific classes.
Class are in the format of abc def jss238 xyz
Now, the jss class number keeps changing, so after some time ,the classes will become abc def jss384 xyz
What is the best way to extract information so that the code doesn't break if the tags change as well.
The current code that I using is
val = soup.findAll('div', class_="abc def jss328 xyz")
I feel Regex can be a good way, but can I also not use jss class and use the other 3 only to search?
SO yes you can use regex to find the pattern that has abc def <pattern of 3 letters and 3 digits> xyz
Personally, I would see if you can get the data from the source. When classes change like that, it's usually because the page is rendered through javascript, but it needs to put the data in there and get it from somewhere. If you share the url and what data you are after, I could see if thats the case. But here's the regex version:
from bs4 import BeautifulSoup
import re
html = '''<div class="abc def jss238 xyz">jss238 text</div>
<div class="abc def jss384 xyz">jss384 text</div>
<div class="hij klm jss238 xyz">doesn't match the pattern</div>'''
soup = BeautifulSoup(html, 'html.parser')
regex = re.compile('abc def \w{3}\d{3} xyz')
specialDivs = soup.find_all('div', {'class':regex})
for each in specialDivs:
print(f'html: {each}\tText: {each.text}')
Output:
html: <div class="abc def jss238 xyz">jss238 text</div> Text: jss238 text
html: <div class="abc def jss384 xyz">jss384 text</div> Text: jss384 text

Access ’aria-label‘ of Yelp review page using BeautifulSoup

I want to access the text ("5 star rating") in the 'aria label' with BeautifulSoup
<div class="lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-5__373c0__N5JxY border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK" aria-label="5 star rating" role="img"><img class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132" height="560" alt=""></div>
When I use soup.find_all('div',attrs={'class':' aria-label'}), it returns an empty list.
Can someone please help me with it?
Here aria-label is not a class it's a attribute of div tag, So u need to access that.
from bs4 import BeautifulSoup
s = """<div class="lemon--div__373c0__1mboc i-stars__373c0__1T6rz i-stars--regular-5__373c0__N5JxY border-color--default__373c0__3-ifU overflow--hidden__373c0__2y4YK" aria-label="5 star rating" role="img"><img class="lemon--img__373c0__3GQUb offscreen__373c0__1KofL" src="https://s3-media0.fl.yelpcdn.com/assets/public/stars_v2.yji-52d3d7a328db670d4402843cbddeed89.png" width="132" height="560" alt=""></div>
"""
soup = BeautifulSoup(s, "html.parser")
print(soup.div["aria-label"])

Remove redundant class names in HTML using BeautifulSoup

I want to convert:
<span class = "foo">data-1</span>
<span class = "foo">data-2</span>
<span class = "foo">data-3</span>
to
<span class = "foo"> data-1 data-2 data-3 </span>
Using BeautifulSoup in Python. This HTML part exists in multiple areas of the page body, hence I want to minimize this part and scrap it. Actually the mid span was with em class hence originally separated.
Adapted from this answer to show how this could be used for your span tags:
span_tags = container.find_all('span')
# combine all the text from b tags
text = ''.join(span.get_text(strip=True) for span in span_tags)
# here you choose a tag you want to preserve and update its text
span_main = span_tags[0] # you can target it however you want, I just take the first one from the list
span_main.span.string = text # replace the text
for tag in span_tags:
if tag is not span_main:
tag.decompose()

BeautifulSoup Scraping Span Class HTML

I am trying to scrape from the <span class= ''>. The code looks like this on the pages I am scraping:
< span class = "catnum"> Disc Number < / span>
"1"
< br >
< span class = "catnum"> Track Number < / span>
"1"
< br>
< span class = "catnum" > Duration < /span>
"5:28"
<br>
What I need to get are those numbers after the </span> tag. I should also mention I am writing a larger piece of code that is scraping 1200 sites and this will have to loop over 1200 sites where the numbers in the quotation marks will change from page to page.
I tried this code as a test on one page:
from bs4 import BeautifulSoup
soup = BeautifulSoup (open("Smith.html"), "html.parser")
for tag in soup.findAll('span'):
if tag.has_key('class'):
if tag['class'] == 'catnum':
print tag.string
I know that will print ALL the 'span class' tags and not just the three I want, but I thought I would still test it to see if it worked and I got this error:
/Library/Python/2.7/site-packages/bs4/element.py:1527: UserWarning:
has_key is deprecated. Use has_attr("class") instead. key))
as said in the error message, you should use tag.has_attr("class") in place of the deprecated tag.has_key("class") method.
Hope it helps.
Simone
You can constrain your search by attribute {'class': 'catnum'} and the text inside text=re.compile('Disc Number'). Then use .next_sibling to find the text:
from bs4 import BeautifulSoup
import re
s = '''
<span class = "catnum"> Disc Number </span>
"1"
<br/>
<span class = "catnum"> Track Number </span>
"1"
<br/>
<span class = "catnum"> Duration </span>
"5:28"
<br/>'''
soup = BeautifulSoup(s, 'html.parser')
span = soup.find('span', {'class': 'catnum'}, text=re.compile(r'Disc Number'))
print span.next_sibling

Categories

Resources