How to extract data form the below HTML code using beautifulsoup? - python

I want to extract data from the div with class 'cinema' and 'timings' using BeautifulSoup in python3 . How can i do it using soup.findAll ?
<div data-order="0" class="cinema">
<div class="__name">SRS Shoppers Pride Mall<span class="__venue"> - Bijnor</span>
</div>
<div class="timings"><span class="__time _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22876','ET00015438','01:30 PM');">01:30 PM</span><span class="__time _center _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22877','ET00015438','04:00 PM');">04:00 PM</span><span class="__time _right _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22878','ET00015438','06:30 PM');">06:30 PM</span><span class="__time _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22879','ET00015438','09:00 PM');">09:00 PM</span>
</div>
</div>
This is my code:
for div in soup.findAll('div',{'class':'cinema'}):
print div.text # It printed nothing ,the program just ended

You can specify both classes in findAll:
soup.findAll(True, {'class': ['cinema', 'timings']})

The "div" you are interested in is another "div" child. To get that "div" you can use the .select method.
from bs4 import BeautifulSoup
html = <your html>
soup = BeautifulSoup(html, 'lxml')
for div in soup.select('div.cinema > div.timings'):
print(div.get_text(strip=True))
Or iterate the find_all() result and use the .find() method to return those "div" where class: "timings"
for div in soup.find_all('div', class_='cinema'):
timings = div.find('div', class_='timings')
print(timings.get_text(strip=True))

Related

How can I print all links within a found elements?

I'm new to BeautifulSoup, I found all the cards, about 12. But when I'm trying to loop through each card and print link href. I kept getting this error
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
# print(cards)
print(len(cards))
for link in cards.find_all('a'):
print(link.get('href'))
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
Will return a collection of all the div's found, you'll need to loop over them before finding the chil a's.
That said, you should probably use findChildren for finding the a elements.
Example Demo with an minimal piece of HTML
from bs4 import BeautifulSoup
html = """
<div class='up-card-section'>
<div class='foo'>
<a href='example.com'>FooBar</a>
</div>
</div>
<div class='up-card-section'>
<div class='foo'>
<a href='example2.com'>FooBar</a>
</div>
</div>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for card in soup.findAll('div', attrs={'class': 'up-card-section'}):
for link in card.findChildren('a', recursive=True):
print(link.get('href'))
Output:
example.com
example2.com

Navigating through html with BeautifulSoup from a specific point

I'm using the following piece of code to find an attribute in a piece of HTML code:
results = soup.findAll("svg", {"data-icon" : "times"})
This works, and it returns me a list with the tag and attributes. However, I would also like to move from that part of the HTML code, to the sibling (if that's the right term) below it, and retrieve the contents of that paragraph. See the example below.
<div class="382"><svg aria-hidden="true" data-icon="times".......</svg></div>
<div class="405"><p>Example</p></div>
I can't seem to figure out how to do this properly. Searching for the div class names does not work, because the class name is randomised.
You can use CSS selector with +:
from bs4 import BeautifulSoup
html_doc = """
<div class="382"><svg aria-hidden="true" data-icon="times"> ... </svg></div>
<div class="405"><p>Example</p></div>
"""
soup = BeautifulSoup(html_doc, "html.parser")
div = soup.select_one('div:has(svg[data-icon="times"]) + div')
print(div.text)
Prints:
Example
Or without CSS selector:
div = soup.find("svg", attrs={"data-icon": "times"}).find_next("div")
print(div.text)
Prints:
Example

How do I specify which a tag I need when scraping in Python?

I am using BeautifulSoup...
When I run this code:
inside_branding_info = container.div.find("div", "item-branding")
print(inside_branding_info)
It returns:
div class="item-branding">
<a class="item-rating" href="https://www.newegg.com/gigabyte-geforce-rtx-2060-super-gv-n206swf2oc-8gd/p/N82E16814932174?cm_sp=SearchSuccess-_-INFOCARD-_-graphics+cards-_-14-932-174-_-1&Description=graphics+cards&IsFeedbackTab=true#scrollFullInfo"><i class="rating rating-4"></i><span class="item-rating-num">(12)</span></a>
</div>
However, in the HTML inspection this is what I see:
Raw Site HTML
Everytime I run:
inside_branding_info.a.img["title"]
...python thinks I want the "a" tag "item-rating"...not the "a" href tag nested inside of the div "item-branding".
How do I get inside of the "a href" tag, then into the "img", to finally extract the "title" (title = "MSI")? I want the title/brand of the item on the website. I am new to Python. I have only used R and SQL before this instance, any help would be greatly appreciated.
You need a selector path .
Accroding to the img you provided...
soup = BeautifulSoup(data)
img = soup.select('.item-brand > img')
print(img['title'])
The above should work for you.
Try the following
from bs4 import BeautifulSoup
html = """<div class="item-branding">
<a href="https://www.newegg.com/" class="item-brand">
<img src="https://www.newegg.com/" title="MSI" alt="MSI"> ==$0
</a></div>"""
soup = BeautifulSoup(html, features="lxml")
element = soup.select('.item-brand > img:nth-of-type(1)')[0]['title']
print(element)

Beautifulsoup find_all() captures too much text

I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?
Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]
from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>
What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.

BeautifulSoup find tag by string without children text

I'm using Python3 and BeautifulSoup 4.4.0 to extract data from a website. I'm interested in the tables in the div tag but to tell what data is inside a table I have to get the text of the h4 tag then get the sibling which is the table. The problem is that one of the h4 tags has a span and BeautifulSoup returns None for the string value when there is another tag inside.
def get_table_items(self, soup, header_title):
header = soup.find('h4', string=re.compile(r'\b{}\b'.format(header_title), re.I))
header_table = header.find_next_sibling('table')
items = header_table.find_all('td')
return items
The code above works on all h4 except <h4>Unique Title 2<span>(Something)</span></h4>
....
<div id="some_id">
<h4>Unique Title 1</h4>
<table>
...
</table>
<h4>Unique Title 2<span>(Something)</span></h4>
<table>
...
</table>
<h4>Unique Title 3</h4>
<table>
...
</table>
</div>
You might need to do the search manually rather than relying on the regular expression:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, "html.parser")
header_title = "Unique Title 2"
for h4 in soup.find_all('h4'):
if header_title in h4.text:
...

Categories

Resources