How can I print all links within a found elements? - python

I'm new to BeautifulSoup, I found all the cards, about 12. But when I'm trying to loop through each card and print link href. I kept getting this error
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
# print(cards)
print(len(cards))
for link in cards.find_all('a'):
print(link.get('href'))

cards = soup.find_all('div', attrs={'class': 'up-card-section'})
Will return a collection of all the div's found, you'll need to loop over them before finding the chil a's.
That said, you should probably use findChildren for finding the a elements.
Example Demo with an minimal piece of HTML
from bs4 import BeautifulSoup
html = """
<div class='up-card-section'>
<div class='foo'>
<a href='example.com'>FooBar</a>
</div>
</div>
<div class='up-card-section'>
<div class='foo'>
<a href='example2.com'>FooBar</a>
</div>
</div>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for card in soup.findAll('div', attrs={'class': 'up-card-section'}):
for link in card.findChildren('a', recursive=True):
print(link.get('href'))
Output:
example.com
example2.com

Related

How to scrape for <span title>?

I have been trying to scrape indeed.com and when doing so I ran into a problem. When scraping for the titles of the positions on some results i get 'new' because there is a span before the position name labeled as 'new'. I have tried researching and trying different things i still havent got no where. So i come for help. The position names live within the span title tags but when i scrape for 'span' in some cases i obviously get the 'new' first because it grabs the first span it sees. I have tried to exclude it several ways but havent had any luck.
Indeed Source Code:
<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class = "label">new</span>
</div>
<span title="Freight Stocker"> Freight Stocker </span>
</h2>
</div>
Code I Tried:
import requests
from bs4 import BeautifulSoup
def extract(page):
headers = {''}
url = f'https://www.indeed.com/jobs?l=Bakersfield%2C%20CA&start={page}&vjk=42cee666fbd2fae9'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'heading4 color-text-primary singleLineTitle tapItem-gutter')
for item in divs:
res = item.find('span').text
print(res)
return
c=extract(0)
transform(c)
Results:
new
Hourly Warehouse Ope
Immediate FT/PT Open
Service Cashier/Rece
new
Cannabis Sales Repreresentative
new
new
new
new
new
You can use a CSS selector .resultContent span[title], which will select all <span> that have a title attribute within the class resultContent.
To use a CSS selector, use the select() method instead of .find():
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select(".resultContent span[title]"):
print(tag.text)

Beautifulsoup find_all() captures too much text

I have some HTML I am parsing in Python using the BeautifulSoup package. Here's the HTML:
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
I am capturing the results using this code chunk:
names = soup3.find_all('div', {'class': "n"})
contact = soup3.find_all('div', {'class': "x"})
other = soup3.find_all('div', {'class': "x c"})
Right now, both classes 'x' and 'x c' are being captured in the 'contact' variable. How can I prevent this from happening?
Try:
soup.select('div[class="x"]')
Output:
[<div class="x">Address</div>, <div class="x">Phone</div>]
from bs4 import BeautifulSoup
html = """
<div class='n'>Name</div>
<div class='x'>Address</div>
<div class='x'>Phone</div>
<div class='x c'>Other</div>
"""
soup = BeautifulSoup(html, 'html.parser')
contact = soup.findAll("div", class_="x")[1]
print(contact)
Output:
<div class="x">Phone</div>
What about using sets?
others = set(soup.find_all('div', {'class': "x c"}))
contacts = set(soup.find_all('div', {'class': "x"})) - others
others will be {<div class="x c">Other</div>}
and
contacts will be {<div class="x">Phone</div>, <div class="x">Address</div>}
Noted that this will only work in this specific case of classes. It may not work in general, depends on the combinations of classes you have in the HTML.
See BeautifulSoup webscraping find_all( ): finding exact match for more details on how .find_all() works.

How to extract data form the below HTML code using beautifulsoup?

I want to extract data from the div with class 'cinema' and 'timings' using BeautifulSoup in python3 . How can i do it using soup.findAll ?
<div data-order="0" class="cinema">
<div class="__name">SRS Shoppers Pride Mall<span class="__venue"> - Bijnor</span>
</div>
<div class="timings"><span class="__time _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22876','ET00015438','01:30 PM');">01:30 PM</span><span class="__time _center _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22877','ET00015438','04:00 PM');">04:00 PM</span><span class="__time _right _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22878','ET00015438','06:30 PM');">06:30 PM</span><span class="__time _available" onclick="fnPushWzKmEvent('SRBI',ShowData);fnCallSeatLayout('SRBI','22879','ET00015438','09:00 PM');">09:00 PM</span>
</div>
</div>
This is my code:
for div in soup.findAll('div',{'class':'cinema'}):
print div.text # It printed nothing ,the program just ended
You can specify both classes in findAll:
soup.findAll(True, {'class': ['cinema', 'timings']})
The "div" you are interested in is another "div" child. To get that "div" you can use the .select method.
from bs4 import BeautifulSoup
html = <your html>
soup = BeautifulSoup(html, 'lxml')
for div in soup.select('div.cinema > div.timings'):
print(div.get_text(strip=True))
Or iterate the find_all() result and use the .find() method to return those "div" where class: "timings"
for div in soup.find_all('div', class_='cinema'):
timings = div.find('div', class_='timings')
print(timings.get_text(strip=True))

AttributeError: 'ResultSet' object has no attribute 'find_all' Beautifulsoup

I dont understand why do i get this error:
I have a fairly simple function:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = news.find_all("href")
return link
Here is th estructure of webpage I am trying to scrape:
<div class="news">
<a href="www.link.com">
<h2 class="heading">
heading
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>
You are doing two things wrong:
You are calling find_all on the news result set; presumably you meant to call it on the links object, one element in that result set.
There are no <href ...> tags in your document, so searching with find_all('href') is not going to get you anything. You only have tags with an href attribute.
You could correct your code to:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = links.find_all(href=True)
return link
to do what I think you tried to do.
I'd use a CSS selector:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news_links = soup.select("div.news [href]")
if news_links:
return news_links[0]
If you wanted to return the value of the href attribute (the link itself), you need to extract that too, of course:
return news_links[0]['href']
If you needed all the link objects, and not the first, simply return news_links for the link objects, or use a list comprehension to extract the URLs:
return [link['href'] for link in news_links]

I'm trying to collect the text with BeautifulSoup using python

I want to know how I can collect the desire data with beautiful soup here is the code and trying to collect the text data that is "RoSharon1977"
I'm trying using
<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>
You have to find the div by its id, then get the next ul element, etc and continuing to drill down until you reach the a element, then get the text of it:
from bs4 import BeautifulSoup
html = '''<div id="twitter" class="editable-item">
<div id="twitter-view">
<ul><li>
RoSharon1977
</li></ul>
</div></div>'''
soup = BeautifulSoup(html)
print soup.find('div', attrs={'id': 'twitter-view'}).findNext('ul').findNext('li').findNext('a').text
Or depending on how the whole webpage looks you could simply do:
soup = BeautifulSoup(html)
print soup.find('a').text
And if there are multiple a elements:
soup = BeautifulSoup(html)
for a in soup.find_all('a'):
print a.text

Categories

Resources