I dont understand why do i get this error:
I have a fairly simple function:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = news.find_all("href")
return link
Here is th estructure of webpage I am trying to scrape:
<div class="news">
<a href="www.link.com">
<h2 class="heading">
heading
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>
You are doing two things wrong:
You are calling find_all on the news result set; presumably you meant to call it on the links object, one element in that result set.
There are no <href ...> tags in your document, so searching with find_all('href') is not going to get you anything. You only have tags with an href attribute.
You could correct your code to:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = links.find_all(href=True)
return link
to do what I think you tried to do.
I'd use a CSS selector:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news_links = soup.select("div.news [href]")
if news_links:
return news_links[0]
If you wanted to return the value of the href attribute (the link itself), you need to extract that too, of course:
return news_links[0]['href']
If you needed all the link objects, and not the first, simply return news_links for the link objects, or use a list comprehension to extract the URLs:
return [link['href'] for link in news_links]
Related
I'm new to BeautifulSoup, I found all the cards, about 12. But when I'm trying to loop through each card and print link href. I kept getting this error
AttributeError: ResultSet object has no attribute 'find'. You're probably treating a list of elements like a single element. Did you call find_all() when you meant to call find()?
from bs4 import BeautifulSoup
with open("index.html") as fp:
soup = BeautifulSoup(fp, 'html.parser')
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
# print(cards)
print(len(cards))
for link in cards.find_all('a'):
print(link.get('href'))
cards = soup.find_all('div', attrs={'class': 'up-card-section'})
Will return a collection of all the div's found, you'll need to loop over them before finding the chil a's.
That said, you should probably use findChildren for finding the a elements.
Example Demo with an minimal piece of HTML
from bs4 import BeautifulSoup
html = """
<div class='up-card-section'>
<div class='foo'>
<a href='example.com'>FooBar</a>
</div>
</div>
<div class='up-card-section'>
<div class='foo'>
<a href='example2.com'>FooBar</a>
</div>
</div>
"""
res = []
soup = BeautifulSoup(html, 'html.parser')
for card in soup.findAll('div', attrs={'class': 'up-card-section'}):
for link in card.findChildren('a', recursive=True):
print(link.get('href'))
Output:
example.com
example2.com
I have been trying to scrape indeed.com and when doing so I ran into a problem. When scraping for the titles of the positions on some results i get 'new' because there is a span before the position name labeled as 'new'. I have tried researching and trying different things i still havent got no where. So i come for help. The position names live within the span title tags but when i scrape for 'span' in some cases i obviously get the 'new' first because it grabs the first span it sees. I have tried to exclude it several ways but havent had any luck.
Indeed Source Code:
<div class="heading4 color-text-primary singleLineTitle tapItem-gutter">
<h2 class="jobTitle jobTitle-color-purple jobTitle-newJob">
<div class="new topLeft holisticNewBlue desktop">
<span class = "label">new</span>
</div>
<span title="Freight Stocker"> Freight Stocker </span>
</h2>
</div>
Code I Tried:
import requests
from bs4 import BeautifulSoup
def extract(page):
headers = {''}
url = f'https://www.indeed.com/jobs?l=Bakersfield%2C%20CA&start={page}&vjk=42cee666fbd2fae9'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'heading4 color-text-primary singleLineTitle tapItem-gutter')
for item in divs:
res = item.find('span').text
print(res)
return
c=extract(0)
transform(c)
Results:
new
Hourly Warehouse Ope
Immediate FT/PT Open
Service Cashier/Rece
new
Cannabis Sales Repreresentative
new
new
new
new
new
You can use a CSS selector .resultContent span[title], which will select all <span> that have a title attribute within the class resultContent.
To use a CSS selector, use the select() method instead of .find():
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
for tag in soup.select(".resultContent span[title]"):
print(tag.text)
This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])
BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm
I am using BeautifulSoup...
When I run this code:
inside_branding_info = container.div.find("div", "item-branding")
print(inside_branding_info)
It returns:
div class="item-branding">
<a class="item-rating" href="https://www.newegg.com/gigabyte-geforce-rtx-2060-super-gv-n206swf2oc-8gd/p/N82E16814932174?cm_sp=SearchSuccess-_-INFOCARD-_-graphics+cards-_-14-932-174-_-1&Description=graphics+cards&IsFeedbackTab=true#scrollFullInfo"><i class="rating rating-4"></i><span class="item-rating-num">(12)</span></a>
</div>
However, in the HTML inspection this is what I see:
Raw Site HTML
Everytime I run:
inside_branding_info.a.img["title"]
...python thinks I want the "a" tag "item-rating"...not the "a" href tag nested inside of the div "item-branding".
How do I get inside of the "a href" tag, then into the "img", to finally extract the "title" (title = "MSI")? I want the title/brand of the item on the website. I am new to Python. I have only used R and SQL before this instance, any help would be greatly appreciated.
You need a selector path .
Accroding to the img you provided...
soup = BeautifulSoup(data)
img = soup.select('.item-brand > img')
print(img['title'])
The above should work for you.
Try the following
from bs4 import BeautifulSoup
html = """<div class="item-branding">
<a href="https://www.newegg.com/" class="item-brand">
<img src="https://www.newegg.com/" title="MSI" alt="MSI"> ==$0
</a></div>"""
soup = BeautifulSoup(html, features="lxml")
element = soup.select('.item-brand > img:nth-of-type(1)')[0]['title']
print(element)
I wish to scrape photo links from a facebook's html code. It is given as:
<img class="_7jys img" src="https://scontent-arn2-2.xx.fbcdn.net/v/t39.16868-6/s600x600/70767091_398966137483846_5448404540279750656_n.jpg?_ncx" alt="">
I am trying to do:
url = 'https://www.facebook.com/ads/archive/render_ad/?'
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
garbage = []
for item in content.findAll('src', attrs={'class': 'img' }):
print(item)
garbage.append(item.text)
It returns an empty list. How do I access the tag?