I wish to scrape photo links from a facebook's html code. It is given as:
<img class="_7jys img" src="https://scontent-arn2-2.xx.fbcdn.net/v/t39.16868-6/s600x600/70767091_398966137483846_5448404540279750656_n.jpg?_ncx" alt="">
I am trying to do:
url = 'https://www.facebook.com/ads/archive/render_ad/?'
response = requests.get(url, timeout=5)
content = BeautifulSoup(response.content, "html.parser")
garbage = []
for item in content.findAll('src', attrs={'class': 'img' }):
print(item)
garbage.append(item.text)
It returns an empty list. How do I access the tag?
Related
This is the part of the html that I am extracting on the platform and it has the snippet I want to get, the value of the href attribute of the tag with the class "booktitle"
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
After logging in using the mechanize library I have this piece of code to try to extract it, but here it returns the name of the book as the code asks, I tried several ways to get only the href value but none worked so far
from bs4 import BeautifulSoup as bs4
from requests import Session
from lxml import html
import Downloader as dw
import requests
def getGenders(browser : mc.Browser, url: str, name: str) -> None:
res = browser.open(url)
aux = res.read()
html2 = bs4(aux, 'html.parser')
with open(name, "w", encoding='utf-8') as file2:
file2.write( str( html2 ) )
getGenders(br, "https://www.goodreads.com/shelf/show/art", "gendersBooks.html")
with open("gendersBooks.html", "r", encoding='utf8') as file:
contents = file.read()
bsObj = bs4(contents, "lxml")
aux = open("books.text", "w", encoding='utf8')
officials = bsObj.find_all('a', {'class' : 'booktitle'})
for text in officials:
print(text.get_text())
aux.write(text.get_text().format())
aux.close()
file.close()
Can you try this? (sorry if it doesn't work, I am not on a pc with python right now)
for text in officials:
print(text['href'])
BeautifulSoup works just fine with the html code that you provided, if you want to get the text of a tag you simply use ".text", if you want to get the href you use ".get('href')" or if you are sure the tag has an href value you can use "['href']".
Here is a simple example easy to understand with your html code snipet.
from bs4 import BeautifulSoup
html_code = '''
</div>
<div class="elementList" style="padding-top: 10px;">
<div class="left" style="width: 75%;">
<a class="leftAlignedImage" href="/book/show/2784.Ways_of_Seeing" title="Ways of Seeing"><img alt="Ways of Seeing" src="https://i.gr-assets.com/images/S/compressed.photo.goodreads.com/books/1464018308l/2784._SY75_.jpg"/></a>
<a class="bookTitle" href="/book/show/2784.Ways_of_Seeing">Ways of Seeing (Paperback)</a>
<br/>
<span class="by">by</span>
<span itemprop="author" itemscope="" itemtype="http://schema.org/Person">
<div class="authorName__container">
<a class="authorName" href="https://www.goodreads.com/author/show/29919.John_Berger" itemprop="url"><span itemprop="name">John Berger</span></a>
</div>
'''
soup = BeautifulSoup(html_code, 'html.parser')
tag = soup.find('a', {'class':'bookTitle'})
# - Book Title -
title = tag.text
print(title)
# - Href Link -
href = tag.get('href')
print(href)
I don't know why you downloaded the html and saved it to disk and then open it again, If you just want to get some tag values, then downloading the html, saving to disk and then reopening is totally unnecessary, you can save the html to a variable and then pass that variable to beautifulsoup.
Now I see that you imported requests library, but you used mechanize instead, as far as I know requests is the easiest and the most modern library to use when getting data from web pages in python. I also see that you imported "session" from requests, session is not necessary unless you want to make mulltiple requests and want to keep the connection open with the server for faster subsecuent request's.
Also if you open a file with the "with" statement, you are using python context managers, which handles the closing of a file, which means you don't have to close the file at the end.
So your code more simplify without saving the downloaded 'html' to disk, I will make it like this.
from bs4 import BeautifulSoup
import requests
url = 'https://www.goodreads.com/shelf/show/art/gendersBooks.html'
html_source = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
# - To get the tag that we want -
tag = soup.find('a', {'class' : 'booktitle'})
# - Extract Book Title -
href = tag.text
# - Extract href from Tag -
title = tag.get('href')
Now if you got multiple "a" tags with the same class name: ('a', {'class' : 'booktitle'}) then you do it like this.
get all the "a" tags first:
a_tags = soup.findAll('a', {'class' : 'booktitle'})
and then scrape all the book tags info and append each book info to a books list.
books = []
for a in a_tags:
try:
title = a.text
href = a.get('href')
books.append({'title':title, 'href':href}) #<-- add each book dict to books list
print(title)
print(href)
except:
pass
To understand your code better I advise you to read this related links:
BeautifulSoup:
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests:
https://requests.readthedocs.io/en/master/
Python Context Manager:
https://book.pythontips.com/en/latest/context_managers.html
https://effbot.org/zone/python-with-statement.htm
I'm trying to extract Some_Product_Title from this block of HTML code
<div id="titleSection" class="a-section a-spacing-none">
<h1 id="title" class="a-size-large a-spacing-none">
<span id="productTitle" class="a-size-large">
Some_Product_Title
</span>
The lines below are working fine
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
But the code below is not
title = soup.find_all(id="productTitle")
Since when I try print(title) I get None as the console output
Does anyone know how to fix this?
You're probably having trouble with .find() because the site from which you are creating the soup is, in all likelihood, generating its html code via javascript.
If this is the case, to find an element by id, you should implement the following:
soup1 = BeautifulSoup(page.content, "html.parser")
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
title = soup2.find(id = "productTitle")
BS4 has CSS selectors built in so you can use:
soup.select('#productTitle')
This would also work:
title = soup.find_all("span", { "id" : "productTitle" })
import requests
from bs4 import BeautifulSoup
URL = 'https://your-own.address/some-thing'
page = requests.get(URL, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.findAll('',{"id":"productTitle"})
print(*title)
The code I am using to scrape the content
class Scraper(object):
# contains methods to scrape data from curse
def scrape(url):
req = request.Request(url, headers={"User-Agent": "Mozilla/5.0"})
return request.urlopen(req).read()
def lookup(page, tag, class_name):
parsed = BeautifulSoup(page, "html.parser")
return parsed.find_all(tag, class_=class_name)
This returns a list with entries similar to this
<li class="title"><h4>World Quest Tracker</h4></li>
I'm attempting to extract the text inbetween the href tags, in this instance
World Quest Tracker
How could I accomplish this?
Try this.
from bs4 import BeautifulSoup
html='''
<li class="title"><h4>World Quest Tracker</h4></li>
'''
soup = BeautifulSoup(html, "lxml")
for item in soup.select(".title"):
print(item.text)
Result:
World Quest Tracker
html_doc = '<li class="title"><h4>World Quest Tracker</h4></li>'
soup = BeautifulSoup(html_doc, 'html.parser')
print soup.find('a').text
this will print
u'World Quest Tracker'
I'm attempting to extract the text inbetween the href tags
If you actually want the text in the href attribute, and not the text content wrapped by the <a></a> anchor (your wording is a bit unclear), use get('href'):
from bs4 import BeautifulSoup
html = '<li class="title"><h4>World Quest Tracker</h4></li>'
soup = BeautifulSoup(html, 'lxml')
soup.find('a').get('href')
'/addons/wow/world-quest-tracker'
I have a list of urls
yandex.ru/search?text=игрушка%20"веселая%20гусеница"%20keenway%20отзывы&lr=47
yandex.ru/search?text=модис&lr=47
yandex.ru/search?text=модис&lr=47
yandex.ru/search?text=авито&lr=47
yandex.ru/search?text=авито&lr=47
yandex.ru/search?text=цветок%20киддиленд%20музыкальный&lr=47
dns-shop.ru/product/c7bf1138670f3361/rul-hori-racing-wheel
dns-shop.ru/product/c7bf1138670f3361/rul-hori-racing-wheel#opinion
kaluga.onlinetrade.ru/catalogue/ruli_dgoystiki_geympadi-c31/hori/reviews/rul_hori_racing_wheel_controller_ps_4_ps4_020e_acps440-274260.html
kazan.onlinetrade.ru/catalogue/ruli_dgoystiki_geympadi-c31/hori/reviews/rul_hori_racing_wheel_controller_xboxone_xbox_005u_acxone34-274261.html
kazan.onlinetrade.ru/catalogue/ruli_dgoystiki_geympadi-c31/hori/reviews/rul_hori_racing_wheel_controller_xboxone_xbox_005u_acxone34-274261.html
ebay.com
And I need to get text from tag title to every from this.
html = urllib.urlopen(url, proxies=proxies).read()
print html
soup = BeautifulSoup(html, 'html.parser')
titles = soup.title.get_text()
When I print html I get real code of pages. But when I try print title
I get
ERROR: The requested URL could not be retrieved
for most of urls.
What's wrong there?
I dont understand why do i get this error:
I have a fairly simple function:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = news.find_all("href")
return link
Here is th estructure of webpage I am trying to scrape:
<div class="news">
<a href="www.link.com">
<h2 class="heading">
heading
</h2>
<div class="teaserImg">
<img alt="" border="0" height="124" src="/image">
</div>
<p> text </p>
</a>
</div>
You are doing two things wrong:
You are calling find_all on the news result set; presumably you meant to call it on the links object, one element in that result set.
There are no <href ...> tags in your document, so searching with find_all('href') is not going to get you anything. You only have tags with an href attribute.
You could correct your code to:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news = soup.find_all("div", attrs={"class": "news"})
for links in news:
link = links.find_all(href=True)
return link
to do what I think you tried to do.
I'd use a CSS selector:
def scrape_a(url):
r = requests.get(url)
soup = BeautifulSoup(r.content)
news_links = soup.select("div.news [href]")
if news_links:
return news_links[0]
If you wanted to return the value of the href attribute (the link itself), you need to extract that too, of course:
return news_links[0]['href']
If you needed all the link objects, and not the first, simply return news_links for the link objects, or use a list comprehension to extract the URLs:
return [link['href'] for link in news_links]