I'm using BeautifulSoup and Requests to scrape allrecipes user data.
When inspecting the HTML code I find that the data I want is contained within
<article class="profile-review-card">
However when I use the following code
URL = 'http://allrecipes.com/cook/2010/reviews/'
response = requests.get(URL ).content
soup = BeautifulSoup(response, 'html.parser')
X = soup.find_all('article', class_ = "profile-review-card" )
While soup and response are full of html, X is empty. I've looked through and there are some inconsistencies between what I see with inspect element and requests.get(URL).content, what is going on?
What Chrome inspect shows me
That's because it's loaded using Ajax/javascript. Requests library doesn't handle that, you'll need to use something that can execute these scripts and get the dom. There are various options, I'll list a couple to get you started.
Selenium
ghost.py
Related
I am trying to get the links to the individual search results on a website (National Gallery of Art). But the link to the search doesn't load the search results. Here is how I try to do it:
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
I can see that the links to the individual results could be found under soup.findAll('a') but they do not appear, instead the last output is a link to empty search result:
https://www.nga.gov/content/ngaweb/collection-search-result.html
How could I get a list of links, the first of which is the first search result (https://www.nga.gov/collection/art-object-page.52389.html), the second is the second search result (https://www.nga.gov/collection/art-object-page.52085.html) etc?
Actually, data is generating from api calls json response. Here is the desired
list of links.
Code:
import requests
import json
url= 'https://www.nga.gov/collection-search-result/jcr:content/parmain/facetcomponent/parList/collectionsearchresu.pageSize__30.pageNumber__1.json?artist=C%C3%A9zanne%2C%20Paul&_=1634762134895'
r = requests.get(url)
for item in r.json()['results']:
url = item['url']
abs_url = f'https://www.nga.gov{url}'
print(abs_url)
Output:
https://www.nga.gov/content/ngaweb/collection/art-object-page.52389.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52085.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46577.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46580.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46578.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136014.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46576.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53120.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.54129.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.52165.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46575.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53122.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.93044.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66405.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53119.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53121.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.46579.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66406.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45866.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.53123.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45867.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45986.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.45877.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.136025.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74193.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.74192.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.66486.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76288.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76223.html
https://www.nga.gov/content/ngaweb/collection/art-object-page.76268.html
This seems to work for me:
from bs4 import BeautifulSoup
import requests
url = 'https://www.nga.gov/collection-search-result.html?artist=C%C3%A9zanne%2C%20Paul'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
for a in soup.findAll('a'):
print(a['href'])
It returns all of the html a href links.
For the links from the search results specifically, those are loaded via AJAX and you would need to implement something that renders the javascript like headless chrome. You can read about one of the ways to implement this here, which fits your use case very closely. http://theautomatic.net/2019/01/19/scraping-data-from-javascript-webpage-python/
If you want to ask how to render javascript from python and then parse the result, you would need to close this question and open a new one, as it is not scoped correctly as is.
Hi, I want to get the text(number 18) from em tag as shown in the picture above.
When I ran my code, it did not work and gave me only empty list. Can anyone help me? Thank you~
here is my code.
from urllib.request import urlopen
from bs4 import BeautifulSoup
url = 'https://blog.naver.com/kwoohyun761/221945923725'
html = urlopen(url)
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
When you disable javascript you'll see that the like count is loaded dynamically, so you have to use a service that renders the website and then you can parse the content.
You can use an API: https://www.scraperapi.com/
Or run your own for example: https://github.com/scrapinghub/splash
EDIT:
First of all, I missed that you were using urlopen incorrectly the correct way is described here: https://docs.python.org/3/howto/urllib2.html . Assuming you are using python3, which seems to be the case judging by the print statement.
Furthermore: looking at the issue again it is a bit more complicated. When you look at the source code of the page it actually loads an iframe and in that iframe you have the actual content: Hit ctrl + u to see the source code of the original url, since the side seems to block the browser context menu.
So in order to achieve your crawling objective you have to first grab the initial page and then grab the page you are interested in:
from urllib.request import urlopen
from bs4 import BeautifulSoup
# original url
url = "https://blog.naver.com/kwoohyun761/221945923725"
with urlopen(url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
iframe = soup.find('iframe')
# iframe grabbed, construct real url
print(iframe['src'])
real_url = "https://blog.naver.com" + iframe['src']
# do your crawling
with urlopen(real_url) as response:
html = response.read()
soup = BeautifulSoup(html, 'lxml')
likes = soup.find_all('em', class_='u_cnt _count')
print(likes)
You might be able to avoid one round trip by analyzing the original url and the URL in the iframe. At first glance it looked like the iframe url can be constructed from the original url.
You'll still need a rendered version of the iframe url to grab your desired value.
I don't know what this site is about, but it seems they do not want to get crawled maybe you respect that.
I am trying to scrape a web-page to collect a list of Fortune 500 companies. However, when I run this code, BeautifulSoup can't find <div class="rt-tr-group" role="rowgroup"> tags.
import requests
from bs4 import BeautifulSoup
url = r'https://fortune.com/fortune500/2019/search/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'lxml')
data = soup.find_all('div', {'class': 'rt-tr-group'})
Instead, I just get an empty list. I've tried changing the parser but saw no results.
The tags exist and can be seen
here:
Data is loading on that page using JS, after some time.
Using Selenium, you can wait for page to be loaded completely, or try to get data from Javascript.
P.S. You can check for XHR requests and try to get JSON instead, without parsing. Here is one request
Content of your parsing page loading with JS, and you can get empty page with requests.get.
I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code
Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium
I'm trying to scrape the links for the top 10 articles on medium each day - by the looks of it, it seems like all the article links are in the class "postArticle-content," but when I run this code, I only get the top 3. Is there a way to get all 10?
from bs4 import BeautifulSoup
import requests
r = requests.get("https://medium.com/browse/726a53df8c8b")
data = r.text
soup = BeautifulSoup(data)
data = soup.findAll('div', attrs={'class' : 'postArticle-content'})
for div in data:
links = div.findAll('a')
for link in links:
print(link.get('href'))
requests gave you the entire results.
That page contains only the first three. The website's design is to use javascript code, running in the browser, to load additional content and add it to the page.
You need an entire web browser, with a javascript engine, to do what you are trying to do. The requests and beautiful-soup libraries are not a web browser. They are merely an implementation of the HTTP protocol and an HTML parser, respectively.