Requests Only Returns Partial Results - python

I'm trying to scrape the links for the top 10 articles on medium each day - by the looks of it, it seems like all the article links are in the class "postArticle-content," but when I run this code, I only get the top 3. Is there a way to get all 10?
from bs4 import BeautifulSoup
import requests
r = requests.get("https://medium.com/browse/726a53df8c8b")
data = r.text
soup = BeautifulSoup(data)
data = soup.findAll('div', attrs={'class' : 'postArticle-content'})
for div in data:
links = div.findAll('a')
for link in links:
print(link.get('href'))

requests gave you the entire results.
That page contains only the first three. The website's design is to use javascript code, running in the browser, to load additional content and add it to the page.
You need an entire web browser, with a javascript engine, to do what you are trying to do. The requests and beautiful-soup libraries are not a web browser. They are merely an implementation of the HTTP protocol and an HTML parser, respectively.

Related

Why is there an empty result while scraping a Exchange Rate table?

I want to scrape Korea Exchange Rate by using http://www.smbs.biz/ExRate/StdExRate.jsp this website.
Daily exchange rate is provided by table, So I tried to scrape using BeautifulSoup, but it's responses are empty.
Table is like,
url = "http://www.smbs.biz/ExRate/StdExRate.jsp"
html = requests.get(url, verify=False).text
soup = BeautifulSoup(html, 'html.parser')
title = soup.select_one('#frm_SearchDate > div:nth-child(17) > table')
title.text
Result :
'\n일별 매매기준율\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n'
Always and first of all, take a look at your soup to see if all the expected ingredients are there.
Data is loaded via XHR request and table is rendered dynamically by JavaScript, That is why you won't get the table with BeautifulSoup cause it could not find it in response of your request.
There are option to get it anyway:
check your browser dev tools on XHR tab to locate the api and pull part of info from there.
use selenium to get driver.page_source with whole table from the 'browser like'rendered version of website.
Example
import requests
from bs4 import BeautifulSoup
url = 'http://www.smbs.biz/ExRate/StdExRate_xml.jsp?arr_value=USD_2023-01-12_2023-02-03'
soup=BeautifulSoup(requests.get(url).text)
{s.get('label'):s.get('value') for s in soup.select('set')}
Output
{'23.01.12': '1245.3',
'23.01.13': '1244.6',
'23.01.16': '1240.6',
'23.01.17': '1234',
'23.01.18': '1238.5',
'23.01.19': '1239.8',
'23.01.20': '1236',
'23.01.25': '1234.4',
'23.01.26': '1233.4',
'23.01.27': '1231.4',
'23.01.30': '1230.2',
'23.01.31': '1228.7',
'23.02.01': '1230.8',
'23.02.02': '1231.4',
'23.02.03': '1219.3'}

Fetch all pages using a Python request, using Beautiful Soup

I tried to fetch all product's name from the web page, but I could have only 12.
If I scroll down the web page then it gets refreshed and adds more information.
How can I to get all information?
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.outre.com/product-category/wigs/"
res = requests.get(url)
res.raise_for_status()
soup = BeautifulSoup(res.text, "lxml")
items = soup.find_all("div", attrs={"class":"title-wrapper"})
for item in items:
print(item.p.a.get_text())
Your code is good. The thing is on the website; the products are dynamically loaded, so when you do your request you can only get the first 12 products.
You can check the developer console inside your browser to track the Ajax call made during browsing.
I did it, and it turns out a call is made to retrieve more product to the URL
https://www.outre.com/product-category/wigs/page/2/
So if you want to get all the products you need to browse multiple pages. I suggest you to use a loop and use your code several times.
N.B.: You can try to check the website to see is there is a more convenient place to get the product (like not from the main page)
The page loads the products from different URL via JavaScript, so Beautiful Soup doesn't see it. To get all pages, you can use the following example:
import requests
from bs4 import BeautifulSoup
url = "https://www.outre.com/product-category/wigs/page/{}/"
page = 1
while True:
soup = BeautifulSoup(requests.get(url.format(page)).content, "html.parser")
titles = soup.select(".product-title")
if not titles:
break
for title in titles:
print(title.text)
page += 1
Prints:
...
Wet & Wavy Loose Curl 18″
Wet & Wavy Boho Curl 20″
Nikaya
Jeanette
Natural Glam Body
Natural Free Deep

BeautifulSoup doesn't return the expected tag as Chrome

I'm trying to parse a page to learn beautifulSoup, here is the code
import requests as req
from bs4 import BeautifulSoup
page = 'https://www.pathofexile.com/trade/search/Delirium/w0brcb'
resp = req.get(page)
soup = BeautifulSoup(resp.content, 'html.parser')
res = soup.find_all('results')
print(len(res))
Result: 0
The goal is to get the first price.
I tried to look for the tag in Chrome and it's there, but probably the browser does another requests to get the results.
Can someone explain what am I missing here?
website's source code
Problems with the code
Your code is looking for a "results"-element. What you really have to look for (based on your screenshot) is a div-element with the class "results".
So try this:
soup.find_all("div", attrs={"class":"results"})
But if you want the price you have to dig deeper for the element which contains the price:
price = soup.find("span", attrs={"data-field":"price"}).text
Problems with the site
It seems the site is loading data by Ajax. With Requests you get the page before/without Ajax data call.
In this case you should change from Requests to Selenium module. This will navigate through a "real Browser" and you can wait until data is finally loaded before you start scraping.
Documentation: Selenium

Unable to navigate Amazon pagination with Python and BS4

I've been trying to create a simple web scraper program to scrape the book titles of a 100 bestseller list on Amazon. I've used this code before on another site with no problems. But for some reason, it scraps the first page fine but then posts the same results for the following iterations.
I'm not sure if it's something to do with how Amazon creates its urls or not. When I manually enter the "#2" (and beyond) at the end of the url in the browser it navigates fine.
(Once the scrape is working I plan on dumping the data in csv files. But for now, print to the terminal will do.)
import requests
from bs4 import BeautifulSoup
for i in range(5):
url = "https://smile.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_nav_kstore_4_158591011#{}".format(i)
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
for book in soup.find_all('div', class_='zg_itemWrapper'):
title = book.find('div', class_='p13n-sc-truncate')
name = book.find('a', class_='a-link-child')
price = book.find('span', class_='p13n-sc-price')
print(title)
print(name)
print(price)
print("END")
This is a common problem that you have to face, some sites load the data asynchronous(with ajax) those are XMLHttpRequest that you can see in the tab networking of your DOM inspector. Usually the websites load the data from a different endpoint with POST method to solve that you can use urllib or requests library.
In this case the request is through a GET method and you can scrape it from this url with no need of extend your code https://www.amazon.com/Best-Sellers-Kindle-Store-Dystopian-Science-Fiction/zgbs/digital-text/6361470011/ref=zg_bs_pg_3?_encoding=UTF8&pg=3&ajax=1 where you only change the pg parameter

Accessing tabular data via hyperlinks with BeautifulSoup

There's something I still don't understand about using BeautifulSoup. I can use this to parse the raw HTML of a webpage, here "example_website.com":
from bs4 import BeautifulSoup # load BeautifulSoup class
import requests
r = requests.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
# soup.find_all('a') grabs all elements with <a> tag for hyperlinks
Then, to retrieve and print all elements with the 'href' attribute, we can use a for loop:
for link in soup.find_all('a'):
print(link.get('href'))
What I don't understand: I have a website with several webpages, and each webpage lists several hyperlinks to a single webpage with tabular data.
I can use BeautifulSoup to parse the homepage, but how do I use the same Python script to scrape page 2, page 3, and so on? How do you "access" the contents found via the 'href' links?
Is there a way to write a python script to do this? Should I be using a spider?
You can do that with requests+BeautifulSoup for sure. It would be of a blocking nature, since you would process the extracted links one by one and you would not proceed to the next link until you are done with the current. Sample implementation:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
with requests.Session() as session:
r = session.get("http://example_website.com")
data = r.text
soup = BeautifulSoup(data)
base_url = "http://example_website.com"
for link in soup.find_all('a'):
url = urljoin(base_url, link.get('href'))
r = session.get(url)
# parse the subpage
Though, it may quickly get complex and slow.
You may need to switch to Scrapy web-scraping framework which makes web-scraping, crawling, following the links easy (check out CrawlSpider with link extractors), fast and in a non-blocking nature (it is based on Twisted).

Categories

Resources