Having trouble setting up a web scraper with Python - python

Three days ago I started learning Python to create a web scraper and collect information about new book releases. I´m stuck on one of my target websites...I know this is a really basic question but I´ve watched some videos, looked at many related questions on stack overflow, tried more than 10 different solutions and nothing. If anybody could help, much appreciated:
My problem:
I can retrieve the title information but can´t retrieve the price information
Data Source:
https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25
My code:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
source = requests.get(url, headers=headers).text
#code to retrieve title
soup = BeautifulSoup(source, 'lxml')
for productdetails in soup.find_all("div", class_='figDetails'):
producttitle = productdetails.a.text
print(producttitle)
#code to retrieve price
for productpricedetails in soup.find_all("div", class_='related-products-block'):
productprice = productdetails.find("div", class_="new-price").span.text
print(productprice)
There are two elements with the name span, I need the information on the second one but don´t know how to get to it.
Also, on trying different possible solutions I kept getting a noneType error...

It looks like the source you're trying to scrape populates this data via Javascript.
Viewing the source of the page you can see the raw HTML shows the div you're trying to target is empty.
<html>
...
<div class="related-products-block" id="reletedProduct_490420">
</div>
...
</html>
You can also see this if you update your second loop like so:
for productpricedetails in soup.find_all("div", class_="related-products-block"):
print(productpricedetails)
Edit:
As a bonus, you can inspect the Javascript the page uses. It is very easy to understand, and the request simply returns the HTML which you are looking for. It will be a bit more involved to get the JSON prepared for the requests but here's an example:
import requests
url = "https://www.bloomsbury.com/uk/catalog/RelatedProductsData"
payload = {"productId": 490420, "type": "List", "ordertype": 0, "formatType": 0}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text.encode("utf8"))

Related

Beautifulsoup scrap very few listings prices instead of all listings prices on a page

I want to scrap data from a real estate website for my education project. I am using beautifulsoup. I write following code. Code works properly but shows very less data.
import requests
from bs4 import BeautifulSoup
url = "https://www.zillow.com/homes/San-Francisco,-CA_rb/"
headers = {
"Accept-Language": "en-GB,en;q=0.5",
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:104.0) Gecko/20100101 Firefox/104.0"
}
response = requests.get(url=url, headers=headers )
soup = BeautifulSoup(response.text, "html.parser")
prices = soup.find_all("span", attrs={"data-test":True})
prices_list = [price.getText().strip("+,/,m,o,1,bd, ") for price in prices]
print(prices_list)
The output of this only shows first 9 listings prices.
['$2,959', '$2,340', '$2,655', '$2,632', '$2,524', '$2,843', '$2,64', '$2,300', '$2,604']
It's because the content is created progressively with continuous requests (Lazy loading). You could try to reverse engineer the backend of the site. I'll look into it and if I find an easy solution I'll update the answer. :)
The API call to their backend looks something like this: https://www.zillow.com/search/GetSearchPageState.htm?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%22San%20Francisco%2C%20CA%22%2C%22mapBounds%22%3A%7B%22west%22%3A-123.07190982226562%2C%22east%22%3A-121.79474917773437%2C%22south%22%3A37.63132659190023%2C%22north%22%3A37.918977518603874%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A20330%2C%22regionType%22%3A6%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22sortSelection%22%3A%7B%22value%22%3A%22days%22%7D%2C%22isAllHomes%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%7D&wants={%22cat1%22:[%22mapResults%22]}&requestId=3
You need to handle cookies correctly in order to see the results but if delivers around 1000 results. Have fun :)
UPDATE:
should look like this
import json
with open("GetSearchPageState.json", "r") as f:
a = json.load(f)
print(a["cat1"]["searchResults"]["mapResults"])

Find __INITIAL_STATE__ API url in website

I want to scrape a website that contains a JSON inside after the "window.INITIAL_STATE=". Although I am able to scrape it by parsing the HTML, I want to know how I can find (if exists at all) the API where that data comes from. I have also checked some APIs that documented in their developers site, but they miss some information that is only available in that INITIAL_STATE
My goal is to do the request directly to an API instead of having to load the entire HTML to then parse it.
Here is the website Im trying to get info from
That script is being loaded in the initial html, it's not pulled from any API. You can get that script data like below:
import requests
from bs4 import BeautifulSoup
import re
import json
headers = {'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:103.0) Gecko/20100101 Firefox/103.0',
'Accept-Language' : 'en-US,en;q=0.5'}
url = 'https://www.just-eat.co.uk/restaurants-theshadse1/'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'html.parser')
data_script = soup.find('script', string=re.compile("INITIAL_STATE"))
data = json.loads(data_script.text.split('window.__INITIAL_STATE__=')[1])
print(data['state'])
Result in terminal:
{'brazeApiKey': 'f714b0fc-6de5-4460-908e-2d9930f31339', 'menuVersion': 'VCE.1b_tZDFq8FmpZNn.ric9K4otZqBy', 'restaurantId': '17395', 'countryCode': 'uk', 'language': 'en-GB', 'localDateTime': '2022-08-04T00:41:58.9747334', 'canonicalBase': 'www.just-eat.co.uk', 'androidAppBaseUrl': [...]
You can analyze and dissect that json object further, to get what you need from it.

Getting the page source of imgur

I'm trying to get the page source of an imgur website using requests, but the results I'm getting are different from the source. I understand that these pages are rendered using JS, but that is not what I am searching for.
It seems I'm getting redirected because they detect I'm using an automated browser, but I'd prefer not to use selenium here. For example, see the following code to scrape the page source of two imgur ID's (one valid ID and one invalid ID) with different page sources.
import requests
from bs4 import BeautifulSoup
url1 = "https://i.imgur.com/ssXK5" #valid ID
url2 = "https://i.imgur.com/ssXK4" #invalid ID
def get_source(url):
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36"}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
page1 = get_source(url1)
page2 = get_source(url2)
print(page1==page2)
#True
The scraped page sources are identical, so I presume it's an anti-scraping thing. I know there is an imgur API, but I'd like to know how to circumvent such a redirection, if possible. Is there any way to get the actual source code using the requests module?
Thanks.

I get nothing when trying to scrape a table

So I want to extract the number 45.5 from here: https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt
But when I try to find the table I get nothing. Here's my code:
import requests
from bs4 import BeautifulSoup
url = 'https://www.myscore.com.ua/match/I9pSZU2I/#odds-comparison;over-under;1st-qrt'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Linux armv7l) AppleWebKit/537.36 (KHTML, like Gecko) Raspbian Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content, 'html.parser')
text = soup.find_all('table', class_ = 'odds sortable')
print(text)
Can anybody help me to extract the number and store it's value into a variable?
You can try to do this without Selenium by recreating the dynamic request that loads the table.
Looking around in the network tab of the page, i saw this XMLHTTPRequest: https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu
Try to reproduce the same parameters as the request.
To access the network tab: Click right->inspect element->Network tab->Select XHR and find the second request.
The final code would be like this:
headers = {'x-fsign' : 'SW9D1eZo'}
page =
requests.get('https://d.myscore.com.ua/x/feed/d_od_I9pSZU2I_ru_1_eu',
headers=headers)
You should check if the x=fisgn value is different based on your browser/ip.

Why Python's requests transforms parts of source code?

I am trying to scrape through google news search results using python's requests to get links to different articles. I get the links by using Beautiful Soup.
The problem I get is that although in browser's source view all links look normal, after the operation they are changed - all of the start with "/url?q=" and after the "core" of the link is finished there goes a string of characters which starts with "&". Also - some characters inside the link are also changed - for example url:
http://www.azonano.com/news.aspx?newsID=35576
changes to:
http://www.azonano.com/news.aspx%newsID%35576
I'm using standard "getting started" code:
import requests, bs4
url_list = list()
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.select('h3 > a'):
url_list.append(link.get('href'))
# First link on google news page is:
# https://www.theengineer.co.uk/graphene-sensor-could-speed-hepatitis-diagnosis/
print url_list[0] #this line will print url modified by requests.
I know it's possible to get around this problem by using selenium, but I'd like to know where lies a root cause of this problem with requests (or more plausible not with requests but the way I'm using it).
Thanks for any help!
You're comparing what you are seeing with a browser with what requests generates (i.e. there is no user agent header). If you specify this before making the initial request it will reflect what you would see in a web browser. Google serves the requests differently it looks like:
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} # I just used a general Chrome 41 user agent header
res = requests.get(url, headers=headers)

Categories

Resources