Why Python's requests transforms parts of source code? - python

I am trying to scrape through google news search results using python's requests to get links to different articles. I get the links by using Beautiful Soup.
The problem I get is that although in browser's source view all links look normal, after the operation they are changed - all of the start with "/url?q=" and after the "core" of the link is finished there goes a string of characters which starts with "&". Also - some characters inside the link are also changed - for example url:
http://www.azonano.com/news.aspx?newsID=35576
changes to:
http://www.azonano.com/news.aspx%newsID%35576
I'm using standard "getting started" code:
import requests, bs4
url_list = list()
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
for link in soup.select('h3 > a'):
url_list.append(link.get('href'))
# First link on google news page is:
# https://www.theengineer.co.uk/graphene-sensor-could-speed-hepatitis-diagnosis/
print url_list[0] #this line will print url modified by requests.
I know it's possible to get around this problem by using selenium, but I'd like to know where lies a root cause of this problem with requests (or more plausible not with requests but the way I'm using it).
Thanks for any help!

You're comparing what you are seeing with a browser with what requests generates (i.e. there is no user agent header). If you specify this before making the initial request it will reflect what you would see in a web browser. Google serves the requests differently it looks like:
url = 'https://www.google.com/search?hl=en&gl=us&tbm=nws&authuser=0&q=graphene&oq=graphene&gs_l=news-cc.3..43j0l9j43i53.2022.4184.0.4322.14.10.3.1.1.1.166.884.5j5.10.0...0.0...1ac.1.-Q2j3YFqIPQ'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'} # I just used a general Chrome 41 user agent header
res = requests.get(url, headers=headers)

Related

Getting the page source of imgur

I'm trying to get the page source of an imgur website using requests, but the results I'm getting are different from the source. I understand that these pages are rendered using JS, but that is not what I am searching for.
It seems I'm getting redirected because they detect I'm using an automated browser, but I'd prefer not to use selenium here. For example, see the following code to scrape the page source of two imgur ID's (one valid ID and one invalid ID) with different page sources.
import requests
from bs4 import BeautifulSoup
url1 = "https://i.imgur.com/ssXK5" #valid ID
url2 = "https://i.imgur.com/ssXK4" #invalid ID
def get_source(url):
headers = {
"User-Agent": "Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Mobile Safari/537.36"}
page = requests.get(url, headers = headers)
soup = BeautifulSoup(page.content, 'html.parser')
return soup
page1 = get_source(url1)
page2 = get_source(url2)
print(page1==page2)
#True
The scraped page sources are identical, so I presume it's an anti-scraping thing. I know there is an imgur API, but I'd like to know how to circumvent such a redirection, if possible. Is there any way to get the actual source code using the requests module?
Thanks.

Having trouble setting up a web scraper with Python

Three days ago I started learning Python to create a web scraper and collect information about new book releases. I´m stuck on one of my target websites...I know this is a really basic question but I´ve watched some videos, looked at many related questions on stack overflow, tried more than 10 different solutions and nothing. If anybody could help, much appreciated:
My problem:
I can retrieve the title information but can´t retrieve the price information
Data Source:
https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25
My code:
from bs4 import BeautifulSoup
import requests
import csv
url = 'https://www.bloomsbury.com/uk/non-fiction/business-and-management/?pagesize=25'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}
source = requests.get(url, headers=headers).text
#code to retrieve title
soup = BeautifulSoup(source, 'lxml')
for productdetails in soup.find_all("div", class_='figDetails'):
producttitle = productdetails.a.text
print(producttitle)
#code to retrieve price
for productpricedetails in soup.find_all("div", class_='related-products-block'):
productprice = productdetails.find("div", class_="new-price").span.text
print(productprice)
There are two elements with the name span, I need the information on the second one but don´t know how to get to it.
Also, on trying different possible solutions I kept getting a noneType error...
It looks like the source you're trying to scrape populates this data via Javascript.
Viewing the source of the page you can see the raw HTML shows the div you're trying to target is empty.
<html>
...
<div class="related-products-block" id="reletedProduct_490420">
</div>
...
</html>
You can also see this if you update your second loop like so:
for productpricedetails in soup.find_all("div", class_="related-products-block"):
print(productpricedetails)
Edit:
As a bonus, you can inspect the Javascript the page uses. It is very easy to understand, and the request simply returns the HTML which you are looking for. It will be a bit more involved to get the JSON prepared for the requests but here's an example:
import requests
url = "https://www.bloomsbury.com/uk/catalog/RelatedProductsData"
payload = {"productId": 490420, "type": "List", "ordertype": 0, "formatType": 0}
headers = {"Content-Type": "application/json"}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text.encode("utf8"))

Web scraping with Python using BeautifulSoup 429 error

Fist I have to say that I'm quite new to Web scraping with Python. I'm trying to scrape datas using these lines of codes
import requests
from bs4 import BeautifulSoup
baseurl ='https://name_of_the_website.com'
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
print(soup)
As output I do not get the expected Html page but another Html page that says : Misbehaving Content Scraper
Please use robots.txt
Your IP has been rate limited
To check the problem I wrote:
try:
page_response = requests.get(baseurl, timeout =5)
if page_response.status_code ==200:
html_page = requests.get(baseurl).text
soup = BeautifulSoup(html_page, 'html.parser')
else:
print(page_response.status_code)
except requests.Timeout as e:
print(str(e))
Then I get 429 (too many requests).
What can I do to handle this problem? Does it mean I cannot print the Html of the page and does it prevent me to scrape any content of the page? Should I rotate the IP address ?
If you are only hitting the page once and getting a 429 it's probably not you hitting them too much. You can't be sure the 429 error is accurate, it's simply what their webserver returned. I've seen pages return a 404 response code, yet the page was fine, and 200 response code on legit missing pages, just a misconfigured server. They may just return 429 from any bot, try changing your User-Agent to Firefox, Chrome, or "Robot Web Scraper 9000" and see what you get. Like this:
requests.get(baseurl, headers = {'User-agent': 'Super Bot Power Level Over 9000'})
to declare yourself as a bot or
requests.get(baseurl, headers = {'User-agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36'})
If you wish to more mimic a browser. Note all the version stuff mimicing a browser, at the time of this writing those are current. You may need later version numbers. Just find your user agent of the browser you use, this page will tell you what that is:
https://www.whatismybrowser.com/detect/what-is-my-user-agent
Some sites return better searchable code if you just say you are a bot, others it's the opposite. It's basically the wild wild west, have to just try different things.
Another pro tip, you may have to write your code to have a 'cookie jar' or a way to accept a cookie. Usually it is just an extra line in your request, but I'll leave that for another stackoverflow question :)
If you are indeed hitting them a lot, you need to sleep between calls. It's a server side response completely controlled by them. You will also want to investigate how your code interacts with robots.txt, that's a file usually on the root of the webserver with the rules it would like your spider to follow.
You can read about that here: Parsing Robots.txt in python
Spidering the web is fun and challenging, just remember that you could be blocked at anytime by any site for any reason, you are their guest. So tread nicely :)

What I scrape from python is not the same as what I see from firebug

I am practicing to write a web crawler to crawl some interesting information from a website. I try this block of code on my personal website. It works as what I expect, but when I try to implement this code on a real website, it does not show what it should show. Does anyone have any ideas? The following is my code and results.
import requests
from bs4 import BeautifulSoup
url = 'https://angel.co/parkwhiz/jobs/284942-product-manager'
page = requests.get(url).text
soup = BeautifulSoup(page,'lxml')
print soup.prettify()
Result from print
Result from firebug(or chrome inspect)
The title shows in the print is "Page not found - 404 - AngelList", but the title shows in firebug is "Product Manager Job at Parkwhiz - AngelList". Is there anything wrong with my code? Shouldn't these two be match?
The website is blocking the script as you're passing default User-Agent which tells the website that it is an automated Python script.
If you check the status code, you'll see that you're getting 404.
>>> r = requests.get('https://angel.co/parkwhiz/jobs/284942-product-manager')
>>> r.status_code
404
To overcome this, change the User-Agent to look like a real browser:
>>> headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
>>> r = requests.get('https://angel.co/parkwhiz/jobs/284942-product-manager', headers=headers)
>>> r.status_code
200

Different data in urllib2 than in Safari's Web Inspector

I looked here and here for information on my issue, but with no luck.
I made some python code that is intended to grab a webpage's source, as in Safari's Web Inspector. However, I have been getting different code from my application and Safari's Web Inspector. Here is my code so far:
#!/usr/bin/python
import urllib2
# headers
hdr = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.28.10 (KHTML, like Gecko) Version/6.0.3 Safari/536.28.10',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Cache-Control': 'max-age=0'}
# request data
req = urllib2.Request("https://www.google.com/#q=rainbow&safe=active", headers=hdr)
# try to get data
try:
page = urllib2.urlopen(req)
print page.info()
except urllib2.HTTPError, e:
print e.fp.read()
content = page.read()
#print content
print content
And the headers match up to what is in Web Inspector:
The code returned is different, though, for a google search for "rainbow".
My python:
http://paste.ubuntu.com/6270549/
Web Inspector:
http://paste.ubuntu.com/6270606/
As far as I know, it seems that my code is missing a large number of the ubiquitous }catch(e){gbar_._DumpException(e)} lines that are present in the Web Inspector code. Also, my code only has 78 lines, while the Web Inspector code has 235 lines. Does this mean that my code is not getting all of the javascript or some other portion of the webpage? How can I get my code retrieve the same data as the Web Inspector?
You are using the wrong link to search with google search- the correct link should be:
https://www.google.com/search?q=rainbow&safe=active
instead of:
https://www.google.com/#q=rainbow&safe=active
The second link will cause a redirect to Google's homepage when used in python, because it is incorrect (for some reason) when not used in Safari. This is why the code is different.

Categories

Resources