Python Webscrape Google Custom Search URL with Parameters

Python Webscrape Google Custom Search URL with Parameters - python

I am trying to do a project where I search for similar images using Google Image stuff and Google's Custom Search API. From that, I get the correct URL that gets me similar images. Then, I simply want the HTML of that page. The page looks like this LINK. I just want the HTML to the page this leads to. But, I tried this:
r = requests.get(fetchUrl)
print(r.text)
This is just the HTML to a really old Google main page. I am not sure where this is coming from. I also tried adding a header to ensure that Google doesn't block me from scraping.
Entire code:
import requests
filePath = 'Initial_Img/a/frame1.jpg'
searchUrl = 'http://www.google.com/searchbyimage/upload'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
print(fetchUrl)
Do you have any ideas? Any help is truly appreciated.

The problem is something with the way Google renders the page. You would have to use Selenium and physically use the web browser to get the HTML. To solve your problem:
Run: sudo apt install firefox-geckodriver and install Firefox
Run: pip install selenium
Change your code to this:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.firefox.options import Options
filePath = 'Initial_Img/a/test.jpg'
searchUrl = 'http://www.google.com/searchbyimage/upload'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
multipart = {'encoded_image': (filePath, open(filePath, 'rb')), 'image_content': ''}
response = requests.post(searchUrl, files=multipart, allow_redirects=False)
fetchUrl = response.headers['Location']
options = Options()
options.add_argument("--disable-extensions")
options.add_argument("--disable-gpu")
options.add_argument("--no-sandbox") # linux only
options.add_argument("--headless")
options.headless = True # also works
nav = webdriver.Firefox(options=options)
nav.get(fetchUrl)
print(nav.page_source)
nav.page_source gets you the HTML of the end page. I hope this helps. I don't know why the normal method doesn't work. If anyone knows the reason, please comment below.

Related

How to bypass Mod_Security while scraping

I tried running this Python script using BeautifulSoup and requests modules :
from bs4 import BeautifulSoup as bs
import requests
url = 'https://udemyfreecourses.org/
headers = {'UserAgent' : 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36'
}
soup = bs(requests.get(url, headers= headers).text, 'lxml')
But when I send this line :
print(soup.get_text())
It doesn't scrape the text data but instead, It returns this output:
Not Acceptable!Not Acceptable!An appropriate representation of the requested resource could not be found on this server. This error was generated by Mod_Security.
I even even used headers when requesting the webpage, so It can looks like a normal navigator, but I'm still getting this message that's preventing me from accessing the real webpage
Note : The webpage is working perfectly on the navigator directly, but It doesn't show much info when I try to scrape it.
Is there any other way than the one I used with headers that can get a perfect valid request from the website and bypass this security called Mod_Security?
Any help would be very very helpful, Thanks.

EDIT: The Dash in "User-Agent" is essential.
Following this Answer https://stackoverflow.com/a/61968635/8106583
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:55.0) Gecko/20100101 Firefox/55.0',
}
Your User-Agent is the problem. This User-Agent works for me.
Also: Your ip might be blocked by now :D

Why Request doesn't work on a specific URL?

I have a question re: requests module in Python.
So far I have been using this to scrape and it's been working well.
However when I do it against one particular website (code below - and refer to the Jupyter Notebook snapshot), it just doesn't want to complete the task (showing [*] forever).
from bs4 import BeautifulSoup
import requests
import pandas as pd
import json
page = requests.get('https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets', verify = False)
soup = BeautifulSoup(page.content, 'html.parser')
Some users also suggest using headers such as below to speed it up but it doesnt work for me as well:
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
req = requests.get(url = url, headers = headers)
Not sure what's going on (this is the first time for me) but I might be missing on something obvious. If someone can explain why this is not working? Or if it's working in your machine, please do let me know!

The page attempts to add a cookie the first time you visit it. By using the requests module and not defining a cookie will prevent you from being able to connect to the page.
I've modified your script to include my cookie which should work - if it doesn't, copy your cookie (for this host domain) from the browser to the script.
url = 'https://www.stoneisland.com/ca/stone-island-shadow-project/coats-jackets'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3'}
cookies = {
'TS01e58ec0': '01a1c9e334eb0b8b191d36d0da302b2bca8927a0ffd2565884aff3ce69db2486850b7fb8e283001c711cc882a8d1f749838ff59d3d'
}
req = requests.get(url = url, headers = headers, cookies=cookies)

BeautifulSoup using Python keep returning null even though the element exists

I am running the following code to parse an amazon page using beautiful soup in Python but when I run the print line, I keep getting None. I am wondering whether I am doing something wrong or if theres an explanation/solution to this. Any help will be appreciated.
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-Bulletin-
Board/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)

Your code is absolutely correct.
There seems to be some issue with the the parser that you have used (html.parser)
I used html5lib in place of html.parser and the code now works:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.ca/Magnetic-Erase-Whiteboard-BulletinBoard/dp/B07GNVZKY2/ref=sr_1_3_sspa?keywords=whiteboard&qid=1578902710&s=office&sr=1-3-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUEzOE5ZSkFGSDdCOFVDJmVuY3J5cHRlZElkPUEwMDM2ODA4M0dWMEtMWkI1U1hJJmVuY3J5cHRlZEFkSWQ9QTA0MDIwMjQxMEUwMzlMQ0pTQVlBJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-Agent": 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.117 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html5lib')
title = soup.find(id='productTitle')
print(title)
More Info not directly related to the answer:
For the other answer given to this question, I wasn't asked for a captcha when visiting the page.
However Amazon does change the response content if it detects that a bot is visiting the website: Remove the headers from requests.get() method, and try page.text
The default headers added by requests library lead to the identification of the request as being form a bot.

When requesting that page outside of a normal browser environment it asked for a captcha, I'd assume that's why the element doesn't exist.
Amazon probably has specific measures to counter "robots" accessing their pages, I suggest to look at their APIs to see if there's anything helpful instead of scraping the webpages directly.

python requests & beautifulsoup bot detection

I'm trying to scrape all the HTML elements of a page using requests & beautifulsoup. I'm using ASIN (Amazon Standard Identification Number) to get the product details of a page. My code is as follows:
from urllib.request import urlopen
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = urlopen(url)
soup = BeautifulSoup(response, "html.parser")
print(soup)
But the output doesn't show the entire HTML of the page, so I can't do my further work with product details.
Any help on this?
EDIT 1:
From the given answer, It shows the markup of the bot detection page. I researched a bit & found two ways to breach it :
I might need to add a header in the requests, but I couldn't understand what should be the value of header.
Use Selenium.
Now my question is, do both of the ways provide equal support?

It is better to use fake_useragent here for making things easy. A random user agent sends request via real world browser usage statistic. If you don't need dynamic content, you're almost always better off just requesting the page content over HTTP and parsing it programmatically.
import requests
from fake_useragent import UserAgent
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
ua=UserAgent()
hdr = {'User-Agent': ua.random,
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'none',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
response = requests.get(url, headers=hdr)
print response.content
Selenium is used for browser automation and high level web scraping for dynamic contents.

As some of the comments already suggested, if you need to somehow interact with Javascript on a page, it is better to use selenium. However, regarding your first approach using a header:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
These headers are a bit old, but should still work. By using them you are pretending that your request is coming from a normal webbrowser. If you use requests without such a header your code is basically telling the server that the request is coming from python, which most of the servers are rejecting right away.
Another alternative for you could also be fake-useragent maybe you can also have a try with this.

try this:
import requests
from bs4 import BeautifulSoup
url = "http://www.amazon.com/dp/" + 'B004CNH98C'
r = requests.get(url)
r = r.text
##options #1
# print r.text
soup = BeautifulSoup( r.encode("utf-8") , "html.parser")
### options 2
print(soup)

POST request and headers in selenium

I'm trying to add functionality to a headless webbrowser. I know there are easier ways but I stumbled across seleniumrequests an it sparked my interest. I was wondering if there would be a way to add request headers as well as being able to POST data as a payload. I've done some searching around and haven't had much luck. The following prints the html of the first website and screenshots for verification and then my program just hangs on the POST request. Doesn't terminate or raise an exception or anything. Where am I going wrong?
Thanks!
#!/usr/bin/env python
from seleniumrequests import PhantomJS
from selenium import webdriver
#Setting user-agent
webdriver.DesiredCapabilities.PHANTOMJS['phantomjs.page.customHeaders.User-Agent'] = 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36'
browser = PhantomJS()
browser.get('http://www.google.com')
print browser.page_source
browser.save_screenshot('preSearch.png')
searchReq='https://www.google.com/complete/search?'
data={"q":"this is my search term"}
resp = browser.request('POST', str(searchReq), data=data)
print resp
browser.save_screenshot('postSearch.png')

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Webscrape Google Custom Search URL with Parameters - python

Related

How to bypass Mod_Security while scraping

Why Request doesn't work on a specific URL?

BeautifulSoup using Python keep returning null even though the element exists

python requests & beautifulsoup bot detection

POST request and headers in selenium

Categories

Resources