Selenium Webdriver / Beautifulsoup + Web Scraping + Error 416 - python

I'm doing web scraping using selenium webdriver in Python with Proxy.
I want to browse more than 10k pages of single site using this scraping.
Issue is using this proxy I'm able to send request for single time only. when I'm sending another request on same link or another link of this site, I'm getting 416 error (kind of block IP using firewall) for 1-2 hours.
Note: I'm able to do scraping all normal sites with this code, but this site has kind of security which is prevent me for scraping.
Here is code.
profile = webdriver.FirefoxProfile()
profile.set_preference("network.proxy.type", 1)
profile.set_preference(
"network.proxy.http", "74.73.148.42")
profile.set_preference("network.proxy.http_port", 3128)
profile.update_preferences()
browser = webdriver.Firefox(firefox_profile=profile)
browser.get('http://www.example.com/')
time.sleep(5)
element = browser.find_elements_by_css_selector(
'.well-sm:not(.mbn) .row .col-md-4 ul .fs-small a')
for ele in element:
print ele.get_attribute('href')
browser.quit()
Any solution ??

Selenium wasn't helpful for me, so I solved the problem by using beautifulsoup, the website has used security to block proxy whenever received request, so I am continuously changing proxyurl and User-Agent whenever server blocking requested proxy.
I'm pasting my code here
from bs4 import BeautifulSoup
import requests
import urllib2
url = 'http://terriblewebsite.com/'
proxy = urllib2.ProxyHandler({'http': '130.0.89.75:8080'})
# Create an URL opener utilizing proxy
opener = urllib2.build_opener(proxy)
urllib2.install_opener(opener)
request = urllib2.Request(url)
request.add_header('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.2.15) Gecko/20110303 Firefox/3.6.15')
result = urllib2.urlopen(request)
data = result.read()
soup = BeautifulSoup(data, 'html.parser')
ptag = soup.find('p', {'class', 'text-primary'}).text
print ptag
Note :
change proxy and User-Agent and use latest updated proxy only
few server are accepting only specific country proxy, In my case I used Proxies from United States
this process might be a slow, still u can scrap the data

Going through the 416 error issues in the following links, it seems that some cached information(cookies maybe) is creating the issues. You are able to send request for the first time and subsequent send requests fail.
https://webmasters.stackexchange.com/questions/17300/what-are-the-causes-of-a-416-error
416 Requested Range Not Satisfiable
Try choosing not to save cookies by setting a preference or deleting the cookies after every send request.
profile.set_preference("network.cookie.cookieBehavior", 2);

Related

Web scraping on Pythonanywhere

In my project I scraping data from Amazon. I deploy this on Pythonanywhere(I'm using paid account). But there is a problem that the code (I'm using BeautifulSoup4) doesn't get the html of the site when I try it on Pythonanywhere. It gets the Something Went Wrong site of Amazon. But on my local it works perfectly. I think its about User Agents. On my local I use my own User Agent. When deploying which User Agent should I use? And how can I fix this?
Here is my code:
URL = link ##some amazon link
headers = {"User-Agent": " ##my user agent"}
page = requests.get(URL, headers=headers)
soup1 = BeautifulSoup(page.content, 'html.parser')
soup2 = BeautifulSoup(soup1.prettify(), "html.parser")
Is there any way I can do it on Pythonanywhere?
Your code works perfectly on my home machine, so the issue could be:
PythonAnywhere machine's IP being blocked by Amazon (as others have mentioned)
Another issue with the machine's access to the internet (Try scraping another site to test this)
To solve the former, you'd probably want to try out a proxy connection to change the IP you access Amazon with (I suggest you check PythonAnywhere's and Amazon's Terms of Service to be aware of any risks). The usage would look something like this:
import requests
proxies = {
"http": "http://IP:Port", # HTTP
"https": "https://IP:Port", # HTTPS
'http': 'socks5://user:pass#IP:Port' # SOCKS5
}
URL = "https://api4.my-ip.io/ip" # Plaintext IPv4 to test
page = requests.get(URL, proxies=proxies)
print(page.text)
Finding proxies to use takes a couple Google searches, but the difficult part is swapping them out occasionally since they don't last forever.
Try Selenium Webdriver instead of BeautifulSoup4. I had this issue myself when deploying a web scraper to pythonanywhere.com
pythonanywhere.com require a Hacker plan (as a minimum) to run web scraping applications. I was told this by their support team: https://www.pythonanywhere.com/pricing/
I also used the following user-agent and chrome options:
from fake_useragent import UserAgent
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
ua = UserAgent()
userAgent = ua.random
chrome_options.add_argument(f'user-agent={userAgent}')
As per: https://www.pythonanywhere.com/forums/topic/21948/

Prevent from being banned from google scraping with BeautifulSoup

I want to make google news scraper with Python and BeautifulSoup but I have read that there is a chance that I can be banned.
I have also read that I can prevent this with using some rotating proxies and rotating IP addresses.
Only thing I managed to do Is to make rotating User-Agent.
Can you show me how to add rotating proxy and rotating IP address?
I know that it should be added in request.get() part but I do not know how.
This is my code:
from bs4 import BeautifulSoup
import requests
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
term = 'usa'
page=0
for page in range(1,5):
page = page*10
url = 'https://www.google.com/search?q={}&tbm=nws&sxsrf=ACYBGNTx2Ew_5d5HsCvjwDoo5SC4U6JBVg:1574261023484&ei=H1HVXf-fHfiU1fAP65K6uAU&start={}&sa=N&ved=0ahUKEwi_q9qog_nlAhV4ShUIHWuJDlcQ8tMDCF8&biw=1280&bih=561&dpr=1.5'.format(term,page)
print(url)
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
headline_text = soup.find_all('h3', class_= "r dO0Ag")
snippet_text = soup.find_all('div', class_='st')
news_date = soup.find_all('div', class_='slp')
print(len(news_date))
You can do searches with the proper API from Google:
https://developers.google.com/custom-search/v1/overview
You can use https://gimmmeproxy.com for rotating proxies and it's python wrapper: https://github.com/DeyaaMuhammad/GimmeProxyApi.
proxy = GimmeProxyAPI(protocol="https")
proxies = {
'http': proxy,
'https': proxy
}
requests.get('https://example.org', proxies=proxies)
If you want to learn web scraping, best choose some other website, like reddit or some magazine online. Google news (and other google services) are well protected against scraping and they change the names of classes regularly enough to prevent you from doing it the easy way.
If your question is 'What to do to get not banned?', then the answer is 'Don't violate the TOS' which means no scraping at all and using the proper search API.
There is some amount of "free" google search uses, based on the ip address you are using. So if you only scraping a handful of searches, this should be no problem.
If your question is 'How to use a proxy with requests module?', then you should start looking here.
import requests
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
requests.get('http://example.org', proxies=proxies)
But this is only the python side, you need to setup a web-proxy (or even better a pool of proxies) yourself and then use an algorithm to choose a different proxy every N requests for example.
One more simple trick is like Using Google colab in the Brave Tor browser and then see the results that you will get different ip addresses.
So, once you'll get the data that you want then you can use that data in you jupyter notebook or VS Code or elsewhere.
See, the results in the screenshots:
Using free proxies will get an error because there are too many requests on the free proxies so, you have to pick every time different one whose proxy is getting lower traffic so that's a terrible task to chose one out of hundreds.
Getting correct results with Brave Tor VPN:

Python urllib2 open URL and wait some time

Here is the situation: I want to access the content of an URL in Python via urllib2.
import urllib2
url = www.iwanttoknowwhatsinside.com
hdr = {
'User-Agent': 'OpenAnything/1.0 +http://somepage.org/',
'Connection': 'keep-alive'
}
request = urllib2.Request(url, headers=hdr)
opener = urllib2.build_opener()
HTML = opener.open(request).read()
This code normally works fine. But if I access a certain page via webbrowser, it says something like "Checking your browser before accessing ... Your browser will be redirected shortly" and then the page loads. The URL never changes. ADD: Then I can freely click around on the page, or open a second Tab with the same URL. I only have to wait before the initial access.
If I try to access this page via Python, I get an urllib2.HTTPError - Service Temporary Not Available instantly, so I figured urllib2 doesn't wait that time. Is there a way to force some waittime before throwing exceptions or retrieving the content? Or am I looking at this the wrong way?

Login into web page using BeautifulSoup and Mechanize

I am trying to log into a web page programatically, using BeautifulSoup and Mechanize.
This is my code:
#import urllib2
from mechanize import Browser, _http, urlopen
from BeautifulSoup import BeautifulSoup
import cookielib
data_url = "http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER"
def are_we_logged_on(html):
soup = BeautifulSoup(html)
elem = soup.find("input", {"id" : "ctl00_ContentPlaceHolder1_LoginControl_m_userName" } )
return elem is None
# Browser
br = Browser()
# Cookie Jar
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
# Browser options
br.set_handle_equiv(True)
#br.set_handle_gzip(True)
br.set_handle_redirect(True)
br.set_handle_referer(True)
br.set_handle_robots(False)
# Follows refresh 0 but not hangs on refresh > 0
br.set_handle_refresh(_http.HTTPRefreshProcessor(), max_time=1)
# User-Agent (this is cheating, ok?)
br.addheaders = [('User-agent', 'Mozilla/5.0 (Windows NT 6.2; WOW64; rv:32.0) Gecko/20100101 Firefox/32.0')]
# The site we will navigate into, handling it's session
response = br.open(data_url)
html = response.get_data()
# do we need to log in?
logged_on = are_we_logged_on(html)
if not logged_on :
print "DEBUG: Attempting to log in ..."
# Select the first (index zero) form
br.select_form(nr=0)
# User credentials
br.form['ctl00$ContentPlaceHolder1$LoginControl$m_userName'] = 'username'
br.form['ctl00$ContentPlaceHolder1$LoginControl$m_password'] = 'password'
# Login
post_url, post_data, headers = br.form.click_request_data()
print post_url
print post_data
print headers
resp = urlopen(post_url, post_data)
# Check if login succesful
html2 = resp.read()
logged_on = are_we_logged_on(html2)
if not logged_on:
with open("icedump_fail.html","w") as f:
f.write(html2)
print "DEBUG: Failed to logon. Aborting script ...!"
exit(-1)
# If we got this far, then we are logged in ...
When I run the script, the path of execution always results in the "Failed to logon" message being printed to screen.
Can anyone spot what I may be doing wrong?. I'm fresh out of ideas,and perhaps a fresh pair of eyes is what is needed .
Turning on the "debug" mode (br.set_debug_http(True)) helped me to inspect the underlying request mechanize was sending to submit the login form and compare it to the actual request being sent when you log in using the browser.
This revealed that __EVENTTARGET parameter was sent as empty while it should not be.
Here is the fixed part of code that helped me to solve the issue:
br.select_form(nr=0)
br.form.set_all_readonly(False)
br.form['ctl00$ContentPlaceHolder1$LoginControl$m_userName'] = 'username'
br.form['ctl00$ContentPlaceHolder1$LoginControl$m_password'] = 'password'
br.form['__EVENTTARGET'] = 'ctl00$ContentPlaceHolder1$LoginControl$LoginButton'
# Login
response = br.submit()
html2 = response.read()
logged_on = are_we_logged_on(html2)
As a side note, make sure there are no violations of the Agreement you are "digitally signing" while registering at "ICE":
Scraping:
The scraping of this website for the purpose of extracting data automatically from this website is strictly prohibited
BY ICE and it should be noted that this process could result in a
drain on ICE's system resources. ICE (or its affiliates, agents or
contractors) may monitor usage of this website for scraping purposes
and may take all necessary actions to ensure that access to this
website is removed from entities carrying out or reasonably believed
to be carrying out web scraping activities.
I would use Selenium, as it's fully featured and much more powerful. You can actually see results too:
from selenium import webdriver
chrome = webdriver.Chrome()
chrome.get('http://data.theice.com/ViewData/EndOfDay/LdnOptions.aspx?p=AER')
user = chrome.find_element_by_name('ctl00$ContentPlaceHolder1$LoginControl$m_userName')
pswd = chrome.find_element_by_name('ctl00$ContentPlaceHolder1$LoginControl$m_password')
form = chrome.find_element_by_name('ctl00_ContentPlaceHolder1_LoginControl_LoginButton')
user.send_keys(your_username_string)
pswd.send_keys(your_password_string)
form.click() # hit the login button

Get HTML source, including result of javascript and authentication

I am building a web scraper and need to get the html page source as it actually appears on the page. However, I only get a limited html source, one that does not include the needed info. I think that I am either seeing it pre javascript loaded or else maybe I'm not getting the full info because I don't have the right authentication?? My result is the same as "view source" in Chrome when what I want is what Chrome's 'inspect element' shows. My test is cimber.dk after entering flight information and searching.
I am coding in python and tried the urllib2 library. Then I heard that Selenium was good for this so I tried that, too. However, that also gets me the same limited page source.
This is what I tried with urllib2 after using Firebug to see the parameters. (I deleted all my cookies after opening cimber.dk so I was starting with a 'clean slate')
url = 'https://www.cimber.dk/booking/'
values = {'ARRANGE_BY' : 'D',...} #one for each value
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
#Using HTTPRedirectHandler instead of HTTPCookieProcessor gives the same.
urllib2.install_opener(opener)
request = urllib2.Request(url)
opener.addheaders = [('User-Agent', 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0) Gecko/20100101 Firefox/4.0')]
request.add_header(....) # one for each header, also the cookie one
p = urllib.urlencode(values)
data = opener.open(request, p).read()
# data is now the limited source, like Chrome View Source
#I tried to add the following in some vain attempt to do a redirect.
#The result is always "HTTP Error 400: Bad request"
f = opener.open('https://wftc2.e-travel.com/plnext/cimber/Override.action')
data = f.read()
f.close()
Most libraries like this do not support javascript.
If you want javascript, you will need to either automate an existing browser or browser engine, or get a really monolithic big beefy library that is essentially an advanced web crawler.

Categories

Resources