Max retries exceeded with url error Python Requests - python

I'm trying to connect to a website with the code below. There is no problem on Heroku. But I am getting error in DigitalOcean.
Code:
headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
def web():
rq = requests.get("https://myurl.com/gts?search=word", headers = headers)
print(rq)
The error I get in DigitalOcean:
HTTPSConnectionPool(host='myurl.com', port=443): Max retries exceeded with url: /gts?search=word (Caused by ConnectTimeoutError(<urllib3.connection.HTTPSConnection object at 0x7fa364a89690>, 'Connection to myurl.com timed out. (connect timeout=None)'))
What I've done and failed:
I added code verify=False, disabled firewall from console. And I tried to access the site from the console with the "curl -I myurl.com" command. All failed.
Thank you for your help.

It seems like the issue is not caused by a bug in your code, but rather by firewall (or other rules) on the remote server side (sozluk.gov.tr).
I have tried to run curl -I https://sozluk.gov.tr and also your snippet on my local and also remote (Digital Ocean hosted) VM - both worked fine.
You mentioned that curl command from console (I assume remove VM console) did not work. This indicates, that the issue is on the network - rather than code - level.
I recommend to spin new VM in different region (to get IP from different IP pool) or use some kind of a proxy which is not blocked by remote server. You can check available regions (datacenters) on this link.
Responses
user#do-server:~$ curl -I https://sozluk.gov.tr
HTTP/1.1 200 OK
Server: nginx
Date: Sat, 18 Feb 2023 11:18:37 GMT
Content-Type: text/html; charset=UTF-8
Content-Length: 108975
Last-Modified: Fri, 13 Jan 2023 12:38:15 GMT
Connection: keep-alive
Vary: Accept-Encoding
ETag: "63c150b7-1a9af"
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS
Accept-Ranges: bytes
user#do-server:~$ python3
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> headers = {"User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36"}
>>> requests.get("https://sozluk.gov.tr/gts?search=word", headers = headers)
<Response [200]>

Related

Unable to log in to a site requiring unconventional payload to be sent with post requests

I'm trying to log in to a website using requests module. While creating a script to do so, I could notice that the payload used in there is completely different from the conventional approach. This is exactly how the payload +åEMAIL"PASSWORD(0 looks like. This is the content type parameters content-type: application/grpc-web+proto.
The following is what I see in dev tools when I log in to that site manually:
General
--------------------------------------------------------
Request URL: https://grips-web.aboutyou.com/checkout.CheckoutV1/logInWithEmail
Request Method: POST
Status Code: 200
Remote Address: 104.18.9.228:443
Response Headers
--------------------------------------------------------
Referrer Policy: strict-origin-when-cross-origin
access-control-allow-credentials: true
access-control-allow-origin: https://www.aboutyou.cz
access-control-expose-headers: Content-Encoding, Vary, Access-Control-Allow-Origin, Access-Control-Allow-Credentials, Date, Content-Type, grpc-status, grpc-message
cf-cache-status: DYNAMIC
cf-ray: 67d009674f604a4d-SIN
content-encoding: gzip
content-type: application/grpc-web+proto
date: Wed, 11 Aug 2021 08:19:04 GMT
expect-ct: max-age=604800, report-uri="https://report-uri.cloudflare.com/cdn-cgi/beacon/expect-ct"
server: cloudflare
set-cookie: __cf_bm=a45185d4acac45725b46236884673503104a9473-1628669944-1800-Ab2Aos6ocz7q8B8v53oEsSK5QiImY/zqlTba/Y0FqpdsaQt2c10FJylcwTacmdovm6tjGd8hLdy/LidfFCtOj70=; path=/; expires=Wed, 11-Aug-21 08:49:04 GMT; domain=.aboutyou.com; HttpOnly; Secure; SameSite=None
vary: Origin
Request Headers
--------------------------------------------------------
:authority: grips-web.aboutyou.com
:method: POST
:path: /checkout.CheckoutV1/logInWithEmail
:scheme: https
accept: */*
accept-encoding: gzip, deflate, br
accept-language: en-US,en;q=0.9
cache-control: no-cache
content-length: 48
content-type: application/grpc-web+proto
origin: https://www.aboutyou.cz
pragma: no-cache
referer: https://www.aboutyou.cz/
sec-ch-ua: "Chromium";v="92", " Not A;Brand";v="99", "Google Chrome";v="92"
sec-ch-ua-mobile: ?0
sec-fetch-dest: empty
sec-fetch-mode: cors
sec-fetch-site: cross-site
user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.36
x-grpc-web: 1
Request Payload
--------------------------------------------------------
+åEMAIL"PASSWORD(0
This is what I've created so far (can't find any way to fill in the payload):
import requests
from bs4 import BeautifulSoup
start_url = 'https://www.aboutyou.cz/'
post_link = 'https://grips-web.aboutyou.com/checkout.CheckoutV1/logInWithEmail'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.131 Safari/537.3',
'content-type': 'application/grpc-web+proto',
'origin': 'https://www.aboutyou.cz',
'referer': 'https://www.aboutyou.cz/',
'x-grpc-web': '1'
}
payload = {
}
with requests.Session() as s:
s.headers.update(headers)
r = s.post(post_link,data=payload)
print(r.status_code)
print(r.url)
Steps to log in to that site manually:
Go to this site
This is how to get the login form
Login form looks like this
How can I log in to that site using requests module?
I don't think that you'll be able to use Python Requests to login to your target site.
Your post_link url:
post_link = 'https://grips-web.aboutyou.com/checkout.CheckoutV1/logInWithEmail'
states that it is: gRPC requires HTTP/2 and Python Requests send HTTP/1.1 requests only.
Additionally, I noted that the target site also uses CloudFlare, which is difficult to bypass with Python, especially when using Python Requests
'Set-Cookie': '__cf_bm=11d867459fe0951da4157b475cf88eb3ab7658fb-1629229293-1800-AeFomlmROcmUYcRosxxcSnoJkGOW/WXjUe1WxK6SkM2eXIbnAqXRlpwOkpvOfONrbApJd4Qwj+a8+kOzLAfpHIE=; path=/; expires=Tue, 17-Aug-21 20:11:33 GMT; domain=.aboutyou.com; HttpOnly; Secure; SameSite=None', 'Vary': 'Accept-Encoding', 'Server': 'cloudflare', 'CF-RAY': '6805616b8facf1b2-ATL', 'Content-Encoding': 'gzip'}
Here are previous Stack Overflow questions on Python Requests with gRPC
Can't make gRPC work with python requests rest api call
Send plain JSON to a gRPC server using python
I looked through the GitHub repository for Python Requests and saw that HTTP/2 has been a requested feature for almost 7 years.
During my research, I discovered HTTPX, which is a HTTP client for Python 3, which provides sync and async APIs, and support for both HTTP/1.1 and HTTP/2. The documentation states that the package is stable, but is still considered a beta at this point. I would recommend trying HTTPX to see if it solves your issue with logging into your target site.

python urllib, returns empty page for specific urls

I am having trouble with specific links with urllib. Below is the code sample I use:
from urllib.request import Request, urlopen
import re
url = ""
req = Request(url)
html_page = urlopen(req).read()
print(len(html_page))
Here are the results I get for two links:
url = "https://www.dafont.com"
Length: 0
url = "https://www.stackoverflow.com"
Length: 196673
Anyone got any idea why this happens?
Try using. You will get the response. Certain websites are secured and only respond to certain user-agents only.
from urllib.request import Request, urlopen
url = "https://www.dafont.com"
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
req = Request(url, headers=headers)
html_page = urlopen(req).read()
print(len(html_page))
This is a limitation imposed by the authors dafont website.
By default, the urllib sends a User-Agent header of urllib/VVV, where VVV is the urllib version number. For more see: https://docs.python.org/3/library/urllib.request.html Many webmasters protect themselves from crawlers. They parse User-Agent header. So when they come across an User-Agent header like urllib/VVV, they think it's a crawler.
Testing HEAD method:
$ curl -A "Python-urllib/2.6" -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:11:53 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Content-Type: text/html
$ curl -I https://www.dafont.com
HTTP/1.1 200 OK
Date: Sun, 13 Jun 2021 15:12:02 GMT
Server: Apache
Strict-Transport-Security: max-age=63072000; includeSubDomains; preload
Set-Cookie: PHPSESSID=dcauh0dp1antb7eps1smfg2a76; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Content-Type: text/html
Testing GET method:
$ curl -sSL -A "Python-urllib/2.6" https://www.dafont.com | wc -c
0
$ curl -sSL https://www.dafont.com | wc -c
18543

Access denied error only on heroku remote machine through requests python

I am facing this issue where when I access the page source of a url from my local machine it works fine but when I run the same piece on code on a heroku machine it shows access denied.
I have tried changing the headers ( like adding Referers or changing the User-Agent) but none of those solutions are working.
LOCAL MACHINE
~/Development/repos/eater-list  master  python manage.py shell  1 ↵  12051  21:15:32
>>> from accounts.zomato import *
>>> z = ZomatoAPI()
>>> response = z.page_source(url='https://www.zomato.com/ncr/the-immigrant-cafe-khan-market-new-delhi')
>>> response[0:50]
'<!DOCTYPE html>\n<html lang="en" prefix="og: http'
>>> response[0:100]
'<!DOCTYPE html>\n<html lang="en" prefix="og: http://ogp.me/ns#" >\n<head>\n <meta charset="utf-8"
REMOTE MACHINE
~ $ python manage.py shell
Python 3.5.7 (default, Jul 17 2019, 15:27:27)
[GCC 7.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
(InteractiveConsole)
>>> from accounts.zomato import *
>>> z = ZomatoAPI()
>>> response = z.page_source(url='https://www.zomato.com/ncr/the-immigrant-cafe-khan-market-new-delhi')
>>> response
'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.zomato.com/ncr/the-immigrant-cafe-khan-market-new-delhi" on this server.<P>\nReference #18.56273017.1572225939.46ec5af\n</BODY>\n</HTML>\n'
>>>
ZOMATO API CODE
There is no change in headers or requests version.
class ZomatoAPI:
def __init__(self):
self.user_key = api_key
self.headers = {
'Accept': 'application/json',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/77.0.3865.90 Safari/537.36',
'user-key': self.user_key}
def page_source(self, url):
fng = requests.session()
page_source = fng.get(
url, headers=self.headers).content.decode("utf-8")
return page_source
Will appreciate some advice on it.
Check the response HTTP status code. It might be that Heroku's IP is simply banned from Zomato. This is more common than one might believe -- services like Cloudflare will a lot of times put common IP's in a "banned list".
Here is what you should be looking for regarding HTTP status code to give you more context.

Python Requests (Web Scraping) - Building a cookie to be able to view data in a website

I'm trying to scrape a finance website to make an application that compares the accuracy of financial data from various other websites (google/yahoo finance).
The URL I am trying to scrape (specifically the stock's "Key Data" like Market Cap, Volume, Etc) is here:
https://www.marketwatch.com/investing/stock/sbux
I've figured out (with the help of others) that a cookie must be built and sent with each request in order for the page to display the data (otherwise the page html response pretty much returns empty).
I used Opera/Firefox/Chrome browsers to look into the HTTP Headers and requests that are being sent back from the browser. I've come to the conclusion that there are 3 steps/requests that need to be done to receive all the cookie data and build it piece by piece.
Step/Request 1
Simply visiting the above URL.
GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44
HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 579
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache
Step/Request 2
I am not sure where this "POST" URL came from. However, using Firefox and viewing network connections this url popped up in the "Trace Stack" tab. Again, I have no idea where to get this URL if its the same for everyone or randomly created. I also don't know what POST data is being sent or where the values of X-Hash-Result or X-Token-Value came from. However, this request returns a very important value in the response header with the following line: 'Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d' this piece of the cookie is crucial for the next request in order to return the full cookie and receive the data on the web-page.
POST /149e9513-01fa-4fb0-aad4-566afd725d1b/2d206a39-8ed7-437e-a3be-862e0f06eea3/fingerprint HTTP/1.1
Host: www.marketwatch.com:443
Accept: */*
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Content-Type: application/json; charset=UTF-8
Origin: https://www.marketwatch.com
Referer: https://www.marketwatch.com/investing/stock/sbux
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44
X-Hash-Result: 701c19ee3f45d07b56b40fb8e313214d
X-Token-Value: 900c4055-ef7a-74a8-e9ec-f78f7edc363b
HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Length: 17
Content-Type: application/json; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:16 GMT
Expires: Sun, 26 Aug 2018 05:12:16 GMT
Pragma: no-cache
Set-Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d; Path=/; HttpOnly
Step/Request 3
This request is sent to the original URL with the cookie picked up in step 2. The full cookie is then returned in the response which can be used in step 1 to avoid going through step 2 and 3 again. It will also display the full page of data.
GET /investing/stock/sbux HTTP/1.1
Host: www.marketwatch.com:443
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9
Cookie: ncg_g_id_zeta=701c19ee3f45d07b56b40fb8e313214d
Referer: https://www.marketwatch.com/investing/stock/sbux
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36 OPR/55.0.2994.44
HTTP/1.1 200 OK
Cache-Control: max-age=0, no-cache, no-store
Connection: keep-alive
Content-Encoding: gzip
Content-Length: 62944
Content-Type: text/html; charset=utf-8
Date: Sun, 26 Aug 2018 05:12:17 GMT
Expires: Sun, 26 Aug 2018 05:12:17 GMT
Pragma: no-cache
Server: Kestrel
Set-Cookie: seenads=0; expires=Sun, 26 Aug 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Set-Cookie: mw_loc=%7B%22country%22%3A%22CA%22%2C%22region%22%3A%22ON%22%2C%22city%22%3A%22MARKHAM%22%2C%22county%22%3A%5B%22%22%5D%2C%22continent%22%3A%22NA%22%7D; expires=Sat, 01 Sep 2018 23:59:59 GMT; domain=.marketwatch.com; path=/
Vary: Accept-Encoding
x-frame-options: SAMEORIGIN
x-machine: 8cfa9f20bf3eb
Summary
In summary, step 2 is the most important to get the remaining cookie piece... But I can't figure out the 3 things:
1) Where the POST url comes from (not embedded in original page, is the URL the same for everyone or is it randomly generated by the site).
2) What's the data being sent in the POST request?
3) Where do X-Hash-Result and X-Token-Value come from? Is it required to be sent in the header with the request?
I tried to get the cookie string appending to work. MarketWatch has done a fairly decent job protecting their data. In order to build the entire cookie you will need a wsj API key (their site's finance data supplier I think) and some hidden variables that are potentially only available to the client's server and strictly withheld based on your web-driver or lack thereof.
For example if you try to hit with requests: POST https://browser.pipe.aria.microsoft.com/Collector/3.0/?qsp=true&content-type=application/bond-compact-binary&client-id=NO_AUTH&sdk-version=ACT-Web-JS-2.7.1&x-apikey=c34cce5c21da4a91907bc59bce4784fb-42e261e9-5073-49df-a2e1-42415e012bc6-6954
You'll get an 400 unauthorized error.
Remember there is also a good chance that the client host server cluster master and the various APIs it communicates with are communicating without our browsers being able to pick up the network traffic. This could be done through a middleware of some sort for example. I believe this could account for the missing X-Hash-Result and X-Token-Value values.
I am not saying it is impossible to build this cookie string, just that it is an inefficient route to take in terms of development time and effort. I also now question this method's ease of scalability in terms of using different tickers besides AAPL. Unless there is an explicit requirement to not use a web-driver and/or the script needs to be highly portable without any configuration allowed outside of pip install, I wouldn't choose this method.
That essentially leaves us with either a Scrapy Spider or a Selenium Scraper (and a little extra environment configuration unfortunately, but very important skills to learn if you want to write and deploy web scrapers. Generally speaking, requests + bs4 is for ideal easy scrapes/unusual code portability needs).
I went ahead and wrote a Selenium Scraper ETL Class using a PhantomJS Web-driver for you. It accepts a ticker string as a parameter and works on other stocks besides AAPL. It was tricky since marketwatch.com will not redirect traffic from a PhantomJS Web-driver (I can tell that they have spent a lot of resources trying to discourage web scrapers btw. Much much more so than say yahoo.com).
Anyway here is the final Selenium Script, it runs on Python 2 and 3:
# Market Watch Test Scraper ETL
# Tested on python 2.7 and 3.5
# IMPORTANT: Ensure PhantomJS Web Driver is configured and installed
import pip
import sys
import signal
import time
# Package installer function to handle missing packages
def install(package):
print(package + ' package for Python not found, pip installing now....')
pip.main(['install', package])
print(package + ' package has been successfully installed for Python\n Continuing Process...')
# Ensure beautifulsoup4 is installed
try:
from bs4 import BeautifulSoup
except:
install('beautifulsoup4')
from bs4 import BeautifulSoup
# Ensure selenium is installed
try:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
except:
install('selenium')
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
# Class to extract and transform raw marketwatch.com financial data
class MarketWatchETL:
def __init__(self, ticker):
self.ticker = ticker.upper()
# Set up desired capabilities to spoof Firefox since marketwatch.com rejects any PhantomJS Request
self._dcap = dict(DesiredCapabilities.PHANTOMJS)
self._dcap["phantomjs.page.settings.userAgent"] = ("Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) "
"AppleWebKit/537.36 (KHTML, like Gecko) "
"Chrome/29.0.1547.57 Safari/537.36")
self._base_url = 'https://www.marketwatch.com/investing/stock/'
self._retries = 10
# Private Static Method to clean and organize Key Data Extract
#staticmethod
def _cleaned_key_data_object(raw_data):
cleaned_data = {}
raw_labels = raw_data['labels']
raw_values = raw_data['values']
i = 0
for raw_label in raw_labels:
raw_value = raw_values[i]
cleaned_data.update({str(raw_label.get_text()): raw_value.get_text()})
i += 1
return cleaned_data
# Private Method to scrape data from MarketWatch's web page
def _scrape_financial_key_data(self):
raw_data_obj = {}
try:
driver = webdriver.PhantomJS(desired_capabilities=self._dcap)
except:
print('***SETUP ERROR: The PhantomJS Web Driver is either not configured or incorrectly configured!***')
sys.exit(1)
driver.get(self._base_url + self.ticker)
i = 0
while i < self._retries:
try:
time.sleep(3)
html = driver.page_source
soup = BeautifulSoup(html, "html.parser")
labels = soup.find_all('small', class_="kv__label")
values = soup.find_all('span', class_="kv__primary")
if labels and values:
raw_data_obj.update({'labels': labels})
raw_data_obj.update({'values': values})
break
else:
i += 1
except:
i += 1
continue
if i == self._retries:
print('Please check your internet connection!\nUnable to connect...')
sys.exit(1)
driver.service.process.send_signal(signal.SIGTERM)
driver.quit()
return raw_data_obj
# Public Method to return a Stock's Key Data Object
def get_stock_key_data(self):
raw_data = self._scrape_financial_key_data()
return self._cleaned_key_data_object(raw_data)
# Script's Main Process to test MarketWatchETL('TICKER')
if __name__ == '__main__':
# Run financial key data extracts for Microsoft, Apple, and Wells Fargo
msft_key_data = MarketWatchETL('MSFT').get_stock_key_data()
aapl_key_data = MarketWatchETL('AAPL').get_stock_key_data()
wfc_key_data = MarketWatchETL('WFC').get_stock_key_data()
# Print result dictionaries
print(msft_key_data.items())
print(aapl_key_data.items())
print(wfc_key_data.items())
Which outputs:
dict_items([('Rev. per Employee', '$841.03K'), ('Short Interest', '44.63M'), ('Yield', '1.53%'), ('Market Cap', '$831.23B'), ('Open', '$109.27'), ('EPS', '$2.11'), ('Shares Outstanding', '7.68B'), ('Ex-Dividend Date', 'Aug 15, 2018'), ('Day Range', '108.51 - 109.64'), ('Average Volume', '25.43M'), ('Dividend', '$0.42'), ('Public Float', '7.56B'), ('P/E Ratio', '51.94'), ('% of Float Shorted', '0.59%'), ('52 Week Range', '72.05 - 111.15'), ('Beta', '1.21')])
dict_items([('Rev. per Employee', '$2.08M'), ('Short Interest', '42.16M'), ('Yield', '1.34%'), ('Market Cap', '$1.04T'), ('Open', '$217.15'), ('EPS', '$11.03'), ('Shares Outstanding', '4.83B'), ('Ex-Dividend Date', 'Aug 10, 2018'), ('Day Range', '216.33 - 218.74'), ('Average Volume', '24.13M'), ('Dividend', '$0.73'), ('Public Float', '4.82B'), ('P/E Ratio', '19.76'), ('% of Float Shorted', '0.87%'), ('52 Week Range', '149.16 - 219.18'), ('Beta', '1.02')])
dict_items([('Rev. per Employee', '$384.4K'), ('Short Interest', '27.44M'), ('Yield', '2.91%'), ('Market Cap', '$282.66B'), ('Open', '$58.87'), ('EPS', '$3.94'), ('Shares Outstanding', '4.82B'), ('Ex-Dividend Date', 'Aug 9, 2018'), ('Day Range', '58.76 - 59.48'), ('Average Volume', '18.45M'), ('Dividend', '$0.43'), ('Public Float', '4.81B'), ('P/E Ratio', '15.00'), ('% of Float Shorted', '0.57%'), ('52 Week Range', '49.27 - 66.31'), ('Beta', '1.13')])
The only extra step you will need to do prior to running this is to install and configure the PhantomJS Web-driver on your deployment environments. If you need to automate the deployment of a web-scraper like this you could write a bash/power shell installer script to handle pre-configuring your environment's PhantomJS.
Some resources for Installing and Configuring PhantomJS:
Windows/Mac PhantomJS Installation Executables
Debian Linux PhantomJS Installation Guide
RHEL PhantomJS Installation Guide
I just doubt the practicability and even possibility of assembling the Cookie in the manner I suggested in your prior post.
I think the other practical possibility here is to write a Scrapy Crawler.

Fuzzy Logic: How to detect when a 404 is not actually an error page?

I'm running into the strangest situation, in which a site (http://seventhgeneration.com/mission) erroneously returns a 404 response code.
I'm writing an automatic test suite which tests all the links within a site and tests that they aren't broken. In this case I'm testing a site which links to http://seventhgeneration.com/mission, though I have no control over the Seventh Generation Mission page. This page works in the browser, although it does return a 404 in the network monitor.
Is there any technical way to validate this page as not an error page, while correctly detecting other pages (e.g. https://github.com/thisShouldNotExist) as 404s? As someone mentioned in the comments, the Seventh Generation site does have a 404 page that appears for other broken URLs: http://seventhgeneration.com/shouldNotExist
# -*- coding: utf-8 -*-
import traceback
import urllib2
import httplib
url = 'http://seventhgeneration.com/mission'
HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36',
#'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3',
'Accept-Encoding': 'gzip, deflate',
'Accept-Language': 'en-US,en;q=0.8',
'Connection': 'keep-alive'}
request = urllib2.Request(url, headers=HEADERS)
try:
response = urllib2.urlopen(request)
response_header = response.info()
print "Success: %s - %s"%(response.code, response_header)
except urllib2.HTTPError, e:
print 'urllib2.HTTPError %s - %s'%(e.code, e.headers)
except urllib2.URLError, e:
print "Unknown URLError: %s"%(e.reason)
except httplib.BadStatusLine as e:
print "Bad Status Error. (Presumably, the server closed the connection before sending a valid response)"
except Exception:
print "Unkown Exception: %s"%(traceback.format_exc())
When run, this script returns:
urllib2.HTTPError 404 - Server: nginx
Content-Type: text/html; charset=utf-8
X-Drupal-Cache: HIT
Etag: "1422054308-1"
Content-Language: en
Link: </node/1523879>; rel="shortlink",</404>; rel="canonical",</node/1523879>; rel="shortlink",</404>; rel="canonical"
X-Generator: Drupal 7 (http://drupal.org)
Cache-Control: public, max-age=21600
Last-Modified: Fri, 23 Jan 2015 23:05:08 +0000
Expires: Sun, 19 Nov 1978 05:00:00 GMT
Vary: Cookie,Accept-Encoding
Content-Encoding: gzip
X-Request-ID: v-82b55230-a357-11e4-94fe-1231380988d9
X-AH-Environment: prod
Content-Length: 11441
Accept-Ranges: bytes
Date: Fri, 23 Jan 2015 23:28:17 GMT
X-Varnish: 2729940224
Age: 0
Via: 1.1 varnish
Connection: close
X-Cache: MISS
This server is clearly not conforming to the HTTP specification. It is returning the entire web page in the HTML that is supposed to be a description of why the 404 error is occurring. You need to fix that, not find ways to get around it.

Categories

Resources