detect if a web page is changed - python

In my python application I have to read many web pages to collect data. To decrease the http calls I would like to fetch only changed pages. My problem is that my code always tells me that the pages have been changed (code 200) but in reality it is not.
This is my code:
from models import mytab
import re
import urllib2
from wsgiref.handlers import format_date_time
from datetime import datetime
from time import mktime
def url_change():
urls = mytab.objects.all()
# this is some urls:
# http://www.venere.com/it/pensioni/venezia/pensione-palazzo-guardi/#reviews
# http://www.zoover.it/italia/sardegna/cala-gonone/san-francisco/hotel
# http://www.orbitz.com/hotel/Italy/Venice/Palazzo_Guardi.h161844/#reviews
# http://it.hotels.com/ho292636/casa-del-miele-susegana-italia/
# http://www.expedia.it/Venezia-Hotel-Palazzo-Guardi.h1040663.Hotel-Information#reviews
# ...
for url in urls:
request = urllib2.Request(url.url)
if url.last_date == None:
now = datetime.now()
stamp = mktime(now.timetuple())
url.last_date = format_date_time(stamp)
url.save()
request.add_header("If-Modified-Since", url.last_date)
try:
response = urllib2.urlopen(request) # Make the request
# some actions
now = datetime.now()
stamp = mktime(now.timetuple())
url.last_date = format_date_time(stamp)
url.save()
except urllib2.HTTPError, err:
if err.code == 304:
print "nothing...."
else:
print "Error code:", err.code
pass
I do not understand what has gone wrong. Can anyone help me?

Web servers aren't required to send a 304 header as the response when you send an 'If-Modified-Since' header. They're free to send a HTTP 200 and send the entire page again.
Sending a 'If-Modified-Since' or 'If-None-Since' alerts the server that you'd like a cached response if available. It's like sending an 'Accept-Encoding: gzip, deflate' header -- you're just telling the server you'll accept something, not requiring it.

A good way to check if a site returns 304 is to use google chromes dev tools. E.g. below is an annotated example of using chrome on the bls website. Keep refreshing and you will see that the server keeps returning 304. If you force refresh with Ctrl+F5 (windows), you will see that instead it returns status code 200.
You can use this technique on your example to find out if the server does not return 304, or if you have incorrectly formatted your request headers somehow. Sometimes a webpage has a resource imported on to it which does not respect the If- headers and so it returns 200 whatever you do (If any resource on the page does not return 304, the whole page will return 200), but sometimes you are only looking at a specific part of a website and you can cheat by loading the resource directly and bypassing the whole document.

Related

How to make a request inside a simple mitmproxy script?

Good day,
I am currently trying to figure out a way to make non blocking requests inside a simple script of mitmproxy, but the documentation doesn't seem to be clear for me for the first look.
I think it's probably the easiest if I show my current code and describe my issue below:
from copy import copy
from mitmproxy import http
def request(flow: http.HTTPFlow):
headers = copy(flow.request.headers)
headers.update({"Authorization": "<removed>", "Requested-URI": flow.request.pretty_url})
req = http.HTTPRequest(
first_line_format="origin_form",
scheme=flow.request.scheme,
port=443,
path="/",
http_version=flow.request.http_version,
content=flow.request.content,
host="my.api.xyz",
headers=headers,
method=flow.request.method
)
print(req.get_text())
flow.response = http.HTTPResponse.make(
200, req.content,
)
Basically I would like to intercept any HTTP(S) request done and make a non blocking request to an API endpoint at https://my.api.xyz/ which should take all original headers and return a png screenshot of the originally requested URL.
However the code above produces an empty content and the print returns nothing either.
My issue seems to be related to: mtmproxy http get request in script and Resubmitting a request from a response in mitmproxy but I still couldn't figure out a proper way of sending requests inside mitmproxy.
The following piece of code probably does what you are looking for:
from copy import copy
from mitmproxy import http
from mitmproxy import ctx
from mitmproxy.addons import clientplayback
def request(flow: http.HTTPFlow):
ctx.log.info("Inside request")
if hasattr(flow.request, 'is_custom'):
return
headers = copy(flow.request.headers)
headers.update({"Authorization": "<removed>", "Requested-URI": flow.request.pretty_url})
req = http.HTTPRequest(
first_line_format="origin_form",
scheme='http',
port=8000,
path="/",
http_version=flow.request.http_version,
content=flow.request.content,
host="localhost",
headers=headers,
method=flow.request.method
)
req.is_custom = True
playback = ctx.master.addons.get('clientplayback')
f = flow.copy()
f.request = req
playback.start_replay([f])
It uses the clientplayback addon in order to send out the request. When this new request is sent, that will generate another request event which will then be an infinite loop. That is the reason for the is_custom attribute I added to the request there. If the request that generated this event is the one that we have created, then we don't want to create a new request from it.

Python + requests + splinter: What's the fastest/best way to make multiple concurrent 'get' requests?

Currently taking a web scraping class with other students, and we are supposed to make ‘get’ requests to a dummy site, parse it, and visit another site.
The problem is, the content of the dummy site is only up for several minutes and disappears, and the content comes back up at a certain interval. During the time the content is available, everyone tries to make the ‘get’ requests, so mine just hangs until everyone clears up, and the content eventually disappears. So I end up not being able to successfully make the ‘get’ request:
import requests
from splinter import Browser
browser = Browser('chrome')
# Hangs here
requests.get('http://dummysite.ca').text
# Even if get is successful hangs here as well
browser.visit(parsed_url)
So my question is, what's the fastest/best way to make endless concurrent 'get' requests until I get a response?
Decide to use either requests or splinter
Read about Requests: HTTP for Humans
Read about Splinter
Related
Read about keep-alive
Read about blocking-or-non-blocking
Read about timeouts
Read about errors-and-exceptions
If you are able to get not hanging requests, you can think of repeated requests, for instance:
while True:
requests.get(...
if request is succesfull:
break
time.sleep(1)
Gevent provides a framework for running asynchronous network requests.
It can patch Python's standard library so that existing libraries like requests and splinter work out of the box.
Here is a short example of how to make 10 concurrent requests, based on the above code, and get their response.
from gevent import monkey
monkey.patch_all()
import gevent.pool
import requests
pool = gevent.pool.Pool(size=10)
greenlets = [pool.spawn(requests.get, 'http://dummysite.ca')
for _ in range(10)]
# Wait for all requests to complete
pool.join()
for greenlet in greenlets:
# This will raise any exceptions raised by the request
# Need to catch errors, or check if an exception was
# thrown by checking `greenlet.exception`
response = greenlet.get()
text_response = response.text
Could also use map and a response handling function instead of get.
See gevent documentation for more information.
In this situation, concurrency will not help much since the server seems to be the limiting factor. One solution is to send a request with a timeout interval, if the interval has exceeded, then try the request again after a few seconds. Then gradually increase the time between retries until you get the data that you want. For instance, your code might look like this:
import time
import requests
def get_content(url, timeout):
# raise Timeout exception if more than x sends have passed
resp = requests.get(url, timeout=timeout)
# raise generic exception if request is unsuccessful
if resp.status_code != 200:
raise LookupError('status is not 200')
return resp.content
timeout = 5 # seconds
retry_interval = 0
max_retry_interval = 120
while True:
try:
response = get_content('https://example.com', timeout=timeout)
retry_interval = 0 # reset retry interval after success
break
except (LookupError, requests.exceptions.Timeout):
retry_interval += 10
if retry_interval > max_retry_interval:
retry_interval = max_retry_interval
time.sleep(retry_interval)
# process response
If concurrency is required, consider the Scrapy project. It uses the Twisted framework. In Scrapy you can replace time.sleep with reactor.callLater(fn, *args, **kw) or use one of hundreds of middleware plugins.
From the documentation for requests:
If the remote server is very slow, you can tell Requests to wait
forever for a response, by passing None as a timeout value and then
retrieving a cup of coffee.
import requests
#Wait potentially forever
r = requests.get('http://dummysite.ca', timeout=None)
#Check the status code to see how the server is handling the request
print r.status_code
Status codes beginning with 2 mean the request was received, understood, and accepted. 200 means the request was a success and the information returned. But 503 means the server is overloaded or undergoing maintenance.
Requests used to include a module called async which could send concurrent requests. It is now an independent module named grequests
which you can use to make concurrent requests endlessly until a 200 response:
import grequests
urls = [
'http://python-requests.org', #Just include one url if you want
'http://httpbin.org',
'http://python-guide.org',
'http://kennethreitz.com'
]
def keep_going():
rs = (grequests.get(u) for u in urls) #Make a set of unsent Requests
out = grequests.map(rs) #Send them all at the same time
for i in out:
if i.status_code == 200:
print i.text
del urls[out.index(i)] #If we have the content, delete the URL
return
while urls:
keep_going()

invalid response from proxy with python requests

I am using Requests API with Python2.7.
I am trying to download certain webpages through proxy servers. I have a list of available proxy servers. But not all proxy servers work as desired. Some proxies require authentication, others redirect to advertisement pages etc. In order to detect/verify incorrect responses, I have included two checks in my url requests code. It looks similar to this
import requests
proxy = '37.228.111.137:80'
url = 'http://www.google.ca/'
response = requests.get(url, proxies = {'http' : 'http://%s' % proxy})
if response.url != url or response.status_code != 200:
print 'incorrect response'
else:
print 'response correct'
print response.text
There are some proxy servers with which the requests.get call is successful and they pass these two conditions and still contain invalid html source in response.text attribute. However, if I use the same proxy in my FireFox browser and try to open the same webpage, I am displayed an invalid webpage, but my python script says that the response should be valid.
Can someone point to me that what other necessary checks I am missing to weed out incorrect html results?
or
How can I successfully verify if the webpage I intended to receive is correct?
Regards.
What is an "invalid webpage" when displayed by your browser? The server can return a HTTP status code of 200, but the content is an error message. You understand it to be an error message because you can comprehend it, a browser or code can not.
If you have any knowledge about the content of the target page, you could check whether the returned HTML contains that content and accept it on that basis.

Python script to get Session Ticket

I have a service, that exposes API for logging in. In my browser if I do
<host>:<port>/myservice/api/login?u=admin&pw=admin
The above url, returns a ticket, that I can pass along for my successive requests.
More details about the same, here.
Below is my python script.
import urllib
url = 'http://<host>:<port>/myservice/api/login?u=admin&pw=admin'
print 'Retrieving', url
uh = urllib.urlopen(url)
data = uh.read()
print 'Retrieved',len(data),'characters'
print data
When I run this I get
IOError: ('http error', 401, 'Authorization Required', <httplib.HTTPMessage instance at 0x<somenumber>>)
Now, I am not sure what am I supposed to do. So I went to my browser, and opened the developer's console.
Apparently, the url has moved to something else. I see two requests.
first one hits the url that I am hitting. Response Header has a Location:Parameter.
The second request hits the url that is returned as Location parameter. the Authorization header has 'Negotiation
It also has a setcookie in the response header.
Now, I am not sure what exactly to do with this information, but if someone can help. Thanks
I believe you problem is having the wrong URL for the login service
If I change you code to instead be:
import urllib, json
url = 'http://localhost:8080/alfresco/service/api/login?u=admin&pw=admin&format=json'
print "Retrieving %s" % url
uh = urllib.urlopen(url)
data = uh.read()
print "Retrieved %d characters" % len(data)
print "Data is %s" % data
ticket = json.loads(data)["data"]["ticket"]
print "Ticket is %s" % ticket
Then against a freshly installed Alfresco 4.2 server, I get back a login Ticket for the admin user.
Note the use of the json format of the login API - much easier to parse from JSON, and of the correct path to the login api - /alfresco/service/api/login
try this two small changes may be it will help:
1) use urllib.urlencode while passing parameters to request url
import urllib
params = urllib.urlencode({'u': 'admin', 'pw': 'admin'})
uh = urllib.urlopen("http://<host>:<port>/myservice/api/login?%s" % params')
2) Stimulate a web browser while making a request using urllib2
import urllib2
req = urllib2.Request('http://<host>:<port>/myservice/api/login?u=admin&pw=admin', headers={ 'User-Agent': 'Mozilla/5.0' })
uh = urllib2.urlopen(req)
401 is unauthorized error! it means you are not authorized to access API. Did you already signup for API keys and access tokens?
Check detailed description of 401 Error:
http://techproblems.org/http-error-401/

Getting the options in a http request status 300

I read that when I get this error I should specify better the url. I assume that I should specify between two displayed or accessible options. How can I do that?
In urllib or its tutorial I couldn't find anything. My assumption is true? Can I read somewhere the possible url?
When I open this url in my browser I am redirected to a new url.
The url I try to access: http://www.uniprot.org/uniprot/P08198_CSG_HALHA.fasta
The new url I am redirected: http://www.uniprot.org/uniprot/?query=replaces:P08198&format=fasta
import urllib.request
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
if int(e.code) == 300:
# what now?
The status code 300 is returned from the server to tell you, your request is somehow not complete and you shall be more specific.
Testing the url, I tried to search from http://www.uniprot.org/ and entered into search "P08198". This resulted in page http://www.uniprot.org/uniprot/P08198 telling me
Demerged into Q9HM69, B0R8E4 and P0DME1. [ List ]
To me it seems, the query for some protein is not specific enough as this protein code was split to subcategories or subcodes Q9HM69, B0R8E4 and P0DME1.
Conclusion
Status code 300 is signal from server app, that your request is somehow ambiguous. The way, you can make it specific enough is application specific and has nothing to do with Python or HTTP status codes, you have to find more details about good url in the application logic.
So I ran into this issue and wanted to get the actual content returned.
turns out that this is the solution to my problem.
import urllib.request
try:
response = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
if int(e.code) == 300:
response = r.read()

Categories

Resources