I'm using following infrastructure for scraping a web site:
Scrapy <--> Splash <--> Scrapoxy <--> web site
I'm doing requests via Splash execute endpoint, with a Lua script like this:
function main(splash)
local host = "..."
local port = "..."
local username = "..."
local password = "..."
splash:on_request(function (request)
request:set_proxy{host, port, username=username, password=password}
end)
splash:go(splash.args.url)
return splash:html()
end
I want to detect bans and remove banned proxies. According to Scrapoxy documentation:
Scrapoxy adds to the response an HTTP header x-cache-proxyname
But I don't see this header in response.headers. The only headers are:
{b'Content-Type': b'text/html; charset=utf-8',
b'Date': b'Wed, 18 Apr 2018 19:02:21 GMT',
b'Server': b'TwistedWeb/16.1.1'}
What am I doing wrong? Should I add something to the Lua script to properly return headers?
UPDATE: Actually, it doesn't seem to be a Splash problem. Scrapoxy doesn't return x-cache-proxyname even if used via HTTPie.
http -v --proxy=https:http://<user>:<password>#<scrapoxy-server>:8888 https://<site>
GET / HTTP/1.1
User-Agent: HTTPie/0.9.9
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Host: <site>
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 28 Jun 2018 08:14:26 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Set-Cookie: <...>
X-Powered-By: Express
ETag: W/"5a31b-faPJ7bjKH24S/3EvHU/8IoJHyxw"
Vary: Cookie, User-Agent
Content-Security-Policy: default-src https:; child-src https:; connect-src https: wss:; form-action https:; frame-ancestors https: http://webvisor.com; media-src https:; object-src https:; img-src https: data: blob:; script-src https: data: 'unsafe-inline' 'unsafe-eval'; style-src https: 'unsafe-inline'; font-src https: data:; report-uri /ajax/csp-report/
Content-Encoding: gzip
I managed to get x-cache-proxyname with this lua script
function main(splash)
local host = "..."
local port = "..."
local username = "..."
local password = "..."
local proxy = ""
splash:on_request(function (request)
request:set_proxy{host, port, username=username, password=password}
end)
splash:on_response_headers(function(response)
proxy = response.headers["x-cache-proxyname"]
end)
splash.images_enabled = false
splash:go(splash.args.url)
splash:set_result_header("x-cache-proxyname", proxy)
splash:go(splash.args.url)
return splash:html()
end
UPDATE:
When you use HTTPs scrapoxy cannot edit headers and add x-cache-proxyname to response
Related
I'm trying to write a python script that would help me install a theme remotely. Unfortunately, the upload part doesn't play nice, trying to do it with requests' POST helpers.
The HTTP headers of a successful upload look like this:
http://127.0.0.1/wordpress/wp-admin/update.php?action=upload-theme
POST /wordpress/wp-admin/update.php?action=upload-theme HTTP/1.1
Host: 127.0.0.1
User-Agent: Mozilla/5.0 (Windows NT 6.3; WOW64; rv:53.0) Gecko/20100101 Firefox/53.0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-US,en;q=0.5
Accept-Encoding: gzip, deflate
Content-Type: multipart/form-data; boundary=---------------------------2455316848522
Content-Length: 2580849
Referer: http://127.0.0.1/wordpress/wp-admin/theme-install.php
Cookie: wordpress_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7Cee7366eea9b5bc9a9d492a664a04cb0916b97b0d211e892875cec86cf43e2f9d; wordpress_test_cookie=WP+Cookie+check; wordpress_logged_in_5bd7a9c61cda6e66fc921a05bc80ee93=admin%7C1497659497%7C4a1VklpOs93uqpjylWqckQs80PccH1QMbZqn15lovQu%7C9949f19ef5d900daf1b859c0bb4e2129cf86d6a970718a1b63e3b9e56dc5e710; wp-settings-1=libraryContent%3Dbrowse; wp-settings-time-1=1497486698
Connection: keep-alive
Upgrade-Insecure-Requests: 1
-----------------------------2455316848522: undefined
Content-Disposition: form-data; name="_wpnonce"
b1467671e0
-----------------------------2455316848522
Content-Disposition: form-data; name="_wp_http_referer"
/wordpress/wp-admin/theme-install.php
-----------------------------2455316848522
Content-Disposition: form-data; name="themezip"; filename="oedipus_theme.zip"
Content-Type: application/octet-stream
PK
HTTP/1.1 200 OK
Date: Thu, 15 Jun 2017 01:33:25 GMT
Server: Apache/2.4.25 (Win32) OpenSSL/1.0.2j PHP/7.1.1
X-Powered-By: PHP/7.1.1
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
X-Frame-Options: SAMEORIGIN
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Transfer-Encoding: chunked
Content-Type: text/html; charset=UTF-8
----------------------------------------------------------
To create a simple session for WP, in order to use later for uploads:
global wp_session
def wpCreateSession(uname, upassword, site_link):
"""
:param uname: Username for the login.
:param upaswword: Password for the login.
:param site_link: Site to login on.
:return: Returns a sessions for the said website.
"""
global wp_session
wp_session = requests.session()
wp_session.post(site_link, data={'log' : uname, 'pwd' : upassword})
To upload the said file to WP, using the wp_session global:
def wpUploadTheme(file_name):
global wp_session
try:
with open(file_name, 'rb') as up_file:
r = wp_session.post('http://127.0.0.1/wordpress/wp-admin/update.php', files = {file_name: up_file})
print "Got after try."
finally:
up_file.close()
And this last bit is where it doesn't work, the upload is not successful and I get returned to WordPress' basic 404.
I have also tried requests_toolbelt MultiPart_Encoder to no avail.
Question: 'requests' POST file fails when trying to upload
Check your files dict, your dict is invalid
files = {file_name: up_file}
Maybe you need a full blown files dict, for instance:
files = {'themezip': ('oedipus_theme.zip',
open('oedipus_theme.zip', 'rb'),
'application/octet-stream', {'Expires': '0'})}
From docs.python-requests.org
files = {'file': open('test.jpg', 'rb')}
requests.post(url, files=files)
From SO Answer Upload Image using POST form data in Python-requests
I'm getting "urllib.error.HTTPError: HTTP Error 401: Unauthorized" after trying this code:
_HTTPHandler = urllib.request.HTTPBasicAuthHandler()
_HTTPHandler.add_password(None,'http://192.168.1.205','admin','password')
opener = urllib.request.build_opener(_HTTPHandler)
opener.open('http://192.168.1.205/api/swis/resource')
I'm sure that the user/password is correct. I've tested it with Google's postman app setting a Basic Auth header and i receive the correct response.
My question is how can a see the headers that are being used by the "opener" so i can check if they are being generated correctly or not.
For manual debugging, you can set the debuglevel of your HTTPHandler:
handler = HTTPHandler(debuglevel=1)
This will produce rich output in stdout, where you can see the full request/response dump:
send: GET / HTTP/1.1
Accept-Encoding: identity
Host: www.example.com
Connection: close
reply: HTTP/1.1 200 OK
header: Date: Mon May 22 2017 12:21:31 GMT
header: Cache-Control: private
header: Content-Type: text/html; charset=ISO-8859-1
Relevant docs: https://docs.python.org/3/library/urllib.request.html
I'm learning how to login to an example website using python requests module. This
Video Tutorial
got me started. From all the cookies that I see in GoogleChrome>Inspect Element>NetworkTab, I'm not able to retrieve all of them using the following code:
import requests
with requests.Session() as s:
url = 'http://www.noobmovies.com/accounts/login/?next=/'
s.get(url)
allcookies = s.cookies.get_dict()
print allcookies
Using this I only get csrftoken like below:
{'csrftoken': 'ePE8zGxV4yHJ5j1NoGbXnhLK1FQ4jwqO'}
But in google chrome, I see all these other cookies apart from csrftoken (sessionid, _gat, _ga etc):
I even tried the following code from here, but the result was the same:
from urllib2 import Request, build_opener, HTTPCookieProcessor, HTTPHandler
import cookielib
#Create a CookieJar object to hold the cookies
cj = cookielib.CookieJar()
#Create an opener to open pages using the http protocol and to process cookies.
opener = build_opener(HTTPCookieProcessor(cj), HTTPHandler())
#create a request object to be used to get the page.
req = Request("http://www.noobmovies.com/accounts/login/?next=/")
f = opener.open(req)
#see the first few lines of the page
html = f.read()
print html[:50]
#Check out the cookies
print "the cookies are: "
for cookie in cj:
print cookie
Output:
<!DOCTYPE html>
<html xmlns="http://www.w3.org
the cookies are:
<Cookie csrftoken=ePE8zGxV4yHJ5j1NoGbXnhLK1FQ4jwqO for www.noobmovies.com/>
So, how can I get all the cookies ? Thanks.
The cookies being set are from other pages/resources, probably loaded by JavaScript code. You can check it making the request to the page only (without running the JS code), using tools such as wget, curl or httpie.
The only cookie this server set is csrftoken, as you can see in:
$ wget --server-response 'http://www.noobmovies.com/accounts/login/?next=/'
--2016-02-01 22:51:55-- http://www.noobmovies.com/accounts/login/?next=/
Resolving www.noobmovies.com (www.noobmovies.com)... 69.164.217.90
Connecting to www.noobmovies.com (www.noobmovies.com)|69.164.217.90|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: nginx/1.4.6 (Ubuntu)
Date: Tue, 02 Feb 2016 00:51:58 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: keep-alive
Vary: Accept-Encoding
Expires: Tue, 02 Feb 2016 00:51:58 GMT
Vary: Cookie,Accept-Encoding
Cache-Control: max-age=0
Set-Cookie: csrftoken=XJ07sWhMpT1hqv4K96lXkyDWAYIFt1W5; expires=Tue, 31-Jan-2017 00:51:58 GMT; Max-Age=31449600; Path=/
Last-Modified: Tue, 02 Feb 2016 00:51:58 GMT
Length: unspecified [text/html]
Saving to: ‘index.html?next=%2F’
index.html?next=%2F [ <=> ] 10,83K 2,93KB/s in 3,7s
2016-02-01 22:52:03 (2,93 KB/s) - ‘index.html?next=%2F’ saved [11085]
Note the Set-Cookie line.
Hi iam trying to scrap some data off from this URL:
http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1
As you may have noticed, if cookies and session data is not yet set you will be redirected to its base url (http://www.21cineplex.com/)
I tried to do it like this:
def main():
try:
cj = CookieJar()
baseurl = "http://www.21cineplex.com"
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
opener.open(baseurl)
urllib2.install_opener(opener)
movieSource = urllib2.urlopen('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1').read()
splitSource = re.findall(r'<ul class="w462">(.*?)</ul>', movieSource)
print splitSource
except Exception, e:
str(e)
print "Error occured in main Block"
However, i ended up failing to scrap from that particular URL.
A quick inspection reveals that the website is setting a session ID (PHPSESSID) and make a copy to the client's cookie as such.
The question is how do i mitigate such example?
ps: i've tried to install request (via pip) how ever it gives me (404):
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Getting page https://pypi.python.org/simple/
URLs to search for versions for request:
* https://pypi.python.org/simple/request/
Getting page https://pypi.python.org/simple/request/
Could not fetch URL https://pypi.python.org/simple/request/: HTTP Error 404: Not Found (request does not have any releases)
Will skip URL https://pypi.python.org/simple/request/ when looking for download links for request
Could not find any downloads that satisfy the requirement request
Cleaning up...
Thanks to #Chainik i got it to work now. I ended up modify my code like this:
cj = CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
baseurl = "http://www.21cineplex.com/"
regex = '<ul class="w462">(.*?)</ul>'
opener.open(baseurl)
urllib2.install_opener(opener)
request = urllib2.Request('http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1')
request.add_header('Referer', baseurl)
requestData = urllib2.urlopen(request)
htmlText = requestData.read()
Once, the html text is retrieved. It's all about parsing its content.
Cheers
Try setting a referer URL, see below.
Without referer URL set (302 redirect):
$ curl -I "http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 302 Moved Temporarily
Server: nginx
Date: Thu, 19 Sep 2013 09:19:19 GMT
Content-Type: text/html
Connection: keep-alive
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=5effe043db4fd83b2c5927818cb1a7ca; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:19 GMT; path=/
Location: http://www.21cineplex.com/
With referer URL set (HTTP/200):
$ curl -I -e "http://www.21cineplex.com/"
"http://www.21cineplex.com/nowplaying/jakarta,3,JKT.htm/1"
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 19 Sep 2013 09:19:24 GMT
Content-Type: text/html
Connection: keep-alive
Vary: Accept-Encoding
X-Powered-By: PHP/5.4.17
Set-Cookie: PHPSESSID=a7abd6592c87e0c1a8fab4f855baa0a4; path=/
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Pragma: no-cache
Set-Cookie: kota=3; expires=Fri, 19-Sep-2014 09:19:24 GMT; path=/
To set referer URL using urllib, see this post
-- ab1
I am using the python urllib2 library for opening URL, and what I want is to get the complete header info of the request. When I use response.info I only get this:
Date: Mon, 15 Aug 2011 12:00:42 GMT
Server: Apache/2.2.0 (Unix)
Last-Modified: Tue, 01 May 2001 18:40:33 GMT
ETag: "13ef600-141-897e4a40"
Accept-Ranges: bytes
Content-Length: 321
Connection: close
Content-Type: text/html
I am expecting the complete info as given by live_http_headers (add-on for firefox), e.g:
http://www.yellowpages.com.mt/Malta-Web/127151.aspx
GET /Malta-Web/127151.aspx HTTP/1.1
Host: www.yellowpages.com.mt
User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0.1) Gecko/20100101 Firefox/4.0.1
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-gb,en;q=0.5
Accept-Encoding: gzip, deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 115
Connection: keep-alive
Cookie: __utma=156587571.1883941323.1313405289.1313405289.1313405289.1; __utmz=156587571.1313405289.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none)
HTTP/1.1 302 Found
Connection: Keep-Alive
Content-Length: 141
Date: Mon, 15 Aug 2011 12:17:25 GMT
Location: http://www.trucks.com.mt
Content-Type: text/html; charset=utf-8
Server: Microsoft-IIS/6.0
X-Powered-By: ASP.NET, UrlRewriter.NET 2.0.0
X-AspNet-Version: 2.0.50727
Set-Cookie: ASP.NET_SessionId=zhnqh5554omyti55dxbvmf55; path=/; HttpOnly
Cache-Control: private
My request function is:
def dorequest(url, post=None, headers={}):
cOpener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
urllib2.install_opener( cOpener )
if post:
post = urllib.urlencode(post)
req = urllib2.Request(url, post, headers)
response = cOpener.open(req)
print response.info() // this does not give complete header info, how can i get complete header info??
return response.read()
url = 'http://www.yellowpages.com.mt/Malta-Web/127151.aspx'
html = dorequest(url)
Is it possible to achieve the desired header info details by using urllib2? I don't want to switch to httplib.
Those are all of the headers the server is sending when you do the request with urllib2.
Firefox is showing you the headers it's sending to the server as well.
When the server gets those headers from Firefox, some of them may trigger it to send back additional headers, so you end up with more response headers as well.
Duplicate the exact headers Firefox sends, and you'll get back an identical response.
Edit: That location header is sent by the page that does the redirect, not the page you're redirected to. Just use response.url to get the location of the page you've been sent to.
That first URL uses a 302 redirect. If you don't want to follow the redirect, but see the headers from the first page instead, use a URLOpener instead of a FancyURLOpener, which automatically follows redirects.
I see that server returns HTTP/1.1 302 Found - HTTP redirect.
urllib automatically follow redirects, so headers returned by urllib is headers from http://www.trucks.com.mt, not http://www.yellowpages.com.mt/Malta-Web/127151.aspx