Crawling SSL site with scrapy

Crawling SSL site with scrapy - python

I've to crawl https://dms.psc.sc.gov/Web/dockets which uses TLS v1.2 using scrapy framework. But in requesting the URL it fails to load and raise [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>].
There is issue discussed on git https://github.com/scrapy/scrapy/issues/981 but it did not work for me. I have scrapy v 0.24.5 and twisted version >=14.
When I try to crawl another site which also uses TLS v1.2 it works but not for the https://dms.psc.sc.gov.
How to solve this issue?

PR fixing this problem in Scrapy was already merged. Recently (in February 2016) there was another pull request fixing similar bug
I see with most recent Scrapy version I can fetch your page all right, but with older versions problem still appears.
In general, if you stumble on HTTP-s problem with Scrapy the solution is:
upgrade Scrapy to newest version
check what version of Twisted you use, if it's not most recent update to most recent Twisted version (as of time of writing versions above 14 are confirmed to be significantly better when it comes to SSL)
If you still experience problems after updating Scrapy and Twisted you may need to subclass ScrapyClientContextFactory - see answer below for details.
More details in this github issue

1.addDOWNLOADER_CLIENTCONTEXTFACTORY='testproject.CustomContext.CustomClientContextFactory'
to your settings.py
2.create file called CustomContext.py in your project directory
and add the below code
from OpenSSL import SSL
from twisted.internet.ssl import ClientContextFactory
from twisted.internet._sslverify import ClientTLSOptions
from scrapy.core.downloader.contextfactory import ScrapyClientContextFactory
class CustomClientContextFactory(ScrapyClientContextFactory):
def getContext(self, hostname=None, port=None):
ctx = ClientContextFactory.getContext(self)
# Enable all workarounds to SSL bugs as documented by
# http://www.openssl.org/docs/ssl/SSL_CTX_set_options.html
ctx.set_options(SSL.OP_ALL)
if hostname:
ClientTLSOptions(hostname, ctx)
return ctx
Note: It worked well for crawling with https sites in windows but when I tried the same in Ubuntu 14.04 it throws error as below :-
from twisted.internet._sslverify import ClientTLSOptions
exceptions.ImportError: cannot import name ClientTLSOptions
It would be great if anyone adds solution for the above error.
EDIT:
instead of using from twisted.internet._sslverify import ClientTLSOptions
I have changed it to the below
try:
# available since twisted 14.0
from twisted.internet._sslverify import ClientTLSOptions
except ImportError:
ClientTLSOptions = None

Anyone having "TypeError: unbound method getContext() must be called with ClientContextFactory instance as first argument ..."
Replace ctx = ClientContextFactory.getContext(self)
with ctx = ScrapyClientContextFactory.getContext(self)

Vinodh Velumayil' answer is right. But I had to edit this string:
ctx = ClientContextFactory.getContext(self)
to this:
inst = ClientContextFactory()
ctx = inst.getContext()

Related

Building the SeeingWand on Raspberry Pi Zero and have coding issues

This is my first posting, so please forgive any lack of decorum
I am building a SeeingWand as outlined in MagPi issue #71.
I have installed and tested all the HW. Then install the python code, the original; code was python2.7, I have update the code to run under python3, but get a strange error when i run the code:
The system displays that the http module does not have a .client attribute.
The documentation says it does. I have tried .client and .server attributes both give the same error. What am i doing wrong?
I have tried several coding variations and several builds of the raspberry OS (Raspbian) mostly give the same errors
import picamera, http, urllib, base64, json, re
from os import system
from gpiozero import Button
CHANGE {MS_API_KEY} BELOW WITH YOUR MICROSOFT VISION API KEY
ms_api_key = "{MS_API_KEY}"
camera button - this is the BCM number, not the pin number
camera_button = Button(27)
setup camera
camera = picamera.PiCamera()
setup vision API
headers = {
'Content-Type': 'application/octet-stream',
'Ocp-Apim-Subscription-Key': ms_api_key,
}
params = urllib.parse.urlencode({
'visualFeatures': 'Description',
})
loop forever waiting for button press
while True:
camera_button.wait_for_press()
camera.capture('/tmp/image.jpg')
body = open('/tmp/image.jpg', "rb").read()
try:
conn = http.client.HTTPsConnection('westcentralus.api.cognitive.microsoft.com')
conn.request("POST", "/vision/v1.0/analyze?%s"%params, body, headers)
response = conn.getresponse()
analysis=json.loads(response.read())
image_caption = analysis["description"]["captions"][0]["text"].capitalize()
# validate text before system() call; use subprocess in next version
if re.match("^[a-zA-z ]+$", image_caption):
system('espeak -ven+f3 -k5 -s120 "' + image_caption + '"')
else :
system('espeak -ven+f3 -k5 -s120 "i do not know what i just saw"')
conn.close()
except Exception as e:
print (e.args)
The system displays an error stating that the http module does not have a .client attribute.
The documentation says it does. I have tried .client and .server attributes both give the same error. What am i doing wrong?
Expected results are:
when i push button 1 I expect the camera to take a picture
when i push button 2 i expect to access MSFT Azure to identify the picture using AI
the final output is for the Wand to access the audio hat and describe what the Wand is "looking" at.

try adding an import like this:
import http.client
Edit: http is a Python package. Even if the package contains some modules, it does not automatically import those modules when you import the package, unless the __init__.py for that package does so on your behalf. In the case of http, the __init__.py is empty, so you get nothing gratis just for importing the package.

vscode import error: from scapy.all import IP

vscode said can't find IP in scapy.all
but from terminal, i can import it:
could somebody tell my why?

I get exactly the same issue with my Scapy code in VS Code. I think it's to do with the way pylint is working.
When you from scapy.all import IP, Python loads scapy/all.py, which includes the line from scapy.layers.all import *. scapy/layers/all.py includes this code:
for _l in conf.load_layers:
log_loading.debug("Loading layer %s" % _l)
try:
load_layer(_l, globals_dict=globals(), symb_list=__all__)
except Exception as e:
log.warning("can't import layer %s: %s", _l, e)
conf.load_layers is over in scapy/config.py:
load_layers = ['bluetooth', 'bluetooth4LE', 'dhcp', 'dhcp6', 'dns',
'dot11', 'dot15d4', 'eap', 'gprs', 'hsrp', 'inet',
'inet6', 'ipsec', 'ir', 'isakmp', 'l2', 'l2tp',
'llmnr', 'lltd', 'mgcp', 'mobileip', 'netbios',
'netflow', 'ntp', 'ppp', 'pptp', 'radius', 'rip',
'rtp', 'sctp', 'sixlowpan', 'skinny', 'smb', 'snmp',
'tftp', 'vrrp', 'vxlan', 'x509', 'zigbee']
I suspect that pylint doesn't follow those imports correctly.
I've tried the workarounds suggested in the relevant GitHub issue, but they don't seem to fix anything for Scapy. Pylint eventually added specific workarounds for the issues in Numpy - and no-one has done that for Scapy.
You can work around these issues by directly importing the IP class from the relevant layer at the top of your Python file:
from scapy.layers.inet import IP, UDP, TCP, ICMP
Et voila! No more pylint complaints about those imports.

Python urllib won't download file due to permissions, but wget will

I'm trying to download an MP3 file, via its URL, using Python's urllib2.
mp3file = urllib2.urlopen(url)
output = open(dst,'wb')
output.write(mp3file.read())
output.close()
I'm getting a urllib2.HTTPError: HTTP Error 403: Forbidden error.
Trying urllib also fails, but silently.
urllib.urlretrieve(url, dst)
However, if I use wget, I can download the file successfully.
I've noted the general differences between the two methods mentioned in "Difference between Python urllib.urlretrieve() and wget", but they don't seem to apply here.
Is wget doing something to negotiate permissions that urllib2 doesn't do? If so, what, and how do I replicate this in urllib2?

Could be something on the server side - blocking python user agent for example. Try using wget user agent : Wget/1.13.4 (linux-gnu) .
In Python 2:
import urllib
# Change header for User-Agent
class AppURLopener(urllib.FancyURLopener):
version = "Wget/1.13.4 (linux-gnu)"
url = "http://www.example.com/test_file"
fname = "test_file"
urllib._urlopener = AppURLopener()
urllib.urlretrieve(url, fname)

The above didn't work for me (I'm using python3.5). wget works fine.
It's not (I assume) a huge problem for me - surely I can still do a system() and use wget to get the data, with some file renaming and munging.
But in case anyone else is suffering from the same problem, these are the errors I get from the above snippet:
Traceback (most recent call last):
File "./mksynt.py", line 10, in <module>
class AppURLopener(urllib.FancyURLopener):
AttributeError: module 'urllib' has no attribute 'FancyURLopener'
I see that the original answer was only promised to work in python2.

remote: missing/incomplete bugzilla conf (no bugzilla_url) error with gitzilla

I have gitzilla config file setup at /ect/gitzillarc on remote central repository server with permissions all read and write.
Content of the file code config is as follows
[/home/gituser/repositories/git-main/git-main.git/.git]
bugzilla_url: https://repo.example.com/bugzilla/
bugzilla_user: sboppana#example.com
bugzilla_password: s123
user_config: deny
allowed_bug_states: NEW, ASSIGNED, REOPENED
logfile: /var/log/gitzilla.log
loglevel: info`
python at 2.6.5
pybugz at 0.9.3 (tried with 0.8.0 also)
Gitzilla at gera-gitzilla-gitzilla-2.0-19-geceeaca.tar.gz
I get the error "remote: missing/incomplete bugzilla conf (no bugzilla_url)" with git push
Of course bugzilla_url value has the real name in my config file not the example name.
Tried many but couldn't get it to work. Thanks for all the help.

Adding xmlrpc.cgi along with the bugzilla URL should solve this issue. This is the workaround for the problem in Pybugz.
For instance, if your bugzilla URL is https://repo.example.com/bugzilla/ try using https://repo.example.com/bugzilla/xmlrpc.cgi This should work.

How to use startTLS with ldaptor?

I'm trying to use ldaptor to connect via startTLS to a LDAP server. Searching on internet and trying myself I arrived to this snippet of code:
from ldaptor.protocols.ldap import ldapclient, ldapsyntax, ldapconnector, distinguishedname
[...]
def main(base, serviceLocationOverrides):
c=ldapconnector.LDAPClientCreator(reactor, ldapclient.LDAPClient)
d = c.connect(base, serviceLocationOverrides)
d.addCallbacks(lambda proto: proto.startTLS(), error)
[...]
d.addErrback(error)
d.addBoth(lambda dummy: reactor.stop())
reactor.run()
but the code exits with an AssertionError:
[Failure instance: Traceback: <type 'exceptions.AssertionError'>:
/usr/lib/python2.7/dist-packages/twisted/internet/base.py:1167:mainLoop
/usr/lib/python2.7/dist-packages/twisted/internet/base.py:789:runUntilCurrent
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:361:callback
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:455:_startRunCallbacks
--- <exception caught here> ---
/usr/lib/python2.7/dist-packages/twisted/internet/defer.py:542:_runCallbacks
/usr/lib/pymodules/python2.7/ldaptor/protocols/ldap/ldapclient.py:239:_startTLS
/usr/lib/pymodules/python2.7/ldaptor/protocols/pureldap.py:1278:__init__
/usr/lib/pymodules/python2.7/ldaptor/protocols/pureldap.py:1144:__init__
]
I have tried to look in ldaptor code for the incriminated assertion but seems all ok.
Is there someone who succeded in using ldaptorClient startTLS?
A code snippet?
Thank you very much
Bye

I'm pretty certain that your problem is one I ran into a while back. In ldaptor/protocols/pureldap.py, line 1144 asserts that the LDAPExtendedRequest requestValue must be a string. But according to RFC 2251, that value is optional, and specifically should NOT be present in startTLS requests.
So your approach is correct; this is just a major bug in ldaptor. As far as I can tell, the author only tested using simple bind without TLS. You need to comment out that line in pureldap.py. If you're deploying this with the expectation that users will download or easy-install ldaptor, then you'll need to create a fixed copy of the LDAPExtendedRequest class in your own code, and sub it in at run-time.
Having had to maintain a project using ldaptor for several years, I would strongly urge you to switch to python-ldap if at all possible. Since it wraps the OpenLDAP libs, it can be much more difficult to build, especially with full support for SSL/SASL. But it's well worth it, because ldaptor has a lot more problems than just the one you ran across.

Using ldaptor 0.0.54 from https://github.com/twisted/ldaptor, I had no problems using StartTLS.
Here is the code:
#! /usr/bin/env python
from twisted.internet import reactor, defer
from ldaptor.protocols.ldap import ldapclient, ldapsyntax, ldapconnector
#defer.inlineCallbacks
def example():
serverip = 'your.server.name.or.ip'
basedn = 'o=Organization'
binddn = 'cn=admin,o=Organization'
bindpw = 'Sekret'
query = '(uid=jetsong)'
c = ldapconnector.LDAPClientCreator(reactor, ldapclient.LDAPClient)
overrides = {basedn: (serverip, 389)}
client = yield c.connect(basedn, overrides=overrides)
client = yield client.startTLS()
yield client.bind(binddn, bindpw)
o = ldapsyntax.LDAPEntry(client, basedn)
results = yield o.search(filterText=query)
for entry in results:
print entry
if __name__ == '__main__':
df = example()
df.addErrback(lambda err: err.printTraceback())
df.addCallback(lambda _: reactor.stop())
reactor.run()

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Crawling SSL site with scrapy - python

Anyone having "TypeError: unbound method getContext() must be called with ClientContextFactory instance as first argument ..." Replace ctx = ClientContextFactory.getContext(self) with ctx = ScrapyClientContextFactory.getContext(self)

Vinodh Velumayil' answer is right. But I had to edit this string: ctx = ClientContextFactory.getContext(self) to this: inst = ClientContextFactory() ctx = inst.getContext()

Related

Building the SeeingWand on Raspberry Pi Zero and have coding issues

vscode import error: from scapy.all import IP

Python urllib won't download file due to permissions, but wget will

remote: missing/incomplete bugzilla conf (no bugzilla_url) error with gitzilla

How to use startTLS with ldaptor?

Categories

Resources