python HTTPSConnection to curl auto convertion - python

I m writing a python client/module to offer support for python users supporting a non-python server that is controled by another group of developers and is written in node.js
The server has a lot of capabilities including https, websockets, 3 authentication modes etc.
My problem is their time bugdet they can offer to help is very limited, and when some of my request fail, no one on the other speaks python or has time to learn, so they cannot check why this happens. From my side I am mainly using http.client's HTTPSConnection e.g. in the simplest example
c = HTTPSConnection(self.credentials['host'])
headers = {'Authorization': 'Basic %s' % self.credentials['auth']}
c.request('GET', path, headers=headers)
(also PUT to send data etc)
Is there any general way to print the request's curl equivalent so that I can send them
some non-python code I can complain does not work for me?
I can easily think how to do this by hand and have seen questions that asks specifics of that, however I would like a general, automatic way, like request_to_curl(HTTPSConnection.request) or something

Related

Python scraping HTTPError: 403 Client Error: Forbidden for url:

My python code used to work, but when I tried it today it did not work anymore.
I assume the website owner forbade non browsers requests recently.
code
import requests, bs4
res = requests.get('https://manga1001.com/日常-raw-free/')
res.raise_for_status()
print(res.text)
I read that adding header in the requests.get method may work, but I don't know which header info exactly I need to make it work.
error
---------------------------------------------------------------------------
HTTPError Traceback (most recent call last)
<ipython-input-15-ed1948d83d51> in <module>
3 # res = requests.get('https://manga1001.com/日常-raw-free/', headers=headers_dic)
4 res = requests.get('https://manga1001.com/日常-raw-free/')
----> 5 res.raise_for_status()
6 print(res.text)
7
~/opt/anaconda3/lib/python3.8/site-packages/requests/models.py in raise_for_status(self)
939
940 if http_error_msg:
--> 941 raise HTTPError(http_error_msg, response=self)
942
943 def close(self):
HTTPError: 403 Client Error: Forbidden for url: https://manga1001.com/%E6%97%A5%E5%B8%B8-raw-free/
Requests get a header argument
res = requests.get('https://manga1001.com/日常-raw-free/', headers="")
I think adding a proper value here could make it work, but I don't know what the value is.
I would really appreciate if you could tell you.
And if you know any other ways to make it work, that is also quite helpful.
Btw I have also tried the code below but it also didn't work.
code 2
from requests_html import HTMLSession
url = "https://search.yahoo.co.jp/realtime"
session = HTMLSession()
r = session.get(url)
r = r.html.render()
print(r)
FYI HTMLSession may not work on IDLE like Jupyter notebook so I tired it after saving it as a python file but it still did not work.
When I run first code without res.raise_for_status() then I can see in HTML with Why do I have to complete a CAPTCHA? and Cloudflare Ray ID which shows what is the problem. It uses Cloudflare to detect scripts/bots/hackers/spamers and it uses Captcha to check it. But if I use header 'User-Agent' with value from real browser or even with short 'Mozilla/5.0' then it get expected page.
It works for me with both pages.
import requests
headers = {
'User-Agent': 'Mozilla/5.0'
}
url = 'https://manga1001.com/日常-raw-free/'
#url = 'https://search.yahoo.co.jp/realtime'
res = requests.get(url, headers=headers)
print('status_code:', res.status_code)
print(res.text)
BTW:
If you will run it often for many links in short time then it may display again CAPTCHA and then you may need other methods to behave more like real human - ie. sleep() with random time, Session() to use cookies, first get main page (to get fresh cookies) and later get this page, add other headers.
I wanted to expand on the answer given by #Furas because I understand his fix will not be the solution in all cases. Yes, In this instance you're getting the 403 and Cloudflare/security captcha page when you make a request because of not "scoring" high enough on the security system (Your HTTP browser isn't similar enough to a real browser)
This creates a big question. What is a real browser and what score do I need to beat it? How do I increase my browser score and make my HTTP-request based browser look more real to the bot protection?
Firstly, it's important to understand that these 403/Security blocks are based on different levels on security. Something you do on one site may not work on the other due to different security configurations/version. Two sites may use the same security system and still the request you make may only work on one.
Why would they have different configurations and everyone not use the highest security available? Because with each additional security measure, there's more false-positives and challenges to pass, on a large scale or for an e-commerce store this can mean lost sales due to a poor user-experience or additional bugs/downtime which are introduced via the security program.
What is a real browser?
A real browser can perform SSL/TLS handshakes, parse and run Javascript and make TCP/requests. Along with this, the security programs will analyze the patterns and timings of everything from Layer 2 to see if you're a "real" human. When you use something like Python to make a request that is only performing a HTTP(s) request it's really easy for these security programs to recognise you as a bot without some heavy configuration.
One way that security systems combat bots is by putting a Javascript challenge as a proxy between the bot and a site, this requires running client-side Javascript which bots cannot do by default, not only do you need to run the client-side Javascript, it also needs to be similar to one that your own browser would generate, the challenge can typically consist of a few hundred individual "browser" challenges or/along with a manual captcha to fingerprint and track your browser to see if you're a human (This is the page you're seeing).
The typical and more lower-standard security systems/configurations can be beaten by using the correct headers (with capitalization, header order and HTTP versions. Like #Furas mentioned, using consistent sessions can also help create longer-lasting sessions before getting another 403. More advanced and higher-level security configurations can do tracking on lower-levels by looking at some flags (Such as WindowSize) of the TCP connection and JA3 fingerprinting analyzing the TLS handshake which will look at your cipher suites and ALPN amongst other things. Security systems can see characteristics which differentiate between browsers, browser-versions and operating systems and compare these all together to generate your realness score. Your IP can also be an important factor, requests can be cross-checked with other sites, intervals, older requests you tried before and much more, you can use proxies to divide your requests between and look less suspicious, but this can come with additional problems and affect your request also causing it to be fingerprinted and blocked.
To understand this better, here's a great site you can go to in your browser and also make a GET request to, check your browser "Rank" and look at the different values which can be seen just from the TLS request alone.
I hope this provides some insight into why a block might appear, although it's impossible to tell from a single URL since blocks can appear for such a variety of different reasons.

How to post request to API using only code?

I am developing a DAG to be scheduled on Apache Airflow which main porpuse will be to post survey data (on json format) to an API and then getting a response (the answers to the surveys). Since this whole process is going to be automated, every part of it has to be programmed in the DAG, so I can´t use Postman or any similar app (unless there is a way to automate their usage, but I don't know if this is possible).
I was thinking of using the requests library for Python, and the function I've written for posting the json to the API looks like this:
def postFileToAPI(**context):
print('uploadFileToAPI() ------ ')
json_file = context['ti'].xcom_pull(task_ids='toJson') ## this pulls the json file from a previous task
print('--------------- Posting survey request to API')
r = requests.post('https://[request]', data = json_file)
(I haven't finished defining the http link for the request because my source data is incomplete.)
However, since this is my frst time working with APIs and the requests library, I don't know if this is enough. For example, I'm unsure if I need to provide a token from the API to perform the request.
I also don't know if there are other libraries that are better suited for this or that could be a good support.
In short: I don't know if what I'm doing will work as intended, what other information I need t provide my DAG or if there are any libraries to make my work easier.
The Python requests package that you're using is all you need, except if you're making a request that needs extra authorisation - then you should also import for example requests_jwt (then from requests_jwt import JWTAuth) if you're using JSON web tokens, or whatever relevant requests package corresponds for your authorisation style.
You make POST and GET requests and all individual requests separately.
Include the URL and data arguments as you have done and that should work!
You may also need headers and/or auth arguments to get through security,
eg for the GitLab api for a private repository you would include these extra arguments, where GITLAB_TOKEN is a GitLab web token.
```headers={'PRIVATE-TOKEN': GITLAB_TOKEN},
auth=JWTAuth(GITLAB_TOKEN)```
If you just try it it should work, if it doesn't work then test the API with curl requests directly in the Terminal, or let us know :)

Building an http server

So I need to build an HTTP server that will contact a client and send him data like pictures or calculations and create a page with those things. I guess you understood that I do not really know what I'm doing... :(
I know python and the basic(+) of the client-server project but I don't understand that HTTP protocol and didn't understand anything from what I read on the internet...
Can anyone explain to me how to work with this protocol? What is the form of HTTP packets?
Here an example of 1 problem that I don't understand: I have been asked to get a packet (which I did) and understand what is the request there, then send back the name of the file the client wants and after it the file itself. I printed the packet and didn't understand where is the request or what the client wants...
Thank you very very much!
Can anyone explain to me how to work with this protocol? What is the form of HTTP packets?
The specification might be helpful.
Concerning the webz, you find a lot of specification on the RFCs.
More to HTTP below.
(Since you seem to be new to programming, I figured I might want to tell you the following:)
Usually one doesn't directly interact with HTTP(S) packets. Instead you use a framework, such as flask, django, aiohttp and many more. The choice of framework depends on the use-case. E.g.:
You need a database, authentication and any imaginable feature? Go with Django.
You just want to create a WebApplication without a bloated framework? Go with Flask.
You need the bare minimum or want to act as a client? Go with aiohttp.
More frameworks are listed here.
The advantage of using such frameworks is that they usually include useful things, that are battletested (i.e. usually no bugs), while you don't have to figure out pecularities of certain protocols.
You just import the framework and write awesomeness! :)
(Anyways, here is a little very oversimplified overview for completeness)
So, HTTP is an text protocol over TCP, which basically means that you send text over a simple tcp socket. When you receive your request you have to "parse" (i.e. comprehend its contents). Luckily for us the requests are standarized and follow the same scheme.
The smallest request would look like this:
GET / HTTP/1.0
Host: www.server.com
The first line starts with a verb (also called request method), in our example the verb is GET. The / denotes the path. Think of file paths on your HDD. The last part of the first line, namely HTTP/1.0, tells the receiver with which version of HTTP we are operating on. Currently the there is HTTP 1.0 and HTTP 1.1; however, I wouldn't bother with HTTP 1.1 yet and stick with HTTP 1.0, if you're implementing the requests your self.
Lastly the Host: www.server.com line tells us which server we want to talk to, since multiple instances of an HTTP server could be running under the same ip. This is used to revole the subdomain.
If you send this request to an HTTP Server, you're likely to receive an response like this:
HTTP/1.0 200 OK
Server: Apache/1.3.29 (Unix) PHP/4.3.4
Content-Length: 1337
Connection: close
Content-Type: text/html
<DATA>
This response contains the status in the first line HTTP/1.0 200 OK. The number and the 'OK' represent a status code, telling us that everything is fine. There are many status codes with their own meaning and usages.
The lines following the first are so-called Response-Headers. They provide additional useful information about the response. For instance, when we open a site like 'stackoverflow.com', the server transmits an HTML file to us for the browser to interpret. Before we can do that, we need to know the size of the HTML file.
Luckily the server tells us beforehand with Content-Length: 1337 line, that the file is 1337 bytes big. The file itself would be present where the <DATA> placeholder stands.
There are, yet again, many of these headers.
As you can see, there are many things to account for when working with HTTP, showing that it is not feasible, without a very good reason, to implement a HTTP client/server from scratch.
Instead it's preferred to use one of the frameworks (for python) listed above.
As a last note:
In the process of trying to explain the concepts as simple as possible I probably left-out or oversimplified some things. If you find any mistake, please let me know.

Which is best in Python: urllib2, PycURL or mechanize?

Ok so I need to download some web pages using Python and did a quick investigation of my options.
Included with Python:
urllib - seems to me that I should use urllib2 instead. urllib has no cookie support, HTTP/FTP/local files only (no SSL)
urllib2 - complete HTTP/FTP client, supports most needed things like cookies, does not support all HTTP verbs (only GET and POST, no TRACE, etc.)
Full featured:
mechanize - can use/save Firefox/IE cookies, take actions like follow second link, actively maintained (0.2.5 released in March 2011)
PycURL - supports everything curl does (FTP, FTPS, HTTP, HTTPS, GOPHER, TELNET, DICT, FILE and LDAP), bad news: not updated since Sep 9, 2008 (7.19.0)
New possibilities:
urllib3 - supports connection re-using/pooling and file posting
Deprecated (a.k.a. use urllib/urllib2 instead):
httplib - HTTP/HTTPS only (no FTP)
httplib2 - HTTP/HTTPS only (no FTP)
The first thing that strikes me is that urllib/urllib2/PycURL/mechanize are all pretty mature solutions that work well. mechanize and PycURL ship with a number of Linux distributions (e.g. Fedora 13) and BSDs so installation is a non issue typically (so that's good).
urllib2 looks good but I'm wondering why PycURL and mechanize both seem very popular, is there something I am missing (i.e. if I use urllib2 will I paint myself in to a corner at some point?). I'd really like some feedback on the pros/cons of these things so I can make the best choice for myself.
Edit: added note on verb support in urllib2
I think this talk (at pycon 2009), has the answers for what you're looking for (Asheesh Laroia has lots of experience on the matter). And he points out the good and the bad from most of your listing
Scrape the Web: Strategies for
programming websites that don't
expect it (Part 1 of 3)
Scrape the Web: Strategies for
programming websites that don't
expect it (Part 2 of 3)
Scrape the Web: Strategies for
programming websites that don't
expect it (Part 3 of 3)
From the PYCON 2009 schedule:
Do you find yourself faced with
websites that have data you need to
extract?
Would your life be simpler if
you could programmatically input data
into web applications, even those
tuned to resist interaction by bots?
We'll discuss the basics of web
scraping, and then dive into the
details of different methods and where
they are most applicable.
You'll leave
with an understanding of when to apply
different tools, and learn about a
"heavy hammer" for screen scraping
that I picked up at a project for the
Electronic Frontier Foundation.
Atendees should bring a laptop, if
possible, to try the examples we
discuss and optionally take notes.
Update:
Asheesh Laroia has updated his presentation for pycon 2010
PyCon 2010: Scrape the Web:
Strategies for programming websites
that don't expected it
* My motto: "The website is the API."
* Choosing a parser: BeautifulSoup, lxml, HTMLParse, and html5lib.
* Extracting information, even in the face of bad HTML: Regular expressions, BeautifulSoup, SAX, and XPath.
* Automatic template reverse-engineering tools.
* Submitting to forms.
* Playing with XML-RPC
* DO NOT BECOME AN EVIL COMMENT SPAMMER.
* Countermeasures, and circumventing them:
o IP address limits
o Hidden form fields
o User-agent detection
o JavaScript
o CAPTCHAs
* Plenty of full source code to working examples:
o Submitting to forms for text-to-speech.
o Downloading music from web stores.
o Automating Firefox with Selenium RC to navigate a pure-JavaScript service.
* Q&A; and workshopping
* Use your power for good, not evil.
Update 2:
PyCon US 2012 - Web scraping: Reliably and efficiently pull data from pages that don't expect it
Exciting information is trapped in web pages and behind HTML forms. In this tutorial, >you'll learn how to parse those pages and when to apply advanced techniques that make >scraping faster and more stable. We'll cover parallel downloading with Twisted, gevent, >and others; analyzing sites behind SSL; driving JavaScript-y sites with Selenium; and >evading common anti-scraping techniques.
Python requests is also a good candidate for HTTP stuff. It has a nicer api IMHO, an example http request from their offcial documentation:
>>> r = requests.get('https://api.github.com', auth=('user', 'pass'))
>>> r.status_code
204
>>> r.headers['content-type']
'application/json'
>>> r.content
...
urllib2 is found in every Python install everywhere, so is a good base upon which to start.
PycURL is useful for people already used to using libcurl, exposes more of the low-level details of HTTP, plus it gains any fixes or improvements applied to libcurl.
mechanize is used to persistently drive a connection much like a browser would.
It's not a matter of one being better than the other, it's a matter of choosing the appropriate tool for the job.
To "get some webpages", use requests!
From http://docs.python-requests.org/en/latest/ :
Python’s standard urllib2 module provides most of the HTTP
capabilities you need, but the API is thoroughly broken. It was built
for a different time — and a different web. It requires an enormous
amount of work (even method overrides) to perform the simplest of
tasks.
Things shouldn’t be this way. Not in Python.
>>> r = requests.get('https://api.github.com/user', auth=('user', 'pass'))
>>> r.status_code
200
>>> r.headers['content-type']
'application/json; charset=utf8'
>>> r.encoding
'utf-8'
>>> r.text
u'{"type":"User"...'
>>> r.json()
{u'private_gists': 419, u'total_private_repos': 77, ...}
Don't worry about "last updated". HTTP hasn't changed much in the last few years ;)
urllib2 is best (as it's inbuilt), then switch to mechanize if you need cookies from Firefox. mechanize can be used as a drop-in replacement for urllib2 - they have similar methods etc. Using Firefox cookies means you can get things from sites (like say StackOverflow) using your personal login credentials. Just be responsible with your number of requests (or you'll get blocked).
PycURL is for people who need all the low level stuff in libcurl. I would try the other libraries first.
Urllib2 only supports HTTP GET and POST, there might be workarounds, but If your app depends on other HTTP verbs, you will probably prefer a different module.
Every python library that speaks HTTP has its own advantages.
Use the one that has the minimum amount of features necessary for a particular task.
Your list is missing at least urllib3 - a cool third party HTTP library which can reuse a HTTP connection, thus speeding up greatly the process of retrieving multiple URLs from the same site.
Take a look on Grab (http://grablib.org). It is a network library which provides two main interfaces:
1) Grab for creating network requests and parsing retrieved data
2) Spider for creating bulk site scrapers
Under the hood Grab uses pycurl and lxml but it is possible to use other network transports (for example, requests library). Requests transport is not well tested yet.

Python and curl question

I will be transmitting purchase info (like CC) to a bank gateway and retrieve the result by using Django thus via Python.
What would be the efficient and secure way of doing this?
I have read a documentation of this gateway for php, they seem to use this method:
$xml= Some xml holding data of a purchase.
$curl = `/usr/bin/curl -s -d 'DATA=$xml' "https://url of the virtual bank POS"`;
$data=explode("\n",$curl); //return value is also an xml, seems like they are splitting by each `\n`
and using the $data, they process if the payment is accepted, rejected etc..
I want to achieve this under python language, for this I have done some searching and seems like there is a python curl application named pycurl yet I have no experience using curl and do not know if this is library is suitable for this task. Please keep in mind that as this transfer requires security, I will be using SSL.
Any suggestion will be appreciated.
Use of the standard library urllib2 module should be enough:
import urllib
import urllib2
request_data = urllib.urlencode({"DATA": xml})
response = urllib2.urlopen("https://url of the virtual bank POS", request_data)
response_data = response.read()
data = response_data.split('\n')
I assume that xml variable holds data to be sent.
Citing pycurl.sourceforge.net:
To sum up, PycURL is very fast (esp. for multiple concurrent operations) and very feature complete, but has a somewhat complex interface. If you need something simpler or prefer a pure Python module you might want to check out urllib2 and urlgrabber. There is also a good comparison of the various libraries.
Both curl and urllib2 can work with https so it's up to you.

Categories

Resources