Can't get redicect URL with python requests - python

import requests
print(requests.get("http://zfxxgk.nea.gov.cn/2022-01/17/c_1310427545.htm").url)
#returns 'http://zfxxgk.nea.gov.cn/2022-01/17/c_1310427545.htm'
Above code returns the same url as I passed, but the url actually redirect to the other url.
How can I get the after redirected url?
Thank you

with a curl-call
curl http://zfxxgk.nea.gov.cn/2022-01/17/c_1310427545.htm
you get the source witch contains the javascript-redirect as furas mentioned:
<script language="javascript" type="text/javascript">window.location.href="http://zfxxgk.nea.gov.cn/2021-12/31/c_1310427545.htm";</script>
If you got a "real" redirection you will find the new "location" in the header:
curl -I http://something.somewhere
HTTP/1.1 301 Moved Permanently
Content-Type: text/html
Content-Length: 185
Connection: keep-alive
Location: https://inder.net
If you need it an automation with python you should look at request.history as described in this answer to a similar question to find all redirections your call triggered: "Python Requests library redirect new url"

Related

Send a HTTP request without "/" at the beginning of the path

I want to be able to send a HTTP request with python without slash "/" in the path.
Simply, here is what the request should be like:
GET test HTTP/1.1
Host: example.com
Connection: keep-alive
Cache-Control: max-age=0
What I want to do is GET test HTTP/1.1 rather than GET /test HTTP/1.1
I am able to send the request using request repeating tools, but I am not sure how to do that with Python.
To clarify more: I don't want the request path to start with "/"
I am looking for the equivalent of this in python.
Thanks!

python requests.put() fails when urllib3 http.request('PUT', ...) succeeds. What gives?

I am trying to hit the Atlassian Confluence REST API using python requests.
I've successfully called a GET api, but when I call the PUT to update a confluence page, it returns 200, but didn't update the page.
I used chrome::YARC to verify that the API was working properly (which it was). After a while trying to debug it, I reverted to try using urllib3, which worked just fine.
I'd really like to use requests, but I can't for the life of me figure this one out after hours and hours of trying to debug, Google, etc.
I'm running Mac/Python3:
$ uname -a
Darwin mylaptop.local 16.7.0 Darwin Kernel Version 16.7.0: Thu Jun 15 17:36:27 PDT 2017; root:xnu-3789.70.16~2/RELEASE_X86_64 x86_64
$ python3 --version
Python 3.6.1
Here's my code that shows all three ways I'm trying this (two requests and one urllib3):
def update(self, spaceKey, pageTitle, newContent, contentType='storage'):
if contentType not in ('storage', 'wiki', 'plain'):
raise ValueError("Invalid contentType={}".format(contentType))
# Get current page info
self._refreshPage(spaceKey, pageTitle) # I retrieve it before I update it.
orig_version = self.version
# Content already same as requested content. Do nothing
if self.wiki == newContent:
return
data_dict = {
'type' : 'page',
'version' : {'number' : self.version + 1},
'body' : {
contentType : {
'representation' : contentType,
'value' : str(newContent)
}
}
}
data_json = json.dumps(data_dict).encode('utf-8')
put = 'urllib3' #for now until I figure out why requests.put() doesn't work
enable_http_logging()
if put == 'requests':
r = self._cs.api.content(self.id).PUT(json=data_dict)
r.raise_for_status()
elif put == 'urllib3':
urllib3.disable_warnings() # I know, you can quit your whining now!!!
headers = { 'Content-Type' : 'application/json;charset=utf-8' }
auth_header = urllib3.util.make_headers(basic_auth=":".join(self._cs.session.auth))
headers = {**headers, **auth_header}
http = urllib3.PoolManager()
r = http.request('PUT', str(self._cs.api.content(self.id)), body=data_json, headers=headers)
else:
raise ValueError("Huh? Unknown put type: {}".format(put))
enable_http_logging(False)
# Verify page was updated
self._refreshPage(spaceKey, pageTitle) # Check for changes
if self.version != orig_version + 1:
raise RuntimeError("Page not updated. Still at version {}".format(self.version))
if self.wiki != newContent:
raise RuntimeError("Page version updated, but not content.")
Any help would be great.
Update 1: Adding request dump
-----------START-----------
PUT http://confluence.myco.com/rest/api/content/101904815
User-Agent: python-requests/2.18.4
Accept-Encoding: gzip, deflate
Accept: */*
Connection: keep-alive
Content-Length: 141
Content-Type: application/json
Authorization: Basic <auth-token-here>==
b'{"type": "page", "version": {"number": 17}, "body": {"storage": {"representation": "storage", "value": "new body here version version 17"}}}'
requests never went back to PUT (Bug???)
What you're observing is requests behaving consistently with web browsers: reacting to HTTP 302 redirect with a GET request.
From Wikipedia:
The user agent (e.g. a web browser) is invited by a response with this code to make a second, otherwise identical, request to the new URL specified in the location field.
(...)
Many web browsers implemented this code in a manner that violated this standard, changing the request type of the new request to GET, regardless of the type employed in the original request (e.g. POST)
(...)
As a consequence, the update of RFC 2616 changes the definition to allow user agents to rewrite POST to GET.
So this behaviour is consistent with RFC 2616. I don't think we can say which of the two libraries behaves "more correctly".
Looks like a difference in how the requests and urllib3 modules deal with switching from http to https. (See #Kos answer above). Here's what I found when I checked the debug logs.
So I got to thinking after #JonClements suggested I send him the Response dump. After doing some research I found the magic runs to enable debugging for requests and urllib3 (See here).
In looking at the diffs from both, I noticed that they were being redirected from http to https for my companies confluence site:
urllib3:
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): confluence.myco.com
DEBUG:urllib3.connectionpool:http://confluence.myco.com:80 "PUT /rest/api/content/101906196 HTTP/1.1" 302 237
DEBUG:urllib3.util.retry:Incremented Retry for (url='http://confluence.myco.com/rest/api/content/101906196'): Retry(total=2, connect=None, read=None, redirect=None, status=None)
INFO:urllib3.poolmanager:Redirecting
http://confluence.myco.com/rest/api/content/101906196 ->
https://confluence.myco.com/rest/api/content/101906196
DEBUG:urllib3.connectionpool:Starting new HTTPS connection (1): confluence.myco.com
DEBUG:urllib3.connectionpool:https://confluence.myco.com:443 "PUT /rest/api/content/101906196 HTTP/1.1" 200 None
while requests tried with my PUT and then after redirecting went to GET:
DEBUG:urllib3.connectionpool:http://confluence.myco.com:80 "PUT /rest/api/content/101906196 HTTP/1.1" 302 237
DEBUG:urllib3.connectionpool:https://confluence.myco.com:443 "GET /rest/api/content/101906196 HTTP/1.1" 200 None
requests never went back to PUT
I changed my initial url from http: to https: and everything worked fine.

JSON in post request works in HttpRequester but not in python Requests

I'm stuck in web scraping a page using Python. Basically, the following is the request from HttpRequester (in Mozilla) and it gives me the right response.
POST https://www.hpe.com/h20195/v2/Library.aspx/LoadMore
Content-Type: application/json
{"sort": "csdisplayorder", "hdnOffset": "1", "uniqueRequestId": "d6da6a30bdeb4d77b0e607a6b688de1e", "test": "", "titleSearch": "false", "facets": "wildcatsearchcategory#HPE,cshierarchycategory#No,csdocumenttype#41,csproducttype#18964"}
-- response --
200 OK
Cache-Control: private, max-age=0
Content-Length: 13701
Content-Type: application/json; charset=utf-8
Server: Microsoft-IIS/7.5
X-AspNet-Version: 4.0.30319
X-Powered-By: ASP.NET
Date: Sat, 28 May 2016 04:12:57 GMT
Connection: keep-alive
The exact same operation in python 2.7.1 using Requests, fails with an error. The following is the code snippet:
jsonContent = {"sort": "csdisplayorder", "hdnOffset": "1", "uniqueRequestId": "d6da6a30bdeb4d77b0e607a6b688de1e", "test": "", "titleSearch": "false", "facets": "wildcatsearchcategory#HPE,cshierarchycategory#No,csdocumenttype#41,csproducttype#18964"}
catResponse = requests.post('https://www.hpe.com/h20195/v2/Library.aspx/LoadMore', json = jsonContent)
The following is the error that I get:
{"Message":"Value cannot be null.\r\nParameter name: source","StackTrace":" at
System.Linq.Enumerable.Contains[TSource](IEnumerable`1 source, TSource value, I
EqualityComparer`1 comparer)\r\n
More information:
The Post request that I'm looking for is fired upon:
opening this web page: https://www.hpe.com/h20195/v2/Library.aspx?doctype=41&doccompany=HPE&footer=41&filter_doctype=no&filter_doclang=no&country=&filter_country=no&cc=us&lc=en&status=A&filter_status=rw#doctype-41&doccompany-HPE&prodtype_oid-18964&status-a&sortorder-csdisplayorder&teasers-off&isRetired-false&isRHParentNode-false&titleCheck-false
Clicking on the "Load more" grey button at the end of the page
I'm capturing the exact set of request headers and response from the browser operation and trying to mimic that in Postman, Python code and HttpRequester (Mozilla).
It flags the same error (mentioned above) with Postman and Python, but works with no headers set on my part with HttpRequester.
Can anyone think of an explanation for this?
If both Postman and requests are receiving an error, then there is more context than what HttpRequester is showing. There are a number of headers that I'd expect to be set almost always, including User-Agent and Content-Length, that are missing here.
The usual suspects are cookies (look for Set-Cookie headers in earlier requests, preserve those by using a requests.Session() object), the User-Agent header and perhaps a Referrer header, but do look for other headers like anything starting with Accept, for example.
Have HttpRequester post to http://httpbin.org/post instead for example, and inspect the returned JSON, which tells you what headers were sent. This won't include cookies (those are domain-specific), but anything else could potentially be something the server looks for. Try such headers one by one if cookies are not helping.

How to view Boto3 HTTPS request string

I have been able to view the attributes of the PreparedRequest that botocore sends, but I'm wondering how I can view the exact request string that is sent to AWS. I need the exact request string to be able to compare it to another application I'm testing AWS calls with.
You could also enable debug logging in boto3. That will log all requests and responses as well as lots of other things. Its a bit obscure to enable it:
import boto3
boto3.set_stream_logger(name='botocore')
The reason you have to specify botocore as the name to log is that all of the actual requests and responses happen at the botocore layer.
So what you probably want to do is to send your request through the proxy (mitmproxy, squid). Then check the proxy for what was sent.
Since HTTPS data is encrypted you must first decrypt it, then log the response, then encrypt it back and send to AWS. One of the options is to use mitmproxy. ( It's really easy to install )
Run mitmproxy
Open up another terminal and point proxy to mitmproxys port:
export http_proxy=127.0.0.1:8080
export https_proxy=$http_proxy
Then set verify=False when creating session/client
In [1]: import botocore.session
In [2]: client = botocore.session.Session().create_client('elasticache', verify=False)
Send request and look at the output of mitmproxy
In [3]: client.describe_cache_engine_versions()
The result should be similar to this:
Host: elasticache.us-east-1.amazonaws.com
Accept-Encoding: identity
Content-Length: 53
Content-Type: application/x-www-form-urlencoded
Authorization: AWS4-HMAC-SHA256 Credential=FOOOOOO/20150428/us-east-1/elasticache/aws4_request, SignedHeaders=host;user-agent;x-amz-date, Signature=BAAAAAAR
X-Amz-Date: 20150428T213004Z
User-Agent: Botocore/0.103.0 Python/2.7.6 Linux/3.13.0-49-generic
<?xml version='1.0' encoding='UTF-8'?>
<DescribeCacheEngineVersionsResponse
xmlns="http://elasticache.amazonaws.com/doc/2015-02-02/">
<DescribeCacheEngineVersionsResult>
<CacheEngineVersions>
<CacheEngineVersion>
<CacheParameterGroupFamily>memcached1.4</CacheParameterGroupFamily>
<Engine>memcached</Engine>
<CacheEngineVersionDescription>memcached version 1.4.14</CacheEngineVersionDescription>
<CacheEngineDescription>memcached</CacheEngineDescription>
<EngineVersion>1.4.14</EngineVersion>

Not able to make post request to native domain from chrome extension

Description - I have a website www.robustest.com which bind with http://robustestom.appspot.com.
When I am tying to make post request to /user/signup (robustest) from a chrome extension Postman I am getting following error
Request URL:http://robustest.com/user/signup
Request Method:POST
Status Code:301 Moved Permanently
**Response Header -**
Alternate-Protocol:80:quic
Content-Length:233
Content-Type:text/html; charset=UTF-8
Date:Mon, 30 Dec 2013 03:43:43 GMT
Location:http://www.robustest.com/user/signup
Server:ghs
X-Frame-Options:SAMEORIGIN
X-XSS-Protection:1; mode=block
But its working as expected when I am firing against http://robustestom.appspot.com/user/signup.
Why We need - We are making an extension and there we need post request against our doamin.
Debugging - I might be wrong but it seems , all post request are redirecting to their counter part 'GET' because of origin is not 'robustest.com' but its a 'chrome extension'
The 301 is redirecting from robustest.com to www.robustest.com. Add the www to the domain the extension is making requests to and the 301 error should go away.

Categories

Resources