AWS URL or URI? - python

Hello I am interested in one thing.I know it can be silly question but I can not understand one thing here:
<CreateBucketConfiguration xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
<LocationConstraint>eu-central-1</LocationConstraint>
</CreateBucketConfiguration>
is "http://s3.amazonaws.com/doc/2006-03-01/" URL or URI?
cause when I type it in browser it shows me this:
screenshot

It's both. A uniform resource identifier (URI) provides a name for a resource. A URL (uniform resource locator) is a kind of URI that describes where (and how) the resource can be accessed.
The URL you provide, http://s3.amazonaws.com/doc/2006-03-01/, doesn't tell you anything about what is there. It simply says that if you use the HTTP protocol to connect to s3.amazonaws.com and request /doc/2006-03-01/, you'll get something back. What that something is is only implied by the name of the XML attribute that has the URL as its value.
(In practice, the server may not actually provide a resource at that location, but it could. The error message you see indicates there might be something there, but you don't have permission to access it.)

Related

AWS CDK: SNS HTTPS Subscription "Must provide protocol if url is unresolved"

I'm having trouble finding any message on this is the documentation. I know that this line is the problem line, because when I comment it out, "cdk deploy" works just fine. Basically, I am getting a url as a parameter from the user (I am aware parameters are not recommended by AWS, but this is necessary for my use case). I then use that parameter to subscribe to an SNS topic in the following line:
my_sns_topic.add_subscription(subscriptions.UrlSubscription(pagerduty_url.value_as_string))
When this line isn't commented out, I get the following error:
jsii.errors.JSIIError: Must provide protocol if url is unresolved
I can subscribe something using an email protocol to this SNS topic just fine, so I don't think it's the SNS topic itself. It works when I just directly pass the URL as a string into the function as well, so it seems to be an issue with the parameter. So, how do I fix this? I understand it wants to have an alternative in case the url is not valid, but there's no information in the CDK documentation on how to actually do this, either that or I'm just not finding it.
Because you are using a CfnParameter and the value can't be read at synth time then the UrlSubscription class has no way of inferring the protocol used, so you must provide it yourself. Take a look here for reference:
https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_sns_subscriptions/UrlSubscription.html
Try:
my_sns_topic.add_subscription(subscriptions.UrlSubscription(pagerduty_url.value_as_string, protocol=sns.SubscriptionProtocol.HTTPS))
If it isn't always going to be an HTTPS protocol then that'll likely have to be another Parameter you request from the user.

Problem with detecting if link is invalid

Is there any way to detect if a link is invalid using webbot?
I need to tell the user that the link they provided was unreachable.
The only way to be completely sure that a url sends you to a valid page is to fetch that page and check it works. You could try making a request other than GET to try to avoid the wasted bandwith downloading the page, but not all servers will respond: the only way to be absolutely sure is to GET and see what happens. Something like:
import requests
from requests.exceptions import ConnectionError
def check_url(url):
try:
r = requests.get(url, timeout=1)
return r.status_code == 200
except ConnectionError:
return False
Is this a good idea? It's only a GET request, and get is supposed to idempotent, so you shouldn't cause anybody any harm. On the other hand, what if a user sets up a script to add a new link every second pointing to the same website? Then you're DDOSing that website. So when you allow users to cause your server to do things like this, you need to think how you might protect it. (In this case: you could keep a cache of valid links expiring every n seconds, and only look up if the cache doesn't hold the link.)
Note that if you just want to check the link points to a valid domain it's a bit easier: you can just do a dns query. (The same point about caching and avoiding abuse probably applies.)
Note that I used requests, because it is easy, but you likely want to do this in the bacground, either with requests in a thread, or with one of the asyncio http libraries and an asyncio event loop. Otherwise your code will block for at least timeout seconds.
(Another attack: this gets the whole page. What if a user links to a massive page? See this question for a discussion of protecting from oversize responses. For your use case you likely just want to get a few bytes. I've deliberately not complicated the example code here because I wanted to illustrate the principle.)
Note that this just checks that something is available on that page. What if it's one of the many dead links which redirects to a domain-name website? You could enforce 'no redirects'---but then some redirects are valid. (Likewise, you could try to detect redirects up to the main domain or to a blacklist of venders' domains, but this will always be imperfect.) There is a tradeoff here to consider, which depends on your concrete use case, but it's worth being aware of.
You could try sending an HTTP request, opening the result, and have a list of known error codes, 404, etc. You can easily implement this in Python and is efficient and quick. Be warned that SOMETIMES (quite rarely) a website might detect your scraper and artificially return an Error Code to confuse you.

What is the difference between HTTP Post URL with /post and without using Python requests module?

I am using Python 2.7 with the requests module to send http post with parameters. I encountered a strange problem.
To do http post, it is just one line;
x = requests.post(URL, params)
I have no problem with the params. It is the URL that puzzled me.
Sometimes, this URL http://hostname/path/post works. Sometimes, I use http://hostname/path without the /post to get the HTTP post to work. I am puzzled why is this so. What is the difference between the two? Under what conditions do I use which one?
'http://hostname/path/post' is a path. You could in principle issue and HTTP GET request to that same path (although probably you wouldn't get anything meaningful back).
In general, you should look at the site's API documentation and post to the url that they say you should post to without adding anything extra to the url.
There are two different concepts, url and HTTP method. You are confused by trying to mix them.
url - an address you talk to
The url is addressing something on some server. If you get valid url, you can take it as a string, do not read in, and use it. Consider it to be a string.
If I would link it to a visiting your friend, url is address of a doors to come to.
HTTP method (POST, GET, DELETE...)
There are multiple HTTP methods which differ in the way, how you talk to given url.
Linking it to visiting a friend, it would be the way, you try to make the doors open (use the bell, knock or use a hammer)

Odd redirect location causes proxy error with urllib2

I am using urllib2 to do an http post request using Python 2.7.3. My request is returning an HTTPError exception (HTTP Error 502: Proxy Error).
Looking at the messages traffic with Charles, I see the following is happening:
I send the HTTP request (POST /index.asp?action=login HTTP/1.1) using urllib2
The remote server replies with status 303 and a location header of ../index.asp?action=news
urllib2 retries sending a get request: (GET /../index.asp?action=news HTTP/1.1)
The remote server replies with status 502 (Proxy error)
The 502 reply includes this in the response body: "DNS lookup failure for: 10.0.0.30:80index.asp" (Notice the malformed URL)
So I take this to mean that a proxy server on the remote server's network sees the "/../index.asp" URL in the request and misinterprets it, sending my request on with a bad URL.
When I make the same request with my browser (Chrome), the retry is sent to GET /index.asp?action=news. So Chrome takes off the leading "/.." from the URL, and the remote server replies with a valid response.
Is this a urllib2 bug? Is there something I can do so the retry ignores the "/.." in the URL? Or is there some other way to solve this problem? Thinking it might be a urllib2 bug, I swapped out urllib2 with requests but requests produced the same result. Of course, that may be because requests is built on urllib2.
Thanks for any help.
The Location being sent with that 302 is wrong in multiple ways.
First, if you read RFC2616 (HTTP/1.1 Header Field Definitions) 14.30 Location, the Location must be an absoluteURI, not a relative one. And section 10.3.3 makes it clear that this is the relevant definition.
Second, even if a relative URI were allowed, RFC 1808, Relative Uniform Resource Locators, 4. Resolving Relative URLs, step 6, only specifies special handling for .. in the pattern <segment>/../. That means that a relative URL shouldn't start with ... So, even if the base URL is http://example.com/foo/bar/ and the relative URL is ../baz/, the resolved URL is not http://example.com/foo/baz/, but http://example.com/foo/bar/../baz. (Of course most servers will treat these the same way, but that's up to each server.)
Finally, even if you did combine the relative and base URLs before resolving .., an absolute URI with a path starting with .. is invalid.
So, the bug is in the server's configuration.
Now, it just so happens that many user-agents will work around this bug. In particular, they turn /../foo into /foo to block users (or arbitrary JS running on their behalf without their knowledge) from trying to do "escape from webroot" attacks.
But that doesn't mean that urllib2 should do so, or that it's buggy for not doing so. Of course urllib2 should detect the error earlier so it can tell you "invalid path" or something, instead of running together an illegal absolute URI that's going to confuse the server into sending you back nonsense errors. But it is right to fail.
It's all well and good to say that the server configuration is wrong, but unless you're the one in charge of the server, you'll probably face an uphill battle trying to convince them that their site is broken and needs to be fixed when it works with every web browser they care about. Which means you may need to write your own workaround to deal with their site.
The way to do that with urllib2 is to supply your own HTTPRedirectHandler with an implementation of redirect_request method that recognizes this case and returns a different Request than the default code would (in particular, http://example.com/index.asp?action=news instead of http://example.com/../index.asp?action=news).

Python urllib.urlopen() call doesn't work with a URL that a browser accepts

If I point Firefox at http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes, I get a page of HTML. But if I try this in Python:
import urllib
site = 'http://bitbucket.org/tortoisehg/stable/wiki/Home/ReleaseNotes'
req = urllib.urlopen(site)
text = req.read()
I get the following:
500 Internal Server Error
The server encountered an internal error or misconfiguration and was unable to complete your request.
What am I doing wrong?
You are not doing anything wrong, bitbucket does some user agent detection (to detect mercurial clients for example). Just changing the user agent fixes it (if it doesn't have urllib as a substring).
You should fill an issue regarding this: http://bitbucket.org/jespern/bitbucket/issues/new/
You're doing nothing wrong, on the surface, and as the error page says you should contact the site's administrators because they're the ones with the server logs which may explain what's happening. Fortunately, bitbucket's site admins are a friendly bunch!
No doubt there is some header or combination of headers that browsers set one way, urllib sets another way, and a bug on the server gets tickled in the latter case. You may want to see exactly what headers are being sent e.g. with firebug in firefox, and reproduce those until you isolate exactly the server bug; most likely it's going to be the user agent or some "accept"-ish header that's tickling that bug.
I don't think you're doing anything wrong -- it looks like this server was just down? Your script worked fine for me ('text' contained the same data as that displayed in the browser).

Categories

Resources