Django URL pattern to include a #

Django URL pattern to include a # - python

I'm having issues getting a URL pattern to work.
The URL is in the format of the following:
/API#access_token=<string>&expires_in=<timestamp>
I can't change the #access_token=&expires_in= part unfortunately, as this is outside of my control, and I simply have to just make my side of the code work.
I've tried a number of different patterns, a number of which are outlined below. This is my first Django project, and any advice, and pointers would be much appreciated.
url(r'^API#access_token=(?P<token_info>\w+)&expires_in(?P<time>\d+)$'
url(r'^API#(?P<tokens>\w+)$'
url(r'^API/#(?P<tokens>\w+)&(?P<expiration>\d+)$'

The issue is that the anchor #, also called the fragment identifier, is not sent to the server by the browser. The regex cannot capture what is not there. From the wikipedia article on the fragment identifier:
The fragment identifier functions differently than the rest of the
URI: namely, its processing is exclusively client-side with no
participation from the web server — of course the server typically
helps to determine the MIME type, and the MIME type determines the
processing of fragments. When an agent (such as a Web browser)
requests a web resource from a Web server, the agent sends the URI to
the server, but does not send the fragment. Instead, the agent waits
for the server to send the resource, and then the agent processes the
resource according to the document type and fragment value.
The only way around this is to parse the fragment in JavaScript on the client side and send it as a separate asynchronous request. For a GET request, you could send the fragment as a query parameter (after stripping off the hash) or put it in the header as a custom value.

Related

Implementing WebSockets with Sony's Audio Control API in Python

Sony's website provided a example to use WebSockets to works with their api in Node.js
https://developer.sony.com/develop/audio-control-api/get-started/websocket-example#tutorial-step-3
it worked fine for me. But when i was trying to implement it in Python, it does not seems to work
i use websocket_client
import websocket
ws = websocket.WebSocket()
ws.connect("ws://192.168.0.34:54480/sony/avContent",sslopt={"cert_reqs": ssl.CERT_NONE})
gives
websocket._exceptions.WebSocketBadStatusException: Handshake status 403 Forbidden
but in their example code, there is not any kinds of authrization or authentication

I recently had the same problem. Here is what I found out:
Normal HTTP responses can contain Access-Control-Allow-Origin headers to explicitly allow other websites to request data. Otherwise, web browsers block such "cross-origin" requests, because the user could be logged in there for example.
This "same-origin-policy" apparently does not apply to WebSockets and the handshakes can't have these headers. Therefore any website could connect to your Sony device. You probably wouldn't want some website to set your speaker/receiver volume to 100% or maybe upload a defective firmware, right?
That's why the audio control API checks the Origin header of the handshake. It always contains the website the request is coming from.
The Python WebSocket client you use assumes http://192.168.0.34:54480/sony/avContent as the origin by default in your case. However, it seems that the API ignores the content of the Origin header and just checks whether it's there.
The WebSocket#connect method has a parameter named suppress_origin which can be used to exclude the Origin header.
TL;DR
The Sony audio control API doesn't accept WebSocket handshakes that contain an Origin header.
You can fix it like this:
ws.connect("ws://192.168.0.34:54480/sony/avContent",
sslopt={"cert_reqs": ssl.CERT_NONE},
suppress_origin=True)

Unable to get complete source code of web page using Python [duplicate]

I would like to try send requests.get to this website:
requests.get('https://rent.591.com.tw')
and I always get
<Response [404]>
I knew this is a common problem and tried different way but still failed.
but all of other website is ok.
any suggestion?

Webservers are black boxes. They are permitted to return any valid HTTP response, based on your request, the time of day, the phase of the moon, or any other criteria they pick. If another HTTP client gets a different response, consistently, try to figure out what the differences are in the request that Python sends and the request the other client sends.
That means you need to:
Record all aspects of the working request
Record all aspects of the failing request
Try out what changes you can make to make the failing request more like the working request, and minimise those changes.
I usually point my requests to a http://httpbin.org endpoint, have it record the request, and then experiment.
For requests, there are several headers that are set automatically, and many of these you would not normally expect to have to change:
Host; this must be set to the hostname you are contacting, so that it can properly multi-host different sites. requests sets this one.
Content-Length and Content-Type, for POST requests, are usually set from the arguments you pass to requests. If these don't match, alter the arguments you pass in to requests (but watch out with multipart/* requests, which use a generated boundary recorded in the Content-Type header; leave generating that to requests).
Connection: leave this to the client to manage
Cookies: these are often set on an initial GET request, or after first logging into the site. Make sure you capture cookies with a requests.Session() object and that you are logged in (supplied credentials the same way the browser did).
Everything else is fair game but if requests has set a default value, then more often than not those defaults are not the issue. That said, I usually start with the User-Agent header and work my way up from there.
In this case, the site is filtering on the user agent, it looks like they are blacklisting Python, setting it to almost any other value already works:
>>> requests.get('https://rent.591.com.tw', headers={'User-Agent': 'Custom'})
<Response [200]>
Next, you need to take into account that requests is not a browser. requests is only a HTTP client, a browser does much, much more. A browser parses HTML for additional resources such as images, fonts, styling and scripts, loads those additional resources too, and executes scripts. Scripts can then alter what the browser displays and load additional resources. If your requests results don't match what you see in the browser, but the initial request the browser makes matches, then you'll need to figure out what other resources the browser has loaded and make additional requests with requests as needed. If all else fails, use a project like requests-html, which lets you run a URL through an actual, headless Chromium browser.
The site you are trying to contact makes an additional AJAX request to https://rent.591.com.tw/home/search/rsList?is_new_list=1&type=1&kind=0&searchtype=1&region=1, take that into account if you are trying to scrape data from this site.
Next, well-built sites will use security best-practices such as CSRF tokens, which require you to make requests in the right order (e.g. a GET request to retrieve a form before a POST to the handler) and handle cookies or otherwise extract the extra information a server expects to be passed from one request to another.
Last but not least, if a site is blocking scripts from making requests, they probably are either trying to enforce terms of service that prohibit scraping, or because they have an API they rather have you use. Check for either, and take into consideration that you might be blocked more effectively if you continue to scrape the site anyway.

One thing to note: I was using requests.get() to do some webscraping off of links I was reading from a file. What I didn't realise was that the links had a newline character (\n) when I read each line from the file.
If you're getting multiple links from a file instead of a Python data type like a string, make sure to strip any \r or \n characters before you call requests.get("your link"). In my case, I used
with open("filepath", 'w') as file:
links = file.read().splitlines()
for link in links:
response = requests.get(link)

In my case this was due to fact that the website address was recently changed, and I was provided the old website address. At least this changed the status code from 404 to 500, which, I think, is progress :)

Instagram Client Side Authentication using Python

I am currently working on a bottle project and will be using Instagram api. I was hoping to use the client side authentication however I am having problems with the access token as it does not returns as a parameter.
I am making the request here:
https://api.instagram.com/oauth/authorize/?client_id=client_id&redirect_uri=redirect_uri&response_type=token&scope=basic+follower_list
The app is redirected to token page correctly and I can even see the token in the url. But when I try to parse it, it comes out empty.
#route('/oauth_callback')
def success_message():
token = request.GET.get("access_token")
print token.values()
return "success"
The token.values() returns an empty list.
ps: Keep in mind that when I try to do the same operation with server side authentication, I can successfully get the code and exchange it for a token.

Once you make a query to Instagram api you must be receiving below response?
http://your-redirect-uri#access_token=ACCESS-TOKEN
the part after # is termed as fragment and not query_string parameter and there is no way you can retrieve that information on Server side in Bottle.
To actually get fragments, bottle.request.urlparts is used
urlparts
The url string as an urlparse.SplitResult tuple. The tuple
contains (scheme, host, path, query_string and fragment), but the
fragment is always empty because it is not visible to the server.
Use the SDK and preferably Server Side operations -> https://github.com/facebookarchive/python-instagram
If you will to go with this approach, then managing a JavaScript which parses the access-token and then posts to your bottle api for your consumption which I don't recommend...
From https://instagram.com/developer/authentication/
Client-Side (Implicit) Authentication
If you are building an app that does not have a server component (a
purely javascript app, for instance), you will notice that it is
impossible to complete step three above to receive your access_token
without also having to store the secret on the client. You should
never pass or store your client_id secret onto a client. For these
situations there is the Implicit Authentication Flow.

Filter broken referer Django emails when switching servers

We often change IP addresses on AWS when switching EC2 servers (yes, we use Elastic IPs when applicable). Sometimes we get IP addresses that used to host other applications (which is fine, of course). When end users click on stale links they get to our servers (which is ok). But these HTTP GET requests have a "referer" header, thus the regular 404 error generates an automated Django email to our developers!
I can recreate the error emails easily using the following Python code:
import urllib2
req = urllib2.Request('http://my_website.com/some/url/we/dont/have')
req.add_header('Referer', 'http://whatever.i.want')
request = urllib2.urlopen(req)
When the Referer header is commented out Django does not send emails.
We still want the emails to be sent when there are real broken links in OUR website, so I do not want to set SEND_BROKEN_LINK_EMAILS to False.
I cannot filter using IGNORABLE_404_URLS because a. the logic is reversed and b. the regular expression only scans the path and not the hostname.
Help?

I'd probably knock up some middleware that catches the Http404 exception, examines the referer header, and either raises a separate exception (probably will still be mailed though), or does a simple redirect to your front page. Even better, create a custom page describing why the user ended up where they are, and just redirect to that.

How to set ETAGS on Google App Engine for Python?

I am serving some JSON content from a Google App Engine server. I need to serve an ETAG for the content to know if its changed since i last loaded the data. Then my app will remove its old data and use the new JSON data to populate its views.
self.response.headers['Content-Type'] = "application/json; charset=utf-8"
self.response.out.write(json.dumps(to_dict(objects,"content")))
Whats the best practice to set ETAGs for the response? Do i have to calculate the ETAG myself? Or is it a way to get the HTTP protocol to do this?

If you're using webapp2, it can add an md5 ETag based on the response body automatically.
self.response.md5_etag()
http://webapp-improved.appspot.com/guide/response.html

You'll have to calculate the e-tag value yourself. E-tags are opaque strings that only have meaning to the application.
Best practice is to just concatenate all the input variables (converted to string) that determine the JSON content; anything that, if changed, would result in a change in the JSON output, should be part of this. If there is anything sensitive in those strings you wouldn't want to be exposed, use the MD5 hash of the values instead.
For example, in a CMS application that I administer, the front page has the following e-tag:
|531337735|en-us;en;q=0.5|0|Eli Visual Theme|1|943ed3c25e6d44497deb3fe274f98a96||
The input variables that we care about have been concatenated with the | symbol into an opaque value, but it does represent several distinct input values, such as a last-modified timestamp (the number), the browser accepted language header, the current visual theme, and a internal UID that is retrieved from a browser cookie (and which determines what context the content on the front page is taken from). If any of those variables would change, the page is likely to be different and the cached copy would be stale.
Note that an e-tag is useless without a means to verify it quickly. A client will include it in a If-None-Match request header, and the server should be able to quickly re-calculate the e-tag header from the current variables and see if the tag is still current. If that re-calculation would take the same amount of time as regenerating the content body, you only save a little bandwidth sending the 304 Not Modified response instead of a full JSON body in a 200 OK response.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.