Best pythonic way to insert quotes into a string - python

I'm dealing with this python bug while writing my own reverse proxy. The server is sending my proxy this Set-Cookie response header:
workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly
I am loading this string into a SimpleCookie instance from the Cookie module. Unfortunately, because of the bug that I referenced above, when I later pull expires out of the morsel dictionary it returns Sun,. I have found that I can overcome this bug by adding quotes around the Expires component of the Set-Cookie header (or adding quotes around any key / value pair that contains spaces in the value).
So this:
workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly
Would become:
workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires="Sun, 02-Dec-2012 5:57:25 GMT"; HttpOnly
And this:
test=a b c; Path=/; Expires=a b c; HttpOnly
Would become:
test="a b c"; Path=/; Expires="a b c"; HttpOnly
I know that I could break the string into tokens and loop through them looking for spaces, then reconstruct the string, but I am curious what the best performing solution would be. As I mentioned, this is a reverse proxy that could potentially handle a few hundred requests a second, so I'd like this substitution to be as fast as possible.
Would a regular expression substitution (pre-compiled of course) be efficient? I've heard that regular expressions are pretty heavy....

How about this regex:
import re
header = re.sub("(?<==)[^;]* [^;]*", r'"\g<0>"', header)
This inserts quotes around whatever follows after a = until the next ; (or end of string), but only if there is at least one space in-between.
>>> header = 'test=a b c; Path=/; Expires=a b c; HttpOnly'
>>> re.sub("(?<==)[^;]* [^;]*", r'"\g<0>"', header)
'test="a b c"; Path=/; Expires="a b c"; HttpOnly'
>>> header = "workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly"
>>> re.sub("(?<==)[^;]* [^;]*", r'"\g<0>"', header)
'workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires="Sun, 02-Dec-2012 5:57:25 GMT"; HttpOnly'

Do you need to put quotes just around the date following Expires, or any arbitrary date that appears anywhere in the header? If it's the former, try this:
header = "workgroup_session_id=ilDJtR0rE1AG28C9ZxKLHj8TBtcT89sw; Path=/; Expires=Sun, 02-Dec-2012 5:57:25 GMT; HttpOnly"
print(header.replace('Expires=', 'Expires="').replace('GMT', 'GMT"'))

Related

Django REST Framework - cookies not being set via Response.set_cookie() call...?

I have the following standard Response setup in my DRF application:
response = Response(data=response, status=status.HTTP_200_OK)
I'm then attempting to add a split JWT header.payload and signature to the response headers with a response.set_cookie() call, as follows:
max_age = 365 * 24 * 60 * 60
expires = datetime.datetime.utcnow() + datetime.timedelta(seconds=max_age)
response.set_cookie(
key='JWT_ACCESS_HEADER_PAYLOAD',
value=header_payload,
httponly=False,
expires=expires.strftime("%a, %d-%b-%Y %H:%M:%S UTC"),
max_age=max_age
)
response.set_cookie(
key='JWT_ACCESS_SIGNATURE',
value=signature,
httponly=True,
expires=expires.strftime("%a, %d-%b-%Y %H:%M:%S UTC"),
max_age=max_age
)
return response
I can't see anything wrong with what I have done thus far:
Yet, for some reason, the only cookies set on the client side are as follows:
What have I done wrong here??
The actual output in the headers seem to be a valid SetCookie value:
JWT_ACCESS_HEADER_PAYLOAD=aSasas; expires=Sat, 03-Apr-2021 10:24:31 GMT; Max-Age=31536000; Path=/
JWT_ACCESS_SIGNATURE=asaSasaS; expires=Sat, 03-Apr-2021 10:24:31 GMT; HttpOnly; Max-Age=31536000; Path=
N.B. Running on localhost...if that helps?
So, this seems like a very trivial solution because it is just that.
I was using axios ... without sending { withCredentials: true }, with the requests.
The cookies were being set - because, well, they were. It's just to see them I needed to refresh the browser.

Abort a request after checking response headers

I have a script that requests a URL via urllib.request's urlopen and then gets it's info().
I don't want to proceed with the request after I've got these headers so I'm currently just leaving it as it is and forgetting about it, but this seems like I'm leaving the connection open and perhaps the server is sending more that just gets ignored.
How can I abort the request properly?
#!/usr/bin/python3
import urllib.request
response = urllib.request.urlopen('http://google.co.uk')
headers = dict(response.info())
print(headers)
# now finished with response, abort???
# ... more stuff
I think what you want is a HEAD request. Something like
>>> import httplib
>>> c = httplib.HTTPConnection("www.google.co.uk")
>>> c.request("HEAD", "/index.html")
>>> r = c.getresponse()
>>> r.getheaders()
[('x-xss-protection', '1; mode=block'), ('transfer-encoding', 'chunked'), ('set-cookie', 'PREF=ID=7867b0a5641d5f7b:FF=0:TM=1363882090:LM=1363882090:S=EXLl2JgBqzMKODcq; expires=Sat, 21-Mar-2015 16:08:10 GMT; path=/; domain=.google.co.uk, NID=67=qElAph6eqHyYKbh995ivP4B-21YRDRED4-uRXx0AvC3vLpv0SF1LkdsI2k6Hg1IhsatrVVqWf2slcMCaQsAZwZ89YfU0F1iPVBdt9PC2FItff31oRJ3gvhJVTQLa_RAt; expires=Fri, 20-Sep-2013 16:08:10 GMT; path=/; domain=.google.co.uk; HttpOnly'), ('expires', '-1'), ('server', 'gws'), ('cache-control', 'private, max-age=0'), ('date', 'Thu, 21 Mar 2013 16:08:10 GMT'), ('p3p', 'CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."'), ('content-type', 'text/html; charset=ISO-8859-1'), ('x-frame-options', 'SAMEORIGIN')]
>>>
From w3.org
The HEAD method is identical to GET except that the server MUST NOT
return a message-body in the response. The metainformation contained
in the HTTP headers in response to a HEAD request SHOULD be identical
to the information sent in response to a GET request. This method can
be used for obtaining metainformation about the entity implied by the
request without transferring the entity-body itself. This method is
often used for testing hypertext links for validity, accessibility,
and recent modification.
The response to a HEAD request MAY be cacheable in the sense that the
information contained in the response MAY be used to update a
previously cached entity from that resource. If the new field values
indicate that the cached entity differs from the current entity (as
would be indicated by a change in Content-Length, Content-MD5, ETag or
Last-Modified), then the cache MUST treat the cache entry as stale.

How to convert unicode accented characters to pure ascii without accents?

I'm trying to download some content from a dictionary site like http://dictionary.reference.com/browse/apple?s=t
The problem I'm having is that the original paragraph has all those squiggly lines, and reverse letters, and such, so when I read the local files I end up with those funny escape characters like \x85, \xa7, \x8d, etc.
My question is, is there any way i can convert all those escape characters into their respective UTF-8 characters, eg if there is an 'à' how do i convert that into a standard 'a' ?
Python calling code:
import os
word = 'apple'
os.system(r'wget.lnk --directory-prefix=G:/projects/words/dictionary/urls/ --output-document=G:\projects\words\dictionary\urls/' + word + '-dict.html http://dictionary.reference.com/browse/' + word)
I'm using wget-1.11.4-1 on a Windows 7 system (don't kill me Linux people, it was a client requirement), and the wget exe is being fired off with a Python 2.6 script file.
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
>>> myfoo = u'àà'
>>> myfoo
u'\xe0\xe0'
>>> unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
>>>
How it works
unicodedata.normalize('NFD', "insert-unicode-text-here") performs a Canonical Decomposition (NFD) of the unicode text; then we use str.encode('ascii', 'ignore') to transform the NFD mapped characters into ascii (ignoring errors).
#Mike Pennington's solution works great thanks to him. but when I tried that solution I notice that it fails some special characters (i.e. ı character from Turkish alphabet) which has not defined at NFD.
I discovered another solution which you can use unidecode library to this conversion.
>>>import unidecode
>>>example = "ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz"
#convert it to utf-8
>>>utf8text = unicode(example, "utf-8")
>>> print utf8text
ABCÇDEFGĞHIİJKLMNOÖPRSŞTUÜVYZabcçdefgğhıijklmnoöprsştuüvyz
#convert utf-8 to ascii text
asciitext = unidecode.unidecode(utf8text)
>>>print asciitext
ABCCDEFGGHIIJKLMNOOPRSSTUUVYZabccdefgghiijklmnooprsstuuvyz
I needed something like this but to remove only accented characters, ignoring special ones and I did this small function:
# ~*~ coding: utf-8 ~*~
import re
def remove_accents(string):
if type(string) is not unicode:
string = unicode(string, encoding='utf-8')
string = re.sub(u"[àáâãäå]", 'a', string)
string = re.sub(u"[èéêë]", 'e', string)
string = re.sub(u"[ìíîï]", 'i', string)
string = re.sub(u"[òóôõö]", 'o', string)
string = re.sub(u"[ùúûü]", 'u', string)
string = re.sub(u"[ýÿ]", 'y', string)
return string
I like that function because you can customize it in case you need to ignore other characters
The given URL returns UTF-8 as the HTTP response clearly indicates:
wget -S http://dictionary.reference.com/browse/apple?s=t
--2013-01-02 08:43:40-- http://dictionary.reference.com/browse/apple?s=t
Resolving dictionary.reference.com (dictionary.reference.com)... 23.14.94.26, 23.14.94.11
Connecting to dictionary.reference.com (dictionary.reference.com)|23.14.94.26|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Server: Apache
Cache-Control: private
Content-Type: text/html;charset=UTF-8
Date: Wed, 02 Jan 2013 07:43:40 GMT
Transfer-Encoding: chunked
Connection: keep-alive
Connection: Transfer-Encoding
Set-Cookie: sid=UOPlLC7t-zl20-k7; Domain=reference.com; Expires=Wed, 02-Jan-2013 08:13:40 GMT; Path=/
Set-Cookie: cu.wz=0; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: recsrch=apple; Domain=reference.com; Expires=Tue, 02-Apr-2013 07:43:40 GMT; Path=/
Set-Cookie: dcc=*~*~*~*~*~*~*~*~; Domain=reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: iv_dic=1-0; Domain=reference.com; Expires=Thu, 03-Jan-2013 07:43:40 GMT; Path=/
Set-Cookie: accepting=1; Domain=.reference.com; Expires=Thu, 02-Jan-2014 07:43:40 GMT; Path=/
Set-Cookie: bid=UOPlLC7t-zlrHXne; Domain=reference.com; Expires=Fri, 02-Jan-2015 07:43:40 GMT; Path=/
Length: unspecified [text/html]
Investigating the saved file using vim also reveals that the data is correctly utf-8 encoded...the same is true fetching the URL using Python.
the issue was different for me but this stack page works to resolved it unicodedata.normalize('NFKC', 'V').encode('ascii', 'ignore')
output - b'V'

Print line containing "word" python

I would like to print ONLY the line which contains "Server" in the below piece of output:
Date: Sun, 16 Dec 2012 20:07:44 GMT
Expires: -1
Cache-Control: private, max-age=0
Content-Type: text/html; charset=ISO-8859-1
Set-Cookie: PREF=ID=da8d52b67e5c7522:FF=0:TM=1355688464:LM=1355688464:S=CrK5vV-qb3UgWUM1; expires=Tue, 16-Dec-2014 20:07:44 GMT; path=/; domain=.google.com
Set-Cookie: NID=67=nICkwXDM6H7TNQfHbo06FbvZhO61bzNmtOn4HA71ukaVDSgywlBjBkAR-gXCpMNo1TlYym-eYMUlMkCHVpj7bDRwiHT6jkr7z4dMrApDuTk_HuTrZrkoctKlS7lXjz9a; expires=Mon, 17-Jun-2013 20:07:44 GMT; path=/; domain=.google.com; HttpOnly
P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."
Server: gws
X-XSS-Protection: 1; mode=block
X-Frame-Options: SAMEORIGIN
Connection: close
This information is fetched from a list called websiteheaders. I have the below piece of code which is driving me crazy that it is not working properly...
for line in websiteheaders:
if "Server" in line:
print line
Now this above piece of code prints exactly the same block of text that is described at the beginning of my post. I just dont seem to get why it does that...
As I've said, I only want to print the line that contains "Server", if possible without regex. And if not possible, with regex.
Please help and thanks!
EDIT: My complete code so far is pasted here: http://pastebin.com/sYuZyvX9
EDIT2: For completeness, in hosts.txt there currently is 1 host named "google.com"
Update
My code was actually working fine, but there was a mistake in a other piece of my code which ensured that the data that was put into the list websiteheaders was 1 large string instead of multiple entries. In the above piece of code, it will ofcourse find "Server" and print the whole entry, which in my case was the full (large) string.
Using
websiteheaders.extend(headers.splitlines())
instead of
websiteheaders.append(headers)
did the trick for me. Thanks alot guys.
Is websiteheaders really a list which is split for very line? Because if it's a string you should use:
for line in websiteheaders.splitlines():
if "Server" in line:
print line
Also, a good tip: I would recommend adding some print-statements on encountering this kind of problems. If you would have added something like:
else:
print 'WRONG LINE:', line
You probably would have catched that this loop was not looping over every line but over every character.
Update
I can't wee what's wrong with your code then. This is what I get:
In [3]: websiteheaders
Out[3]:
['Date: Sun, 16 Dec 2012 20:07:44 GMT',
'Expires: -1',
'Cache-Control: private, max-age=0',
'Content-Type: text/html; charset=ISO-8859-1',
'Set-Cookie: PREF=ID=da8d52b67e5c7522:FF=0:TM=1355688464:LM=1355688464:S=CrK5vV-qb3UgWUM1; expires=Tue, 16-Dec-2014 20:07:44 GMT; path=/; domain=.google.com',
'Set-Cookie: NID=67=nICkwXDM6H7TNQfHbo06FbvZhO61bzNmtOn4HA71ukaVDSgywlBjBkAR-gXCpMNo1TlYym-eYMUlMkCHVpj7bDRwiHT6jkr7z4dMrApDuTk_HuTrZrkoctKlS7lXjz9a; expires=Mon, 17-Jun-2013 20:07:44 GMT; path=/; domain=.google.com; HttpOnly',
'P3P: CP="This is not a P3P policy! See http://www.google.com/support/accounts/bin/answer.py?hl=en&answer=151657 for more info."',
'Server: gws',
'X-XSS-Protection: 1; mode=block',
'X-Frame-Options: SAMEORIGIN',
'Connection: close"']
In [4]: for line in websiteheaders:
...: if 'Server' in line:
...: print line
...:
Server: gws
for single_line in websiteheaders.splitlines():
if `Server` in single_line:
print single_line

JSON string decoding error

I am calling the URL :
http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json
using urllib2 and decoding using the json module
url = "http://code.google.com/feeds/issues/p/chromium/issues/full/291?alt=json"
request = urllib2.Request(query)
response = urllib2.urlopen(request)
issue_report = json.loads(response.read())
I run into the following error :
ValueError: Invalid control character at: line 1 column 1120 (char 1120)
I tried checking the header and I got the following :
Content-Type: application/json; charset=UTF-8
Access-Control-Allow-Origin: *
Expires: Sun, 03 Jul 2011 17:38:38 GMT
Date: Sun, 03 Jul 2011 17:38:38 GMT
Cache-Control: private, max-age=0, must-revalidate, no-transform
Vary: Accept, X-GData-Authorization, GData-Version
GData-Version: 1.0
ETag: W/"CUEGQX47eCl7ImA9WxJaFEw."
Last-Modified: Tue, 04 Aug 2009 19:20:20 GMT
X-Content-Type-Options: nosniff
X-Frame-Options: SAMEORIGIN
X-XSS-Protection: 1; mode=block
Server: GSE
Connection: close
I also tried adding an encoding parameter as follows :
issue_report = json.loads(response.read() , encoding = 'UTF-8')
I still run into the same error.
The feed has raw data from a JPEG in it at that point; the JSON is malformed, so it's not your fault. Report a bug to Google.
You could consider using lxml instead, since the JSON is malformed. It's XPath support makes working with XML pretty straight-forward:
import lxml.etree
url = 'http://code.google.com/feeds/issues/p/chromium/issues/full/291'
doc = lxml.etree.parse(url)
ns = {'issues': 'http://schemas.google.com/projecthosting/issues/2009'}
issues = doc.xpath('//issues:*', namespaces=ns)
Fairly easy to manipulate elements, for instance to strip namespace from tags, convert to dict:
>>> dict((x.tag[len(ns['issues'])+2:], x.text) for x in issues)
<<<
{'closedDate': '2009-08-04T19:20:20.000Z',
'id': '291',
'label': 'Area-BrowserUI',
'stars': '13',
'state': 'closed',
'status': 'Verified'}

Categories

Resources