Flask: flask.request.args.get replacing '+' with space in url - python

I am trying to use a flask server for an api that takes image urls through the http get parameters.
I am using this url example which is very long (on pastebin) and contain's many +'s in the url. I have the following route set up in my flask server
#webapp.route('/example', methods=['GET'])
def process_example():
imageurl = flask.request.args.get('imageurl', '')
url = StringIO.StringIO(urllib.urlopen(imageurl).read())
...
but the issue I get is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 597, in open_data
data = base64.decodestring(data)
File "/Users/aly/anaconda/lib/python2.7/base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
Upon further inspection (i.e. printing the imageurl that flask gets) it would appear that the + characters are being replaced by literal spaces which seems to be screwing things up.
Is there an option for the flask.args.get function that can handle this?

You need to encode your query parameters correctly; in URL query paramater encoding, spaces are encoded to +, while + itself is encoded to %2B.
Flask cannot be told to treat specific data differently; you cannot reliably detect what data was correctly encoded and what wasn't. You could extract the parameters from query string manually, however, by using request.query_string.
The better approach is to escape your parameters correctly (in JavaScript, use encodeURIComponent(), for example). The + character is not the only problematic character in a Base64-encoded value; the format also uses / and =, both of which carry meaning in a URL, which is why there is a URL-safe variant.
In fact, it is probably the = character at the end of that data: URL that is missing, being the more direct cause of the Incorrect padding error message. If you added it back you'd next indeed have problems with all the + characters having been decoded to ' '.

Related

Firestore python query throwing ValueError: Path `number-field` not consumed residue: -number

I am playing around with Firestore for the first time and I am using Python 3.9.5
I have successfully added data to my database in python using:
addDoc = db.collection('My-Collection').document()
addDoc.set(val)
where val is a json object.
However when I try to query the collection on number-field:
docs = db.collection("Keno-Games").where("game-number", "==", 373).stream()
This was copied from the firestore console query creator where it does work, but it doesn't when I try to use it in my python code. The error is:
Traceback (most recent call last):
File "/Users/timbolton/Desktop/kenoApi/kenoFirestoreIngest.py", line 21, in <module>
docs = db.collection("Keno-Games").where("game-number", "==", 373).stream()
File "/Users/timbolton/Desktop/kenoApi/venv/lib/python3.9/site-packages/google/cloud/firestore_v1/base_collection.py", line 243, in where
return query.where(field_path, op_string, value)
File "/Users/timbolton/Desktop/kenoApi/venv/lib/python3.9/site-packages/google/cloud/firestore_v1/base_query.py", line 278, in where
field_path_module.split_field_path(field_path) # raises
File "/Users/timbolton/Desktop/kenoApi/venv/lib/python3.9/site-packages/google/cloud/firestore_v1/field_path.py", line 84, in split_field_path
for element in _tokenize_field_path(path):
File "/Users/timbolton/Desktop/kenoApi/venv/lib/python3.9/site-packages/google/cloud/firestore_v1/field_path.py", line 64, in _tokenize_field_path
raise ValueError("Path {} not consumed, residue: {}".format(path, path[pos:]))
ValueError: Path game-number not consumed, residue: -number
Fixed:
Field Paths cannot contain the - character as per the Documentation - Constraints on Field Paths :
Must separate field names with a single period (.)
May be passed as a string when all field names in the path are simple, otherwise must be passed as a FieldPath object (e.g. JavaScript FieldPath)
A simple field name is one where all of the following are true:
Contains only the characters a-z, A-Z, 0-9, and underscore (_)
Does not start with 0-9

Python -- get at JSON info that's written like XML

In Python, I usually do simple JSON with this sort of template:
url = "url"
file = urllib2.urlopen(url)
json = file.read()
parsed = json.loads(json)
and then get at the variables with calls like:
parsed[obj name][value name]
But, this works with JSON that's formatted roughly like:
{'object':{'index':'value', 'index':'value'}}
The JSON I just encountered is formatted like:
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
so there are no names for me to reference the different blocks. Of course the blocks give different info, but have the same "keys" -- much like XML is usually formatted. Using my method above, how would I parse through this JSON?
The following is not a valid JSON.
{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}
Where as
[{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}] is a valid JSON.
and python trackback shows that
import json
string = "{'index':'value', 'index':'value'},{'index':'value', 'index':'value'}"
parsed = json.loads(string)
print parsed
Traceback (most recent call last):
File "/Users/tron/Desktop/test3.py", line 3, in <module>
parsed_json = json.loads(json_string)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/__init__.py", line 338, in loads
return _default_decoder.decode(s)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/json/decoder.py", line 369, in decode
raise ValueError(errmsg("Extra data", s, end, len(s)))
ValueError: Extra data: line 1 column 27 - line 1 column 54 (char 26 - 53)
[Finished in 0.0s with exit code 1]
where is if you do
json_string = '[{"a":"value", "b":"value"},{"a":"value", "b":"value"}]'
everything works fine.
If that is the case, you can refer to it as an array of Jsons. where json_string[0] is the first JSON string. json_string[1] is the second and so on.
Otherwise if you think this is going to be an issue that you "just have to deal with". Here is one option:
Think of the ways JSON can be malformed and write a simple class to account for them. In the case above, here is a hacky way you can deal with it.
import json
json_string = '{"a":"value", "b":"value"},{"a":"value", "b":"value"}'
def parseJson(string):
parsed_json = None
try:
parsed_json = json.loads(string)
print parsed_json
except ValueError, e:
print string, "didnt parse"
if "Extra data" in str(e.args):
newString = "["+string+"]"
print newString
return parseJson(newString)
You could add more if/else to deal with various things you run into. I have to admit, this is very hacky and I don't think you can ever account for every possible mutation.
Good luck
The result must be list of dict:
[{'index1':'value1', 'index2':'value2'},{'index1':'value1', 'index2':'value2'}]
thus you can reference it using numbers: item[1]['index1']

Python encoding of Latin american characters

I'm trying to allow users to signup to my service and I'm noticing errors whenever somebody signs up with Latin american characters in their name.I tried reading several SO posts/websites as per below:
Python regex against Latin-1 character encoding?
http://www.w3.org/TR/2009/WD-html5-20090423/infrastructure.html#character-encodings-0
http://docs.python.org/2/library/json.html
https://pypi.python.org/pypi/anyjson
but was still unable to solve it. My code example is as per below:
>>> val = json.dumps({"name":"Déjà"}, encoding="ISO-8859-1")
>>> val
'{"name": "D\\u00c3\\u00a9j\\u00c3\\u00a0"}'
Is there anyway to force the encoding to work in this case for both that and deserializing? Any help is appreciated!
EDIT
The client is Android and iPhone applications. I'm using the following libraries to encode the json on the clients:
http://loopj.com/android-async-http/ (android)
https://github.com/AFNetworking/AFNetworking (ios)
EDIT 2
The same text was received by the server from the Android client as per below:
{"NAME":"D\ufffdj\ufffd"}
I was using anyjson to deserialize that and it said:
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 135, in loads
return implementation.loads(value)
File "/usr/local/lib/python2.7/dist-packages/anyjson/__init__.py", line 99, in loads
return self._decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/__init__.py", line 454, in loads
return _default_decoder.decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 374, in decode
obj, end = self.raw_decode(s)
File "/usr/local/lib/python2.7/dist-packages/simplejson/decoder.py", line 393, in raw_decode
return self.scan_once(s, idx=_w(s, idx).end())
ValueError: ('utf8', "D\xe9j\xe0", 1, 2, 'invalid continuation byte')
JSON should almost always be in Unicode (when encoded), and if you're writing a webserver, UTF-8. The following, in Python 3, is basically correct:
In [1]: import json
In [2]: val = json.dumps({"name":"Déjà"})
In [3]: val
Out[3]: '{"name": "D\\u00e9j\\u00e0"}'
A closer look:
'{"name": "D\\u00e9j\\u00e0"}'
^^^^^^^
The text \u00e9, which in JSON means "é".
The slash is doubled because we're looking at a repr of a str.
You can then send val to the client, and in Javascript, JSON.parse should give you the right result.
Because you mentioned, "when somebody signs up": that implies data coming from the client (web browser) to you. How is that data being sent? What library/libraries are you writing a webserver in?
Turns out this was mainly an issue in how I was doing the encoding from the Android side.
I am now setting the StringEntity this way in Android and it's working now:
StringEntity se = new StringEntity(obj.toString(), "UTF-8");
se.setContentType("application/json;charset=UTF-8");
se.setContentEncoding( new BasicHeader(HTTP.CONTENT_TYPE, "application/json"));
Also, I was using anyjson on the server which was using simplejson. This was creating errors at times as well. I switched to using the json library for Python.

Strange failure to make a HIT for Amazon Mechanical Turk with some URLs?

I was trying to include a link in a HIT request in Amazon Mechanical Turk, using boto, and kept getting an error that my XML was invalid. I gradually pared my html down to the bare minimum, and isolated that it seems to be that some valid links fail for seemingly no reason. Can anyone with expertise in boto or aws help me parse why?
I followed these two guides:
http://www.toforge.com/2011/04/boto-mturk-tutorial-create-hits/
https://gist.github.com/j2labs/740267
Here is my example:
from boto.mturk.connection import MTurkConnection
from boto.mturk.question import QuestionContent,Question,QuestionForm,Overview,AnswerSpecification,SelectionAnswer,FormattedContent,FreeTextAnswer
from config import *
HOST = 'mechanicalturk.sandbox.amazonaws.com'
mtc = MTurkConnection(aws_access_key_id=ACCESS_ID,
aws_secret_access_key=SECRET_KEY,
host=HOST)
title = 'HIT title'
description = ("HIT description.")
keywords = 'keywords'
s1 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "http://www.example.com"
s2 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "https://www.google.com/search?q=example&site=imghp&tbm=isch"
def makeahit(s):
overview = Overview()
overview.append_field('Title', 'HIT title itself')
overview.append_field('FormattedContent',s)
qc = QuestionContent()
qc.append_field('Title','The title')
fta = FreeTextAnswer()
q = Question(identifier="URL",
content=qc,
answer_spec=AnswerSpecification(fta))
question_form = QuestionForm()
question_form.append(overview)
question_form.append(q)
mtc.create_hit(questions=question_form,
max_assignments=1,
title=title,
description=description,
keywords=keywords,
duration = 30,
reward=0.05)
makeahit(s1) # SUCCESS!
makeahit(s2) # FAIL?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 25, in makeahit
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 263, in create_hit
return self._process_request('CreateHIT', params, [('HIT', HIT)])
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 821, in _process_request
return self._process_response(response, marker_elems)
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 836, in _process_response
raise MTurkRequestError(response.status, response.reason, body)
boto.mturk.connection.MTurkRequestError: MTurkRequestError: 200 OK
<?xml version="1.0"?>
<CreateHITResponse><OperationRequest><RequestId>19548ab5-034b-49ec-86b2-9e499a3c9a79</RequestId></OperationRequest><HIT><Request><IsValid>False</IsValid><Errors><Error><Code>AWS.MechanicalTurk.XHTMLParseError</Code><Message>There was an error parsing the XHTML data in your request. Please make sure the data is well-formed and validates against the appropriate schema. Details: The reference to entity "site" must end with the ';' delimiter. Invalid content: <FormattedContent><![CDATA[<p>Here comes a link <a href='https://www.google.com/search?q=example&site=imghp&tbm=isch'>LINK</a></p>]]></FormattedContent> (1369323038698 s)</Message></Error></Errors></Request></HIT></CreateHITResponse>
Any idea why s2 fails, but s1 succeeds when both are valid links? Both link contents work:
http://www.example.com
https://www.google.com/search?q=example&site=imghp&tbm=isch
Things with query strings? Https?
UPDATE
I'm going to do some tests, but right now my candidate hypotheses are:
HTTPS doesn't work (so, I'll see if I can get another https link to work)
URLs with params don't work (so, I'll see if I can get another url with params to work)
Google doesn't allow its searches to get posted this way? (if 1 and 2 fail!)
You need to escape ampersands in urls, i.e. & => &.
At the end of s2, use
q=example&site=imghp&tbm=isch
instead of
q=example&site=imghp&tbm=isch

Syntax error after uploading GAE Python app

I have created a GAE app that parses RSS feeds using cElementTree. Testing on my local installation of GAE works fine. When I uploaded this app and tried to test it, I get a SyntaxError.
The error is :
Traceback (most recent call last): File "/base/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 509, in __call__
handler.post(*groups) File "/base/data/home/apps/palmfeedparser/1-6.339910418736930444/pipes.py", line 285, in post
tree = ET.parse(urlopen(URL)) File "<string>", line 45, in parse File "<string>", line 32,
in parse SyntaxError: no element found: line 14039, column 45
I did what Mr.Alex Martelli suggested and it printed out the following on my local machine:
[
' <ac:tag><![CDATA[Mobilit\xc3\xa4t]]></ac:tag>\n',
' </ac:tags>\n',
' <ac:images>\n',
' <ac:image ac:number="1">\n',
' <ac:asset_url ac:type="app">http://cdn.downloads.example.com/public/1198/de/images/1/A/01.png</ac:asset_url>\n'
]
I uploaded the app and it printed out:
[
' <ac:tag><![CDATA[Mobilit\xc3\xa4t]]></ac:tag>\n',
' </ac:tags>\n',
' <ac:images>\n',
' <ac:image ac:number="1">\n',
' <ac:asset_url ac:type="app">http://cdn.downloads.example.com/public/1198/de/images/1/A/01.png</ac:asset_url>\n'
]
These lines correspond to the following lines in the RSS feed I am reading:
<ac:tags>
<ac:tag><![CDATA[Mobilität]]></ac:tag>
</ac:tags>
<ac:images>
<ac:image ac:number="1">
<ac:asset_url ac:type="app">http://cdn.downloads.example.com/public/1198/de/images/1/A/01.png</ac:asset_url>
I notice that there is a newline before the closing ac:tags. Line 14039 corresponds to this new line.
Update:
I use urllib.urlopen to access the URL of the feed. I displayed the contents it fetches both locally and on GAE proper. Locally, no content is truncated. Testing after uploading the app, shows that the feed that has 15289 lines is truncated to 14185 lines.
What method can I use to fetch this huge feed? Would urlfetch work?
Thanks in advance for your help!
A_iyer
You may have run into one of the mysterious limits placed on GAE.
Urlopen has been overridden by google to it's urlfetch method, so there shouldn't be any difference in it. (though it might be worth trying, there are a lot of hidden things in GAE)
newline characters shouldn't effect cElementTree.
Are there any other logging messages coming through in your AppEngine Logs? (Relating to the urlopen request?)

Categories

Resources