Syntax error after uploading GAE Python app - python

I have created a GAE app that parses RSS feeds using cElementTree. Testing on my local installation of GAE works fine. When I uploaded this app and tried to test it, I get a SyntaxError.
The error is :
Traceback (most recent call last): File "/base/python_lib/versions/1/google/appengine/ext/webapp/__init__.py", line 509, in __call__
handler.post(*groups) File "/base/data/home/apps/palmfeedparser/1-6.339910418736930444/pipes.py", line 285, in post
tree = ET.parse(urlopen(URL)) File "<string>", line 45, in parse File "<string>", line 32,
in parse SyntaxError: no element found: line 14039, column 45
I did what Mr.Alex Martelli suggested and it printed out the following on my local machine:
[
' <ac:tag><![CDATA[Mobilit\xc3\xa4t]]></ac:tag>\n',
' </ac:tags>\n',
' <ac:images>\n',
' <ac:image ac:number="1">\n',
' <ac:asset_url ac:type="app">http://cdn.downloads.example.com/public/1198/de/images/1/A/01.png</ac:asset_url>\n'
]
I uploaded the app and it printed out:
[
' <ac:tag><![CDATA[Mobilit\xc3\xa4t]]></ac:tag>\n',
' </ac:tags>\n',
' <ac:images>\n',
' <ac:image ac:number="1">\n',
' <ac:asset_url ac:type="app">http://cdn.downloads.example.com/public/1198/de/images/1/A/01.png</ac:asset_url>\n'
]
These lines correspond to the following lines in the RSS feed I am reading:
<ac:tags>
<ac:tag><![CDATA[Mobilität]]></ac:tag>
</ac:tags>
<ac:images>
<ac:image ac:number="1">
<ac:asset_url ac:type="app">http://cdn.downloads.example.com/public/1198/de/images/1/A/01.png</ac:asset_url>
I notice that there is a newline before the closing ac:tags. Line 14039 corresponds to this new line.
Update:
I use urllib.urlopen to access the URL of the feed. I displayed the contents it fetches both locally and on GAE proper. Locally, no content is truncated. Testing after uploading the app, shows that the feed that has 15289 lines is truncated to 14185 lines.
What method can I use to fetch this huge feed? Would urlfetch work?
Thanks in advance for your help!
A_iyer

You may have run into one of the mysterious limits placed on GAE.
Urlopen has been overridden by google to it's urlfetch method, so there shouldn't be any difference in it. (though it might be worth trying, there are a lot of hidden things in GAE)
newline characters shouldn't effect cElementTree.
Are there any other logging messages coming through in your AppEngine Logs? (Relating to the urlopen request?)

Related

can't get page title from notion using api

I'm using notion.py and I'm new to python I want to get a page title from page and post it in another page but when I try I'm getting an error
Traceback (most recent call last):
File "auto_notion_read.py", line 16, in <module>
page_read = client.get_block(list_url_read)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/client.py", line 169, in get_block
block = self.get_record_data("block", block_id, force_refresh=force_refresh)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/client.py", line 162, in get_record_data
return self._store.get(table, id, force_refresh=force_refresh)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/store.py", line 184, in get
self.call_load_page_chunk(id)
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/store.py", line 286, in call_load_page_chunk
recordmap = self._client.post("loadPageChunk", data).json()["recordMap"]
File "/home/lotfi/.local/lib/python3.6/site-packages/notion/client.py", line 262, in post
"message", "There was an error (400) submitting the request."
requests.exceptions.HTTPError: Invalid input.
My code is that I'm using is
from notion.client import NotionClient
import time
token_v2 = "my page tocken"
client = NotionClient(token_v2 = token_v2)
list_url_read = 'the url of the page page to read'
page_read = client.get_block(list_url_read)
list_url_post = 'the url of the page'
page_post = client.get_block(list_url_post)
print (page_read.title)
It isn't recommended to edit source code for dependencies, as you will most certainly cause a conflict when updating the dependencies in the future.
Fix PR 294 has been open since the 6th of March 2021 and has not been merged.
To fix this issue with the currently open PR (pull request) on GitHub, do the following:
pip uninstall notion
Then either:
pip install git+https://github.com/jamalex/notion-py.git#refs/pull/294/merge
OR in your requirements.txt add
git+https://github.com/jamalex/notion-py.git#refs/pull/294/merge
Source for PR 294 fix
You can find the fix here
In a nutshell you need to modify two files in the library itself:
store.py
client.py
Find "limit" value and change it to 100 in both.

How to download datasat from The Humanitarian Data Exchange (hdx api python)

I don't quite understand how I can download data from a dataset. I only download one file, and there are several of them. How can I solve this problem?
I am using hdx api library. There is a small example in the documentation. A list is returned to me and I use the download method. But only the first file from the list is downloaded, not all of them.
My code
from hdx.hdx_configuration import Configuration
from hdx.data.dataset import Dataset
Configuration.create(hdx_site='prod', user_agent='A_Quick_Example', hdx_read_only=True)
dataset = Dataset.read_from_hdx('novel-coronavirus-2019-ncov-cases')
resources = dataset.get_resources()
print(resources)
url, path = resources[0].download()
print('Resource URL %s downloaded to %s' % (url, path))
I tried to use different methods, but only this one turned out to be working, it seems some kind of error in the loop, but I do not understand how to solve it.
Result
Resource URL https://data.humdata.org/hxlproxy/api/data-preview.csv?url=https%3A%2F%2Fraw.githubusercontent.com%2FCSSEGISandData%2FCOVID-19%2Fmaster%2Fcsse_covid_19_data%2Fcsse_covid_19_time_series%2Ftime_series_covid19_confirmed_global.csv&filename=time_series_covid19_confirmed_global.csv downloaded to C:\Users\tred1\AppData\Local\Temp\time_series_covid19_confirmed_global.csv.CSV
Forgot to add that I get a list of strings where there is a download url value. Probably the problem is in the loop
When I use a for-loop I get this:
for res in resources:
print(res)
res[0].download()
Traceback (most recent call last):
File "C:/Users/tred1/PycharmProjects/pythonProject2/HDXapi.py", line 31, in <module>
main()
File "C:/Users/tred1/PycharmProjects/pythonProject2/HDXapi.py", line 21, in main
res[0].download()
File "C:\Users\tred1\AppData\Local\Programs\Python\Python38\lib\collections\__init__.py", line 1010, in __getitem__
raise KeyError(key)
KeyError: 0
Datasets
You can get the download link as follows:
dataset = Dataset.read_from_hdx('acled-conflict-data-for-africa-1997-lastyear')
lita_resources = dataset.get_resources()
dictio=lista_resources[1]
url=dictio['download_url']

Flask: flask.request.args.get replacing '+' with space in url

I am trying to use a flask server for an api that takes image urls through the http get parameters.
I am using this url example which is very long (on pastebin) and contain's many +'s in the url. I have the following route set up in my flask server
#webapp.route('/example', methods=['GET'])
def process_example():
imageurl = flask.request.args.get('imageurl', '')
url = StringIO.StringIO(urllib.urlopen(imageurl).read())
...
but the issue I get is
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 87, in urlopen
return opener.open(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 208, in open
return getattr(self, name)(url)
File "/Users/aly/anaconda/lib/python2.7/urllib.py", line 597, in open_data
data = base64.decodestring(data)
File "/Users/aly/anaconda/lib/python2.7/base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
Upon further inspection (i.e. printing the imageurl that flask gets) it would appear that the + characters are being replaced by literal spaces which seems to be screwing things up.
Is there an option for the flask.args.get function that can handle this?
You need to encode your query parameters correctly; in URL query paramater encoding, spaces are encoded to +, while + itself is encoded to %2B.
Flask cannot be told to treat specific data differently; you cannot reliably detect what data was correctly encoded and what wasn't. You could extract the parameters from query string manually, however, by using request.query_string.
The better approach is to escape your parameters correctly (in JavaScript, use encodeURIComponent(), for example). The + character is not the only problematic character in a Base64-encoded value; the format also uses / and =, both of which carry meaning in a URL, which is why there is a URL-safe variant.
In fact, it is probably the = character at the end of that data: URL that is missing, being the more direct cause of the Incorrect padding error message. If you added it back you'd next indeed have problems with all the + characters having been decoded to ' '.

django test file download - "ValueError: I/O operation on closed file"

I have code for a view which serves a file download, and it works fine in the browser. Now I am trying to write a test for it, using the internal django Client.get:
response = self.client.get("/compile-book/", {'id': book.id})
self.assertEqual(response.status_code, 200)
self.assertEquals(response.get('Content-Disposition'),
"attachment; filename=book.zip")
so far so good. Now I would like to test if the downloaded file is the one I expect it to download. So I start by saying:
f = cStringIO.StringIO(response.content)
Now my test runner responds with:
Traceback (most recent call last):
File ".../tests.py", line 154, in test_download
f = cStringIO.StringIO(response.content)
File "/home/epub/projects/epub-env/lib/python2.7/site-packages/django/http/response.py", line 282, in content
self._consume_content()
File "/home/epub/projects/epub-env/lib/python2.7/site-packages/django/http/response.py", line 278, in _consume_content
self.content = b''.join(self.make_bytes(e) for e in self._container)
File "/home/epub/projects/epub-env/lib/python2.7/site-packages/django/http/response.py", line 278, in <genexpr>
self.content = b''.join(self.make_bytes(e) for e in self._container)
File "/usr/lib/python2.7/wsgiref/util.py", line 30, in next
data = self.filelike.read(self.blksize)
ValueError: I/O operation on closed file
Even when I do simply: self.assertIsNotNone(response.content) I get the same ValueError
The only topic on the entire internet (including django docs) I could find about testing downloads was this stackoverflow topic: Django Unit Test for testing a file download. Trying that solution led to these results. It is old and rare enough for me to open a new question.
Anybody knows how the testing of downloads is supposed to be handled in Django? (btw, running django 1.5 on python 2.7)
This works for us. We return rest_framework.response.Response but it should work with regular Django responses, as well.
import io
response = self.client.get(download_url, {'id': archive_id})
downloaded_file = io.BytesIO(b"".join(response.streaming_content))
Note:
streaming_content is only available for StreamingHttpResponse (also Django 1.10):
https://docs.djangoproject.com/en/1.10/ref/request-response/#django.http.StreamingHttpResponse.streaming_content
I had some file download code and a corresponding test that worked with Django 1.4. The test failed when I upgraded to Django 1.5 (with the same ValueError: I/O operation on closed file error that you encountered).
I fixed it by changing my non-test code to use a StreamingHttpResponse instead of a standard HttpResponse. My test code used response.content so I first migrated to CompatibleStreamingHttpResponse, then changed my test code code to use response.streaming_content instead to allow me to drop CompatibleStreamingHttpResponse in favour of StreamingHttpResponse.

Strange failure to make a HIT for Amazon Mechanical Turk with some URLs?

I was trying to include a link in a HIT request in Amazon Mechanical Turk, using boto, and kept getting an error that my XML was invalid. I gradually pared my html down to the bare minimum, and isolated that it seems to be that some valid links fail for seemingly no reason. Can anyone with expertise in boto or aws help me parse why?
I followed these two guides:
http://www.toforge.com/2011/04/boto-mturk-tutorial-create-hits/
https://gist.github.com/j2labs/740267
Here is my example:
from boto.mturk.connection import MTurkConnection
from boto.mturk.question import QuestionContent,Question,QuestionForm,Overview,AnswerSpecification,SelectionAnswer,FormattedContent,FreeTextAnswer
from config import *
HOST = 'mechanicalturk.sandbox.amazonaws.com'
mtc = MTurkConnection(aws_access_key_id=ACCESS_ID,
aws_secret_access_key=SECRET_KEY,
host=HOST)
title = 'HIT title'
description = ("HIT description.")
keywords = 'keywords'
s1 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "http://www.example.com"
s2 = """<![CDATA[<p>Here comes a link <a href='%s'>LINK</a></p>]]>""" % "https://www.google.com/search?q=example&site=imghp&tbm=isch"
def makeahit(s):
overview = Overview()
overview.append_field('Title', 'HIT title itself')
overview.append_field('FormattedContent',s)
qc = QuestionContent()
qc.append_field('Title','The title')
fta = FreeTextAnswer()
q = Question(identifier="URL",
content=qc,
answer_spec=AnswerSpecification(fta))
question_form = QuestionForm()
question_form.append(overview)
question_form.append(q)
mtc.create_hit(questions=question_form,
max_assignments=1,
title=title,
description=description,
keywords=keywords,
duration = 30,
reward=0.05)
makeahit(s1) # SUCCESS!
makeahit(s2) # FAIL?
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 25, in makeahit
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 263, in create_hit
return self._process_request('CreateHIT', params, [('HIT', HIT)])
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 821, in _process_request
return self._process_response(response, marker_elems)
File "/usr/local/lib/python2.7/dist-packages/boto/mturk/connection.py", line 836, in _process_response
raise MTurkRequestError(response.status, response.reason, body)
boto.mturk.connection.MTurkRequestError: MTurkRequestError: 200 OK
<?xml version="1.0"?>
<CreateHITResponse><OperationRequest><RequestId>19548ab5-034b-49ec-86b2-9e499a3c9a79</RequestId></OperationRequest><HIT><Request><IsValid>False</IsValid><Errors><Error><Code>AWS.MechanicalTurk.XHTMLParseError</Code><Message>There was an error parsing the XHTML data in your request. Please make sure the data is well-formed and validates against the appropriate schema. Details: The reference to entity "site" must end with the ';' delimiter. Invalid content: <FormattedContent><![CDATA[<p>Here comes a link <a href='https://www.google.com/search?q=example&site=imghp&tbm=isch'>LINK</a></p>]]></FormattedContent> (1369323038698 s)</Message></Error></Errors></Request></HIT></CreateHITResponse>
Any idea why s2 fails, but s1 succeeds when both are valid links? Both link contents work:
http://www.example.com
https://www.google.com/search?q=example&site=imghp&tbm=isch
Things with query strings? Https?
UPDATE
I'm going to do some tests, but right now my candidate hypotheses are:
HTTPS doesn't work (so, I'll see if I can get another https link to work)
URLs with params don't work (so, I'll see if I can get another url with params to work)
Google doesn't allow its searches to get posted this way? (if 1 and 2 fail!)
You need to escape ampersands in urls, i.e. & => &.
At the end of s2, use
q=example&site=imghp&tbm=isch
instead of
q=example&site=imghp&tbm=isch

Categories

Resources