Newlines removed in POST request body? (Google App Engine)

Newlines removed in POST request body? (Google App Engine) - python

I am building a REST API on Google App Engine (not using Endpoints) that will allow users to upload a CSV or tab-delimited file and search for potential duplicates. Since it's an API, I cannot use <form>s or the BlobStore's upload_url. I also cannot rely on having a single web client that will call this API. Instead, ideally, users will send the file in the body of the request.
My problem is, when I try to read the content of a tab-delimited file, I find that all newline characters have been removed, so there is no way of splitting the content into rows.
If I check the content of the file directly on the Python interpreter, I see that tabs and newlines are there (output is truncated in the example)
>>> with open('./data/occ_sample.txt') as o:
... o.read()
...
'id\ttype\tmodified\tlanguage\trights\n123456\tPhysicalObject\t2015-11-11 11:50:59.0\ten\thttp://creativecommons.org/licenses/by-nc/3.0\n...'
The RequestHandler logs the content of the request body:
import logging
class ReportApi(webapp2.RequestHandler):
def post(self):
logging.info(self.request.body)
...
So when I call the API running in the dev_appserver via curl
curl -X POST -d #data/occ_sample.txt http://localhost:8080/api/v0/report
This shows up in the logs:
id type modified language rights123456 PhysicalObject 2015-11-11 11:50:59.0 en http://creativecommons.org/licenses/by-nc/3.0
As you can see, there is nothing between the last value of the headers and the first record (rights and 123456 respectively) and the same happens with the last value of each record and the first one of the next.
Am I missing something obvious here? I have tried loading the data with self.request.body, self.request.body_file and self.request.POST, and none seem to work. I also tried applying the Content-Type values text/csv, text/plain, application/csv in the request headers, with no success. Should I add a different Content-Type?

You are using the wrong curl command-line option to send your file data, and it is this option that is stripping the newlines.
The -d option parses out your data and sends a application/x-www-form-urlencoded request, and it strips newlines. From the curl manpage:
-d, --data <data>
[...]
If you start the data with the letter #, the rest should be a file name to read the data from, or - if you want curl to read the data from stdin. Multiple files can also be specified. Posting data from a file named 'foobar' would thus be done with --data #foobar. When --data is told to read from a file like that, carriage returns and newlines will be stripped out.
Bold emphasis mine.
Use the --data-binary option instead:
--data-binary <data>
(HTTP) This posts data exactly as specified with no extra processing whatsoever.
If you start the data with the letter #, the rest should be a filename. Data is posted in a similar manner as --data-ascii does, except that newlines and carriage returns are preserved and conversions are never done.
You may want to include a Content-Type header in that case; of course this depends on your handler if you care about that header.

Related

Grabbing program header information with pyelftools

I am simply trying to grab the program header information with pyelftools (the offset, virtual address, and physical address).
This can be done from the terminal by running:
readelf -l <elf_file>
But I am having trouble getting the same information from pyelftools. From the examples, I have pieced together this:
elffile = ELFFile(stream)
section_header = elffile.structs.Elf_Shdr.parse_stream(stream)
print (section_header)
Note: Elf_Shdr is the program header file.
This will print the offset, virtual address, physical address, etc. But not in the hexadecimal format I want, or like how readelf prints it. Is there a way to get it to print out the hex format like readelf?

Some strange things in your post happen.
You say:
I am simply trying to grab the program header information with
pyelftools (the offset, virtual address, and physical address).
Note: Elf_Shdr is the program header file.
But Elf_Shdr is not Program header, it is Section header. Look at elftools/elf/structs.py
Elf_Shdr:
Section header
Then in your code you parse file twice by some reason. First string is enough to parse it, you can access all header data from elffile object:
for segment in elffile.iter_segments():
header = segment.header
for key, value in header.items():
if isinstance(value, int):
print(key, hex(value))
Here I iterate over all segments (they are described by Program headers) of ELF file, then iterate over all attributes in header and print them as hex if it is integer. No magic here, header is just Container for standard dictionary.
Also you may be interested in readelf implemented with pyelftools - here.

Trello API - 400: "Error parsing body" when POSTing a multipart file attachment

I'm trying to upload a PDF as an attachment to a Trello card using python-requests. I've been unable to get the request in the function below to return anything other than 400: Error parsing body despite significant tweaks (detailed below).
I should note that I'm able to create cards and add URL attachments to them (neither of which require a file upload) without any problems.
Here's the code that handles the POST of the file:
def post_pdf(session, design, card_id):
attachment = {
"name": design["campaign_title"] + " - Combined PDF",
"mimeType": "application/pdf"
}
pdf_post = session.post(
url = "https://api.trello.com/1/cards/" + card_id + "/attachments",
files = {"file": open("combined_pdf.pdf", "rb")},
data = attachment
)
The authentication key and token are set Session params when the session was created, so they're not added here.
Also, in the actual code, the POST is handled by a wrapper function that adds some boilerplate error-checking and rate limiting to the request, as well as more-verbose error dumps when a request fails, but I've confirmed (in the above example) that the same error persists without the wrapper.
Adjustments I've tried
Substituting data = attachment with json = attachment
Substituting data = attachment with params = attachment
Omitting attachment completely and POSTing the file with no associated data
Adding stream = True to the request parameters (this doesn't seem to matter for uploads, but I figured it couldn't hurt to try)
Encoding the file as base64 (this encoding has been required elsewhere; I was grasping at straws)
Encoding the file as base64, combined with the above tweaks to data / json / params
Note: The PDF file is potentially a source of the problem - it's generated by converting several images to PDF format and then concatenating them with pdfunite, so I could well have made mistakes in its creation that are causing Trello to reject the file. What seems to confirm this is that Googling for Trello "Error parsing body" returns two hits, only one of which deals with Trello, and neither of which are useful. This leads me to think that this is a particularly odd / rare error message, which means to me that I've made some kind of serious error encoding the file.
However, the PDF file opens properly on my (and my coworkers') systems without any error messages, artifacts, or other strange behavior. More importantly, trying this with other "known good" PDFs also fails, with the same error code. Because the file's contents fall within the bounds of "company property / information", I'd like to avoid posting it (and / or the raw request body), but I'll do so if there's agreement that it's causing the problem.

I found the solution: the Content-Type header was set incorrectly due to a session-wide setting ( Session.headers.update({"Content-Type": "application/json"}) ) overriding the multipart/form-data header when the upload request was sent. This caused Trello to reject the body. I solved the problem by removing the session-level header, which allowed requests to modify the content type for each request.

How to parse the "request body" using python CGI?

I just need to write a simple python CGI script to parse the contents of a POST request containing JSON. This is only test code so that I can test a client application until the actual server is ready (written by someone else).
I can read the cgi.FieldStorage() and dump the keys() but the request body containing the JSON is nowhere to be found.
I can also dump the os.environ() which provides lots of info except that I do not see a variable containing the request body.
Any input appreciated.
Chris

If you're using CGI, just read data from stdin:
import sys
data = sys.stdin.read()

notice that if you call cgi.FieldStorage() before in your code, you can't get the body data from stdin, because it just be read once.

Python: Downloading a large file to a local path and setting custom http headers

I am looking to download a file from a http url to a local file. The file is large enough that I want to download it and save it chunks rather than read() and write() the whole file as a single giant string.
The interface of urllib.urlretrieve is essentially what I want. However, I cannot see a way to set request headers when downloading via urllib.urlretrieve, which is something I need to do.
If I use urllib2, I can set request headers via its Request object. However, I don't see an API in urllib2 to download a file directly to a path on disk like urlretrieve. It seems that instead I will have to use a loop to iterate over the returned data in chunks, writing them to a file myself and checking when we are done.
What would be the best way to build a function that works like urllib.urlretrieve but allows request headers to be passed in?

What is the harm in writing your own function using urllib2?
import os
import sys
import urllib2
def urlretrieve(urlfile, fpath):
chunk = 4096
f = open(fpath, "w")
while 1:
data = urlfile.read(chunk)
if not data:
print "done."
break
f.write(data)
print "Read %s bytes"%len(data)
and using request object to set headers
request = urllib2.Request("http://www.google.com")
request.add_header('User-agent', 'Chrome XXX')
urlretrieve(urllib2.urlopen(request), "/tmp/del.html")

If you want to use urllib and urlretrieve, subclass urllib.URLopener and use its addheader() method to adjust the headers (ie: addheader('Accept', 'sound/basic'), which I'm pulling from the docstring for urllib.addheader).
To install your URLopener for use by urllib, see the example in the urllib._urlopener section of the docs (note the underscore):
import urllib
class MyURLopener(urllib.URLopener):
pass # your override here, perhaps to __init__
urllib._urlopener = MyURLopener
However, you'll be pleased to hear wrt your comment to the question comments, reading an empty string from read() is indeed the signal to stop. This is how urlretrieve handles when to stop, for example. TCP/IP and sockets abstract the reading process, blocking waiting for additional data unless the connection on the other end is EOF and closed, in which case read()ing from connection returns an empty string. An empty string means there is no data trickling in... you don't have to worry about ordered packet re-assembly as that has all been handled for you. If that's your concern for urllib2, I think you can safely use it.

can cherrypy receive multipart/mixed POSTs out of the box?

We're receiving some POST data of xml + arbitrary binary files (like images and audio) from a device that only gives us multipart/mixed encoding.
I've setup a cherrypy upload/POST handler for our receiver end. I've managed to allow it to do arbitrary number of parameters using multipart/form-data. However when we try to send the multipart-mixed data, we're not getting any processing.
#cherrypy.expose
def upload(self, *args,**kwargs):
"""upload adapted from cherrypy tutorials
We use our variation of cgi.FieldStorage to parse the MIME
encoded HTML form data containing the file."""
print args
print kwargs
cherrypy.response.timeout = 1300
lcHDRS = {}
for key, val in cherrypy.request.headers.iteritems():
lcHDRS[key.lower()] = val
incomingBytes = int(lcHDRS['content-length'])
print cherrypy.request.rfile
#etc..etc...
So, when submitting multipart/form-data, args and kwargs are well defined.
args are the form fields, kwargs=hash of vars and values.
When I submit multipart/mixed, args and kwargs are empty, and I just have cherrypy.request.rfile as the raw POST information.
My question is, does cherrypy have a built in handler to handle multipart/mixed and chunked encoding for POST? Or will I need to override the cherrypy.tools.process_request_body and roll my own decoder?
It seems like the builtin wsgi server with cherrypy handles this as part of the HTTP/1.1 spec, but I could not seem to find documentation in cherrypy in accessing this functionality.
...to clarify
I'm using latest version 3.1.1 or so of Cherrypy.
Making a default form just involves making parameters in the upload function.
For the multipart/form-data, I've been calling curl -F param1=#file1.jpg -F param2=sometext -F param3=#file3.wav http://destination:port/upload
In that example, I get:
args = ['param1','param2','param3]
kwargs = {'param1':CString<>, 'param2': 'sometext', 'param3':CString<>}
When trying to submit the multipart/mixed, I tried looking at the request.body, but kept on getting None for that, regardless of setting the body processing.
The input we're getting is coming in as this:
user-agent:UNTRUSTED/1.0 Profile/MIDP-2.0 Configuration/CLDC-1.1
content-language:en-US
content-length:565719
mime-version:1.0
content-type:multipart/mixed; boundary='newdivider'
host:192.168.1.1:8180
transfer-encoding:chunked
--newdivider
Content-type: text/xml
<?xml version='1.0' ?><data><Stuff>....
etc...etc...
--newdivider
Content-type: image/jpeg
Content-ID: file://localhost/root1/photos/Garden.jpg
Content-transfer-encoding: binary
<binary data>
I've got a sneaking suspicion that the multipart/mixed is the problem that cherrypy is just giving me just the rfile. Our goal is to have cherrypy process the body into its parts with minimal processing on the receive side (ie, let cherrypy do its magic). If that requires us being tougher on the sending format to be a content-type that cherrypy likes, then so be it. What are the accepted formats? Is it only multipart/form-data?

My bad. Whenever the Content-Type is of type "multipart/*", then CP tries to stick the contents into request.params (if any other Content-Type, it goes into request.body).
Unfortunately, CP has assumed that any multipart message is form-data, and made no provision for other subtypes. I've just fixed this in trunk, and it should be released in 3.1.2. Sorry for the inconvenience. In the short term, you can try applying the changeset locally; see http://www.cherrypy.org/ticket/890.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Newlines removed in POST request body? (Google App Engine) - python

Related

Grabbing program header information with pyelftools

Trello API - 400: "Error parsing body" when POSTing a multipart file attachment

How to parse the "request body" using python CGI?

Python: Downloading a large file to a local path and setting custom http headers

can cherrypy receive multipart/mixed POSTs out of the box?

Categories

Resources