I may be poor at googling, but so far I have come up dry. Is there no such thing as a universal decoder for HTTP responses, where you give it the body and the headers, and it returns the decoded data?
For example:
response = requests.get("...")
body = clever_package.decode(response.body, response.headers)
This is using the requests package to get the data, though this isn't strictly necessary. Is there no universal decoder which takes the contentType and isBase64Encoded headers and works its magic?
Perhaps I'm not seeing an obvious flaw in such a package, which explains why I can't find it anywhere.
Cheers!
What do you mean with decoding? Just bytes to string data? In that case, python-chardet would be what you are looking for for cases where the header doesn't specify the encoding (if the headers specify the decoding, just decode it from the encoding specified in the header).
If you want to parse XML, JSON, ... in different ways, you'd probably use the respective libraries (built-in json module, yaml module, etc..) after having decoded the data into a unicode string.
Related
I need to send an encoded and decoded image along with some metadata via HTTP.
I would like to send the images as binary data instead of encoding them as base64 as encoding & decoding adds unnecessary latency.
So for example, the encoded image may look like this:
img = open(img_file, 'rb').read()
and the decoded image may look like this:
img = cv2.imread(img_file)
Assume I also need to send some additional information in POST request, such as the image name for example.
What is the most efficient way to send these? What would the code look like in Python? What content-type or other headers would I need to use?
I've found some examples like this online, but they only send a single image and therefore set the content-type as image/jpeg, but I'm wondering what happens when you have additional fields to send.
If you want to send additional fields you have a few options:
Base64 encode the image data and embed it in a json string with all your extra data
Add custom HTTP headers with your fields in
Add your fields to the image metadata itself
I know you said you didn't want to do 1, but how do you know it is adding unnecessary latency if you've never tried it? I expect it's far less than the latency of an HTTP request. Option 2 is risky as the headers can get stripped or changed by network infrastructure and your users might not expect to find data in the headers. Option 3 depends a bit what the data is and whether it makes sense for it to be inside the image (and again whether your users know to look for it there)
What I'm doing is:
via javascript, reading the DOM of webpage
converting to json string
sending to python as ajax
in Python, json decoding the string into object
What I want is for any text that is part of the json to be in unicode to avoid any character issues. I used to use beautifulsoup for this:
from bs4 import *
from bs4.dammit import UnicodeDammit
text_unicode = UnicodeDammit(text, [None, None], "html", True).unicode_markup
But that doesn't work with the json string. Running the string through UnicodeDammit causes an error when I try to json decode it.
The thing is, I'm not even sure that collecting the DOM doesn't handle this issue automatically.
For starters, I would therefore like a series of test webpages to test this. Where one is encoded with utf-8, another with something else, etc. And that uses characters that will look wrong if, for example, you think it's utf-8 but it's not. Note that I don't even bother considering the webpage's stated encoding. This is too often wrong.
You are trying to solve a problem that does not exist.
The browser is responsible for detecting and handling the web page encoding. It'll determine the correct encoding based on the server headers, meta tags in the HTML page and plain guessing if needed. The DOM gives you Unicode data.
JSON handles Unicode data; sending JSON data to your Python process sends appropriately encoded byte data that any decent JSON library will turn back into Unicode values for you. The Python json module is such a library.
Just load the data from your JavaScript script with the json.load() or json.loads() functions as is. Your browser will already have used the correct encoding (most likely UTF-8), and the Python json module will decode any of the standard encodings used without additional configuration or handling.
I have a script that needs to determine the charset before being read by lxml.HTML() for parsing. I will assume ISO-8859-1(that's the normal assumed charset for this right?) if it can't be found and search the html for the meta tag with the charset attribute. However I'm not sure the best way to do that. I could try to create an etree with lxml, but I don't want to read the whole file since I may run into encoding problems. However, if I don't read the whole file I can't build an etree since some tags will not have been closed.
Should I just find the meta tag with some fancy string subscripting and break out of the loop once it's found or a certain number of lines have been read? Maybe use a low level HTML parser, eg html.parser? Using python3 btw, thanks.
You should first try to extract encoding from HTTP headers. If it is not present there, you should parse it with the lxml. This might be tricky since lxml throws parse errors if charset does not match. A work-around would be decoding and encoding the data ignoring the unknown characters.
html_data=html_data.decode("UTF-8","ignore")
html_data=html_data.encode("UTF-8","ignore")
After this, you can parse by invoking the lxml.HTML() command with utf-8 encoding.
This way, you'll be able to find the correct encoding defined in the HTML headers.
After finding the encoding, you'll have to re-parse the HTML document with proper encoding.
Unfortunately, sometimes you might not find character encoding even in the HTML headers. I'd suggest you using the chardet module to find the proper encoding only after these steps fail.
Determining the character encoding of an HTML file correctly is actually quite a complex matter, but the HTML5 spec defines exactly how a processor should do it. You can find the algorithm here: http://dev.w3.org/html5/spec/parsing.html#determining-the-character-encoding
I'm reading some documentation on a service I'm trying to use, and it reads something like this:
All requests must be sent using HTTP Post.
The XML engine only accepts plain ASCII (text) UTF-8 requests/streams. Encoded streams are not acceptable.
All requests/responses are XML.
But I really just don't understand what it's asking for. From what I've been reading on HTTP POST in Python, you still need to encode key=value pairs to make a request, where it sounds like they just want the plain XML itself (as a multipart, maybe? I am very confused). Are they giving me enough information and I'm just fundamentally misunderstanding their documentation, or should I ask for more details?
using urllib2.Request
import urllib2
req = urllib2.Request("http://foo.com/post_here", "<xml data to post>")
response = urllib2.urlopen(req)
the_page = response.read()
"plain ASCII UTF-8" is a contradiction in terms, IMHO -- ASCII is a subset of UTF-8, though. Try sending UTF-8 including some "special" (non-ASCII) character and see what happens (or, if you can, do ask them to reword said contradition-in-terms!-).
I am really lost in all the encoding/decoding issues with Python. Having read quite few docs about how to handle incoming perfectly, i still have issues with few languages, like Korean. Anyhow, here is the what i am doing.
korean_text = korean_text.encode('utf-8', 'ignore')
korean_text = unicode(korean_text, 'utf-8')
I save the above data to database, which goes through fine.
Later when i need to display data, i fetch content from db, and do the following:
korean_text = korean_text.encode( 'utf-8' )
print korean_text
And all i see is '???' echoed on the browser. Can someone please let me know what is the right way to save and display above data.
Thanks
Even having read some docs, you seem to be confused on how unicode works.
Unicode is not an encoding. Unicode is the absence of encodings.
utf-8 is not unicode. utf-8 is an encoding.
You decode utf-8 bytestrings to get unicode. You encode unicode using an encoding, say, utf-8, to get an encoded bytestring.
Only bytestrings can be saved to disk, database, or sent on a network, or printed on a printer, or screen. Unicode only exists inside your code.
The good practice is to decode everything you get as early as possible, work with it decoded, as unicode, in all your code, and then encode it as late as possible, when the text is ready to leave your program, to screen, database or network.
Now for your problem:
If you have a text that came from the browser, say, from a form, then it is encoded. It is a bytestring. It is not unicode.
You must then decode it to get unicode. Decode it using the encoding the browser used to encode. The correct encoding comes from the browser itself, in the correct HTTP REQUEST header.
Don't use 'ignore' when decoding. Since the browser said which encoding it is using, you shouldn't get any errors. Using 'ignore' means you will hide a bug if there is one.
Perhaps your web framework of choice already does that. I know that django, pylons, werkzeug, cherrypy all do that. In that case you already get unicode.
Now that you have a decoded unicode string, you can encode it using whatever encoding you like to store on the database. utf-8 is a good choice, since it can encode all unicode codepoints.
When you retrieve the data from the database, decode it using the same encoding you used to store it. And then encode it using the encoding you want to use on the page - the one declared in the html meta header <meta http-equiv="Content-Type" content="text/html; charset=utf-8"/>. If the encoding is the same used on the previous step, you can skip the decode/reencode since it is already encoded in utf-8.
If you see ??? then the data is being lost on any step above. To know exactly, more information is needed.
Read through this post about handling Unicode in Python.
You basically want to be doing these things:
.encode() text to a particular encoding (such as utf-8) before sending it to the database.
.decode() text back to unicode (from your encoding) when reading it from the database
The problem is most certainly (especially if other non-ASCII characters appear to work fine) that your browser or OS doesn't have the right fonts to display Korean text, or that the default font used by your browser doesn't support Korean. Try to choose another font until it works.