How to handle unicode of an unknown encoding in Django?

How to handle unicode of an unknown encoding in Django? - python

I want to save some text to the database using the Django ORM wrappers. The problem is, this text is generated by scraping external websites and many times it seems they are listed with the wrong encoding. I would like to store the raw bytes so I can improve my encoding detection as time goes on without redoing the scrapes. But Django seems to want everything to be stored as unicode. Can I get around that somehow?

You can store data, encoded into base64, for example. Or try to analize HTTP headers from browser, may be it is simplier to get proper encoding from there.

Create a File with the data. Use a Django models.FileField to hold a reference to the file.
No it does not involve a ton of I/O. If your file is small it adds 2 or 3 I/O's (the directory read, the iNode read and the data read.)

Related

how to get text from a downloadable .doc file without saving?

I'm trying to download a .doc file using requests.get() request (though I've heard about other methods - they all require saving too)
Is there any method I could use to extract the text from it (or even convert it into a .txt for example) straight away without saving it into a file?
I've tried passing request.raw into various conventors (docx2txt.process() for example) but I assume they all work with files, not with streams.

While the script is running the memory allocation are handled by the python interpreter but if you save the content to a file the memory allocated is different. This article can be helpful to you.
Link: article

How to remove non-english from the textual data in csv file

I am cleaning the textual data that I scrawled from multiple urls. How can I remove non-English words/symbols from the data in a csv file?
I saved the data and read the data using the following codes:
To save the data as csv file:
df.to_csv("blogdata.csv", encoding = "utf-8")
After saving the data, the csv file shows as follows including non-English words and symbols (e.g., '\n\t\t\t', mâ€™, etc.):
The symbols did not show in the original data and some of them even appear from the data that are in English. Take 'Ross Parker' in the 7th row as an example.
The data saved in the csv file says: ['\n\t\t\t', 'Itâ€™s about time I wrote an update on what weâ€™ve been up to over the past few months. Weâ€™re about to...
Where in the original data scrawled from the url, it shows as follows:
Can anybody explain why this happens and help me solve this issue and clean the non-English data from the file?
Thank you so much in advance!

It looks like pilot error: the data is correct but you are looking at it in a tool which is configured or hard-coded to display the text as Latin-1 (or Windows code page 1252?) even though you saved it as UTF-8.
Some tools - epecially on Windows - will do whimsical things with UTF-8 which doesn't have a BOM. Maybe try adding one (and maybe file a bug report if this really helps; the tool should at the very least let you override its default encoding, without modifying the input data).
In other words, if the screen shot with the broken data is from Excel, you probably selected a DOS oode page (or the horribly mislabelled "ANSI") instead of UTF-8 when it asked how to import this CSV file. Perhaps the best fix would be to devise a workflow which does not involve a spreadsheet.
Or perhaps you used a tool which didn't ask you anything, and tried to "sniff" the data to determine its encoding, and it guessed wrong. Adding an invisible byte sequence called a BOM which is unique to UTF-8 should hopefully allow it to guess right; but this is buggy behavior - you should not be a hostage to its clearly imperfect heuristics. (See also "Bush hid the facts" for a related story.)

Elasticsearch indexing with Python UTF-8 problems

I am indexing data on elasticsearch using the official python library for this: elasticsearch-py. The data is directly taken from oracle using the cx_oracle python library, cast into a document format and send for indexing to elasticsearch. For the most part this works great, but sometimes I encounter problems with characters like ö. Sometimes this character is indexed as \xc3\xb8 and sometimes as ö. This happens even in the same database entry. One variable can have the ö indexed correct while for another variable this is not the case.
Does Anyone an idea what might cause this?
thanks in advance

If your "ö" is sometimes right - and sometimes not, the data must be corrupted in your database. This is not a problem of Elasticsearch. (I had the exact same problem one month ago!)
Strings with various encodings are likely put in your database without being all converted to a single format before.
text = "ö"
asUtf=text.encode('UTF-8')
print(asUtf)
print(asUtf.decode())
Result:
b'\xc3\xb6'
ö
This problem could be solved before the insertion into Elasticsearch. Find the text sequences matching '\xXX\xXX', treat them as UTF-8 and decode them to unicode. Try to sanitize you database and fix the way you put information inside.
PS: a better practice to move information from a database to Elasticsearch is to use rivers or to make a script that would directly send the data to Elasticsearch, without saving them into a file first.
2016 edit: the rivers are deprecated now, so you should find an alternative like logstash.

Robust way to put contents of any arbitrary text file in the database (using Django/Python)?

As part of my Django app, I have to get the contents of a text file which a user uploads (which could be any charset) and save it to my DB. I keep running into issues (like having to remove UTF8's BOM manually, or having to figure out how to account for non-printable characters, or having to figure out how to make all unicode characters work - not just Latin ones, etc.) and each of these issues requires its own hack.
Is there a robust way to do this that doesn't require each of these case-by-case fixes? Right now I'm just using file.read() to get the contents, then doing all of those workarounds to clean the contents, and then using .save() to save it to the DB (I have a model for this).
What else can I be doing?

Causes some overhead, but you could base64 encode the entire string before persisting to the db. Then no escaping is required.

If you want to explicitly steer away from any issues with encoding and just see files as bunches of binary data (not strings of text in a specific encoding) you might want to use your database's binary format.
For MySQL this is BINARY and VARBINARY: http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html
For a deeper understanding of unicode & utf-8 issues (recommended) this is a nice read on the subject:
http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF

Printing several binary data fields from Google DataStore?

I'm using Google App Engine and python for a web service. Some of the models (tables) I have in my web service have several binary data fields in them, and I'd like to present this data to a computer requesting it, all fields at the same time. Now, the problem is I don't know how to write it out in a way that the other computer knows where the first data ends and the other starts. I've been using JSON for all the things that aren't binary, but afaik JSON doesn't work for binary data. So how do you get around this?
You could of course separate the data and put it in its own model, and then reference it back to some metadata model. That would allow you to make a single page that just prints one data field of one of the items, but that is trappy both server and client implementation wise.
Another solution would be to put in some kind of separator, and just split the data on that. I suppose it would work and that's how you do it, but isn't there like a standardized way to do that? Any libraries I could use?
In short, I'd like to be able to do something like this:
binaryDataField1: data data data ...
binaryDataField2: data data data ...
etc

Several easy options:
base64 encode your data - meaning you can still use JSON.
Use Protocol Buffers.
Prefix each field with its length - either as a 4- or 8- byte integer, or as a numeric string.

One solution that would leverage your json investment would be to simply convert the binary data to something that json can support. For example, Base64 encoding might work well for you. You could treat the output of your BAse64 encoder just like you would a normal string in json. it looks like python has Base64 support built in, though i only use java on app engine so I can't guarantee that the linked library work in the sandbox or not.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to handle unicode of an unknown encoding in Django? - python

You can store data, encoded into base64, for example. Or try to analize HTTP headers from browser, may be it is simplier to get proper encoding from there.

Create a File with the data. Use a Django models.FileField to hold a reference to the file. No it does not involve a ton of I/O. If your file is small it adds 2 or 3 I/O's (the directory read, the iNode read and the data read.)

Related

how to get text from a downloadable .doc file without saving?

How to remove non-english from the textual data in csv file

Elasticsearch indexing with Python UTF-8 problems

Robust way to put contents of any arbitrary text file in the database (using Django/Python)?

Printing several binary data fields from Google DataStore?

Categories

Resources