I am throwing in the towel here. I'm trying to convert a string scraped from the source code of a website with scrapy (injected javascript) to json so I can easily access the data. The problem comes down to a decode error. I tried all kinds of encoding, decoding, escaping, codecs, regular expressions, string manipulations and nothing works. Oh, using Python 3.
I narrowed down the culprit on the string (or at least part of it)
scraped = '{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
scraped_raw = r'{"propertyNotes": [{"title": "Local Description", "text": "\u003Cp\u003EAPPS\u003C/p\u003E\n\n\u003Cp\u003EBig Island Revealed (comes as app or as a printed book)\u003C/p\u003E\n\n\u003Cp\u003EAloha Big Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island\u003C/p\u003E\n\n\u003Cp\u003EBig Island Smart Maps (I like this one a lot)\u003C/p\u003E\n\n\u003Cp\u003EBig Island Adventures (includes videos)\u003C/p\u003E\n\n\u003Cp\u003EThe descriptions of beaches are helpful. Suitability for swimming, ease of access, etc. is included. Some beaches are great for picnics and scenic views, while others are suitable for swimming and snorkeling. Check before you go.\u003C/p\u003E"}]}'
data = json.loads(scraped_raw) #<= works
print(data["propertyNotes"])
failed = json.loads(scraped) #no work
print(failed["propertyNotes"])
Unfortunately, I cannot find a way for scrapy/splash to return the string as raw. So, somehow I need to have python interprets the string as raw while it is loading the json. Please help
Update:
What worked for that string was json.loads(str(data.encode('unicode_escape'), 'utf-8')) However, it didnt work with the larger string. The error I get doing this is JSONDecodeError: Invalid \escape on the larger json string
The problem exists because the string you're getting has escaped control characters which when interpreted by python become actual bytes when encoded (while this is not necessarily bad, we know that these escaped characters are control characters that json would not expect). Similar to Turn's answer, you need to interpret the string without interpreting the escaped values which is done using
json.loads(scraped.encode('unicode_escape'))
This works by encoding the contents as expected by the latin-1 encoding whilst interpreting any \u003 like escaped character as literally \u003 unless it's some sort of control character.
If my understanding is correct however, you may not want this because you then lose the escaped control characters so the data might not be the same as the original.
You can see this in action by noticing that the control chars disappear after converting the encoded string back to a normal python string:
scraped.encode('unicode_escape').decode('utf-8')
If you want to keep the control characters you're going to have to attempt to escape the strings before loading them.
If you are using Python 3.6 or later I think you can get this to work with
json.loads(scraped.encode('unicode_escape'))
As per the docs, this will give you an
Encoding suitable as the contents of a Unicode literal in
ASCII-encoded Python source code, except that quotes are not escaped.
Decodes from Latin-1 source code. Beware that Python source code
actually uses UTF-8 by default.
Which seems like exactly what you need.
Ok. so since I am on windows, I have to set the console to handle special characters. I did this by typing chcp 65001 into the terminal. I also use a regular expression and chained the string manipulation functions which is the python way anyways.
usable_json = json.loads(re.search('start_sub_string(.*)end_sub_string', hxs.xpath("//script[contains(., 'some_string')]//text()").extract_first()).group(1))
Then everything went smoth. I'll sort out the encoding and escaping when writing to database down the line.
Related
I'm working to export a small subset of music from my iTunes library to an external drive, for use with a Sonos speaker (via Media Library on Sonos). All was going fine until I came across some unicode text in track, album and artist names.
I'm moving from iTunes on Mac to a folder structure on Linux (Ubuntu), and the file paths all contain the original Unicode names and these are displayed and play fine from Sonos in the Artist / Album view. The only problem is playlists, which I'm generating via a bit of Python3 code.
Sonos does not appear to support UTF-8 encoding in .m3u / .m3u8 playlists. The character ÷ was interpreted by Sonos as ÷, which after a bit of Googling I found was clearly mixing up UTF-8 and UTF-16 (÷ 0xC3 0xB7 in UTF-8, whilst à is U+00C3 in UTF-16 and · is U+00B7 in UTF-16). I've tried many different ways of encoding it, and just can't get it to recognise tracks with non-standard (non-ASCII?) characters in their names.
I then tried .wpl playlists, and thought I'd solved it. Tracks with characters such as ÷ and • in their path now work perfectly, just using those characters in their unicode / UTF-8 form in the playlist file itself.
However, just as I was starting to tidy up and finish off the code, I found some other characters that weren't being handled correctly: ö, å, á and a couple of others. I've tried both using these as their original unicode characters, but also as their encoded XML identifier e.g. ́ Using this format doesn't make a difference to what works or does not work - ÷ (÷) and • (•) are fine, whilst ö (ö), å (å) and á (á) are not.
I've never really worked with unicode / UTF-8 before, but having read various guides and how-to's I feel like I'm getting close but probably just missing something simple. The fact that some unicode characters work now, and others don't, makes me think it's got to be something obvious! I'm guessing the difference is that accents modify the previous character, rather than being a character in itself, but tried removing the previous letter and that didn't work!
Within Python itself I'm not doing anything particularly clever. I read in the data from iTunes' XML file using:
with open(settings['itunes_path'], 'rb') as itunes_handle:
itunes_library = plistlib.load(itunes_handle)
For export I've tried dozens of different options, but generally something like the below (sometimes with encoding='utf-8' and various other options):
with open(dest_path, 'w') as playlist_file:
playlist_file.write(generated_playlist)
Where generated_playlist is the result of extracting and filtering data from itunes_library, having run urllib.parse.unquote() on any iTunes XML data.
Any thoughts or tips on where to look would be very much appreciated! I'm hoping that to someone who understands Unicode better the answer will be really really obvious! Thanks!
Current version of the code available here: https://github.com/dwalker-uk/iTunesToSonos
With thanks to #lenz for the suggestions above, I do now have unicode playlists fully working with Sonos.
A couple of critical points that should save someone else a lot of time:
Only .wpl playlists seem to work. Unicode will not work with .m3u or .m3u8 playlists on Sonos.
Sonos needs any unicode text to be normalised into NFC form - I'd never heard of this before, but essentially means that any accented characters have to be represented by a single character, not as a normal character with a separate accent.
The .pls playlist, which is an XML format, needs to have unicode characters encoded in an XML format, i.e. é is represented in the .pls file as é.
The .pls file also needs the XML reserved characters (& < > ' ") in their escaped form, i.e & is &.
In Python 3, converting a path from iTunes XML format into something suitable for a .pls playlist on Sonos, needs the following key steps:
left = len(itunes_library['Music Folder'])
path_relative = 'Media/' + itunes_library['Tracks'][track_id]['Location'][left:]
path_unquoted = urllib.parse.unquote(path_relative)
path_norm = unicodedata.normalize('NFC', path_unquoted)
path = path_norm.replace('&', '&').replace('<', '<').replace('>', '>').replace('"', '"')
playlist_wpl += '<media src="%s"/>\n' % path
with open(pl_path, 'wb') as pl_file:
pl_file.write(playlist_wpl.encode('ascii', 'xmlcharrefreplace'))
A full working demo for exporting from iTunes for use in Sonos (or anything else) as .pls is available here: https://github.com/dwalker-uk/iTunesToSonos
Hope that helps someone!
I am working on a program (Python 2.7) that reads xls files (in MHTML format). One of the problems I have is that files contain symbols/characters that are not ascii. My initial solution was to read the files in using unicode
Here is how I am reading in a file:
theString=unicode(open(excelFile).read(),'UTF-8','replace')
I am then using lxml to do some processing. These files have many tables, the first step of my processing requires that I find the right table. I can find the table based on words that are in the the first cell of the first row. This is where is gets tricky. I had hoped to use a regular expression to test the text_content() of the cell but discovered that there were too many variants of the words (in a test run of 3,200 files I found 91 different ways that the concept that defines just one of the tables was expressed. Therefore I decided to dump all of the text_contents of the particular cell out and use some algorithims in excel to strictly identify all of the variants.
The code I used to write the text_content() was
headerDict['header_'+str(column+1)]=encode(string,'Latin-1','replace')
I did this baseed on previous answers to questions similar to mine here where it seems the consensus was to read in the file using unicode and then encode it just before the file is written out.
So I processed the labels/words in excel - converted them all to lower case and got rid of the spaces and saved the output as a text file.
The text file has a column of all of the unique ways the table I am looking for is labeled
I then am reading in the file - and the first time I did I read it in using
labels=set([label for label in unicode(open('C:\\balsheetstrings-1.txt').read(),'UTF-8','replace').split('\n')])
I ran my program and discovered that some matches did not occur, investigating it I discovered that unicode replaced certain charactors with \ufffd like in the example below
u'unauditedcondensedstatementsoffinancialcondition(usd\ufffd$)inthousands'
More research turns up that the replacement happens when unicode does not have a mapping for the character (probably not the exact explanation but that was my interpretation)
So then I tried (after thinking what do I have to lose) reading in my list of labels without using unicode. So I read it in using this code:
labels=set(open('C:\\balsheetstrings-1.txt').readlines())
now looking at the same label in the interpreter I see
'unauditedcondensedstatementsoffinancialcondition(usd\xa0$)inthousands'
I then try to use this set of labels to match and I get this error
Warning (from warnings module):
File "C:\FunctionsForExcel.py", line 128
if tableHeader in testSet:
UnicodeWarning: Unicode equal comparison failed to convert both arguments to Unicode - interpreting them as being unequal
Now the frustrating thing is that the value for tableHeader is NOT in the test set When I ask for the value of tableHeader after it broke I received this
'fairvaluemeasurements:'
And to add insult to injury when I type the test into Idle
tableHeader in testSet
it correctly returns false
I understand that the code '\xa0' is code for a non-breaking space. So does Python when I read it in without using unicode. I thought I had gotten rid of all the spaces in excel but to handle these I split them and then joined them
labels=[''.joiin([word for word in label.split()] for label in labels])
I still have not gotten to a question yet. Sorry I am still trying to get my head around this. It seems to me that I am dealing with inconsistent behavior here. When I read the string in originally and used unicode and UTF-8 all the characters were perserved/transportable if you will. I encoded them to write them out and they displayed fine in Excel, I then saved them as a txt file and they looked okay But something is going on and I can't seem to figure out where.
If I could avoid writing the strings out to identify the correct labels I have a feeling my problem would go away but there are 20,000 or more labels. I can use a regular expression to cut my potential list down significantly but some of it just requires inspection.
As an aside I will note that the source files all specify the charset='UTF-8'
Recap- when I read sourcedocument and list of labels in using unicode I fail to make some matches because the labels have some characters replaced by the ufffd, and when I read the sourcedocument in using unicode and the list of labels in without any special handling I get the warning.
I would like to understand what is going on so I can fix it but I have exhausted all the places I can think to look
You read (and write) encoded files like this:
import codecs
# read a utf8 encoded file and return the data as unicode
data = codecs.open(excelFile, 'rb', 'UTF-8').read()
The encoding you use does not matter as long as you do all the comparisons in unicode.
I understand that the code '\xa0' is code for a non-breaking space.
In a byte string, \xA0 is a byte representing non-breaking space in a few encodings; the most likely of those would be Windows code page 1252 (Western European). But it's certainly not UTF-8, where byte \xA0 on its own is invalid.
Use .decode('cp1252') to turn that byte string into Unicode instead of 'utf-8'. In general if you want to know what encoding an HTML file is in, look for the charset parameter in the <meta http-equiv="Content-Type"> tag; it is likely to differ depending on what exported it.
Not exactly a solution, but something like xlrd would probably make a lot more sense than jumping through all those hoops.
I'm working in django and Python and I'm having issues with saving utf-16 characters in PostgreSQL. Is there any method to convert utf-16 to utf-8 before saving?
I'm using python 2.6 here is my code snippets
sample_data="This is the time of year when Travel & Leisure, TripAdvisor and other travel media trot out their “Best†lists, so I thought I might share my own list of outstanding hotels I’ve had the good fortune to visit over the years."
Above data contains some latin special characters but it is not showing correctly, I just want to show those latin special characters in appropriate formats.
There are no such things as "utf-16 characters". You should show your data by using print repr(data), and tell us which pieces of your data you are having trouble with. Show us the essence of your data e.g. the repr() of "Leisure “Best†lists I’ve had"
What you actually have is a string of bytes containing text encoded in UTF-8. Here is its repr():
'Leisure \xe2\x80\x9cBest\xe2\x80\x9d lists I\xe2\x80\x99ve had'
You'll notice 3 clumps of guff in what you showed. These correspond to the 3 clumps of \xhh in the repr.
Clump1 (\xe2\x80\x9c) decodes to U+201C LEFT DOUBLE QUOTATION MARK.
Clump 2 is \xe2\x80\x9d. Note that only the first 2 "latin special characters" aka "guff" showed up in your display. That is because your terminal's encoding is cp1252 which doesn't map \x9d; it just ignored it. Unicode is U+201D RIGHT DOUBLE QUOTATION MARK.
Clump 3: becomes U+2019 RIGHT SINGLE QUOTATION MARK (being used as an apostrophe).
As you have UTF-8-encoded bytes, you should be having no trouble with PostgreSQL. If you are getting errors, show your code, the full error message and the full traceback.
If you really need to display the guff to your Windows terminal, print guff.decode('utf8').encode('cp1252') ... just be prepared for unicode characters that are not supported by cp1252.
Update in response to comment I dont have any issue with saving data,problem is while displaying it is showing weired characters,so what iam thinking is convert those data before saving am i right?
Make up your mind. (1) In your question you say "I'm having issues with saving utf-16 characters in PostgreSQL". (2) Now you say "I dont have any issue with saving data,problem is while displaying it is showing weired characters"
Summary: Your sample data is encoded in UTF-8. If UTF-8 is not acceptable to PostgreSQL, decode it to Unicode. If you are having display problems, first try displaying the corresponding Unicode; if that doesn't work, try an encoding that your terminal will support (presumably one of the cp125X family.
This works for me to convert strings: sample_data.decode('mbcs').encode('utf-8')
I'm a German developer writing web applications for Germans, which means I cannot by any means rely on plain ASCII encoding. At least characters like ä, ö, ü, ß have to be supported.
Fortunately, Django treats ByteStrings as utf-8 encoded by default (as described in the docs). So it should just work, if I add the # -*- coding: utf-8 -*- line to the beginning of each .py file and set the editor encoding, shouldn't it? Well, it does most of the time...
But I seem to miss something when it comes to URLs. Or maybe that has not to do anything with URLs but until now I didn't notice any other encoding misbehavior. There are two cases I can remember as examples:
The URL pattern url(r'^([a-z0-9äöüß_\-]+)/$', views.view_page) doesn't recognize URLs containing ä, ö, ü, ß at all. Those characters are simply ignored.
The following code of a view function throws an Exception:
def do_redirect(request, id):
return redirect('/page/{0}'.format(id))
Where the id argument is captured from the URL like the one in the first example. If I fix the URL pattern (by specifying it as unicode string) and than access /ä/, I get the Exception
UnicodeEncodeError at /ä/
'ascii' codec can't encode character u'\xe4' in position 0: ordinal not in range(128)
However, trying the following code for the view function:
def do_redirect(request, id):
return redirect('/page/' + id)
everything works out fine. That makes me belief the actual problem lies not within Django but derives from Python, treating ByteStrings as ASCII. I'm not that much into encoding but the problem in the second example is obviously the format() method of the String object. So, in the first example it might fail because of the way Python handles regular expressions (though I don't know if Django uses the re module or something else).
My workaround until now is just prefixing the string with u whenever such an error occurs. That's a bad solution since I might easily overlook something. I tried marking every Python string as unicode but that causes other exceptions and is quite ugly.
Does anyone know exactly, what the problem is and how to solve it in a pleasant way (i.e. a way that doesn't let your head explode when the code grows bigger)?
Thanks in advance!
EDIT: For my regular expression I found out, why the u is needed. Specifying a string as Raw String (r) makes it being interpreted as ASCII. Leaving the r away makes the regex work without the u but introduces some headache with backslashes.
Prefixing your strings with u is the solution.
If it's a problem for you, then it looks like a symptom of a more general problem: you have a lot of magic constants in your code. It is bad (and you already see why). Try to avoid them, for example you can use named url pattern or view name for redirecting instead of re-typing the part of URL.
If you can't avoid them, turn them into named constants, and place their assignments in one place. Then, you'll see that all of them are prefixed properly, and it will be difficult to overlook it.
In django 1.4, one of the new features is better support for url internationalization, including support for translating URLs.
This would go a long way in helping you out, but it doesn't mean you should ignore the other advice as that is for Python in general and applies to everything, not just django.
I am trying to encode and store, and decode arguments in Python and getting lost somewhere along the way. Here are my steps:
1) I use google toolkit's gtm_stringByEscapingForURLArgument to convert an NSString properly for passing into HTTP arguments.
2) On my server (python), I store these string arguments as something like u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\\|~<>\u20ac\xa3\xa5\u2022.,?!\'' (note that these are the standard keys on an iphone keypad in the "123" view and the "#+=" view, the \u and \x chars in there being some monetary prefixes like pound, yen, etc)
3) I call urllib.quote(myString,'') on that stored value, presumably to %-escape them for transport to the client so the client can unpercent escape them.
The result is that I am getting an exception when I try to log the result of % escaping. Is there some crucial step I am overlooking that needs to be applied to the stored value with the \u and \x format in order to properly convert it for sending over http?
Update: The suggestion marked as the answer below worked for me. I am providing some updates to address the comments below to be complete, though.
The exception I received cited an issue with \u20ac. I don't know if it was a problem with that specifically, rather than the fact that it was the first unicode character in the string.
That \u20ac char is the unicode for the 'euro' symbol. I basically found I'd have issues with it unless I used the urllib2 quote method.
url encoding a "raw" unicode doesn't really make sense. What you need to do is .encode("utf8") first so you have a known byte encoding and then .quote() that.
The output isn't very pretty but it should be a correct uri encoding.
>>> s = u'1234567890-/:;()$&#".,?!\'[]{}#%^*+=_\|~<>\u20ac\xa3\xa5\u2022.,?!\''
>>> urllib2.quote(s.encode("utf8"))
'1234567890-/%3A%3B%28%29%24%26%40%22.%2C%3F%21%27%5B%5D%7B%7D%23%25%5E%2A%2B%3D_%5C%7C%7E%3C%3E%E2%82%AC%C2%A3%C2%A5%E2%80%A2.%2C%3F%21%27'
Remember that you will need to both unquote() and decode() this to print it out properly if you're debugging or whatever.
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8")))
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
>>> # oops, nasty  means we've got a utf8 byte stream being treated as an ascii stream
>>> print urllib2.unquote(urllib2.quote(s.encode("utf8"))).decode("utf8")
1234567890-/:;()$&#".,?!'[]{}#%^*+=_\|~<>€£¥•.,?!'
This is, in fact, what the django functions mentioned in another answer do.
The functions
django.utils.http.urlquote() and
django.utils.http.urlquote_plus() are
versions of Python’s standard
urllib.quote() and urllib.quote_plus()
that work with non-ASCII characters.
(The data is converted to UTF-8 prior
to encoding.)
Be careful if you are applying any further quotes or encodings not to mangle things.
i want to second pycruft's remark. web protocols have evolved over decades, and dealing with the various sets of conventions can be cumbersome. now URLs happen to be explicitly not defined for characters, but only for bytes (octets). as a historical coincidence, URLs are one of the places where you can only assume, but not enforce or safely expect an encoding to be present. however, there is a convention to prefer latin-1 and utf-8 over other encodings here. for a while, it looked like 'unicode percent escapes' would be the future, but they never caught on.
it is of paramount importance to be pedantically picky in this area about the difference between unicode objects and octet strings (in Python < 3.0; that's, confusingly, str unicode objects and bytes/bytearray objects in Python >= 3.0). unfortunately, in my experience it is for a number of reasons pretty difficult to cleanly separate the two concepts in Python 2.x.
even more OT, when you want to receive third-party HTTP requests, you can not absolutely rely on URLs being sent in percent-escaped, utf-8-encoded octets: there may both be the occasional %uxxxx escape in there, and at least firefox 2.x used to encode URLs as latin-1 where possible, and as utf-8 only where necessary.
You are out of your luck with stdlib, urllib.quote doesn't work with unicode. If you are using django you can use django.utils.http.urlquote which works properly with unicode