Issue in decoding string in python - python

I have a set of strings that need to be decoded. The strings format varies with products on the site. So its pretty unpredictable. Few examples of the format are given below:
1. longDescription":"\u003cul\u003e \u003cli\u003eTender grill’d bites made " (unicode and symbol combination)
2. longDescription":"Goodness You Can See™" (all decoded, to be picked as is)
3. longDescription":"With a wide variety of headphones, \u003cbr /\u003e \u003cb\u003e\u003cbr /\u003eBlackWeb Flat CAT6 Network Cable:\u003c/b\u003e \u003cbr /\u003e \u003cul\u003e \u003cli\u003eFlat CAT6 Network Cable\u003c/li\u003e \u003cli\u003eLength: 14'\u003c/li\u003e \u003cli\u003eUltra-slim design\u003c/li\u003e \u003cli\u003e1GBPS" (all unicode)
Basically, I want to extract this long description key (backend) or (bulleted list in the front end) from products like https://www.walmart.com/ip/Friskies-Gravy-Wet-Cat-Food-Warm-d-Serv-d-Grill-d-Bites-With-Shrimp-3-5-oz-Pouch/842464118
I have tried the below codes:
if '\\u' in longdescription:
try:
#temp['Key_Features'] =longdescription
temp['Key_Features'] =longdescription.decode("unicode-escape").encode()
except Exception as e:
temp['Key_Features'] =HTMLParser.HTMLParser().unescape(longdescription)
else:
temp['Key_Features'] =longdescription
I have tried all these above cases separately and the above one is with a combination. These work for most cases but in cases like the 1st one, it encodes and decodes ' symbol (or any other symbol) too and my output becomes:
Tender grillâd bites (see the change in grill'd)
We have a dependency on python2 for this code, so requesting a solution in python2. Also, I am ok with HTML tags coming in output. Just need to have a code that works for all three cases. Thanks.

This is fixed in python3 now. Used below code to convert :
temp['Key_Features']=longDescription.encode().decode('unicode-escape').encode('latin1').decode('utf8').replace('&','&').replace(' ','').replace('"','"')
This happened because data was in different encoding formats and couldn't be handled by a single encoding/decoding. The above logic works for all.

Related

printing \x1bZ in python is weird, explanation needed

The following python code has some weird behavior which I don't understand
print('\x1bZ')
when I run this code whether in a file or in the interpreter I have a wierd outcome:
actual values as displayed when you write this value to a file as bytes:
Discoveries at time of posting this question:
whether single quotes or double quotes make a difference (they don't)
0x1b is hex for 27 which in ascii is ESC which matches as displayed with the second picture. This lead me to theorize that the letter Z in the string literal can be replaced but as per my test in point number 3 it cant be reproduced with other letters
instead of \x1bZ (ESC and then Z) trying ESC and then some other letter (I haven't checked all possibilities) yielded no apparent result except from replacing Z with c which seems to clear the terminal
Hypothesis that I came up with:
This page may be relevant to the answer: https://pypi.org/project/py100/ because I have found a pattern there that resembles weird result: Esc[?1;Value0c where Value would be replaced by something. also ^[[?1;<n>0c appears in https://espterm.github.io/docs/VT100%20escape%20codes.html
Is this some encoding problem?
Is this related to ANSI character escaping? [?1;0c vs [38;2; which is used when changing background color of text
Questions:
Why is this particular sequence of characters results in this output?
What is VT100 and how it is related if it is related? (I visited it's Wikipedia page)
whether it is possible to print a string that contains this specific sequence without that weird outcome as displayed in the first picture
all help and knowledge about this will be appreciated!!

Geopandas encodings for "never seen before" characters

during these days I'm struggling with geographical dataframes which I'm managing with geopandas. My problem comes from weird format of special characters that belong to the names of regions and towns. I never saw the format which I'm in front of. Fortunately they are not so many.
I tried to select all kind of encodings, from latin-1 to several ISO-xxx but the only way that appears to work properly is a manual replacement with a dictionary (which I don't like as it is built only with the examples I can reach from the dataframe itself. If it does change in the future, it will omit that).
Here's an example of how I approached the replacement. Since I couldn't find any good encoding that allowed me to read the dataframe properly, I put the 'utf-8' encoding as a parameter of the geopandas opener.
df1 = gpd.read_file('path/to/my/file.shp', encoding='utf-8')
The obtained result is the same inserted in the example, anyway. For the sake of the example, I put only 2 instances, beside in my original dataframe there is at least one for each pair in the dictionary.
df = pd.DataFrame([[b"Pr\x8e-Saint-Didier", b"Vall\x8e d'Aoste"],[ "Bozen", b"Trentino Alto Adige - S\x9ddtirol"]], columns = ['town', 'region'])
special_chars = {
'\x9f':'ü',
'\x93':'ì',
'\xed':'ì',
'\x8e':'é',
'\x8f':'è',
'\x8d':'ç',
'\x90':'ê',
'\x98':'ò',
'\x9d':'ù',
'\x88':'à',
}
df['town'] = df['town'].str.decode('latin-1').replace(special_chars, regex=True)
df['region'] = df['region'].str.decode('latin-1').replace(special_chars, regex=True)
Does anybody have any idea on how to solve this problem?
How to handle it?
Probably it is an existing encoding, so you have several possibilities: check few of such characters in Wikipedia. Some of accented characters have a list of possible encoding. In this case, I found that an old MacOS codepage had some of your characters correct. So I checked other Mac encodings, I think I found it.
Alternatively (and do this if you have many different files and encodings): you can write a Python script with a short conversion table, and iterate all encodings. Select the 3 encodings with better point (and maybe print also the character in such encodings). This is longer on first try, but if you have often such problem, it will help you (especially because it seems you are dealing with old data).
Note: It seems that maybe few guesses of you are wrong (wrong case?).
What I found?
I think it is Mac OS Roman. Or maybe some related Mac_OS encoding. Now it is your task to check carefully if my guess is correct (I didn't check all characters).
Note: This encoding is known as mac_roman in Python.

How to transform the emoji code to the unicode? [duplicate]

I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda

Python3 src encodings of Emojis

I'd like to print emojis from python(3) src
I'm working on a project that analyzes Facebook Message histories and in the raw htm data file downloaded I find a lot of emojis are displayed as boxes with question marks, as happens when the value can't be displayed. If I copy paste these symbols into terminal as strings, I get values such as \U000fe328. This is also the output I'm getting when I run the htm files through BeautifulSoup and output the data.
I Googled this string (and others), and consistently one of the only sites that comes up with them is iemoji.com, in the case of the above string this page, that lists the string as a Python Src. I want to be able to print out these strings as their corresponding emojis (after all, they were originallly emojis when being messaged), and after looking around I found a mapping of src encodings at this page, that mapped the above like strings to emoji string names. I then found this emoji string names to Unicode list, that for the most part seems to map the emoji names to Unicode. If I try printing out these values, I get good output. Like following
>>> print(u'\U0001F624')
😤
Is there a way to map these "Python src" encodings to their unicode values? Chaining both libraries would work if not for the fact that the original src mapping is missing around 50% of the unicode values found in the unicode library. And if I do end up having to do that, is there a good way to find the Python Src value of a given emoji? From my testing emoji as strings equal their Unicode, such as '😤' == u'\U0001F624', but I can't seem to get any sort of relations to \U000fe328
This has nothing to do with Python. An escape like \U000fe328 just contains the hexadecimal representation of a code point, so this one is U+0FE328 (which is a private use character).
These days a lot of emoji are assigned to code points, eg. 😤 is U+01F624 — FACE WITH LOOK OF TRIUMPH.
Before these were assigned, various programs used various code points in the private use ranges to represent emoji. Facebook apparently used the private use character U+0FE328. The mapping from these code points to the standard code points is arbitrary. Some of them may not have a standard equivalent at all.
So what you have to look for is a table which tells you which of these old assignments correspond to which standard code point.
There's php-emoji on GitHub which appears to contain these mappings. But note that this is PHP code, and the characters are represented as UTF-8 (eg. the character above would be "\xf3\xbe\x8c\xa8").

Pitfalls in my code for detecting text file encoding with Python?

I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...
Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.
In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.
*My purposes are:
Have a dependency-free snippet I can use with a moderate-high degree of success,
Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)
for the challenge of getting this to work.
Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.
def guess_encoding_debug(file_path):
"""
DEBUG - returns many 2 value tuples
Will return list of all possible text encodings with a count of the number of chars
read that are common characters, which might be a symptom of success.
SEE warnings in sister function
"""
import codecs
import string
from operator import itemgetter
READ_LEN = 1000
ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
'utf_16_be','utf_32','utf_32_le','utf_32_be']
#chars in the regular ascii printable set are BY FAR the most common
#in most files written in English, so their presence suggests the file
#was decoded correctly.
nonsuspect_chars = string.printable
#to be a list of 2 value tuples
results = []
for e in ENCODINGS:
#some encodings will cause an exception with an incompatible file,
#they are invalid encoding, so use try to exclude them from results[]
try:
with codecs.open(file_path, 'r', e) as f:
#sample from the beginning of the file
data = f.read(READ_LEN)
nonsuspect_sum = 0
#count the number of printable ascii chars in the
#READ_LEN sized sample of the file
for n in nonsuspect_chars:
nonsuspect_sum += data.count(n)
#if there are more chars than READ_LEN
#the encoding is wrong and bloating the data
if nonsuspect_sum <= READ_LEN:
results.append([e, nonsuspect_sum])
except:
pass
#sort results descending based on nonsuspect_sum portion of
#tuple (itemgetter index 1).
results = sorted(results, key=itemgetter(1), reverse=True)
return results
def guess_encoding(file_path):
"""
Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
Will return one likely text encoding, though there may be others just as likely.
WARNING: DO NOT use if your file uses any significant number of characters
outside the standard ASCII printable characters!
WARNING: DO NOT use for critical applications, this code will fail you.
"""
results = guess_encoding_debug(file_path)
#return the encoding string (second 0 index) from the first
#result in descending list of encodings (first 0 index)
return results[0][0]
I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.
External libraries may be an option in the future, but for now I want to avoid them because:
This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.
I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.
The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.
My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.
It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.
Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.

Categories

Resources