I'm trying to send Bengali text using an SMS Gateway. However it doesn't normally support Bengali text. Their documentation says I need to convert the SMS string to utf-16be; without any other details. However I found a Python implementation of what I'm looking for here .
>>> message = 'আমার সোনার বাংলা'
>>> message
'আমার সোনার বাংলা'
>>> message.encode('utf-16-be')
b'\t\x86\t\xae\t\xbe\t\xb0\x00 \t\xb8\t\xcb\t\xa8\t\xbe\t\xb0\x00 \t\xac\t\xbe\t\x82\t\xb2\t\xbe'
>>> message.encode('utf-16-be').hex()
'098609ae09be09b0002009b809cb09a809be09b0002009ac09be098209b209be'
>>> message.encode('utf-16-be').hex().upper()
'098609AE09BE09B0002009B809CB09A809BE09B0002009AC09BE098209B209BE'
I am trying to accomplish two things here:
Understand the Python Implementation
Replicate the same procedure in Ruby 2.6
So far I've come up with following
text = 'আমার সোনার বাংলা'.encode("UTF-16BE")
p text
#output-> "\u0986\u09AE\u09BE\u09B0 \u09B8\u09CB\u09A8\u09BE\u09B0 \u09AC\u09BE\u0982\u09B2\u09BE"
Typically converting from a string to bytes is accomplished with the unpack method:
# ref unpack documentation for specifics, but I use 'H*' here for hex
message.encode('utf-16-be').unpack('H*')
Related
I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda
I've got two bytes-type variable that I've concatenated (separated by a space) so I can send it as one variable to a server (socket programming). What I'm trying to figure out is how to then separate them and assign them to their original variables using regular expressions. I've consulted regular expressions parsing a binary file but it wouldn't work for me. Here is my output after trying the expression as so just to get the cipher variable
ciphertext = re.match(b'\S', ciphertext)
It generally only matches the first couple characters and returns an object, which isn't what I'm wanting. What am I doing wrong?
edit: I'm probably doing it the hard way. Honestly, any recommendation on how to send 2 bytes objects over a socket using UDP. Its proving really difficult
Ended up using str.rpartition to solve my problems. Wasn't the most obvious answer, but it worked.
Why are you using regex to do this?. You should take a look at the struct module:
In [1]: import struct
In [2]: magic = b'\xcf\xfa\xed\xfe'
In [3]: decoded = struct.unpack('<I', magic)[0]
In [4]: hex(decoded)
Out[4]: '0xfeedfacf'
Also, you can use this recipe for decoding binary files
I am getting error on zeromq python while sending strings through ROUTER SOCKET. String type messages are receveid successfully but some times, a unicode message throws exception "Type Error: unicode not allowed. use send_unicode". Although I have been trying to use msg.encode('utf-8'). BUt I cant figure out a way to get over with it.
I am on python 2.7.3. I am not using pyzmq (import zmq only). Looking forward to your suggesitons :) Thanks
if backend in sockets:
request=backend.recv_multipart()
#print ("Backend Thread is ready")
worker_id,client_id = request[:2]
if client_id != b"READY" and len(request) > 3:
#print (len(request))
empty2,reply = request[2:]
router_socket.send_multipart([client_id, reply.encode('utf-8')])
The problem was resolved only thing was that I needed to convert the unicode strings back to ascii by using string.encode('ascii')
I got the same error. My erroneous code was:
socket.send("Server message to client3")
You must convert the message to bytes to solve it. To do so, just add b like this:
socket.send(b"Server message to client3")
Is it better to convert strings to byte, then bytes to strings when data sent through network, and why?
So, because PyZMQ is actually a good library they have some docs.
https://pyzmq.readthedocs.io/en/latest/unicode.html
What it says is that the str object changed it's nature over the course of history of Python evolution.
In Python 3 str is a collection of characters and in Python 2 it is a simple wrapper (with some sugar) for char* that we know from C :).
Docs explain why the people behind pyZMQ chose to make the differences explicit - performance is the answer.
To send strings in Python3 you should use the right method, which is send_string, probably the other way around for Python2 (to send unicode you should use send_unicode).
It is however recommended to stick to bytes and explicitly provide correct encoding and decoding where needed.
Also you are using pyzmq... the module name "zmq" comes from pyzmq library/package.
To confront this statement use: pip list | grep zmq (or pip list | select-string zmq for Windows).
I am using Readability Parser API to extract content from a web page. It is ok when the web page is in Latin character set, but when I extract article in Cyrillic, it ends up with the following:
<div>Ввоскресень</div>...etc
The interesting thing here is that the title of a web page is extracted correctly in Cyrillic, but not the content. My attempt was to do the following as it suggested in this SO answer:
content = unicodedata.normalize('NFKD', content).encode('ascii','ignore')
but it did not work. Could you tell me if there is a way to convert this string before saving to database?
Please let me know if the title of my question explains correctly what I need. Thank you.
One way (Python 3.3):
>>> s='<div>Ввоскресень</div>'
>>> import html.parser
>>> h=html.parser.HTMLParser()
>>> h.unescape(s)
'<div>Ввоскресень</div>'
Python 2.7:
>>> s='<div>Ввоскресень</div>'
>>> import HTMLParser
>>> h=HTMLParser.HTMLParser()
>>> print(h.unescape(s))
<div>Ввоскресень</div>
P.S. I went to look for the documentation link and it looks like unescape isn't documented. Here's a way without using an undocumented API:
>>> re.sub(r'&#x(.*?);',lambda x: chr(int(x.group(1),16)),s)
'<div>Ввоскресень</div>'
Per comment it looks finally documented (and moved) in Python 3.4:
https://docs.python.org/3.4/library/html.html#html.unescape
I'm reading some strings from a memory buffer, written by a C program. I need to fetch them using python and print them. however when I encounter a string containing %llx python does not know how to parse this:
"unsupported format character 'l' (0x6c) at index 14"
I could use replace('%llx','%x') but than it would not be a long long.. would python handle this correctly in this case?
than it would not be a long long
Python (essentially) doesn't have any concept of a long long. If you're pulling long longs from C code, just use %x and be done with it -- you're not ever going to get values from the C code that are out of the long long range, the only issue that could arise is if you were trying to send them from Python code into C. Just use (with a new-style format string):
print('{0:x}'.format(your_int))
Tested on both Python v3.3.3 and v2.7.6 :
>>> print('%x' % 523433939134152323423597861958781271347434)
6023bedba8c47434c84785469b1724910ea