I have some str variables, having the form of
'Nov 3, 2019 16:13:05.882679000
\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4', and I want to convert the unicode part '\xe4\xb8\xad\xe5\x9b...' to Chinese, here they mean "中国标准时间". I have tried this method :
t.encode('raw_unicode_escape').decode()
It works well when I assign the string directly to t. However, when t is a variable—— I mean do not assign the string to it,the method doesn't work.
Is there another method to solve the problem or something worry with my code?
from pyshark.packet.fields import LayerField
from scapy.all import *
import pyshark
from pyshark.packet.packet import Packet
capture = pyshark.LiveCapture(interface='WLAN')
capture.sniff(packet_count=10)
pkt = capture[0] # type: Packet
time = pkt.frame_info.time.fields[0] # type: LayerField
t=time.showname_value # type: str
s='\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4'
print(t)
print()
print(t[t.find('\\'):])
print(s)
print()
print(t[t.find('\\'):].encode('raw_unicode_escape'))
print(s.encode('raw_unicode_escape'))
------------------------ I forgot the outcome-----------
Nov 3, 2019 16:33:57.630346000 \xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4
\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4
ä¸å½æ åæ¶é´
b'\\xe4\\xb8\\xad\\xe5\\x9b\\xbd\\xe6\\xa0\\x87\\xe5\\x87\\x86\\xe6\\x97\\xb6\\xe9\\x97\\xb4'
b'\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4'
The bytes that you provide are not Unicode sequences. The string "中国标准时间" in Unicode sequences would look like: \u4e2d\u56fd\u6807\u51c6\u65f6\u95f4. I can offer you to use MgntUtils java Open Source library that has a feature that converts any string in any language into a Unicode sequence and vice-versa. The code I used to convert your String into the above Unicode sequence looks like this:
System.out.println(StringUnicodeEncoderDecoder.encodeStringToUnicodeSequence("中国标准时间"));
The library can be found at Maven Central or at Github It comes as maven artifact and with sources and javadoc
Here is javadoc for the class StringUnicodeEncoderDecoder
This is UTF-8 incorrectly decoded as latin-1. Mojibake. To reverse it, undo the incorrect decoder and apply the correct decoder:
>>> s = '\xe4\xb8\xad\xe5\x9b\xbd\xe6\xa0\x87\xe5\x87\x86\xe6\x97\xb6\xe9\x97\xb4'
>>> s.encode('latin-1').decode('utf-8')
'中国标准时间'
Zhōngguó biāozhǔn shíjiān or China Standard Time according to Google translate.
Related
I'm trying to build a way to find emojis in twitter and relate them to the unicode table that one can find in unicode.org but I'm finding hard to identify them because of what I think are encoding problems or simply my misunderstanding on this topic. In short, what I did is build a "library" of emojis from the table found in http://www.unicode.org/emoji/charts/full-emoji-list.html that contains the title and the code point (code) of the emoji. I scrapped this in R with the library rvest.
The problem comes when I grab the information from twitter with the twitteR API in R. As the codes for the emojis do not look at all like the ones in this table.
Let's have an example with the emoji of the 100 (one hundred points) red icon. This is the number 1468 in the before linked table and its code point code is:
U+1F4AF
Now, when I grab it from twitter, first of all it is shown like this in the status class that the API has builtin to work with the tweets.
\xed��\xed��
Then, once I convert it to a dataframe, I do it also with a builtin function from the twitter API. For example:
tweet$toDataFrame()
The emoji becomes this:
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
I tried to convert it with the function iconv in R, with the following code:
iconv(tweet$text, from="UTF-8", to="ASCII", "byte)
and I only manage to make it look like this:
<ed><a0><bd><ed><b2><af>
So, wrapping up and at the end of my tests, I got to the following results:
<ed><a0><bd><ed><b2><af>
<ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>
\xed��\xed��
None of which look like the code point specified by the table:
U+1F4AF
Is there any possibility to transform between the two strings?
What am I missing? Why is twitter returning this information for emojis?
I didn't know anything about enconding before, but after days of reading I think I know what is going on. I don't understand perfectly how the encoding for emoji works, but I stumbled upon the same problem and solved it.
You want to map \xed��\xed�� to its name-decoded version: hundred points. A sensible way could be to scrape a dictionary online and use a key, such as Unicode, to replace it. In this case it would be U+1F4AF.
The conversions you show are not different encodings but different notation for the same encoded emoji:
as.data.frame(tweet) returns <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>.
iconv(tweet, from="UTF-8", to="ASCII", "byte") returns <ed><a0><bd><ed><b2><af>.
So using Unicode directly isn't feasible. Another way could be to use a dictionary that already encodes emoji in the <ed>...<ed>... way like the one here: emoji list. Voilà! Only her list is incomplete because it comes from
a dictionary that contains fewer emoticons.
The fast solution is to simply scrape a more complete dictionary and map the <ed>...<ed>... with its corresponding english text translation. I have done that already and posted here.
Although the fact that nobody else posted a list with the proper encoding bugged me. In fact, most dictionaries I found had an UTF-8 encoding using not an <ed>...<ed>... representation but rather <f0>.... It turns out they are both correct UTF-8 encodings for the same unicode U+1F4AF only the Bytes are read differently.
Long answer. The tweet is read in UTF-16 and then converted to UTF-8, and here is where conversions diverge. When the read is done by pairs of bytes the result will be UTF-8 <ed>...<ed>..., when it is read by chunks of four bytes the result will be UTF-8 <f0>... (Why is this? I don't fully understand, but I suspect it has something to do with the architecture of your processor).
So a slower (but more conscious) way to solve your problem is to scrape the <f0>... dictionary, convert it to UTF-16, convert it back to UTF-8 by pairs and you'll end up with two <ed>.... These two <ed>... is known as the low-high surrogate pair representation for the Unicode U+xxxxx.
As an example:
unicode <- 0x1F4Af
# Multibyte Version
intToUtf8(unicode)
# Byte-pair Version
hilo <- unicode2hilo(unicode)
intToUtf8(hilo)
Returns:
[1] "\xf0\u009f\u0092�"
[1] "\xed��\xed��"
Which, again, using iconv(..., 'utf-8', 'latin1', 'byte'), is the same as:
[1] "<f0><9f><92><af>"
[1] "<ed><a0><bd><ed><b2><af>"
PS1.:
Function unicode2hilo is a simple linear transformation of hi-lo to unicode
unicode2hilo <- function(unicode){
hi = floor((unicode - 0x10000)/0x400) + 0xd800
lo = (unicode - 0x10000) + 0xdc00 - (hi-0xd800)*0x400
hilo = paste('0x', as.hexmode(c(hi,lo)), sep = '')
return(hilo)
}
hilo2unicode <- function(hi,lo){
unicode = (hi - 0xD800) * 0x400 + lo - 0xDC00 + 0x10000
unicode = paste('0x', as.hexmode(unicode), sep = '')
return(unicode)
}
PS2.:
I would recommend using iconv(tweet, 'UTF-8', 'latin1', 'byte') to preserve special characters like áäà.
PS3.:
To replace the emoji with its english text, tag, hash, or anything you want to map it to, I would suggest using DFS in a graph of emojis because there are some emojis whose unicode is the concatenation of other simpler unicodes (i.e. <f0><9f><a4><b8><e2><80><8d><e2><99><82><ef><b8><8f> is a man cartwheeling, while independently <f0><9f><a4><b8> is person cartwheeling, <e2><80><8d> is nothing, <e2><99><82> is a male sign, and <ef><b8><8f> is nothing) and while man cartwheeling and person cartwheeling male sign are obviously semantically related, I prefer the more faithfull translation.
The answer provided by Felipe Suárez Colmenares is excellent because it describes the mechanics of this issue, but I wanted to point you here, which is a dictionary I made with the < ed > R encoding specifically for Twitter. I also have code on how to go through and identify prose versions of emojis. Thought this might be easier for people who stumble into this problem in the future. The dictionary is up to date to the most recent Unicode version (9) and once the even newer one comes out I'll update it then too.
Please try type this: iconv(tweet$text, "latin1", "ASCII", sub="")
There you have also similar discussion:
Emoticons in Twitter Sentiment Analysis in r
Regards,
Magda
I'm writing a Python script to fetch Korean vocabulary pronunciation. I have a URL ready to go, and when I open the URL in Safari, it retrieves the expected JSON from the server.
When I use requests to get the JSON, the call fails and no results are found.
Using Charles, I can see that the URL with my original query, a Hangul word, is URL encoded after I paste the URL into Safari and hit enter. For example, the instance of 소식 in the URL string becomes %EC%86%8C%EC%8B%9D on its way out.
However, when I make that same request with requests, the word is encoded as %E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8. Both encodings can be decoded back to the original word 소식 (using a web app to confirm). The former encoding is accepted by the server, the latter is not.
Why would I be getting a different encoding from requests?
Edit
Query string comes into the script as 소식
query = sys.argv[1]
sys.stderr.write(query) -> 소식
Interpolating the query into the URL string yields ...json/word/소식... when printing it.
Going through Charles it now looks like this /json/word/%E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8/. Everything is default, no specified encoding.
These are both valid url-encodings of the "same" input text:
>>> from urllib.parse import unquote
>>> ulong = unquote('%E1%84%89%E1%85%A9%E1%84%89%E1%85%B5%E1%86%A8')
>>> ushort = unquote('%EC%86%8C%EC%8B%9D')
>>> ulong
'소식'
>>> ushort
'소식'
The strings are not actually equal, though, they have different forms in unicode:
>>> from unicodedata import name
>>> [name(x) for x in ulong]
['HANGUL CHOSEONG SIOS',
'HANGUL JUNGSEONG O',
'HANGUL CHOSEONG SIOS',
'HANGUL JUNGSEONG I',
'HANGUL JONGSEONG KIYEOK']
>>> [name(x) for x in ushort]
['HANGUL SYLLABLE SO', 'HANGUL SYLLABLE SIG']
I do not know any Korean, but it looks like the long string is composed of combining characters (you can also see similar things with latin characters and accents). If I perform a canonical decomposition and composition of the forms, I get equality:
>>> from unicodedata import normalize
>>> normalize('NFC', ulong) == ushort
True
So, either you are using different input texts, that just happen to look the same (even repr is not enough to see the difference, you have to examine the codepoints) or one of the methods you are using - probably the browser - is performing a normalization/transformation.
Since the short form of the text is what worked with the server, I suggest you normalize the inputs to your script into the NFC form.
I would like to encode an IP address in as short a string as possible using all the printable characters. According to https://en.wikipedia.org/wiki/ASCII#Printable_characters these are codes 20hex to 7Ehex.
For example:
shorten("172.45.1.33") --> "^.1 9" maybe.
In order to make decoding easy I also need the length of the encoding always to be the same. I also would like to avoid using the space character in order to make parsing easier in the future.
How can one do this?
I am looking for a solution that works in Python 2.7.x.
My attempt so far to modify Eloims's answer to work in Python 2:
First I installed the ipaddress backport for Python 2 (https://pypi.python.org/pypi/ipaddress) .
#This is needed because ipaddress expects character strings and not byte strings for textual IP address representations
from __future__ import unicode_literals
import ipaddress
import base64
#Taken from http://stackoverflow.com/a/20793663/2179021
def to_bytes(n, length, endianess='big'):
h = '%x' % n
s = ('0'*(len(h) % 2) + h).zfill(length*2).decode('hex')
return s if endianess == 'big' else s[::-1]
def def encode(ip):
ip_as_integer = int(ipaddress.IPv4Address(ip))
ip_as_bytes = to_bytes(ip_as_integer, 4, endianess="big")
ip_base85 = base64.a85encode(ip_as_bytes)
return ip_base
print(encode("192.168.0.1"))
This now fails because base64 doesn't have an attribute 'a85encode'.
An IP stored in binary is 4 bytes.
You can encode it in 5 printable ASCII characters using Base85.
Using more printable characters won't be able to shorten the resulting string more than that.
import ipaddress
import base64
def encode(ip):
ip_as_integer = int(ipaddress.IPv4Address(ip))
ip_as_bytes = ip_as_integer.to_bytes(4, byteorder="big")
ip_base85 = base64.a85encode(ip_as_bytes)
return ip_base85
print(encode("192.168.0.1"))
I found this question looking for a way to use base85/ascii85 on python 2. Eventually I discovered a couple of projects available to install via pypi. I settled on one called hackercodecs because the project is specific to encoding/decoding whereas the others I found just offered the implementation as a byproduct of necessity
from __future__ import unicode_literals
import ipaddress
from hackercodecs import ascii85_encode
def encode(ip):
return ascii85_encode(ipaddress.ip_address(ip).packed)[0]
print(encode("192.168.0.1"))
https://pypi.python.org/pypi/hackercodecs
https://github.com/jdukes/hackercodecs
from pybtex.database.input import bibtex
parser = bibtex.Parser()
bibdata = parser.parse_file("sample.bib")
The above code snippet works really well in parsing a .bib file but it seems not to support accent characters, like {\"u} or \"{u}(From LaTeX). Just like to confirm if pybtex support that or not.
For example, according to LaTeX/Special Characters and How to write “ä” and other umlauts and accented letters in bibliography?, \"{o} should convert to ö, and so does {\"o}.
Update: this feature is now supported by pybtex since version 0.20.
It does not at the moment. But you can read the bib file using a latex codec before you process it with pybtex, e.g. with https://pypi.python.org/pypi/latexcodec/ This codec will convert (a wide range of) LaTeX commands to unicode for you.
However, you'll have to remove brackets in a post-processing stage. Why? In order to handle bibtex code gracefully, \"{U} has to be converted into {Ü} rather than into Ü to prevent it from being lower cased in titles. The following example demonstrates this behaviour:
import pybtex.database.input.bibtex
import pybtex.plugin
import codecs
import latexcodec
style = pybtex.plugin.find_plugin('pybtex.style.formatting', 'plain')()
backend = pybtex.plugin.find_plugin('pybtex.backends', 'latex')()
parser = pybtex.database.input.bibtex.Parser()
with codecs.open("test.bib", encoding="latex") as stream:
# this shows what the latexcodec does to the source
print stream.read()
with codecs.open("test.bib", encoding="latex") as stream:
data = parser.parse_stream(stream)
for entry in style.format_entries(data.entries.itervalues()):
print entry.text.render(backend)
where test.bib is
#Article{test,
author = {John Doe},
title = {Testing \"UTEST \"{U}TEST},
journal = {Journal of Test},
year = {2000},
}
This will print how the latexcodec converted test.bib into unicode (edited for readability):
#Article{test,
author = {John Doe}, title = {Testing ÜTEST {Ü}TEST},
journal = {Journal of Test}, year = {2000},
}
followed by the pybtex rendered entry (in this case, producing latex code):
John Doe.
\newblock Testing ütest {Ü}test.
\newblock \emph{Journal of Test}, 2000.
If the codec were to strip the brackets, pybtex would have converted the case wrongly. Further, in (pathological) cases like journal = {\"u} clearly the brackets cannot be removed either.
An obvious downside is that if you render to a non-LaTeX backend, then you have to remove the brackets in a post-processing stage. But you may want to do that anyway to process any special LaTeX commands (such as \url). It would be nice if pybtex could somehow do that for you, but it doesn't at the moment.
pylatexenc (https://pypi.org/project/pylatexenc/)
from pylatexenc.latex2text import LatexNodes2Text
latex_text = 'Gl{\\"o}ckner'
text = LatexNodes2Text().latex_to_text(latex_text)
print(text) # Glöckner
Conclusion: It's impossible to override or disable Python's built-in escape sequence processing, such that, you can skip using the raw prefix specifier. I dug into Python's internals to figure this out. So if anyone tries designing objects that work on complex strings (like regex) as part of some kind of framework, make sure to specify in the docstrings that string arguments to the object's __init__() MUST include the r prefix!
Original question: I am finding it a bit difficult to force Python to not "change" anything about a user-inputted string, which may contain among other things, regex or escaped hexadecimal sequences. I've already tried various combinations of raw strings, .encode('string-escape') (and its decode counterpart), but I can't find the right approach.
Given an escaped, hexadecimal representation of the Documentation IPv6 address 2001:0db8:85a3:0000:0000:8a2e:0370:7334, using .encode(), this small script (called x.py):
#!/usr/bin/env python
class foo(object):
__slots__ = ("_bar",)
def __init__(self, input):
if input is not None:
self._bar = input.encode('string-escape')
else:
self._bar = "qux?"
def _get_bar(self): return self._bar
bar = property(_get_bar)
#
x = foo("\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
print x.bar
Will yield the following output when executed:
$ ./x.py
\x01\r\xb8\x85\xa3\x00\x00\x00\x00\x8a.\x03ps4
Note the \x20 got converted to an ASCII space character, along with a few others. This is basically correct due to Python processing the escaped hex sequences and converting them to their printable ASCII values.
This can be solved if the initializer to foo() was treated as a raw string (and the .encode() call removed), like this:
x = foo(r"\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34")
However, my end goal is to create a kind of framework that can be used and I want to hide these kinds of "implementation details" from the end user. If they called foo() with the above IPv6 address in escaped hexadecimal form (without the raw specifier) and immediately print it back out, they should get back exactly what they put in w/o knowing or using the raw specifier. So I need to find a way to have foo's __init__() do whatever processing is necessary to enable that.
Edit: Per this SO question, it seems it's a defect of Python, in that it always performs some kind of escape sequence processing. There does not appear to be any kind of facility to completely turn off escape sequence processing, even temporarily. Sucks. I guess I am going to have to research subclassing str to create something like rawstr that intelligently determines what escape sequences Python processed in a string, and convert them back to their original format. This is not going to be fun...
Edit2: Another example, given the sample regex below:
"^.{0}\xcb\x00\x71[\x00-\xff]"
If I assign this to a var or pass it to a function without using the raw specifier, the \x71 gets converted to the letter q. Even if I add .encode('string-escape') or .replace('\\', '\\\\'), the escape sequences are still processed. thus resulting in this output:
"^.{0}\xcb\x00q[\x00-\xff]"
How can I stop this, again, without using the raw specifier? Is there some way to "turn off" the escape sequence processing or "revert" it after the fact thus that the q turns back into \x71? Is there a way to process the string and escape the backslashes before the escape sequence processing happens?
I think you have an understandable confusion about a difference between Python string literals (source code representation), Python string objects in memory, and how that objects can be printed (in what format they can be represented in the output).
If you read some bytes from a file into a bytestring you can write them back as is.
r"" exists only in source code there is no such thing at runtime i.e., r"\x" and "\\x" are equal, they may even be the exact same string object in memory.
To see that input is not corrupted, you could print each byte as an integer:
print " ".join(map(ord, raw_input("input something")))
Or just echo as is (there could be a difference but it is unrelated to your "string-escape" issue):
print raw_input("input something")
Identity function:
def identity(obj):
return obj
If you do nothing to the string then your users will receive the exact same object back. You can provide examples in the docs what you consider a concise readable way to represent input string as Python literals. If you find confusing to work with binary strings such as "\x20\x01" then you could accept ascii hex-representation instead: "2001" (you could use binascii.hexlify/unhexlify to convert one to another).
The regex case is more complex because there are two languages:
Escapes sequences are interpreted by Python according to its string literal syntax
Regex engine interprets the string object as a regex pattern that also has its own escape sequences
I think you will have to go the join route.
Here's an example:
>>> m = {chr(c): '\\x{0}'.format(hex(c)[2:].zfill(2)) for c in xrange(0,256)}
>>>
>>> x = "\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34"
>>> print ''.join(map(m.get, x))
\x20\x01\x0d\xb8\x85\xa3\x00\x00\x00\x00\x8a\x2e\x03\x70\x73\x34
I'm not entirely sure why you need that though. If your code needs to interact with other pieces of code, I'd suggest that you agree on a defined format, and stick to it.