Decoding normal URL escaped characters is a fairly easy task with python.
If you want to decode something like: Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3
All you need to use is:
import urllib
urllib.parse.unquote('Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3')
And you get: 'Wikivoyage:删除表决'
However, I have identified some characters which this does not work with, namely 4-digit % encoded strings:
For example: %25D8
This apparently decodes to ◘
But if you use the urllib function I demonstrated previously, you get: %D8
I understand why this happens, the unquote command reads the %25 as a '%', which is what it normally translates to. Is there any way to get Python to read this properly? Especially in a string of similar characters?
The actual problem
In a comment you posted the real examples:
The data I am pulling from is just a list of url-encoded strings. One of the example strings I am trying to decode is represented as: %25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1 This is the raw form of it. Other strings are normal url escapes such as: %D8%A5%D9%88%D8%B2
The first one is double-quoted, as wim pointed out. So they unquote as: إزالة_الشعر_بالليزر and إوز (which are Arabic for "laser hair removal" and "geese").
So you were mistaken about the unquoting and ◘ is a red herring.
Solution
Ideally you would fix whatever gave you this inconsistent data, but if nothing else, you could try detecting double-quoted strings, for example, by checking if the number of % equals the number of %25.
def unquote_possibly_double_quoted(s: str) -> str:
if s.count('%') == s.count('%25'):
# Double
s = urllib.parse.unquote(s)
return urllib.parse.unquote(s)
>>> s = '%25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1'
>>> unquote_possibly_double_quoted(s)
'إزالة_الشعر_بالليزر'
>>> unquote_possibly_double_quoted('%D8%A5%D9%88%D8%B2')
'إوز'
You might want to add some checks to this, like for example, s.count('%') > 0 (or '%' in s).
Related
There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).
To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.
For instance:
thisToken = u'tandh\u2013bm'
print(thisToken)
prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)
I try using replace like you would with any other character:
newToke = thisToken.replace('\u2013','')
print(newToke)
but it just doesn't work. Any help is much appreciated.
Seth
The string you're searching for to replace must also be a Unicode string. Try:
newToke = thisToken.replace(u'\u2013','')
You can see the answer in this post: How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")
I have a text file containing something that behaves like C-strings. For example:
something = "some text\nin two lines\tand tab";
somethingElse = "some text with \"quotes\"";
Fetching things between quotes is not a problem. Problem is that later I'm processing this string and slash escapes makes this hard.
I'd like to decode these strings, process them, then encode them back to C-string literals.
So from that raw input
some text\\with line wrap\nand \"quote\"
I need:
some text\with line wrap
and "quote"
and vice versa.
What I've tried:
I've found some API for processing Python string literals (string_escape), it is close to what I need, but since I'm processing C-strings it is useless. I've tried find other codecs to match my problem but no luck so far.
I'm looking for a simple solution also, and json module seems to be the easiest solution. The following is my quick hack. Note that there are still issues if/when both the single (') and double quote (") appear in the same string... And I suspect you will have issues with unicode characters...
def c_decode(in_str:str) -> str:
return json.loads(in_str.join('""' if '"' not in in_str else "''"))
def c_encode(in_str:str) -> str:
""" Encode a string literal as per C"""
return json.dumps(in_str)[1:-1]
Note also that if in_str is "AB\n\r\tYZ" ...
then we alternatively have: ("%r"%(in_str.join('""')))[2:-2]
giving: 'AB\\n\\r\\tYZ' # almost the same c_encode above.
Here's hoping that someone has a nicer solution.
I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?
This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.
If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.
I'm trying to process some Bibtex entries converted to an XML tree via Pybtex. I'd like to go ahead and process all the special characters from their LaTeX specials to unicode characters, via latexcodec. Via question Does pybtex support accent/special characters in .bib file? and the documentation I have checked the syntax, however, I am not getting the correct output.
>>> import latexcodec
>>> name = 'Br\"{u}derle'
>>> name.decode('latex')
u'Br"{u}derle'
I have tested this across different strings and special characters and always it just strips off the first slash without translating the character. Should I be using latexencoder differently to get the correct output?
Your backslash is not included in the string at all because it is treated as a string escape, so the codec never sees it:
>>> print 'Br\"{u}derle'
Br"{u}derle
Use a raw string:
name = r'Br\"{u}derle'
Alternatively, try reading actual data from a file, in which case the raw/non-raw distinction will not matter. (The distinction only applies to literal strings in Python source code.)
How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.
The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.
For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";
In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.