I'm trying to process some Bibtex entries converted to an XML tree via Pybtex. I'd like to go ahead and process all the special characters from their LaTeX specials to unicode characters, via latexcodec. Via question Does pybtex support accent/special characters in .bib file? and the documentation I have checked the syntax, however, I am not getting the correct output.
>>> import latexcodec
>>> name = 'Br\"{u}derle'
>>> name.decode('latex')
u'Br"{u}derle'
I have tested this across different strings and special characters and always it just strips off the first slash without translating the character. Should I be using latexencoder differently to get the correct output?
Your backslash is not included in the string at all because it is treated as a string escape, so the codec never sees it:
>>> print 'Br\"{u}derle'
Br"{u}derle
Use a raw string:
name = r'Br\"{u}derle'
Alternatively, try reading actual data from a file, in which case the raw/non-raw distinction will not matter. (The distinction only applies to literal strings in Python source code.)
Related
Decoding normal URL escaped characters is a fairly easy task with python.
If you want to decode something like: Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3
All you need to use is:
import urllib
urllib.parse.unquote('Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3')
And you get: 'Wikivoyage:删除表决'
However, I have identified some characters which this does not work with, namely 4-digit % encoded strings:
For example: %25D8
This apparently decodes to ◘
But if you use the urllib function I demonstrated previously, you get: %D8
I understand why this happens, the unquote command reads the %25 as a '%', which is what it normally translates to. Is there any way to get Python to read this properly? Especially in a string of similar characters?
The actual problem
In a comment you posted the real examples:
The data I am pulling from is just a list of url-encoded strings. One of the example strings I am trying to decode is represented as: %25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1 This is the raw form of it. Other strings are normal url escapes such as: %D8%A5%D9%88%D8%B2
The first one is double-quoted, as wim pointed out. So they unquote as: إزالة_الشعر_بالليزر and إوز (which are Arabic for "laser hair removal" and "geese").
So you were mistaken about the unquoting and ◘ is a red herring.
Solution
Ideally you would fix whatever gave you this inconsistent data, but if nothing else, you could try detecting double-quoted strings, for example, by checking if the number of % equals the number of %25.
def unquote_possibly_double_quoted(s: str) -> str:
if s.count('%') == s.count('%25'):
# Double
s = urllib.parse.unquote(s)
return urllib.parse.unquote(s)
>>> s = '%25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1'
>>> unquote_possibly_double_quoted(s)
'إزالة_الشعر_بالليزر'
>>> unquote_possibly_double_quoted('%D8%A5%D9%88%D8%B2')
'إوز'
You might want to add some checks to this, like for example, s.count('%') > 0 (or '%' in s).
This link lists out some python specific encodings.
One of the encoding is "unicode_escape".
I am just trying to figure out, is this special encoding really needed?
>>> l = r'C:\Users\userx\toot'
>>> l
'C:\\Users\\userx\\toot'
>>> l.encode('unicode_escape').decode()
'C:\\\\Users\\\\userx\\\\toot'
If you could see above, 'l' which is a unicode object, has already taken care of escaping the backslashes. Converting it to "unicode_escape" encoding adds one more set of escaped backslashes which doesn't make any sense to me.
Questions:
Is "unicode_escape" encoding really needed?
why did "unicode_escape" added one more sets of backslashes above?
Quoting the document you linked:
Encoding suitable as the contents of a Unicode literal in ASCII-encoded Python source code, except that quotes are not escaped. Decodes from Latin-1 source code. Beware that Python source code actually uses UTF-8 by default.
Thus, print(l.encode('unicode_escape').decode()) does something almost exactly equivalent to print(repr(l)), except that it doesn't add quotes on the outside and escape quotes on the inside of the string.
When you leave off the print(), the REPL does a default repr(), so you get backslashes escaped twice -- exactly the same thing as happens when you run >>> repr(l).
I am trying to make a random wiki page generator which asks the user whether or not they want to access a random wiki page. However, some of these pages have accented characters and I would like to display them in git bash when I run the code. I am using the cmd module to allow for user input. Right now, the way I display titles is using
r_site = requests.get("http://en.wikipedia.org/w/api.php?action=query&list=random&rnnamespace=0&rnlimit=10&format=json")
print(json.loads(r_site.text)["query"]["random"][0]["title"].encode("utf-8"))
At times it works, but whenever an accented character appears it shows up like 25\xe2\x80\x9399.
Any workarounds or alternatives? Thanks.
import sys
change your encode to .encode(sys.stdout.encoding, errors="some string")
where "some string" can be one of the following:
'strict' (the default) - raises a UnicodeError when an unprintable character is encountered
'ignore' - don't print the unencodable characters
'replace' - replace the unencodable characters with a ?
'xmlcharrefreplace' - replace unencodable characters with xml escape sequence
'backslashreplace' - replace unencodable characters with escaped unicode code point value
So no, there is no way to get the character to show up if the locale of your terminal doesn't support it. But these options let you choose what to do instead.
Check here for more reference.
I assume this is Python 3.x, given that you're writing 3.x-style print function calls.
In Python 3.x, printing any object calls str on that object, then encodes it to sys.stdout.encoding for printing.
So, if you pass it a Unicode string, it just works (assuming your terminal can handle Unicode, and Python has correctly guessed sys.stdout.encoding):
>>> print('abcé')
abcé
But if you pass it a bytes object, like the one you got back from calling .encode('utf-8'), the str function formats it like this:
>>> print('abcé'.encode('utf-8'))
b'abc\xce\xa9'
Why? Because bytes objects isn't a string, and that's how bytes objects get printed—the b prefix, the quotes, and the backslash escapes for every non-printable-ASCII byte.
The solution is just to not call encode('utf-8').
Most likely your confusion is that you read some code for Python 2.x, where bytes and str are the same type, and the type that print actually wants, and tried to use it in Python 3.x.
I'm tagging some unicode text with Python NLTK.
The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8.
Given the input string:
s = u"The problem isn’t getting to Huancavelica from Huancayo to the north."
I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:
The/DT problem/NN isn’t/NN getting/VBG
Instead of:
The/DT problem/NN isn't/VBG getting/VBG
How do I get clean the text from these special characters?
Thanks for any feedback,
Mulone
UPDATE: If I run HTMLParser().unescape(s), I get:
u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'
In other cases, I still get things like & and
in the text.
What do I need to do to translate this into something that NLTK will understand?
This is not an character/Unicode encoding issue. The text you have contains XML/HTML numeric character reference entities, which are markup. Whatever library you're using to parse the file should provide some function to dereference ’ to the appropriate character.
If you're not bound to any library, see Decode HTML entities in Python string?
The resulting string includes a special apostrophe instead of an ascii single-quote. You can just replace it in the result:
In [6]: s = u"isn’t"
In [7]: print HTMLParser.HTMLParser().unescape(s)
isn’t
In [8]: print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")
isn't
Unescape will take care of the rest of the characters. For example & is the & symbol itself.
is a CR symbol (\r) and can be either ignored or converted into a newline depending on where the original text comes from (old macs used it for newlines)
How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.
The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.
For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";
In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.