I'm tagging some unicode text with Python NLTK.
The issue is that the text is from data sources that are badly encoded, and do not specify the encoding. After some messing, I figured out that the text must be in UTF-8.
Given the input string:
s = u"The problem isn’t getting to Huancavelica from Huancayo to the north."
I want process it with NLTK, for example for POS tagging, but the special characters are not resolved, and I get output like:
The/DT problem/NN isn’t/NN getting/VBG
Instead of:
The/DT problem/NN isn't/VBG getting/VBG
How do I get clean the text from these special characters?
Thanks for any feedback,
Mulone
UPDATE: If I run HTMLParser().unescape(s), I get:
u'The problem isn\u2019t getting to Huancavelica from Huancayo to the north.'
In other cases, I still get things like & and
in the text.
What do I need to do to translate this into something that NLTK will understand?
This is not an character/Unicode encoding issue. The text you have contains XML/HTML numeric character reference entities, which are markup. Whatever library you're using to parse the file should provide some function to dereference ’ to the appropriate character.
If you're not bound to any library, see Decode HTML entities in Python string?
The resulting string includes a special apostrophe instead of an ascii single-quote. You can just replace it in the result:
In [6]: s = u"isn’t"
In [7]: print HTMLParser.HTMLParser().unescape(s)
isn’t
In [8]: print HTMLParser.HTMLParser().unescape(s).replace(u'\u2019', "'")
isn't
Unescape will take care of the rest of the characters. For example & is the & symbol itself.
is a CR symbol (\r) and can be either ignored or converted into a newline depending on where the original text comes from (old macs used it for newlines)
Related
I only code in English but I have to deal with python unicode all the time.
Sometimes its hard to remove unicode character from and dict.
How can I change python default character encoding to ASCII???
That would be the wrong thing to do. As in very wrong. To start with, it would only give you an UnicodeDecodeError instead of removing the characters. Learn proper encoding and decoding to/from unicode so that you can filter out tthe values using rules like errors="ignore"
You can't just ignore the characters taht are part of your data, just because
you 'dislike" then. It is text, and in an itnerconected World, text is not composed of only 26 glyphs.
I'd suggest you get started by reading this document: http://www.joelonsoftware.com/articles/Unicode.html
I'm trying to process some Bibtex entries converted to an XML tree via Pybtex. I'd like to go ahead and process all the special characters from their LaTeX specials to unicode characters, via latexcodec. Via question Does pybtex support accent/special characters in .bib file? and the documentation I have checked the syntax, however, I am not getting the correct output.
>>> import latexcodec
>>> name = 'Br\"{u}derle'
>>> name.decode('latex')
u'Br"{u}derle'
I have tested this across different strings and special characters and always it just strips off the first slash without translating the character. Should I be using latexencoder differently to get the correct output?
Your backslash is not included in the string at all because it is treated as a string escape, so the codec never sees it:
>>> print 'Br\"{u}derle'
Br"{u}derle
Use a raw string:
name = r'Br\"{u}derle'
Alternatively, try reading actual data from a file, in which case the raw/non-raw distinction will not matter. (The distinction only applies to literal strings in Python source code.)
I want to search and replace the character '№' in a string.
I am not sure if it's actually a single character or two.
How do I do it? What is its unicode?
If it's any help, I am using Python3.
EDIT: The sentence "I am not sure if it's actually a single character or two" kind of deformed my question. I actually wanted to know its unicode so that I could use the code instead of pasting the character in my python script.
In Python 3 it is always a single character.
3>> 'foo№bar'.replace('№', '#')
'foo#bar'
That character is U+2116 ɴᴜᴍᴇʀᴏ sɪɢɴ.
You can just type it directly in your source file, taking care to to specify the source file encoding as per PEP-236.
Alternatively, you can use either the numeric Unicode escapes, or the more readable named Unicode escapes:
>>> 'foo\u2116'
'foo№'
>>> 'foo\N{NUMERO SIGN}'
'foo№'
I have a file with sentences, some of which are in Spanish and contain accented letters (e.g. é) or special characters (e.g. ¿). I have to be able to search for these characters in the sentence so I can determine if the sentence is in Spanish or English.
I've tried my best to accomplish this, but have had no luck in getting it right. Below is one of the solutions I tried, but clearly gave the wrong answer.
sentence = ¿Qué tipo es el? #in str format, received from standard open file method
sentence = sentence.decode('latin-1')
print 'é'.decode('latin-1') in sentence
>>> False
I've also tried using codecs.open(.., .., 'latin-1') to read in the file instead, but that didn't help. Then I tried u'é'.encode('latin-1'), and that didn't work.
I'm out of ideas here, any suggestions?
#icktoofay provided the solution. I ended up keeping the decoding of the file (using latin-1), but then using the Python unicode for the characters (u'é'). This required me to set the Python unicode encoding at the top of the script. The final step was to use the unicodedata.normalize method to normalize both strings, then compare accordingly. Thank you guys for the prompt and great support.
Use unicodedata.normalize on the string before checking.
Explanation
Unicode offers multiple forms to create some characters. For example, á could be represented with a single character, á, or two characters: a, then 'put a ´ on top of that'. Normalizing the string will force it to one or the other of the representations. (which representation it normalizes to depends on what you pass as the form parameter)
I suspect your terminal is using UTF-8, so 'é'.decode('latin-1') is incorrect. Just use a Unicode constant instead u'é'.
To handle Unicode correctly in a script, declare the script and data file encodings, and decode incoming data, and encode outgoing data. Using Unicode strings for text in the script.
Example (save script in UTF-8):
# coding: utf8
import codecs
with codecs.open('input.txt',encoding='latin-1') as f:
sentence = f.readline()
if u'é' in sentence:
print u'Found é'
Note that print implicitly encodes the output in the terminal encoding.
How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.
The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.
For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";
In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.