What are those u characters that appear when I use NLTK? [duplicate]

What are those u characters that appear when I use NLTK? [duplicate] - python

Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?

You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.

My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.

I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr

All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).

Related

What and why is the small u in {u'key': value} printed out from the list of collections.Counter item [duplicate]

Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?

You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.

My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.

I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr

All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).

For-loop printing oddly [duplicate]

Like in:
u'Hello'
My guess is that it indicates "Unicode", is that correct?
If so, since when has it been available?

You're right, see 3.1.3. Unicode Strings.
It's been the syntax since Python 2.0.
Python 3 made them redundant, as the default string type is Unicode. Versions 3.0 through 3.2 removed them, but they were re-added in 3.3+ for compatibility with Python 2 to aide the 2 to 3 transition.

The u in u'Some String' means that your string is a Unicode string.
Q: I'm in a terrible, awful hurry and I landed here from Google Search. I'm trying to write this data to a file, I'm getting an error, and I need the dead simplest, probably flawed, solution this second.
A: You should really read Joel's Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) essay on character sets.
Q: sry no time code pls
A: Fine. try str('Some String') or 'Some String'.encode('ascii', 'ignore'). But you should really read some of the answers and discussion on Converting a Unicode string and this excellent, excellent, primer on character encoding.

My guess is that it indicates "Unicode", is it correct?
Yes.
If so, since when is it available?
Python 2.x.
In Python 3.x the strings use Unicode by default and there's no need for the u prefix. Note: in Python 3.0-3.2, the u is a syntax error. In Python 3.3+ it's legal again to make it easier to write 2/3 compatible apps.

I came here because I had funny-char-syndrome on my requests output. I thought response.text would give me a properly decoded string, but in the output I found funny double-chars where German umlauts should have been.
Turns out response.encoding was empty somehow and so response did not know how to properly decode the content and just treated it as ASCII (I guess).
My solution was to get the raw bytes with 'response.content' and manually apply decode('utf_8') to it. The result was schöne Umlaute.
The correctly decoded
für
vs. the improperly decoded
fĂźr

All strings meant for humans should use u"".
I found that the following mindset helps a lot when dealing with Python strings: All Python manifest strings should use the u"" syntax. The "" syntax is for byte arrays, only.
Before the bashing begins, let me explain. Most Python programs start out with using "" for strings. But then they need to support documentation off the Internet, so they start using "".decode and all of a sudden they are getting exceptions everywhere about decoding this and that - all because of the use of "" for strings. In this case, Unicode does act like a virus and will wreak havoc.
But, if you follow my rule, you won't have this infection (because you will already be infected).

How to find right encoding in python? [duplicate]

This question already has answers here:
How to determine the encoding of text
(16 answers)
Closed 5 years ago.
I'm trying to get rid of diacritics in my textfile. I converted a pdf to text with a tool, not made by myself. I wasn't able to understand which encoding they use. The text is written in Nahuatl, orthographically familiar with Spanish.
I transformed the text into a list of strings. No I'm trying to do the following:
# check whether there is a not-ascii character in the item
def is_ascii(word):
check = string.ascii_letters + "."
if word not in check:
return False
return True
# if there is a not ascii-character encode the string
def to_ascii(word):
if is_ascii(word) == False:
newWord = word.encode("utf8")
return newWord
return word
What I want to get is a unicode-version of my string. It doesn't work so far and I tried several encodings like latin1, cp1252, iso-8859-1. What I get is Can anybody tell me what I did wrong?
How can I find out the right encoding?
Thank you!
EDIT:
I wrote to the people that developed the converter (pdf-txt) and they said they were using unicode already. So John Machin was right with (1) in his answer.
As I wrote in some comment that wasn't clear to me, because in the Eclipse debugger the list itself showed some signs in unicodes, others not. And if I looked at the items seperately they were all decoded in some way, so that I actually saw unicode.
Thank you for your help!

Edit your question to show the version of Python you are using. Guessing the version from your code is not possible. Whether you are using Python 3.X or 2.X matters a lot. Following remarks assume Python 2.x.
You already seem to have determined that you have UTF-8 encoded text. Try the_text.decode('utf8'). Note decode, NOT encode.
If decoding with UTF-8 does not raise UnicodeDecodeError and your text is not trivially short, then it is very close to certain that UTF-8 is the correct encoding.
If the above does not work, show us the result of print repr(the_text).
Note that it is counter-productive trying to check whether the file is encoded in ASCII -- ASCII is a subset of UTF-8. Leaving some data as str objects and other as unicode is messy in Python 2.x and won't work in Python 3.X
In any case, your first function doesn't do what you think it does; it returns False for any input string whose length is 2 or more. Please consider unit-testing functions as you write them; it makes debugging much faster later on.
Note that latin1 and iso-8859-1 are the same encoding. As latin1 encodes the first 256 codepoints in Unicode in the same order, then it is impossible to get UnicodeDecodeError raised by text.decode('latin1'). "No error" is this case has exactly zero diagnostic value.
Update in response to this comment from OP:
I use Python 2.7. If I use text.decode("utf8") it raises the following
error: UnicodeEncodeError: 'latin-1' codec can't encode character
u'\u2014' in position 0: ordinal not in range(256).
That can happen two ways:
(1) In a single statement like foo = text.decode('utf8'), text is already a unicode object so Python 2.X tries to encode it using the default encoding (latin-1 ???).
(2) Possibly in two different statements, first foo = text.decode('utf8') where text is an str object encoded in UTF-8, and this statement doesn't raise an error, followed by something like print foo and your sys.stdout.encoding is latin-1 (???).
I can't imagine why you have "ticked" my answer as correct. Nobody knows what the question is yet!
Please edit your question to show your code (insert print repr(text) just before the text.decode("utf8") line), and the result of running it. Show the repr() result and the full traceback (so that we can determine which line is causing the error).
I ask again: can you make your file available for analysis?
By the way, u'\u2014' is an "EM DASH" and is a valid character in cp1252 (but not in latin-1, as you have seen from the error message). What version of what operating system are you using?
And to answer your last question, NO, you must NOT attempt to decode your text using every codec in the known universe. You are ALREADY getting plausible Unicode; something (your code?) is decoding something somehow -- the presence of u'\u2014' is enough evidence of that. Just show us your code and its results.

If you have read some bytes and want to interpret them as an unicode string, then you have to use .decode() rather than encode().
Like #delnan said in the comment, I hope you know the encoding. If not, the guesswork should go easy once you fix the function used.
BTW even if there are only ASCII characters in that word, why not .decode() it too? You'd have the same data type (unicode) everywhere, which will make your program simpler.

Python and Unicode: How everything should be Unicode

Forgive if this a long a question:
I have been programming in Python for around six months. Self taught, starting with the Python tutorial and then SO and then just using Google for stuff.
Here is the sad part: No one told me all strings should be Unicode. No, I am not lying or making this up, but where does the tutorial mention it? And most examples also I see just make use of byte strings, instead of Unicode strings. I was just browsing and came across this question on SO, which says how every string in Python should be a Unicode string. This pretty much made me cry!
I read that every string in Python 3.0 is Unicode by default, so my questions are for 2.x:
Should I do a:
print u'Some text' or just print
'Text' ?
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If I get the string from say raw_input(), should I convert that to Unicode also?
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?
Sorry for being a such a noob, but this changes what I have been doing for a long time and so clearly I am confused.

The "always use Unicode" suggestion is primarily to make the transition to Python 3 easier. If you have a lot of non-Unicode string access in your code, it'll take more work to port it.
Also, you shouldn't have to decide on a case-by-case basis whether a string should be stored as Unicode or not. You shouldn't have to change the types of your strings and their very syntax just because you changed their contents, either.
It's also easy to use the wrong string type, leading to code that mostly works, or code which works in Linux but not in Windows, or in one locale but not another. For example, for c in "漢字" in a UTF-8 locale will iterate over each UTF-8 byte (all six of them), not over each character; whether that breaks things depends on what you do with them.
In principle, nothing should break if you use Unicode strings, but things may break if you use regular strings when you shouldn't.
In practice, however, it's a pain to use Unicode strings everywhere in Python 2. codecs.open doesn't pick the correct locale automatically; this fails:
codecs.open("blar.txt", "w").write(u"漢字")
The real answer is:
import locale, codecs
lang, encoding = locale.getdefaultlocale()
codecs.open("blar.txt", "w", encoding).write(u"漢字")
... which is cumbersome, forcing people to make helper functions just to open files. codecs.open should be using the encoding from locale automatically when one isn't specified; Python's failure to make such a simple operation convenient is one of the reasons people generally don't use Unicode everywhere.
Finally, note that Unicode strings are even more critical in Windows in some cases. For example, if you're in a Western locale and you have a file named "漢字", you must use a Unicode string to access it, eg. os.stat(u"漢字"). It's impossible to access it with a non-Unicode string; it just won't see the file.
So, in principle I'd say the Unicode string recommendation is reasonable, but with the caveat that I don't generally even follow it myself.

No, not every string "should be Unicode". Within your Python code, you know if the string literals needs to be Unicode or not, so it doesn't make any sense to make every string literal into a Unicode literal.
But there are cases where you should use Unicode. For example, if you have arbitrary input that is text, use Unicode for it. You will sooner or later find a non-american using it, and he want to wrîte têxt ås hé is üsed tö. And you'll get problems in that case unless your input and output happen to use the same encoding, which you can't be sure of.
So in short, no, strings shouldn't be Unicode. Text should be. But YMMV.
Specifically:
No need to use Unicode here. You know if that string is ASCII or not.
Depends if you need to merge those strings with Unicode or not.
Both ways work. But do not encode decode "when required". Decode ASAP, encode as late as possible. Using codecs work well (or io, from Python 2.7).
Yeah.

IMHO (my simple rules):
Should I do a:
print u'Some text' or just print 'Text' ?
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
Well, I use unicode literals only when I have some char above ASCII 128:
print 'New York', u'São Paulo'
t = ('New York', u'São Paulo')
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
If you expect unicode text, use codecs.
If I get the string from say raw_input(), should I convert that to Unicode also?
Only if you expect unicode text that may get transfered to another system with distinct default encoding (including databases).
EDITED (about mixing unicode and byte strings):
>>> print 'New York', 'to', u'São Paulo'
New York to São Paulo
>>> print 'New York' + ' to ' + u'São Paulo'
New York to São Paulo
>>> print "Côte d'Azur" + ' to ' + u'São Paulo'
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 1:
ordinal not in range(128)
>>> print "Côte d'Azur".decode('utf-8') + ' to ' + u'São Paulo'
Côte d'Azur to São Paulo
So if you mix a byte string that contains utf-8 (or other non ascii char) with unicode text without explicit conversion, you will have trouble, because default assumes ascii. The other way arround seems to be safe. If you follow the rule of writing every string containing non-ascii as an unicode literal, you should be OK.
DISCLAIMER: I live in Brazil where people speak Portuguese, a language with lots of non-ascii chars. My default encoding is always set to 'utf-8'. Your mileage may vary in English/ascii systems.

I’m just adding my personal opinion here. Not as long and elaborate at the other answers, but maybe it can help, too.
print u'Some text' or just print 'Text' ?
I’d indeed prefer the first. If you know that you only have Unicode strings, you have one invariant more. Various other languages (C, C++, Perl, PHP, Ruby, Lua, …) sometimes encounter painful problems because of their lack of separation between code unit sequences and integer sequences. I find the approach of strict distinction between them that exists in .NET, Java, Python etc. quite a bit cleaner.
Everything should be Unicode, does this mean, like say I have a tuple:
t = ('First', 'Second'), it should be t = (u'First', u'Second')?
Yes.
I read that I can do a from __future__ import unicode_literals and then every string will be a Unicode string, but should I do this inside a container also?
Yes. Future statements apply only to the file where they’re used, so you can use them without interfering with other modules. I generally import all futures in Python 2.x modules to make the transition to 3.x easier.
When reading/ writing to a file, I should use the codecs module. Right? Or should I just use the standard way or reading/ writing and encode or decode where required?
You should use the codecs module because that makes it impossible (or at least very hard) to accidentally write differently-encoded representations to a single file. It is also the way Python 3.x works when you open a file in text mode.
If I get the string from say raw_input(), should I convert that to Unicode also?
I’d say yes to this too: In most cases it’s easier to deal with only one encoding, so I recommend converting to Python Unicode strings as early as possible.
What is the common approach to handling all of the above issues in 2.x? The from __future__ import unicode_literals statement?
I don’t know what the common approach is, but I use that statement all the time. I have encountered only very few issues with this approach, and most of them are related to bugs in external libraries—i.e., NumPy sometimes requires byte strings without documenting that.

The fact that you were writing Python code for 6 months before encountering anything about Unicode means that the Python 2.x ASCII default for strings didn't cause you any problems. Certainly for a beginner to try to grasp the idea of Unicode/code points/encoding in itself is a hard issue to tackle; therefore, most tutorials naturally bypass it until you get more of a grounding in the fundamentals. That's why in a book like Dive Into Python, Unicode is only mentioned in later chapters.
If you need to support Unicode in your application, I suggest looking at Kumar McMillan's PyCon 2008 talk on Unicode for a list of best practices. It should answer your remaining questions.

1/2) Personally I've never heard of "always use unicode". That seems pretty stupid to me. I guess I understand if you plan to support other languages that need unicode support. But other than that I wouldn't do that, it seems like more of a pain than it's worth.
3) I would just read/write the standard way and encode when necessary.

Reading "raw" Unicode-strings in Python

I am quite new to Python so my question might be silly, but even though reading through a lot of threads I didn't find an answer to my question.
I have a mixed source document which contains html, xml, latex and other textformats and which I try to get into a latex-only format.
Therefore, I have used python to recognise the different commands as regular expresssions and replace them with the adequate latex command. Everything has worked out fine so far.
Now I am left with some "raw-type" Unicode signs, such as the greek letters. Unfortunaltly is just about to much to do it by hand. Therefore, I am looking for a way to do this the smart way too. Is there a way for Python to recognise / read them? And how do I tell python to recognise / read e.g. Pi written as a Greek letter?
A minimal example of the code I use is:
fh = open('SOURCE_DOCUMENT','r')
stuff = fh.read()
fh.close()
new_stuff = re.sub('READ','REPLACE',stuff)
fh = open('LATEX_DOCUMENT','w')
fh.write(new_stuff)
fh.close()
I am not sure whether it is an important information or not, but I am using Python 2.6 running on windows.
I would be really glad, if someone might be able to give me hint, at least where to find the according information or how this might work. Or whether I am completely wrong, and Python can't do this job ...
Many thanks in advance.
Cheers,
Britta

You talk of ``raw'' Unicode strings. What does that mean? Unicode itself is not an encoding, but there are different encodings to store Unicode characters (read this post by Joel).
The open function in Python 3.0 takes an optional encoding argument that lets you specify the encoding, e.g. UTF-8 (a very common way to encode Unicode). In Python 2.x, have a look at the codecs module, which also provides an open function that allows specifying the encoding of the file.
Edit: alternatively, why not just let those poor characters be, and specify the encoding of your LaTeX file at the top:
\usepackage[utf8]{inputenc}
(I never tried this, but I figure it should work. You may need to replace utf8 by utf8x, though)

Please, first, read this:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Then, come back and ask questions.

You need to determine the "encoding" of the input document. Unicode can encode millions of characters but files can only story 8-bit values (0-255). So the Unicode text must be encoded in some way.
If the document is XML, it should be in the first line (encoding="..."; "utf-8" is the default if there is no "encoding" field). For HTML, look for "charset".
If all else fails, open the document in an editor where you can set the encoding (jEdit, for example). Try them until the text looks right. Then use this value as the encoding parameter for codecs.open() in Python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.