How to get rid of non-ascii characters in Perl & Python [both]? - python

How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.

The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.

For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";

In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.

Related

replace or delete specific unicode characters in python

There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).
To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.
For instance:
thisToken = u'tandh\u2013bm'
print(thisToken)
prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)
I try using replace like you would with any other character:
newToke = thisToken.replace('\u2013','')
print(newToke)
but it just doesn't work. Any help is much appreciated.
Seth
The string you're searching for to replace must also be a Unicode string. Try:
newToke = thisToken.replace(u'\u2013','')
You can see the answer in this post: How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")

How many displayable characters in a unicode string (Japanese / Chinese)

I'd need to know how many displayable characters are in a unicode string containing japanese / chinese characters.
Sample code to make the question very obvious :
# -*- coding: UTF-8 -*-
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
print len(str)
12
print str
睡眠時間 <<<
note that four characters are displayed
How can i know, from the string, that 4 characters are going to be displayed ?
This string
str = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
Is an encoded representation of unicode code points. It contain bytes, len(str) returns you amount of bytes.
You want to know, how many unicode codes contains the string. For that, you need to know, what encoding was used to encode those unicode codes. The most popular encoding is utf8. In utf8 encoding, one unicode code point can take from 1 to 6 bytes. But you must not remember that, just decode the string:
>>> str.decode('utf8')
u'\u7761\u7720\u6642\u9593'
Here you can see 4 unicode points.
Print it, to see printable version:
>>> print str.decode('utf8')
睡眠時間
And get amount of unicode codes:
>>> len(str.decode('utf8'))
4
UPDATE: Look also at abarnert answer to respect all possible cases.
If you actually want "displayable characters", you have to do two things.
First, you have to convert the string from UTF-8 to Unicode, as explained by stalk:
s = '\xe7\x9d\xa1\xe7\x9c\xa0\xe6\x99\x82\xe9\x96\x93'
u = s.decode('utf-8')
Next, you have to filter out all code points that don't represent displayable characters. You can use the unicodedata module for this. The category function can give you the general category of any code unit. To make sense of these categories, look at the General Categories table in the version of the Unicode Character Database linked from your version of Python's unicodedata docs.
For Python 2.7.8, which uses UCD 5.2.0, you have to do a bit of interpretation to decide what counts as "displayable", because Unicode didn't really have anything corresponding to "displayable". But let's say you've decided that all control, format, private-use, and unassigned characters are not displayable, and everything else is. Then you'd write:
def displayable(c):
return unicodedata.category(c).startswith('C')
p = u''.join(c for c in u if displayable(c))
Or, if you've decided that Mn and Me are also not "displayable" but Mc is:
def displayable(c):
return unicodedata.category(c) in {'Mn', 'Me', 'Cc', 'Cf', 'Co', 'Cn'}
But even this may not be what you want. For example, does a nonspacing combining mark followed by a letter count as one character or two? The standard example is U+0043 plus U+0327: two code points that make up one character, Ç (but U+00C7 is also that same character in a single code point). Often, just normalizing your string appropriately (which usually means NKFC or NKFD) is enough to solve that—once you know what answer you want. Until you can answer that, of course, nobody can tell you how to do it.
If you're thinking, "This sucks, there should be an official definition of what 'printable' means, and Python should know that definition", well, they do, you just need to use a newer version of Python. In 3.x, you can just write:
p = ''.join(c for c in u is c.isprintable())
But of course that only works if their definition of "printable" happens to match what you mean by "displayable". And it very well may not—for example, they consider all separators except ' ' non-printable. Obviously they can't include definitions for any distinction anyone might want to make.

python: extended ASCII codes

Hi I want to know how I can append and then print extended ASCII codes in python.
I have the following.
code = chr(247)
li = []
li.append(code)
print li
The result python print out is ['\xf7'] when it should be a division symbol. If I simple print code directly "print code" then I get the division symbol but not if I append it to a list. What am I doing wrong?
Thanks.
When you print a list, it outputs the default representation of all its elements - ie by calling repr() on each of them. The repr() of a string is its escaped code, by design. If you want to output all the elements of the list properly you should convert it to a string, eg via ', '.join(li).
Note that as those in the comments have stated, there isn't really any such thing as "extended ASCII", there are just various different encodings.
You probably want the charmap encoding, which lets you turn unicode into bytes without 'magic' conversions.
s='\xf7'
b=s.encode('charmap')
with open('/dev/stdout','wb') as f:
f.write(b)
f.flush()
Will print ÷ on my system.
Note that 'extended ASCII' refers to any of a number of proprietary extensions to ASCII, none of which were ever officially adopted and all of which are incompatible with each other. As a result, the symbol output by that code will vary based on the controlling terminal's choice of how to interpret it.
There's no single defined standard named "extend ASCII Codes"> - there are however, plenty of characters, tens of thousands, as defined in the Unicode standards.
You can be limited to the charset encoding of your text terminal, which you may think of as "Extend ASCII", but which might be "latin-1", for example (if you are on a Unix system such as Linux or Mac OS X, your text terminal will likely use UTF-8 encoding, and able to display any of the tens of thousands chars available in Unicode)
So, you must read this piece in order to understand what text is, after 1992 -
If you try to do any production application believing in "extended ASCII" you are harming yourself, your users and the whole eco-system at once: http://www.joelonsoftware.com/articles/Unicode.html
That said, Python2's (and Python3's) print will call the an implicit str conversion for the objects passed in. If you use a list, this conversion does not recursively calls str for each list element, instead, it uses the element's repr, which displays non ASCII characters as their numeric representation or other unsuitable notations.
You can simply join your desired characters in a unicode string, for example, and then print them normally, using the terminal encoding:
import sys
mytext = u""
mytext += unichr(247) #check the codes for unicode chars here: http://en.wikipedia.org/wiki/List_of_Unicode_characters
print mytext.encode(sys.stdout.encoding, errors="replace")
You are doing nothing wrong.
What you do is to add a string of length 1 to a list.
This string contains a character outside the range of printable characters, and outside of ASCII (which is only 7 bit). That's why its representation looks like '\xf7'.
If you print it, it will be transformed as good as the system can.
In Python 2, the byte will be just printed. The resulting output may be the division symbol, or any other thing, according to what your system's encoding is.
In Python 3, it is a unicode character and will be processed according to how stdout is set up. Normally, this indeed should be the division symbol.
In a representation of a list, the __repr__() of the string is called, leading to what you see.

Operating on character '№'

I want to search and replace the character '№' in a string.
I am not sure if it's actually a single character or two.
How do I do it? What is its unicode?
If it's any help, I am using Python3.
EDIT: The sentence "I am not sure if it's actually a single character or two" kind of deformed my question. I actually wanted to know its unicode so that I could use the code instead of pasting the character in my python script.
In Python 3 it is always a single character.
3>> 'foo№bar'.replace('№', '#')
'foo#bar'
That character is U+2116 ɴᴜᴍᴇʀᴏ sɪɢɴ.
You can just type it directly in your source file, taking care to to specify the source file encoding as per PEP-236.
Alternatively, you can use either the numeric Unicode escapes, or the more readable named Unicode escapes:
>>> 'foo\u2116'
'foo№'
>>> 'foo\N{NUMERO SIGN}'
'foo№'

What is the fool proof way to convert some string (utf-8 or else) to a simple ASCII string in python

Inside my python scrip, I get some string back from a function which I didn't write. The encoding of it varies. I need to convert it to ascii format. Is there some fool-proof way of doing this? I don't mind replacing the non-ascii chars with blanks or something else...
If you want an ASCII string that unambiguously represents what you have got, without losing any information, the answer is simple:
Don't muck about with encode/decode, use the repr() function (Python 2.X) or the ascii() function (Python 3.x).
You say "the encoding of it varies". I guess that by "it" you mean a Python 2.x "string", which is really a sequence of bytes.
Answer part one: if you do not know the encoding of that encoded string, then no, there is no way at all to do anything meaningful with it*. If you do know the encoding, then step one is to convert your str into a unicode:
encoded_string = i_have_no_control()
the_encoding = 'utf-8' # for the sake of example
text = unicode(encoded_string, the_encoding)
Then you can re-encode your unicode object as ASCII, if you like.
ascii_garbage = text.encode('ascii', 'replace')
* There are heuristic methods for guessing encodings, but they are slow and unreliable. Here's one excellent attempt in Python.
I'd try to normalize the string then encode it. What about :
import unicodedata
s = u"éèêàùçÇ"
print unicodedata.normalize('NFKD',s).encode('ascii','ignore')
This works only if you have unicode as input. Therefor, you must know what can of encoding the function ouputs and decode it. If you don't, there are encoding detection heuristics, but on short strings, there are not reliable.
Of course, you could have luck and the function outputs rely on various unknow encodings but using ascii as a code base, therefor they would allocate the same value for the bytes from 0 to 127 (like utf-8).
In that case, you can just get rid of the unwanted chars by filtering them using OrderedSets :
import string.printable # asccii chars
print "".join(OrderedSet(string.printable) & OrderedSet(s))
Or if you want blanks instead :
print("".join(((char if char in string.printable else " ") for char in s )))
"translate" can help you to do the same.
The only way to know if your are this lucky is to try it out... Sometimes, a big fat lucky day is what any dev need :-)
What's meant by "foolproof" is that the function does not fail with even the most obscure, impossible input -- meaning, you could feed the function random binary data and IT WOULD NEVER FAIL, NO MATTER WHAT. That's what "foolproof" means.
The function should then proceed do its best to convert to the destination encoding. If it has to throw away all the trash it does not understand, then that is perfectly fine and is in fact the most desirable result. Why try to salvage all the junk? Just discard the junk. Tell the user he's not merely a moron for using Microsoft anything, but a non-standard moron for using non-standard Microsoft anything...or for attempting to send in binary data!
I have just precisely this same need (though my need is in PHP), and I also have users who are at least as moronic as I am, sometimes moreso; however, they are definitely nicer and no doubt more patient.
The best, bottom-line thing I've found so far is (in PHP 5.3):
$fixed_string = iconv( 'ISO-8859-1', 'UTF-8//IGNORE//TRANSLATE', $in_string );
This attempts to translate whatever it can and simply throws away all the junk, resulting in a legal UTF-8 string output. I've also not been able to break it or cause it to fail or reject any incoming text or data, even by feeding it gobs of binary junk data.
Finding the iconv() and getting it to work is easy; what's so maddening and wasteful is reading through all the total garbage and bend-over-backwards idiocy that so many programmers seem to espouse when dealing with this encoding fiasco. What's become of the enviable (and respectable) "Flail and Burn The Idiots" mentality of old school programming? Let's get back to basics. Use iconv() and throw away their garbage, and don't be bashful when telling them you threw away their garbage -- in short, don't fail to flail the morons who feed you garbage. And you can tell them I told you so.
If all you want to do is preserve ASCII-compatible characters and throw away the rest, then in most encodings that boils down to removing all characters that have the high bit set -- i.e., characters with value over 127. This works because nearly all character sets are extensions of 7-bit ASCII.
If it's a normal string (i.e., not unicode), you need to decode it in an arbitrary character set (such as iso-8859-1 because it accepts any byte values) and then encode in ascii, using the ignore or replace option for errors:
>>> orig = '1ä2äö3öü4ü'
>>> orig.decode('iso-8859-1').encode('ascii', 'ignore')
'1234'
>>> orig.decode('iso-8859-1').encode('ascii', 'replace')
'1??2????3????4??'
The decode step is necessary because you need a unicode string in order to use encode. If you already have a Unicode string, it's simpler:
>>> orig = u'1ä2äö3öü4ü'
>>> orig.encode('ascii', 'ignore')
'1234'
>>> orig.encode('ascii', 'replace')
'1??2????3????4??'

Categories

Resources