There seem to be a lot of posts about doing this in other languages, but I can't seem to figure out how in Python (I'm using 2.7).
To be clear, I would ideally like to keep the string in unicode, just be able to replace certain specific characters.
For instance:
thisToken = u'tandh\u2013bm'
print(thisToken)
prints the word with the m-dash in the middle. I would just like to delete the m-dash. (but not using indexing, because I want to be able to do this anywhere I find these specific characters.)
I try using replace like you would with any other character:
newToke = thisToken.replace('\u2013','')
print(newToke)
but it just doesn't work. Any help is much appreciated.
Seth
The string you're searching for to replace must also be a Unicode string. Try:
newToke = thisToken.replace(u'\u2013','')
You can see the answer in this post: How to replace unicode characters in string with something else python?
Decode the string to Unicode. Assuming it's UTF-8-encoded:
str.decode("utf-8")
Call the replace method and be sure to pass it a Unicode string as its first argument:
str.decode("utf-8").replace(u"\u2022", "")
Encode back to UTF-8, if needed:
str.decode("utf-8").replace(u"\u2022", "").encode("utf-8")
Related
Decoding normal URL escaped characters is a fairly easy task with python.
If you want to decode something like: Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3
All you need to use is:
import urllib
urllib.parse.unquote('Wikivoyage:%E5%88%A0%E9%99%A4%E8%A1%A8%E5%86%B3')
And you get: 'Wikivoyage:删除表决'
However, I have identified some characters which this does not work with, namely 4-digit % encoded strings:
For example: %25D8
This apparently decodes to ◘
But if you use the urllib function I demonstrated previously, you get: %D8
I understand why this happens, the unquote command reads the %25 as a '%', which is what it normally translates to. Is there any way to get Python to read this properly? Especially in a string of similar characters?
The actual problem
In a comment you posted the real examples:
The data I am pulling from is just a list of url-encoded strings. One of the example strings I am trying to decode is represented as: %25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1 This is the raw form of it. Other strings are normal url escapes such as: %D8%A5%D9%88%D8%B2
The first one is double-quoted, as wim pointed out. So they unquote as: إزالة_الشعر_بالليزر and إوز (which are Arabic for "laser hair removal" and "geese").
So you were mistaken about the unquoting and ◘ is a red herring.
Solution
Ideally you would fix whatever gave you this inconsistent data, but if nothing else, you could try detecting double-quoted strings, for example, by checking if the number of % equals the number of %25.
def unquote_possibly_double_quoted(s: str) -> str:
if s.count('%') == s.count('%25'):
# Double
s = urllib.parse.unquote(s)
return urllib.parse.unquote(s)
>>> s = '%25D8%25A5%25D8%25B2%25D8%25A7%25D9%2584%25D8%25A9_%25D8%25A7%25D9%2584%25D8%25B4%25D8%25B9%25D8%25B1_%25D8%25A8%25D8%25A7%25D9%2584%25D9%2584%25D9%258A%25D8%25B2%25D8%25B1'
>>> unquote_possibly_double_quoted(s)
'إزالة_الشعر_بالليزر'
>>> unquote_possibly_double_quoted('%D8%A5%D9%88%D8%B2')
'إوز'
You might want to add some checks to this, like for example, s.count('%') > 0 (or '%' in s).
This question already has answers here:
What is the best way to remove accents (normalize) in a Python unicode string?
(13 answers)
Closed 8 years ago.
I'm retrieving data from the internet and I want it to convert it to ASCII. But I can't get it fixed. (Python 2.7)
When I use decode('utf-8') on the strings I get for example Yalçınkaya. I want this however converted to Yalcinkaya. Raw data was Yalçınkaya.
Anyone who can help me?
Thanks.
Edit: I've tried the suggestion that was made by the user who marked this question as duplicate (What is the best way to remove accents in a Python unicode string?) but that did not solve my problem.
That post mainly talks about removing the special characters, and that did not solve my problem of replacing the turkish characters (Yalçınkaya) to their ascii characters (Yalcinkaya).
# Printing the raw string in Python results in "Yalçınkaya".
# When applying unicode to utf8 the string changes to 'Yalçınkaya'.
# HTMLParser is used to revert special characters such as commas
# FKD normalize is used, which converts the string to 'Yalçınkaya'.
# Applying ASCII encoding results in 'Yalcnkaya', missing the original turkish 'i' which is not what I wanted.
name = unicodedata.normalize('NFKD', unicode(name, 'utf8'))
name = HTMLParser.HTMLParser().unescape(name)
name = unicodedata.normalize('NFKD', u'%s' %name).encode('ascii', 'ignore')
Let's check - first, one really needs to understand what is character encodings and Unicode. That is dead serious. I'd suggest you to read http://www.joelonsoftware.com/articles/Unicode.html before continuing any further in your project. (By the way, "converting to ASCII" is not a generally useful solution - its more like a brokerage. Think about trying to parse numbers, but since you don't understand the digit "9", you decide just to skip it).
That said - you can tell Python to "decode" a string, and just replace the unknown characters of the chosen encoding with a proper "unknown" character (u"\ufffd") - you can then just replace that character before re-encoding it to your preferred output:raw_data.decode("ASCII", errors="replace"). If you choose to brake your parsing even further, you can use "ignore" instead of replace: the unknown characters will just be suppressed. Remember you get a "Unicode" object after decoding - you have to apply the "encode" method to it before outputting that data anywhere (printing, recording to a file, etc.) - please read the article linked above.
Now - checking your specific data - the particular Yalçınkaya is exactly raw UTF-8 text that looks as though it were encoded in latin-1. Just decode it from utf-8 as usual, and then use the recipe above to strip the accent - but be advised that this just works for Latin letters with diacritics, and "World text" from the Internet may contain all kinds of characters - you should not be relying in stuff being convertible to ASCII. I have to say again: read that article, and rethink your practices.
I want to search and replace the character '№' in a string.
I am not sure if it's actually a single character or two.
How do I do it? What is its unicode?
If it's any help, I am using Python3.
EDIT: The sentence "I am not sure if it's actually a single character or two" kind of deformed my question. I actually wanted to know its unicode so that I could use the code instead of pasting the character in my python script.
In Python 3 it is always a single character.
3>> 'foo№bar'.replace('№', '#')
'foo#bar'
That character is U+2116 ɴᴜᴍᴇʀᴏ sɪɢɴ.
You can just type it directly in your source file, taking care to to specify the source file encoding as per PEP-236.
Alternatively, you can use either the numeric Unicode escapes, or the more readable named Unicode escapes:
>>> 'foo\u2116'
'foo№'
>>> 'foo\N{NUMERO SIGN}'
'foo№'
Is there a codec in python that will escape everything that is not in the ascii range of 48-57 or 65-122 (i.e. not a alpha-numeric)
The only exception would be a slash and backslash characters.
Ideally, I would want to convert something like this:
/MyString/My#^/Blah/
To this:
/MyString/My\x23\x5e/Blah/
I know that there is the string-escape encoding which does something similar, but I need a custom range of characters to be encoded. I'm looking for clever suggestions or modules that can do this efficiently.
Thanks!
You can use re.sub with a function parameter like this:
s = "/MyString/My#^/Blah/"
import re
print re.sub(r'[^\w/\\]', lambda m: '\\x%x' % ord(m.group(0)), s)
#/MyString/My\x23\x5e/Blah/
I haven't looked into this, but the first thing that came to mind is the string class translate function.
http://docs.python.org/library/string.html#string.translate
If you put together your encoding string correctly (i.e. write a script to do so) the translation should do the job.
Hope that helps.
How to get rid of non-ascii characters like "^L,¢,â" in Perl & Python ? Actually while parsing PDF files in Python & Perl. I'm getting these special characters. Now i have text version of these PDF files, but with these special characters. Is there any function available which will make insures that a file or a variable should not contain any non-ascii character.
The direct answer to your question, in Python, is to use .encode('ascii', 'ignore'), on the Unicode string in question. This will convert the Unicode string to an ASCII string and take out any non-ASCII characters:
>>> u'abc\x0c¢â'.encode('ascii', errors='ignore')
'abc\x0c'
Note that it did not take out the '\x0c'. I put that in because you mentioned the character "^L", by which I assume you mean the form-feed character '\x0c' which can be typed with Ctrl+L. That is an ASCII character, and if you want to take that out, you will also need to write some other code to remove it, such as:
>>> str(''.join([c for c in u'abc\x0c¢â' if 32 <= ord(c) < 128]))
'abc'
BUT this possibly won't help you, because I suspect you don't just want to delete these characters, but actually resolve problems relating to why they are there in the first place. In this case, it could be because of Unicode encoding issues. To deal with that, you will need to ask much more specific questions with specific examples about what you expect and what you are seeing.
For the sake of completeness, some Perl solutions. Both return ,,. Unlike the accepted Python answer, I have used no magic numbers like 32 or 128. The constants here can be looked up much easier in the documentation.
use 5.014; use Encode qw(encode); encode('ANSI_X3.4-1968', "\cL,¢,â", sub{q()}) =~ s/\p{PosixCntrl}//gr;
use 5.014; use Unicode::UCD qw(charinfo); join q(), grep { my $u = charinfo ord $_; 'Basic Latin' eq $u->{block} && 'Cc' ne $u->{category} } split //, "\cL,¢,â";
In Python you can (ab)use the encode function for this purpose (Python 3 prompt):
>>> "hello swede åäö".encode("ascii", "ignore")
b'hello swede '
åäö yields encoding errors, but since I have the errors flag on "ignore", it just happily goes on. Obviously this can mask other errors.
If you want to be absolutely sure you are not missing any "important" errors, register an error handler with codecs.register_error(name, error_handler). This would let you specify a replacement for each error instance.
Also note, that in the example above using Python 3 I get a bytes object back, I would need to convert back to Unicode proper should I need a string object.