Using re.sub to remove double quotes - python

I am trying to remove double quotes from a text file in Python. the statement print re.sub(r'"', '', line) works in the interpreter, but not when I use it in a file. Why would this be?
from the interpreter directly:
>>>
>>> import re
>>> str = "bill"
>>> print re.sub(r'"', '', str)
bill
>>>
from my .py file:
def remove_quotes (filename):
with open(filename, 'rU') as file:
print re.sub(r'"', '', file.read())
output:
“Bill”
“pretty good” bastante bien
“friendship” amistad
“teenager” adolescent
OK, as col6y pointed out, I am dealing with fancy L/R quotes. Trying to get rid of them:
>>> line
'\xe2\x80\x9cBill\xe2\x80\x9d\n'
text = line.replace(u'\xe2\x80\x9c', '')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)
Tried another character encoding:
text = line.replace(u"\u201c", '')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 0: ordinal not in range(128)

In your interpreter example, you say:
>>>
>>> import re
>>> str = "bill"
>>> print re.sub(r'"', '', str)
bill
>>>
However, the string "bill" does not contain any quotes, so this doesn't test anything. If you try print str, you'll see it never had quotes in the first place - this is because the quotes mark that str is a string, and are therefore not included. (You wouldn't always want quotes in your strings.) If you wanted to include quotes, you could say "\"bill\"" or '"bill"'.
However, this doesn't explain the real issue in your other program. To understand that, note the difference between “, ”, and ". They look similar, but they're slightly different, and are definitely different to the computer. In your file, you have “ and ”, but you are replacing the ". You'll want to replace the other two as well.
Also, as #MikeT pointed out, it would be easier to use file.read().replace(...) instead of re.replace(..., file.read()). re.replace is for regular expressions, but you don't need their power here.
You should also note that file.read() will only read a part of the file, not the whole file. For that, consider using file.readlines(), and iterating over the lines.

Related

How to tell python that a string is actually bytes-object? Not converting

I have a txt file which contains a line:
' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
The contents in the double quotes is actually octal encoding, but with two escape characters.
After the line has been read in, I used regex to extract the contents in the double quotes.
c = re.search(r': "(.+)"', line).group(1)
After that, I have two problem:
First, I need to replace the two escape characters with one.
Second, Tell python that the str object c is actually a byte object.
None of them has been done.
I have tried:
re.sub('\\', '\', line)
re.sub(r'\\', '\', line)
re.sub(r'\\', r'\', line)
All failed.
A bytes object can be easily define with 'b'.
c = b'\351\231\220\346\227\266\345\205\215\350\264\271'
How to change the variable type of a string to bytes? I think this not a encode-and-decode thing.
I googled a lot, but with no answers. Maybe I use the wrong key word.
Does anyone know how to do these? Or other way to get what I want?
This is always a little confusing. I assume your bytes object should represent a string like:
b = b'\351\231\220\346\227\266\345\205\215\350\264\271'
b.decode()
# '限时免费'
To get that with your escaped string, you could use the codecs library and try:
import re
import codecs
line = ' 6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"'
c = re.search(r': "(.+)"', line).group(1)
codecs.escape_decode(bytes(c, "utf-8"))[0].decode("utf-8")
# '限时免费'
giving the same result.
The string contains literal text for escape codes. You cannot just replace the literal backslashes with a single backslash as escape codes are used in source code to indicate a single character. Decoding is needed to change literal escape codes to the actual character, but only byte strings can be decoded.
Encoding a Unicode string to a byte string with the Latin-1 codec translates Unicode code points 1:1 to the corresponding byte, so it is the common way to directly convert a "byte-string-like" Unicode string to an actual byte string.
Step-by-Step:
>>> s = "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"
>>> print(s) # Actual text of the string
\351\231\220\346\227\266\345\205\215\350\264\271
>>> s.encode('latin1') # Convert to byte string
b'\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271'
>>> # decode the escape codes...result is latin-1 characters in Unicode
>>> s.encode('latin1').decode('unicode-escape')
'é\x99\x90æ\x97¶å\x85\x8dè´¹' # convert back to byte string
>>> s.encode('latin1').decode('unicode-escape').encode('latin1')
b'\xe9\x99\x90\xe6\x97\xb6\xe5\x85\x8d\xe8\xb4\xb9'
>>> # data is UTF-8-encoded text so decode it correctly now
>>> s.encode('latin1').decode('unicode-escape').encode('latin1').decode('utf8')
'限时免费'
Your text example looks like part of a Python dictionary. You may be able to save some steps by using the ast module's literal_eval function to turn the dictionary directly into a Python object, and then just fix this line of code:
>>> # Python dictionary-like text
d='{6: "\\351\\231\\220\\346\\227\\266\\345\\205\\215\\350\\264\\271"}'
>>> import ast
>>> ast.literal_eval(d) # returns Python dictionary with value already decoded
{6: 'é\x99\x90æ\x97¶å\x85\x8dè´¹'}
>>> ast.literal_eval(d)[6] # but decoded incorrectly as Latin-1 text.
'é\x99\x90æ\x97¶å\x85\x8dè´¹'
>>> ast.literal_eval(d)[6].encode('latin1').decode('utf8') # undo Latin1, decode as UTF-8
'限时免费'

python unicode replace backslash u with an empty string

I'm sanitizing a pandas dataframe and encounters unicode string that has a u inside it with a backslash than I need to replace e.g.
u'\u2014'.replace('\u','')
Result: u'\u2014'
I've tried encoding it as utf-8 then decoding it but that didn't work and I feel there must be an easier way around this.
pandas code
merged['Rank World Bank'] = merged['Rank World Bank'].astype(str)
Error
UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 0: ordinal not in range(128)
u'\u2014' is actually -. It's not a number. It's a utf-8 character. Try using print keyword to print it . You will know
This is the output in ipython:
In [4]: print("val = ", u'\u2014')
val = —
Based on your comment, here is what you are doing wrong
"-" is not same as "EM Dash" Unicode character(u'\u2014')
So, you should do the following
print(u'\u2014'.replace("\u2014",""))
and that will work
EDIT:
since you are using python 2.x, you have to encode it with utf-8 as follows
u'\u2014'.encode('utf-8').decode('utf-8').replace("-","")
Yeah, Because it is taking '2014' followed by '\u' as a unicode string and not a string literal.
Things that can help:
Converting to ascii using .encode('ascii', 'ignore')
As you are using pandas, you can use 'encoding' parameter and pass 'ascii' there.
Do this instead : u'\u2014'.replace(u'\u2014', u'2014').encode('ascii', 'ignore')
Hope this helps.

How to decode utf-8 string in python? [duplicate]

For example, if I have a unicode string, I can encode it as an ASCII string like so:
>>> u'\u003cfoo/\u003e'.encode('ascii')
'<foo/>'
However, I have e.g. this ASCII string:
'\u003foo\u003e'
... that I want to turn into the same ASCII string as in my first example above:
'<foo/>'
It took me a while to figure this one out, but this page had the best answer:
>>> s = '\u003cfoo/\u003e'
>>> s.decode( 'unicode-escape' )
u'<foo/>'
>>> s.decode( 'unicode-escape' ).encode( 'ascii' )
'<foo/>'
There's also a 'raw-unicode-escape' codec to handle the other way to specify Unicode strings -- check the "Unicode Constructors" section of the linked page for more details (since I'm not that Unicode-saavy).
EDIT: See also Python Standard Encodings.
On Python 2.5 the correct encoding is "unicode_escape", not "unicode-escape" (note the underscore).
I'm not sure if the newer version of Python changed the unicode name, but here only worked with the underscore.
Anyway, this is it.
At some point you will run into issues when you encounter special characters like Chinese characters or emoticons in a string you want to decode i.e. errors that look like this:
UnicodeEncodeError: 'ascii' codec can't encode characters in position 109-123: ordinal not in range(128)
For my case (twitter data processing), I decoded as follows to allow me to see all characters with no errors
>>> s = '\u003cfoo\u003e'
>>> s.decode( 'unicode-escape' ).encode( 'utf-8' )
>>> <foo>
Ned Batchelder said:
It's a little dangerous depending on where the string is coming from,
but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'
Actually this method can be made safe like so:
>>> s = '\u003cfoo\u003e'
>>> s_unescaped = eval('u"""'+s.replace('"', r'\"')+'-"""')[:-1]
Mind the triple-quote string and the dash right before the closing 3-quotes.
Using a 3-quoted string will ensure that if the user enters ' \\" ' (spaces added for visual clarity) in the string it would not disrupt the evaluator;
The dash at the end is a failsafe in case the user's string ends with a ' \" ' . Before we assign the result we slice the inserted dash with [:-1]
So there would be no need to worry about what the users enter, as long as it is captured in raw format.
It's a little dangerous depending on where the string is coming from, but how about:
>>> s = '\u003cfoo\u003e'
>>> eval('u"'+s.replace('"', r'\"')+'"').encode('ascii')
'<foo>'

Backslash to forward in unicode string in Python

I have a spreadsheet with dates, usually encoded as strings in the format "DD\MM\YYYY", as 08\09\2014. The function I use returns the data as unicode, and I use Python 2.7. So, i start with:
> data_prob_raw
08\09\2014
To convert the string to a datetime object (datetime.parser.parse()) I need a string without '\', but I don't find a way to remove or substitute that problematic character with '/'.
I already tried with unicode codes:
data_prob_raw=data_prob_raw.replace(r'\x81', '/201')
data_prob_raw=data_prob_raw.replace(u'\x81', '/201')
And simply a string:
data_prob_raw=data_prob_raw.replace('\201','/201')
But it doesn't change anything:
08\09\2014
decoding the string:
data_prob_raw=data_raw_unic.encode('ascii')
But \201 goes uver the 128 ascii chars:
UnicodeDecodeError: 'ascii' codec can't decode byte 0x81 in position 0: ordinal not in range(128)
How can I solve that problem?
When you read data into a file from python you should get an escaped string.
I have a file called test.txt with the contents 01\01\2010
>>> with open(r'C:\users\john\desktop\test.txt') as f:
s = f.read()
>>> s
'01\\01\\2010'
>>> s.replace('\\', '/')
'01/01/2010'
and I have no problem using .replace on the string. What might be happening is that you are creating a variable directly, to test the functionality, and are assigning data_prob_raw='08\09\2014' when you should be testing with either data_prob_raw='08\\09\\2014' or reading the date in from the file.
As zondo suggested you can also use raw stings like so; data_prob_raw=r'08\09\2014'. Notice the preceding r, that r tells Python to treat the backslashes as literal backslashes instead of parsing the escape characters.
To process simply a backslash in a string, you just have to put it twice. It is the escape character, so the following replace should be enough:
data_prob_raw=data_prob_raw.replace('\\', '/')
You don't need to perform replacement. datetime can parse any date format you specify:
>>> data = ur'08\09\2014'
>>> from datetime import datetime
>>> datetime.strptime(data,ur'%m\%d\%Y')
datetime.datetime(2014, 8, 9, 0, 0)

How can I change unicode to ascii and drop unrecognized characters

My file is in unicode. However, for some reason, I want to change it to plain ascii while dropping any characters that are not recognized in ascii. For example, I want to change u'This is a string�' to just 'This is a string'. Following is the code I use to do so.
ascii_str = unicode_str.encode('ascii', 'ignore')
However, I still get the following annoying error.
UnicodeDecodeError: 'ascii' codec can't decode byte 0xf3 in position 0:
ordinal not in range(128)
How can I solve this problem? I am fine with plain ascii strings.
I assume that your unicode_str is a real unicode string.
>>> u"\xf3".encode("ascii", "ignore")
''
If not use this
>>> "\xf3".decode("ascii", "ignore").encode("ascii")
Always the best way would be, find out which encoding you deal with and than decode it. So you have an unicode string in the right format. This means start at unicode_str either to be a real unicode string or read it with the right codec. I assume that there is a file. So the very best would be:
import codecs
f = codecs.open('unicode.rst', encoding='utf-8')
for line in f:
print repr(line)
Another desperate approach would be:
>>> import string
>>> a = "abc\xf3abc"
>>> "".join(b for b in a if b in string.printable)
'abcabc'
You need to decode it. if you have a file
with open('example.csv', 'rb') as f:
csv = f.read().decode("utf-8")
if you wanna decode a string, you can do it this way
data.decode('UTF-8')
UPDATE
You can use ord() to get code ascii of every character
d=u'This is a string'
l=[ord(s) for s in d.encode('ascii', 'ignore')]
print l
If you need to concatenate them, you can use join
print "".join(l)
As you have a Replacement character ( a symbol found in the Unicode standard at codepoint U+FFFD in the Specials table) in your string , you need to specify that for your interpreter before decoding , with add u at the leading of your string :
>>> unicode_str=u'This is a string�'
>>> unicode_str.encode('ascii', 'ignore')
'This is a string'

Categories

Resources