Unable to match and replace " and ' characters in a text file - python

Hello: I have a text file where the double- and single-quote characters cannot be matched and replaced (Python 3.5.2). Below is a sample word copied and pasted.
>>> line_copied_pasted = 'gilingan.”'
>>> line_copied_pasted.replace('"','')
'gilingan.”'
When the string is manually entered, matching succeeds:
>>> line_manually_entered = 'gilingan."'
>>> line_manually_entered
'gilingan."'
>>> line_manually_entered.replace('"','')
'gilingan.'
The file is UTF-16 encoded, I think. Any help to fix the problem? Thanks.

You seem to have it figured out. Since it both ” and " are different, it does not make sense to try replacing first while comparing with the latter.
Just do :
line_copied_pasted.replace('”','')

In copied text ”(right double quotation mark) and "(quotation mark) are different characters. You could check their codes here.

Related

remove apostrophe from string python

I'm trying to remove the apostrophe from a string in python.
Here is what I am trying to do:
source = 'weatherForecast/dataRAW/2004/grib/tmax/'
destination= 'weatherForecast/csv/2004/tmax'
for file in sftp.listdir(source):
filepath = source + str(file)
subprocess.call(['degrib', filepath, '-C', '-msg', '1', '-Csv', '-Unit', 'm', '-namePath', destination, '-nameStyle', '%e_%R.csv'])
filepath currently comes out as the path with wrapped around by apostrophes.
i.e.
`subprocess.call(['', 'weatherForecast/.../filename')]`
and I want to get the path without the apostrophes
i.e.
subprocess.call(['', weatherForecast/.../filename)]
I have tried source.strip(" ' ", ""), but it doesn't really do anything.
I have tried putting in print(filepath) or return(filepath) since these will remove the apostrophes but they gave me
syntax errors.
filepath = print(source + str(file))
^
SyntaxError: invalid syntax
I'm currently out of ideas. Any suggestions?
The strip method of a string object only removes matching values from the ends of a string, it stops searching for matches when it first encounters a non-required character.
To remove characters, replace them with the empty string.
s = s.replace("'", "")
The accepted answer to this question is actually wrong and can cause lots of trouble. strip method removes as leading/trailing characters. So you use it when you have character to remove from start and end.
If you use replace instead, you will change all characters in the string. Here is a quick example.
my_string = "'Hello rokman's iphone'"
my_string.replace("'", "")
The above code will return Hello rokamns iphone. As you can see you lost the quote before s. This is not someting you would need in your case. However, you only parse location without that character I believe. That's why it was ok for you to use at that time.
For the solution, you are doing just one thing wrong. When you call strip method you leave space before and after. The right way to use it should be like this.
my_string = "'Hello world'"
my_string.strip("'")
However this assumes that you got ', if you get " from the response you can change quotes like this.
my_string = '"Hello world"'
my_string.strip('"')

remove this unidentified character "\" from string python

I want to remove this string "\". But i can't remove it because it's needed to do "\t or \n". Then i try this one """"\"""". But the python still don't do anything. I think the binary is not get this string. This is the code
remove = string.replace(""""\"""", " ")
And I want to replace
"\workspace\file.txt" become "workspace file.txt"
Anyone have any idea? Thanks in advance
You're trying to replace a backslash, but since Python uses backslashes as escape characters, you actually have to escape the backslash itself.
remove = str.replace("\\", " ")
DEMO:
In [1]: r"\workspace\file.txt".replace("\\", " ")
Out[1]: ' workspace file.txt'
Note the leading space. You may want to call str.strip on the end result.
You have to escape the backslash, as it has special meaning in strings. Even in your original string, if you leave it like that, it'll come out...not as you expect:
"\workspace\file.txt" --> '\\workspace\x0cile.txt'
Here's something that will break the string up by a backslash, join them together with a space where the backslash was, and trim the whitespace before and after the string. It also contains the correctly escaped string you need.
" ".join("\\workspace\\file.txt".split("\\")).strip()
View this way you can archive this,
>>> str = r"\workspace\file.txt"
>>> str.replace("\\", " ").strip()
'workspace file.txt'
>>>

How to remove this \xa0 from a string in python?

I have the following string:
word = u'Buffalo,\xa0IL\xa060625'
I don't want the "\xa0" in there. How can I get rid of it? The string I want is:
word = 'Buffalo, IL 06025
The most robust way would be to use the unidecode module to convert all non-ASCII characters to their closest ASCII equivalent automatically.
The character \xa0 (not \xa as you stated) is a NO-BREAK SPACE, and the closest ASCII equivalent would of course be a regular space.
import unidecode
word = unidecode.unidecode(word)
If you know for sure that is the only character you don't want, you can .replace it:
>>> word.replace(u'\xa0', ' ')
u'Buffalo, IL 60625'
If you need to handle all non-ascii characters, encoding and replacing bad characters might be a good start...:
>>> word.encode('ascii', 'replace')
'Buffalo,?IL?60625'
There is no \xa there. If you try to put that into a string literal, you're going to get a syntax error if you're lucky, or it's going to swallow up the next attempted character if you're not, because \x sequences aways have to be followed by two hexadecimal digits.
What you have is \xa0, which is an escape sequence for the character U+00A0, aka "NO-BREAK SPACE".
I think you want to replace them with spaces, but whatever you want to do is pretty easy to write:
word.replace(u'\xa0', u' ') # replaced with space
word.replace(u'\xa0', u'0') # closest to what you were literally asking for
word.replace(u'\xa0', u'') # removed completely
You can easily use unicodedata to get rid of all of \x... characters.
from unicodedata import normalize
normalize('NFKD', word)
>>> 'Buffalo, IL 60625'
This seems to work for getting rid of non-ascii characters:
fixedword = word.encode('ascii','ignore')

Python: Find backslash in imported string

I'm trying to snip the first phrase in an imported string (s) which always takes the form:
"\first phrase\\...\ ... "
The first phrase can be any length and consist of more than one word
The code I initially tried was:
phrase = s[1:s.find('\',1,len(s))]
which obviously didn't work.
r'\' won't compile (returns EOL error).
Variations of the following: r'\\\'; r'\\\\\\\', "\\\", "\\\\\\\""
resolve to: phrase = s[1:-1].
As the first character is always a backslash I've also tried:
phrase = s[1:find(s[0:1],1,len(s))], but it wasn't having any of it.
Any suggestions appreciated, this was supposed to be a 10 minute job!
Backslashes in string literals need to be escaped.
'\\'
I just use the split command, which will handle your multi-word requirement easily:
>>> s='\\first phrase\\second phrase\\third phrase\\'
>>> print s
\first phrase\second phrase\third phrase\
>>> s.split('\\')
['', 'first phrase', 'second phrase', 'third phrase', '']
>>> s.split('\\')[1]
'first phrase'
The trick is to make sure the backslash is escaped by a backslash.
That's why it turns out to be \\ that you are searching for or splitting on.
You can't have an '\' as the last character of a string, even if it's a raw string - it needs to be written '\\' - in fact, if you look at your question, you'll see the highlighting go somewhat wonky - try changing it as suggested and it may well correct itself...

search for hexadecimal number on python using re

I am processing an html text file, and serching for hexadecimal numbers as follows:
example \xb7\xc7\xa0....
I tried with this code
t=re.findall (r'\\x[0-9a-fA-F]+', line)
but I can only gained empty list.
please tell the right way of writing the code.
It works fine for me. Two scenarios come to mind that might explain your problem:
You're testing this by assigning the string to a variable line like so:
line = 'example \xb7\xc7\xa0....'
In this case, you need to escape the backslashes:
line = 'example \\xb7\\xc7\\xa0....'
You are viewing the contents of the file or line as a Python string, so that the \xb7 you are seeing is actually the character who's code is B7 hex, not the character sequence '\', '\x', 'b', '7'.
Your code works fine if the backslash is escaped inside the regular expression:
t = re.findall (r'\\x[0-9a-fA-F]+', line)
Result:
['\\xb7', '\\xc7', '\\xa0']
ideone: http://ideone.com/MPO5j
If it doesn't work it might be because you string contains literal binary characters. Then try something like this instead:
t = re.findall (r'[\x80-\xff]', line)
ideone: http://ideone.com/ChIsh
Your code works fine for me:
>>> line = r'\xb7\xc7\xa0....'
>>> t=re.findall (r'\\x[0-9a-fA-F]+', line)
>>> t
['\\xb7', '\\xc7', '\\xa0']

Categories

Resources