I have string that look like this text = u'\xd7\nRecord has been added successfully, record id: 92'. I tried to remove the escape character \xd7 and \n from my string so that I could use it for another purpose.
I tried str(text). It works but it could not remove character \xd7.
UnicodeEncodeError: 'ascii' codec can't encode character u'\xd7' in
position 0: ordinal not in range(128)
Any way I could do to remove any escape character as such above from string? Thanks
You can try the following using replace :
text=u'\xd7\nRecord has been added successfully, record id: 92'
bad_chars = ['\xd7', '\n', '\x99m', "\xf0"]
for i in bad_chars :
text = text.replace(i, '')
text
It seems you have a unicode string like in python 2.x we have unicode strings like
inp_str = u'\xd7\nRecord has been added successfully, record id: 92'
if you want to remove escape charecters which means almost special charecters, i hope this is one of the way for getting only ascii charecters without using any regex or any Hardcoded.
inp_str = u'\xd7\nRecord has been added successfully, record id: 92'
print inp_str.encode('ascii',errors='ignore').strip('\n')
Results : 'Record has been added successfully, record id: 92'
First i did encode because it is already a unicode, So while encoding to ascii if any charecters not in ascii level,It will Ignore.And you just strip '\n'
Hope this helps you :)
I believe Regex can help
import re
text = u'\xd7\nRecord has been added successfully, record id: 92'
res = re.sub('[^A-Za-z0-9]+', ' ', text).strip()
Result:
'Record has been added successfully record id 92'
You could do it by 'slicing' the string:
string = '\xd7\nRecord has been added successfully, record id: 92'
text = string[2:]
Try regex.
import re
def escape_ansi(line):
ansi_escape =re.compile(r'(\xd7|\n)')
return ansi_escape.sub('', line)
text = u'\xd7\nRecord has been added successfully, record id: 92'
print(escape_ansi(text))
You could use the built-in regex library.
import re
text = u'\xd7\nRecord has been added successfully, record id: 92'
result = re.sub('[^A-Za-z0-9]+', ' ', text)
print(result)
That spits out Record has been added successfully record id 92
This seems to pass your test case if you can live without the punctuation.
Related
I've a string of dictionary as following:
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"cisco123\", \"name\": \"admin\"}}}"
Now I want to format this string to replace the pwd and name dynamically. What I've tried is:
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}".format('password', 'username')
But this gives following error:
traceback (most recent call last):
File ".\ll.py", line 4, in <module>
CREDENTIALS = "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}".format('password', 'username')
KeyError: '"aaaUser"
It is possible by just loading the string as dict using json.loads()and then setting the attributes as required, but this is not what I want. I want to format the string, so that I can use this string in other files/modules.
'
What I'm missing here? Any help would be appreciated.
Don't try to work with the JSON string directly; decode it, update the data structure, and re-encode it:
# Use single quotes instead of escaping all the double quotes
CREDENTIALS = '{"aaaUser": {"attributes": {"pwd": "cisco123", "name": "admin"}}}'
d = json.loads(CREDENTIALS)
attributes = d["aaaUser"]["attributes"]
attributes["name"] = username
attributes["pwd"] = password
CREDENTIALS = json.dumps(d)
With string formatting, you would need to change your string to look like
CREDENTIALS = '{{"aaaUser": {{"attributes": {{"pwd": "{0}", "name": "{1}"}}}}}}'
doubling all the literal braces so that the format method doesn't mistake them for placeholders.
However, formatting also means that the password needs to be pre-escaped if it contains anything that could be mistaken for JSON syntax, such as a double quote.
# This produces invalid JSON
NEW_CREDENTIALS = CREDENTIALS.format('new"password', 'bob')
# This produces valid JSON
NEW_CREDENTIALS = CREDENTIALS.format('new\\"password', 'bob')
It's far easier and safer to just decode and re-encode.
str.format deals with the text enclosed with braces {}. Here variable CREDENTIALS has the starting letter as braces { which follows the str.format rule to replace it's text and find the immediately closing braces since it don't find it and instead gets another opening braces '{' that's why it throws the error.
The string on which this method is called can contain literal text or replacement fields delimited by braces {}
Now to escape braces and replace only which indented can be done if enclosed twice like
'{{ Hey Escape }} {0}'.format(12) # O/P '{ Hey Escape } 12'
If you escape the parent and grandparent {} then it will work.
Example:
'{{Escape Me {n} }}'.format(n='Yes') # {Escape Me Yes}
So following the rule of the str.format, I'm escaping the parents text enclosed with braces by adding one extra brace to escape it.
"{{\"aaaUser\": {{\"attributes\": {{\"pwd\": \"{0}\", \"name\": \"{1}\"}}}}}}".format('password', 'username')
#O/P '{"aaaUser": {"attributes": {"pwd": "password", "name": "username"}}}'
Now Coming to the string formatting to make it work. There is other way of doing it. However this is not recommended in your case as you need to make sure the problem always has the format as you mentioned and never mess with other otherwise the result could change drastically.
So here the solution that I follow is using string replace to convert the format from {0} to %(0)s so that string formatting works without any issue and never cares about braces .
'Hello %(0)s' % {'0': 'World'} # Hello World
SO here I'm using re.sub to replace all occurrence
def myReplace(obj):
found = obj.group(0)
if found:
found = found.replace('{', '%(')
found = found.replace('}', ')s')
return found
CREDENTIALS = re.sub('\{\d{1}\}', myReplace, "{\"aaaUser\": {\"attributes\": {\"pwd\": \"{0}\", \"name\": \"{1}\"}}}"% {'0': 'password', '1': 'username'}
print CREDENTIALS # It should print desirable result
I am using SAX Parser. I am trying to send the 'content' I retrieved using below code:
After checking the startElement and endElement, I have the below code:
def characters(self, content):
text = format.formatter(content)
this format.formatter is expected to read this data that I sent as 'content' for any processing like removing junk characters etc and return it. I do that by using string.replace function:
remArticles = {' ! ':'', ' $ ':''}
for line in content:
for i in remArticles:
line= line.replace(i, remArticles[i])
#FormattedFileForIndexing.write(line)
return line
However the output is not coming up as expected.
It will be great if some one can help on this.
source will some thing like:
"Oh! That's lots and 1000s of $$$$"
Expected: Oh That's lot of 1000s
I have tried:
def characters(content):
remArticles = {'!': '', '$': ''} # remove spaces from " ! "
for i in remArticles:
content = content.replace(i, remArticles[i])
return content
But it didn't help. It wouldn't replace. This I thought it worked earlier, but does not look like it is working today:
Pass Content To Function of Another Module in Python
i'm trying to read words from a line after matching words :
To be exact -
I have a file with below texts:
-- Host: localhost
-- Generation Time: Nov 15, 2006 at 09:58 AM
-- Server version: 5.0.21
-- PHP Version: 5.1.2
I want to search that, if that file contains 'Server version:' sub string, if do then read next characters after 'Server version:' till next line, in this case '5.0.21'.
I tried the following code, but it gives the next line(-- PHP Version: 5.1.2) instead of next word (5.0.21).
with open('/root/Desktop/test.txt', 'r+') as f:
for line in f:
if 'Server version:' in line:
print f.next()
you are using f.next() which will return the next line.
Instead you need:
with open('/root/Desktop/test.txt', 'r+') as f:
for line in f:
found = line.find('Server version:')
if found != -1:
version = line[found+len('Server version:')+1:]
print version
You might want to replace that text like this
if 'Server version: ' in line:
print line.rstrip().replace('-- Server version: ', '')
We do line.rstrip() because the read line will have a new line at the end and we strip that.
Might be overkill, but you could also use the regular expressions module re:
match = re.search("Server version: (.+)", line)
if match: # found a line matching this pattern
print match.group(1) # whatever was matched for (.+ )
The advantage is that you need to type the key only once, but of course you can have the same effect by wrapping any of the other solutions into a function definition. Also, you could do some additional validation.
You can try using the split method on strings, using the string to remove (i.e. 'Server version: ') as separator:
if 'Server version: ' in line:
print line.split('Server version: ', 1)[1]
as you have
line='-- Server version: 5.0.21'
just:
line.split()[-1]
This gives you the last word rather than all the characters after :.
If you want all the characters after :
line.split(':', 1)[-1].strip()
Replace : with other string as needed.
I'm editing a script which gets the name of a video url and the line of code that does this is:
title = unicode(entry.title.text, "utf-8")
It can be found here. Is there a simple way to add a predefined prefix before this?
For example if there is a Youtube video named "test", the script should show "Testing Videos: test".
Just prepend a unicode string:
title = u'Testing Videos: ' + unicode(entry.title.text, "utf-8")
or use string formatting for more complex options; like adding both a prefix and a postfix:
title = u'Testing Videos: {} (YouTube)'.format(unicode(entry.title.text, "utf-8"))
All that unicode(inputvalue, codec) does is decode a byte string to a unicode value; you are free to concatenate that with other unicode values, including unicode literals.
An alternative spelling would be to use the str.decode() method on the entry.title.text object:
title = u'Testing Videos: ' + entry.title.text.decode("utf-8")
but the outcome would be the same.
For example, I have a file a.js whose content is:
Hello, 你好, bye.
Which contains two Chinese characters whose unicode form is \u4f60\u597d
I want to write a python program which convert the Chinese characters in a.js to its unicode form to output b.js, whose content should be: Hello, \u4f60\u597d, bye.
My code:
fp = open("a.js")
content = fp.read()
fp.close()
fp2 = open("b.js", "w")
result = content.decode("utf-8")
fp2.write(result)
fp2.close()
but it seems that the Chinese characters are still one character , not an ASCII string like I want.
>>> print u'Hello, 你好, bye.'.encode('unicode-escape')
Hello, \u4f60\u597d, bye.
But you should consider using JSON, via json.
You can try codecs module
codecs.open(filename, mode[, encoding[, errors[, buffering]]])
a = codecs.open("a.js", "r", "cp936").read() # a is a unicode object
codecs.open("b.js", "w", "utf16").write(a)
There two ways you can use.
first one, use 'encode' method
str1 = "Hello, 你好, bye. "
print(str1.encode("raw_unicode_escape"))
print(str1.encode("unicode_escape"))
Also you can use 'codecs' module:
import codecs
print(codecs.raw_unicode_escape_encode(str1))
I found that repr(content.decode("utf-8")) will return "u'Hello, \u4f60\u597d, bye'"
so repr(content.decode("utf-8"))[2:-1] will do the job
you can use repr:
a = u"Hello, 你好, bye. "
print repr(a)[2:-1]
or you can use encode method:
print a.encode("raw_unicode_escape")
print a.encode("unicode_escape")