Python encoding issue in script if string not hard-coded

Python encoding issue in script if string not hard-coded - python

I have an encoding issue with strings I get from an external source.
This source sends the strings encoded to me and I can decode them only if they are part of the script's code.
I've looked at several threads here and even some recommended tutorials (such as this one) but came up empty.
For example, if I run this:
python -c 'print "gro\303\237e"'
I get:
große
Which is the correct result.
But If I use it in a script, such as:
import sys
print sys.argv[1]
and call it like test.py "gro\303\237e", I get:
gro\303\237e
I intend to write the correct string to syslog, but I can't seem to get this to work.
Some data on my system:
- Python 2.7.10
- CentOS Linux
- LANG=en_US.UTF-8
- LC_CTYPE=UTF-8
I will appreciate any help, please let me know if you need more information.
Thanks!

If you really have the chars gro\303\237e which is something else as "gro\303\237e" (the first one are the chars g r o \ 3 0 3 \ 2 3 7, the second one is the chars g r o ß e) you can use decode("escape_string") as described in this SO answer
Note that this is probably an encoding error whoever produced the data. So it may contain other errors that you can not fix with this method.

This will work:
import sys
import ast
print ast.literal_eval('b"%s"' % sys.argv[1]).decode("utf-8")
But please read about literal_eval first to make sure it suits your needs (I think it should be safe to use but you should read and make sure).

Related

Removing a control character using Python

I have a script that processes the output of a command (the aws help cli command).
I step through the output line-by-line and don't start the actual real parsing until I encounter the text "AVAILABLE COMMANDS" at which point I set a flag to true and start further processing on each line.
I've had this working fine - BUT on Ubuntu we encounter a problem which is this :
The CLI highlights the text in a way I have not seen before:
The output is very long, so I've grep'd the particular line in question - see below:
># aws ec2 help | egrep '^A'
>AVAILABLE COMMANDS
># aws ec2 help | egrep '^A' | cat -vet
>A^HAV^HVA^HAI^HIL^HLA^HAB^HBL^HLE^HE C^HCO^HOM^HMM^HMA^HAN^HND^HDS^HS$
What I haven't seen before is that each letter that is highligted is in the format X^HX.
I'd like to apply a simple transformation of the type X^HX --> X (for all a-zA-Z).
What have I tried so far:
well my workaround is this - first I remove control characters like this:
String = re.sub(r'[\x00-\x1f\x7f-\x9f]','',String)
but I still have to search for 'AAVVAAIILLAABBLLEE' which is totally ugly. I considered using a further regex to turn doubles to singles but that will catch true doubles and get messy.
I started writing a function with an iteration across a constructed list of alpha characters to translate as described, and I used hexdump to try to figure out the exact \x code of the control characters in question but could not get it working - I could remove H but not the ^.
I really don't want to use any additional modules because I want to make this available to people without them having to install extras. In conclusion I have a workaround that is quite ugly, but I'm sure someone must know a quick an easy way to do this translation. It's odd that it only seems to show up on Ubuntu.

After looking at this a little further I was able to put in place a solution:
from string import ascii_lowercase
from string import ascii_uppercase
def RemoveUbuntuHighlighting(String):
for Char in ascii_uppercase + ascii_lowercase:
Match = Char + '\x08' + Char
String = re.sub(Match,Char,String)
return(String)
I'm still a little confounded to see characters highlighted in the format (X\x08X), the arrangement does seem to repeat the same information unnecessarily.
The other thing I would advise to anyone not familiar with reading hexcode is that each pair of hexes is swapped around with respect to the order of their appearance.

A much simpler and more reliable fix is to replace a backspace and duplicate of any character.
I have also augmented this to handle underscores using the same mechanism (character, backspace, underscore).
String = re.sub(r'(.)\x08(\1|_)', r'\1', String)
Demo: https://ideone.com/yzwd2V
This highlighting was standard back when output was to a line printer; backspacing and printing the same character again would add pigmentation to produce boldface. (Backspacing and printing an underscore would produce underlining.)
Probably the AWS CLI can be configured to disable this by setting the TERM variable to something like dumb. There is also a utility col which can remove this formatting (try col-b; maybe see also colcrt). Though perhaps really the best solution would be to import the AWS Python code and extract the help message natively.

base64 syntax in python is not working

this code works on the command line.
python -c 'import base64,sys; u,p=sys.argv[1:3]; print base64.encodestring("%s\x00%s\x00%s" % (u,u,p))' user pass
output is
dXNlcgB1c2VyAHBhc3M=
I am trying to get this to work in my script
test = base64.encodestring("{0}{0}{1}").format(acct_name,pw)
print test
output is
ezB9ezB9ezF9
anyone no what i am doing wrong ?
thank you.

You have a mistake in parenthesis. Instead of:
test = base64.encodestring("{0}{0}{1}").format(acct_name,pw)
(which first encodes "{0}{0}{1}" in base64 and then tries to substitute variables using format),
you should have
test = base64.encodestring("{0}{0}{1}".format(acct_name,pw))
(which first substitutes variables using format and then encodes in base64).

Thanks SZYM i am all set. This is the code that gets it to work
test = base64.encodestring("{0}\x00{0}\x00{1}".format(acct_name,pw))
Turns out the hex \x00 is needed so program getting the hash knows where username stops and password begins.
-ALF

Repeated errors while working with Portable Python [duplicate]

This question already has answers here:
Syntax error on print with Python 3 [duplicate]
(3 answers)
Closed 9 years ago.
Note: This was not answered by the question that was marked as the original. This is more than just a Python v2 vs v3 problem, which I explain in the comments below.
Original post:
I am trying to learn Python at work, so I am currently using Portable Python 3.2.1.1 (which will henceforth be referred to as PP). (I mention this because this problem doesn't happen at home when I use my Mac and regular Python.)
I am working through exercise 16 of Learning Python the Hard Way (http://learnpythonthehardway.org/book/ex16.html). I've heard this isn't the best learning tool, but I am a complete programming n00b and I'm a hands-on learner. If you have any better suggestions, I'm open!
The first few lines of the exercise read:
from sys import argv
script, filename = argv
print "We're going to erase %r." % filename
print "If you don't want that, hit CTRL-C (^C)."
My script is titled Ex16.py and the file I am using is Python.txt, and both of these are in the same folder as the PP .exes. I don't think that's necessary, but hoped maybe it would fix the problem... negative. When I press "Run" in PP, it doesn't work because argv requires you provide an argument when you start the script: python Ex16.py Python.txt
When I launch Python.exe (which, in PP is Portable-Python.exe), I get the standard Python prompt, >>>, but whatever I enter I get the same error message:
File "<stdin>", line 1
with whatever I've just tried repeated back to me with the marker to
indicate where the problem is. (has not been helpful so far)
SyntaxError: invalid syntax
I have tried typing the following at the >>> prompt:
python Ex16.py Python.txt,,
Ex16.py Python.txt,,
"%PATH&\Ex16.py" "%PATH%\Python.txt" (with the actual filepaths),,
print 'hello world'
I just keep getting the same invalid syntax error over and over. Even a basic print command returned an invalid syntax error. The only one that triggered a different error was the one where I tried whole filepaths. That one returned:
File "<stdin>", line 1
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in
position 2-3: truncated \UXXXXXXXX escape
Yes, I have Googled the crap outta both errors. I read that sometimes the problem is not doubling the backspaces, so I tried that, too, putting two \ where just one had been before in both filepaths. I even tried putting — # -*- coding: utf-8 -*- at the beginning of the script thinking maybe there was some unicode error. That, with the full filepaths, resulted in the same unicode error mentioned earlier.
Yes, I have checked that my code is matching that in the exercise.
Yes, this works at home on non-PP.
All this leads me to believe that the problem is probably in the way I'm trying to run the scripts in PP (but why won't print work?), but I haven't a clue what I'm doing wrong.
Thanks!

print is a function in Python 3:
print('my string with content and the like')
It is no longer supported as being a 'statement'. You might want to check out a list of things that changed from python2.x to python3.x (there's a number of incompatibilities). Also, you might be better off finding a tutorial using Python3.

You have to type:
Portable-Python.exe Ex16.py Python.txt
at your command prompt. To get a command prompt, press WindowsKey-R, then type "cmd" and press enter. You should now be looking at something like c:\>. Navigate to your portable python installation by using the cd command.

Help me make my Python 2 code work in Python 3

import math,sys,time;i=0
while 1: sys.stdout.write("\r"+':(__)'[:3+int(round(math.sin(i)))]+'n'+':(__)'[3+int(round(math.sin(i))):]);sys.stdout.flush();time.sleep(.15);i+=0.5*math.pi
I wrote that simple program in Python 2 a long time ago and it worked fine but it has syntax errors in Python 3. I would greatly appreciate if someone could help me update it to be Python 3 compliant. Thanks.

I pasted your code in a file, saved it, then opened it in a Python shell:
In [10]: f=open('test2.py')
In [11]: content=f.read()
In [12]: content
Out[12]: '#!/usr/bin/env python\n# coding: utf-8\n\nimport math,sys,time;i=0\nwhile 1: sys.stdout.write("\\r"+\':(_\xe2\x80\x8b_)\'[:3+int(round(math.sin(\xe2\x80\x8bi)))]+\'n\'+\':(__)\'[3+int(ro\xe2\x80\x8bund(math.sin(i))):]);sys.s\xe2\x80\x8btdout.flush();time.sleep(.\xe2\x80\x8b15);i+=0.5*math.pi\n'
Notice the '\xe2\x80\x8b' bytes sprinkled here and there. These are ZERO WIDTH SPACE characters encoded in utf-8:
In [24]: print(repr(u'\N{ZERO WIDTH SPACE}'.encode('utf-8')))
'\xe2\x80\x8b'
This is why your code is giving rise to SyntaxErrors.
Just retype it (or copy the code below) and it will run in Python3:
import math, sys, time; i=0
while 1: sys.stdout.write('\r'+':(__)'[:3+int(round(math.sin(i)))]+'n'+':(__)'[3+int(round(math.sin(i))):]); sys.stdout.flush(); time.sleep(0.15); i+=0.5*math.pi

The problems has nothing to do with your Python version. You've got weird characters in your code.
I pasted it in Metapad and a bunch of ? showed up, I assume meaning unprintable character.
Just retype it and it will work fine, or find a text editor which will show those characters and delete them, or use Python to delete any non-printable characters.

Indeed, #agf is correct. There was a weird character between the underscores in the first (__). Corrected (and works fine with Python 3):
import math,sys,time;i=0
while 1: sys.stdout.write("\r"+':(__)'[:3+int(round(math.sin(i)))]+'n'+':(__)'[3+int(round(math.sin(i))):]);sys.stdout.flush();time.sleep(.15);i+=0.5*math.pi

Use 2to3 on your python installation. It comes standard (I think) with 2.7.2+

Unpredictable results from os.path.join in windows

So what I'm trying to do is to join something in the form of
os.path.join('C:\path\to\folder', 'filename').
**edit :
Actual code is :
filename = 'creepy_%s.pcl' % identifier
file = open(os.path.join(self.cache_dir, filename), 'w')
where self.cache_dir is read from a file using configobj (returns string) and in the particular case is '\Documents and Settings\Administrator\creepy\cache'
The first part is returned from a configuration file, using configobj. The second is a concatenation of 2 strings like: 'file%s' % name
When I run the application through the console in windows using the python interpreter installed, I get the expected result which is
C:\\path\\to\\folder\\filename
When I bundle the same application and the python interpreter (same version, 2.6) in an executable in windows and run the app the result is instead
C:\\path\\to\\folderfilename
Any clues as to what might be the problem, or what would cause such inconsistencies in the output ?

Your code is malformed. You need to double those backslashes or use a raw string.
os.path.join('C:\\path\\to\\folder', 'filename').
I don't know why it works in one interpreter and not the other but your code will not be interpreted properly as is. The weird thing is i'd have expected a different output, ie: C:pathtofolder\filename.

It is surprising behavior. There is no reason it should behave in such a way.
Just be be cautious, you can change the line to the following.
os.path.join(r'C:\path\to\folder\', 'filename').
Note the r'' raw string and the final \

Three things you can do:
Use double-slashes in your original string, 'C:\\path\\to\\folder'
Use a raw string, r'C:\path\to\folder'
Use forward-slashes, 'C:/path/to/folder'

I figure it out yesterday. As usual when things seem really strange, the explanation is very simple and most of the times involve you being stupid.
To cut a long story short there were leftovers from some previous installations in dist-packages. The bundled interpreter loaded the module from there , but when i ran the python script from the terminal , the module (newer version) in the current dir was loaded. Hence the "unpredictable" results.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.