I am using graphviz's dot to generate some svg graphs for a web application. I call dot using Popen:
p = subprocess.Popen(u'/usr/bin/dot -Kfdp -Tsvg', shell=True,\
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
str = u'long-unicode-string-i-want-to-convert'
(stdout,stderr) = p.communicate(str)
What happends is that the dot program throw errors like:
Error: not well-formed (invalid token) in line 1
... <tr><td cellpadding="4bgcolor="#EEE8AA"> ...
in label of node n260
That obvious error is most certainly NOT in the input string. In particular, if I save it to str.txt with utf-8 encoding and do
/usr/bin/dot -Kfdp -Tsvg < str.txt > myimg.svg
I get the desired output. The only 'special' thing about str is that it contain characters like the danish øæå.
Right now I have no clue what I should do. The problem may very well be in dot; but it certainly seem to be triggered by Popen being different than using < from the shell, and i have no idea where to begin. Any help or ideas for alternatively calling dot (besides writing all the data to a file and calling that!) would be very appreciated!
Sounds like you should be doing:
stdout, stderr = p.communicate(str.encode('utf-8'))
(except, of course, that you shouldn't shadow the builtin str.) The unicode type in Python holds unicode data, not UTF-8. If you want UTF-8, you need to explicitly encode it.
On top of that, there's no reason to use shell=True in that snippet, nor is the unicode literal passed to subprocess.Popen a particularly good idea (it just gets encoded to ASCII anyway.) And the backslash at the end is unnecessary -- Python knows the line is continued, because you have an open parenthesis that hasn't been closed yet. So, use:
p = subprocess.Popen(['/usr/bin/dot', '-Kfdp', '-Tsvg'],
stdin=subprocess.PIPE, stdout=subprocess.PIPE)
Related
Why is it that calling an executable via subprocess.call gives different results to subprocess.run?
The output of the call method is perfect - all new lines removed, formatting of the document is exactly right, '-' characters, bullets and tables are handled perfectly.
Running exactly the same function with the run method however and reading the output from stdout completely throws the output. Full of '\n', 'Â\xad', '\x97', '\x8f' characters with spacing all over the place.
Here's the code I'm using:
Subprocess.CALL
result=subprocess.call(['/path_to_pdftotext','-layout','/path_to_file.pdf','-'])
Subprocess.RUN
result=subprocess.run(['/path_to_pdftotext','-layout','/path_to_file.pdf','-'],stdout=PIPE, stderr=PIPE, universal_newlines=True, encoding='utf-8')
I don't understand why the run method doesn't parse and display the file in the same way. I'd use call however I need to save the result of the pdftotext conversion to a variable (in the case of run: var = result.stdout).
I can go through and just identify all the unicode it's not picking up in run and strip it out but I figure there must just be some encoding / decoding settings that the run method changes.
EDIT
Having read a similarly worded question - I believe this is different in scope as I'm wanting to understand why the output is different.
I've made some tests.
Are you printing the content on the console? Try to send the text in a text file with subprocess in both cases and see if it is different:
result=subprocess.call(['/path_to_pdftotext','-layout','/path_to_file.pdf','test.txt'])
result=subprocess.run(['/path_to_pdftotext','-layout','/path_to_file.pdf','test2.txt'])
and compare test.txt and test2.txt. In my case they are identical.
I suspect that the difference you are experiencing is not strictly related to subprocess, but how the console represent the output in both cases.
As said in the answer I linked in the comments, call():
It is equivalent to: run(...).returncode (except that the input and
check parameters are not supported)
That is your result stores an integer (the returncode) and the output is printed in the console, which seems to show it with the correct encoding, newlines etc.
With run() the result is a CompletedProcess instance. The CompletedProcess.stdout argument is:
Captured stdout from the child process. A bytes sequence, or a string
if run() was called with an encoding or errors. None if stdout was not
captured.
So being a bytes sequence or a string, python represents it differently when printed on the console, showing all the stuffs '\n', 'Â\xad', '\x97', '\x8f' and so on.
I'm new to python and unicode is starting to give me headaches.
Currently I write to file like this:
my_string = "马/馬"
f = codecs.open(local_filepath, encoding='utf-8', mode='w+')
f.write(my_string)
f.close()
And when I open file with i.e. Gedit, I can see something like this:
\u9a6c/\u99ac\tm\u01ce
While I'd like to see exactly what I've written:
马/馬
I've tried a few different variations, like writing my_string.decode() or my_string.encode('utf-8') instead of just my_string, I know those two methods are the opposites but I was not sure which one I needed. Neither worked anyway.
If I manually write these symbols to text file, then with python read the file, re-write what I've just read back to the same file and save, symbols get turned to the code \u9a6c. Not sure if this is importat, figured I'd just mention it to help identify the problem.
Edit: the strings came from SQL Alchemy objects repr method, which turned out to be where the problem lied. I didn't mention it because it just didn't occur to me it can be related to the problem somehow. Thanks again for your help!
From the comments it is now clear you are using either the repr() function or calling the object.__repr__() method directly.
Don't do that. You are writing debugging information to your file:
>>> my_string = u"马/馬"
>>> print repr(my_string)
u'\u9a6c/\u99ac'
The value produced is meant to be pastable back into a Python session so you can re-produce the exact same value, and as such it is ASCII-safe (so it can be used in Python 2 source code without encoding issues).
From the repr() documentation:
For many types, this function makes an attempt to return a string that would yield an object with the same value when passed to eval(), otherwise the representation is a string enclosed in angle brackets that contains the name of the type of the object together with additional information often including the name and address of the object.
Write the Unicode objects to your file directly instead, codecs.open() handles encoding to UTF-8 correctly if you do.
A python module I am using provides a hook that allows capturing user keyboard input before it is sent to a shell terminal. The problem I am facing is that it captures input character-by-character, which makes capturing the input commands difficult when the user performs such things as backspacing or moving the cursor.
For example, given the string exit\x1b[4D\x1b[Jshow myself out, the following takes place:
>>> a = exit\x1b[4D\x1b[Jshow myself out
>>> print(a)
show myself out
>>> with open('file.txt', 'w+') as f:
>>> f.write(a)
>>> exit()
less abc.txt
The less command shows the raw command (exit\x1b[4D\x1b[Jshow myself out), when in fact I would like it to be stored 'cleanly' as it is displayed when using the print function (show myself out).
Printing the result, or 'cat'ing the file shows exactly what I would want to be displayed, but I am guessing here that the terminal is transforming the output.
Is there a way to achieve a 'clean' write to file, either using some python module, or some bash utility? Surely there must be some module out there that can do this for me?
less is interpreting the control characters.
You can get around this with the -r command line option:
$ less -r file.txt
show myself out
From the manual:
-r or --raw-control-chars
Causes "raw" control characters to be displayed. The default is
to display control characters using the caret notation; for
example, a control-A (octal 001) is displayed as "^A". Warning:
when the -r option is used, less cannot keep track of the actual
appearance of the screen (since this depends on how the screen
responds to each type of control character). Thus, various dis‐
play problems may result, such as long lines being split in the
wrong place.
The raw control characters are sent to the terminal, which then interprets them as cat would.
As others have stated, you would need to interpret the characters yourself before writing them to a file.
I am quite new to python and i struck an issue wherein, I am dynamically retrieving a string from a dictionary which looks like this
files="eputilities/epbalancing_alb/referenced assemblies/model/cv6_xmltypemodel_xp2.cs"
I am unable to to perform any actions on this particular file as it is reading the path as 2 different strings
eputilities/epbalancing_alb/referenced and assemblies/model/cv6_xmltypemodel_xp2.cs
as there is a space between referenced and assemblies.
I wanted to know how to convert this to raw_string (ignore the space, but still keep the space between the two and consider it as one string)
I'm not able to figure this out although several comments where there on the web.
Please do help.
Thanks
From the comments to the other answer, I understand that you want to execute some external tool and pass a parameter (a filename) to it. This parameter, however, has spaces in it.
I'd propose to approaches; definitely, I'd use subprocess, not os.system.
import subprocess
# Option 1
subprocess.call([path_to_executable, parameter])
# Option 2
subprocess.call("%s \"%s\"" % (path_to_executable, parameter), shell=True)
For me, both worked, please check if they work yor you as well.
Explanations:
Option 1 takes a list of strings, where the first string has to be the path to the executable and all others are interpreted as command line arguments. As subprocess.call knows about each of these entities, it properly calls the external so that it understand thatparameter` is to be interpreted as one string with spaces - and not as two or more parameters.
Option 2 is different. With the keyword-argument shell=True we tell subprocess.call to execute the call through a shell, i.e., the first positional argument is "interpreted as if it was typed like this in a shell". But now, we have to prepare this string accordingly. So what would you do if you had to type a filename with spaces as a parameter? You'd put it between double quotes. This is what I do here.
Standard string building in python works like this
'%s foo %s'%(str_val_1, str_val_2)
So if I'm understanding you right either have a list of two strings or two different string variables.
For the prior do this:
' '.join(list)
For the latter do this:
'%s %s'%(string_1, string_2)
I'm using minidom to parse an xml file and it threw an error indicating that the data is not well formed. I figured out that some of the pages have characters like ไà¸à¹€à¸Ÿà¸¥ &, causing the parser to hiccup. Is there an easy way to clean the file before I start parsing it? Right now I'm using a regular expressing to throw away anything that isn't an alpha numeric character and the </> characters, but it isn't quite working.
Try
xmltext = re.sub(u"[^\x20-\x7f]+",u"",xmltext)
It will get rid of everything except 0x20-0x7F range.
You may start from \x01, if you want want to keep control characters like tab, line breaks.
xmltext = re.sub(u"[^\x01-\x7f]+",u"",xmltext)
Take a look at µTidyLib, a Python wrapper to TidyLib.
If you do need the data with the strange characters you could, in stead of just stripping them, convert them to codes the XML parser can understand.
You could have a look at the unicodedata package, especially the normalize method.
I haven't used it myself, so I can't tell you all that much, but you could ask again here on SO if you decide you're going to convert and keep that data.
>>> import unicodedata
>>> unicodedata.normalize("NFKD" , u"ไภเฟล &")
u'a\u03001\u201ea\u0300 \u0327 a\u03001\u20aca\u0300 \u0327Y\u0308a\u0300 \u0327\xa5 &'
It looks like you're dealing with data which are saved with some kind of encoding "as if" they were ASCII. XML file should normally be UTF8, and SAX (the underlying parser used by minidom) should handle that, so it looks like something's wrong in that part of the processing chain. Instead of focusing on "cleaning up" I'd first try to make sure the encoding is correct and correctly recognized. Maybe a broken XML directive? Can you edit your Q to show the first few lines of the file, especially the <?xml ... directive at the very start?
I'd throw out all non-ASCII characters which can be identified by having the 8th bit (0x80) set (128 .. 255 respectively 0x80 .. 0xff).
You could read in the file into a Python string named old_str
Then perform a filter call in conjunction with a lambda statement:
new_str = filter(lambda x: x in string.ascii_letters, old_str)
Parse new_str
Many ways exist to accomplish stripping non-ASCII characters from a string.
This question might be related: How to check if a string in Python is in ASCII?