I am using argparse to take a list of input files:
import argparse
p = argparse.ArgumentParser()
p.add_argument("infile", nargs='+', type=argparse.FileType('r'), help="copy from")
p.add_argument("outfile", help="copy to")
args = p.parse_args()
However, this opens the door for user to pass in prog /path/to/* outfile, where the source directory could potentially have millions of file, the shell expansion can overrun the parser. My questions are:
is there a way to disable the shell expansion (*) within?
if not, if there a way to put a cap on the number of input files before it is assembled into a list?
(1) no, the shell expansion is done by the shell. When Python is run, the command line is expanded already. The use "*" or '*' will deactivate it but that also happens on the shell.
(2) Yes, get the length of sys.argv early in your code and exit if it is too long.
Also most shells have a built-in limit to the expansion.
If you are concerned about too many infile values, don't use FileType.
p.add_argument("infile", nargs='+', help="copy from")
Just accept a list of file names. That's not going to cost you much. Then you can open and process just as many of the files as you want.
FileType opens the file when the name is parsed. That is ok for a few files that you will use right away in small script. But usually you don't want, or need, to have all those files open at once. In modern Python you are encouraged to open files in a with context, so the get closed right away (instead of hanging around till the script is done).
FileType handles the '-', stdin, value. And it will issue a nice error report if it fails to open a file. But is that what you want? Or would you rather process each file, skipping over the bad names.
Overall FileType is a convenience, but generally a poor choice in serious applications.
Something else to be worried about - outfile is the last of a (potentially) long list of files, the '+' input ones and 1 more. argparse accepts that, but it could give problems. For example what if the user forgets to provide an 'outfile'? Then the last of input files will be used as the outfile. That error could result in unintentionally over writing a file. It may be safer to use '-o','--outfile',, making the user explicitly mark the outfile. And the user could give it first, so he doesn't forget.
In general '+' and '*' positionals are safest when used last.
Related
I am generating an ImageMagick bash command using Python. Something like
import subprocess
input_file = "hello.png"
output_file = "world.jpg"
subprocess.run(["convert", input_file, output_file])
where there might be more arguments before input_file or output_file. My question is, if either of the filenames is user provided and the user provides a filename that can be parsed as a command line option for ImageMagick, isn't that unsafe?
If the filename starts with a dash, ImageMagick indeed could think that this is an option instead of a filename. Most programs - including AFIK the ImageMagick command line tools - follow the convention that a double-dash (--) denotes the end of the options. If you do a
subprocess.run(["convert", "--", input_file, output_file])
you should be safe in this respect.
From the man page (and a few tests), convert requires an input file and an output file. If you only allow two tokens and if a file name is interpreted as an option then convert is going to miss at least one of the files, so you'll get an ugly message but you should be fine.
Otherwise you can prefix any file name that starts with - with ./ (except - itself, which is stdin or stdout depending on position), so that it becomes an unambiguous file path to the same file.
I am trying to convert a .pdf file into several .png files using Ghostscript in Python. The other answers on here were pretty old hence this new thread.
The following code was given as an example on pypi.org of the 'high level' interface, and I am trying to model my code after the example code below.
import sys
import locale
import ghostscript
args = [
"ps2pdf", # actual value doesn't matter
"-dNOPAUSE", "-dBATCH", "-dSAFER",
"-sDEVICE=pdfwrite",
"-sOutputFile=" + sys.argv[1],
"-c", ".setpdfwrite",
"-f", sys.argv[2]
]
# arguments have to be bytes, encode them
encoding = locale.getpreferredencoding()
args = [a.encode(encoding) for a in args]
ghostscript.Ghostscript(*args)
Can someone explain what this code is doing? And can it be used somehow to convert a .pdf into .png files?
I am new to this and am truly confused. Thanks so much!
That's calling Ghostscript, obviously. From the arguments it's not spawning a process, it's linked (either dynamically or statically) to the Ghostscript library.
The args are Ghostscript arguments. These are documented in the Ghostscript documentation, you can find it online here. Because it mimics the command line interface, where the first argument is the calling program, the first argument here is meaningless and can be anything you want (as the comment says).
The next three arguments turn on SAFER (which prevents some potentially dangerous operations and is, now, the default anyway), sets NOPAUSE so the entire input is processed without pausing between pages, and BATCH so that on completion Ghostscript exits instead of returning to the interactive prompt.
Then it selects a device. In Ghostscript (due to the PostScript language) devices are what actually output stuff. In this case the device selected is the pdfwrite device, which outputs PDF.
Then there's the OutputFile, you can probably guess that this is the name (and path) of the file where the output is to be written.
The next 3 arguments; -c .setpdfwrite -f are, frankly archaic and pointless. They were once recommended when using the pdfwrite device (and only the pdfwrite device) but they have no useful effect these days.
The very last argument is, of course, the input file.
Certainly you can use Ghostscript to render PDF files to PNG. You want to use one of the PNG devices, there are several depending on what colour depth you want to support. Unless you have some stranger requirement, just use png16m. If your input file contains more than one page you'll want to set the OutputFile to use %d so that it writes one file per page.
More details on all of this can, of course, be found in the documentation.
I'm working on a PDF generator project. The goal is to have a program that takes document files and generate a PDF file. I'm having trouble in finding a way to input a file into the program to be converted.
I started out by using the input function, where I input the file in the terminal. As a test, I wanted to input, open, read, and print a csv file containing US zipcode data. The rest of the program opens, reads and prints out some of the data. Here is the code:
import csv
file = input("Drop file here: ")
with open(file, 'r', encoding='utf8') as zf:
rf = csv.reader(zf, delimiter=',')
header = next(rf)
data = [row for row in rf]
print(header)
print(data[1])
print(data[10])
print(data[100])
print(data[1000])
When I opened the terminal to input the file this error (TypeError: 'encoding' is an invalid keyword argument for this function) appeared.
Is there a better way I can code a program to input a file so it can be open and converted into a PDF?
There are more things going on and as was mentioned in the comments, in this case it is very relevant which version of python are you using. A bit more of the back story.
input built-in has different meaning in Python2 (https://docs.python.org/2.7/library/functions.html#input) or Python3 (https://docs.python.org/3.6/library/functions.html#input). In Python2 it reads the user input and tries to execute it as python code, which is unlikely what you actually wanted.
Then as pointed out, open arguments are different as well (https://docs.python.org/2.7/library/functions.html#open and https://docs.python.org/3.6/library/functions.html#open).
In short, as suggested by #idlehands, if you have both version installed try calling python3 instead of python and this code should actually run.
Recommendation: I would suggest not to use interactive input like this at all (unless there is a good reason to do that) and instead let the desired filename be passed in from outside. I'd opt for argparse (https://docs.python.org/3.6/library/argparse.html#module-argparse) in this case which very comfortably gives you great flexibility, for instance myscript.py:
#!/usr/bin/env python3
import argparse
import sys
parser = argparse.ArgumentParser(description='My script to do stuff.')
parser.add_argument('-o', '--output', metavar='OUTFILE', dest='out_file',
type=argparse.FileType('w'), default=sys.stdout,
help='Resulting file.')
parser.add_argument('in_file', metavar='INFILE', nargs="?",
type=argparse.FileType('r'), default=sys.stdin,
help='File to be processed.')
args = parser.parse_args()
args.out_file.write(args.in_file.read()) # replace with actual action
This gives you the ability to run the script as a pass through pipe stuff in and out, work on specified file(s) as well as explicitly use - to denote stdin/stdout are to be used. argparse also gives you command line usage/help for free.
You may want the specifics tweak for different behavior, but bottom line, I'd still go with a command line argument.
EDIT: I should add more more comment for consideration. I'd write the actual code (a function or more complex object) performing the wanted action so that it exposes ins/outs through its interfaces and write the command line to gather these bits and call my action code with it. That way you can reuse it from another Python script easily or write a GUI for that should you need/want to.
I am using rsync command with python like this:
rsync_out = subprocess.Popen(['sshpass', '-p', password, 'rsync', '--recursive', source],
stdout=subprocess.PIPE)
command = subprocess.Popen(('grep', '\.'), stdin=rsync_out.stdout, stdout=subprocess.PIPE).communicate()[0]
The purpose of using grep to display files like this:
rathi/20090209.02s1.2_sequence.txt
rathi/20090729.02s4.2_sequence.txt.gz
rathi/Homo_sapiens_UCSC_hg19.tar.gz
rathi/hello/ok.txt
instead of
rathi
rathi/20090209.02s1.2_sequence.txt
rathi/20090729.02s4.2_sequence.txt.gz
rathi/Homo_sapiens_UCSC_hg19.tar.gz
hello
rathi/hello/ok.txt
It's works fine except if the directory name has '.' on it.
If there's a directory names hello.v1 then the output would be:
rathi/hello.v1
rathi/hello.v1/ok.txt
Since hello.v1 is a directory name I only want to show like this:
rathi/hello.v1/ok.txt
How can I do this?
Personally I wouldn't bother using grep, I'd simply use Python's own string filtering - however, that wasn't the question you asked.
Since the filenames are remote and Python sees them as simply strings then we can't use any of Python's own file manipulation routines (e.g. os.path.isdir()). So, I think you have three basic approaches:
Split each string by slashes and use this to build your own representation of the filesystem tree in memory. Then, do a pass through the tree and only display leaf nodes (i.e. files).
If you can assume that files within a directory are always listed immediately after that directory, then you can do a quick check against the previous entries to see if this entry is a file within one of those directories.
Use meta-information from rsync.
I would suggest the third option. My experience with rsync is that it usually gives you full file information like this:
drwxr-xr-x 4096 2013/06/14 17:19:13 tmp/t
-rwxrwxr-x 14532 2013/06/14 17:17:23 tmp/t/a.out
-rwxrwxr-x 14539 2013/06/14 17:19:13 tmp/t/static-order
In your example I can't see any code which removes this additional information, and you could easily use this to filter out directories by looking for any line which starts with a d instead of a -.
If you don't have this extended information, you'll need to do one of the other two. The first option is pretty simple - just split by slashes and then descend a standard tree structure, adding entries for directories and files which haven't been seen yet. Once all the entries have been parsed, you can traverse the tree and print out anything which is a node with no children.
The second option is something more complicated, but more memory efficient, where you maintain a list of parent directories and check whether they're a prefix of the current item in the list. If so, you can be sure the previous one is a directory and the current one is a file, so you can mark the previous one as something not to show. You can also throw items off this list once you've recursed "out" of that directory, provided that rsync returns them in a predictable order. You have to make sure you only check for prefixes at slash boundaries (so foo/dir is not a parent of foo/dir-bar, but it is a parent of foo/dir/bar). Generally this approach is rather fiddly, and unless you're dealing with an awfully large directory tree then one of the other approaches is probably preferable.
By the way, either of the pure string-based approaches also have the disadvantage that an empty directory will be indistinguishable from a file, since it's only the presence or absence of files within a directory which distinguishes them. This is another reason I suggest using the meta-information from rsync.
EDIT
As requested, an example using the rsync meta-data:
import subprocess
cmdline = ["rsync", "-e", "ssh", "-r", "user#host:/dir"]
proc = subprocess.Popen(cmdline, stdout=subprocess.PIPE)
for entry in proc.stdout:
items = entry.strip().split(None, 4)
if not items[0].startswith("d") and "." in items[4]:
print items[4]
In this example, I'm invoking rsync directly and having it use ssh, assuming that appropriate SSH keys are set up. I would strongly suggest using SSH keys instead of the sshpass utility - storing your passwords in plaintext is a really bad idea from a security perspective. You can always set up your keys with no passphrase if you're not worried about them being stolen. There are lots of pages which explain how to create SSH keys (this one, for example).
Replace user, host and /dir with your username on the remote machine, the hostname of the remote machine and the parent directory you wish to list on the remote machine (you can omit /dir if you want to list the user's home directory). Otherwise the code should run unmodified. If will print the path name of each file that it founds, skipping directories and items which don't contain a dot. If your dot filter was just an attempt to skip directories as well, you can omit 'and "." in items[4]'.
EDIT 2
This example just prints the entries, but of course you'll presumably want to do something else. If you want to be really clever, you could write it as a generator which calls yield on the items as they crop up. I've got an example of this below, which also prints the items but you can see how it can be used to do anything else. I've also added some better error handling to make sure the use of subprocess can't deadlock:
EDIT 3: I've updated this example to also include file size and modification time. This is based on what I get back from my rsync - if yours has a different format you might need to use different members from items or possible change the format string to strptime() to match the formats returned by your rsync.
from datetime import datetime
import os
import subprocess
def find_remote_files(hostspec):
cmdline = ["rsync", "-e", "ssh", "-r", hostspec]
with open(os.devnull, "w") as devnull:
proc = subprocess.Popen(cmdline, stdout=subprocess.PIPE, stderr=devnull)
try:
for entry in proc.stdout:
items = entry.strip().split(None, 4)
if not items[0].startswith("d"):
dt = datetime.strptime(" ".join(items[2:4]),
"%Y/%m/%d %H:%M:%S")
yield (int(items[1]), dt, items[4])
proc.wait()
except:
# On any exception, terminate process and re-raise exception.
proc.terminate()
proc.wait()
raise
for filesize, filedate, filename in find_remote_files("user#host:/dir"):
print "Filename: %s" % (filename,)
print "(%d bytes, modified %s)" % (filesize, filedate.strftime("%Y-%m-%d"))
You should be able to paste the whole find_remote_files() function into your code and use it directly, if you like.
Update/Solution: the answer is below, from Zack. The problem was, indeed, DOS line endings on the script file itself, clenotes.cmd. Since I futzed with the various files so much, I deleted the whole directory and then re-downloaded a fresh copy from HERE. I ran Zack's perl script on the file just like so:
perl -pi.bak -e 's/[ \t\r]+$//' clenotes.cmd
I then edited the command execution just slightly so that the final script became:
CWD=`dirname $0`
JYTHON_HOME="$CWD"
LIB_DIR="$JYTHON_HOME/lib"
NOTES_HOME="/opt/ibm/lotus/notes/"
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$NOTES_HOME
java -cp "$LIB_DIR" -jar "$LIB_DIR/jython.jar" -Djython.home="$CWD/" -Dpython.path="$LIB_DIR:$CWD/ext" -Djava.library.path="$NOTES_HOME" "$LIB_DIR/clenotes/cletes/clenotes.py" "$#"
That was it -- everything else worked. No edits needed to clenotes.py or clenotes.cfg. Many thanks for sticking with the question, which I guess ended up being quite simple.
Update: I'm cutting down on some of the code to make this more readable and remove unnecessary information from the post.
I'm trying to get Lotus Notes command line to work on Linux and am having an issue with something related to sys.argv[1:] in the python file. The windows script is here:
#echo off
#setlocal
set CWD=%~dp0
set JYTHON_HOME=%CWD%
set LIB_DIR=%JYTHON_HOME%/lib
java -cp %LIB_DIR% -jar %LIB_DIR%/jython.jar -Djython.home=%CWD% -python.path=%LIB_DIR%;%CWD%/ext %LIB_DIR%/clenotes/clenotes.py %*
#endlocal
I was having a tough time with variables, so for Linux, it simply looks like this:
java -cp ./lib/ -jar ./lib/jython.jar -Djython.home=./ -Dpython.path=./lib:./ext -Djava.library.path=/opt/ibm/lotus/notes/ ./lib/clenotes/clenotes.py $*
I run it from within the directory. In any case, what puzzles me is that it's not picking up any options I pass from the command line. clenotes.cmd --help results in
No commands specified. Use --help option for usage.
Here is the section where the command line arguments are supposed to be parsed:
def main():
Output.log("Entering %s v%s" % (PROGRAM_NAME,VERSION),Output.LOGTYPE_DEBUG)
cliOptions2=[]
for opt in cliOptions:
opt2=opt.replace('--','')
opt2=opt2.replace('!','=')
cliOptions2.append(opt2)
opts=[]
args=[]
try:
opts, args = getopt.getopt(sys.argv[1:], '', cliOptions2)
I'm using Python 3.1.3 on Arch Linux 64bit in a 32bit chroot environment. Can I provide anything else?
Just in case it's needed... HERE is the whole clenotes.py file.
Also, as requested in the comments, the config file (which contains the help message and viable options/arguments, is HERE
Update
After a lot of fiddling, the best progress I have made has been to examine what it's setting as opts and args in the (main) method. Most surprising was that when passing an argument and then looking at it's parsed result using print sys.argv, the option would come up with a trailing \r in it. For example:
clenotes.cmd appointments
args is ['appointments\r']
On Windows I did the same and args was reported as ['appointments']. Furthermore, manually setting args=['appointments'] and then commenting out the section where getopt.getopt is assigning a value worked.
Lastly, I've found that when using multiple arguments, n-1 of them get interpreted and used while the nth one gets ignored. This is kind of a workaround since I can actually use the script... but obviously it's not preferred. If I want to look at today's appointments, I can execute clenotes.cmd appointments --today --today and it will work. sys.argv will spit out: ['appointments', '--today', '--today\r'].
So... what's causing the trailing \r? I'm thinking it has to do with the actual script. Note it again:
java -cp ./lib/ -jar ./lib/jython.jar -Djython.home=./ -Dpython.path=./lib:./ext -Djava.library.path=/opt/ibm/lotus/notes/ ./lib/clenotes/clenotes.py $*
So... bunch of path stuff and then the actual python file: clenotes.py $*
I got the $* from HERE
Is it picking up the carriage return??
I think your problem is that clenotes.cfg has DOS line endings, which Python is misinterpreting. Try changing this line of clenotes.py
config.readfp(open('%sconfig/clenotes.cfg' % System.getProperty('jython.home')))
to read
config.readfp(open('%sconfig/clenotes.cfg' % System.getProperty('jython.home'), "rU"))
The "rU" tells Python that even though it's running on a Unix system it should be prepared to cope with a file containing DOS line endings. See http://docs.python.org/library/functions.html#open -- scroll down to the paragraph that begins "In addition to the standard fopen() modes...".
(Or you could run this command: perl -pi.bak -e 's/[ \t\r]+$// clenotes.cfg -- that will convert it to Unix line endings. In your shoes I would probably do both.)
(If neither of the above suggestions helps, the next thing I would try is hitting clenotes.py itself with the above perl command. I don't see how that could be the problem, but if the \r characters are not coming from clenotes.cfg, the .py file is the only plausible remaining source.)
(EDIT: Based on your comments on the question itself, I now think it's clenotes.cmd, the shell script wrapper, that needs to be converted from DOS to Unix line endings.)
I'll have to keep looking to figure out where that \r is coming from. But in the meanwhile, this problem has become much simpler. Once the args are parsed, do this:
args = [arg.strip() for arg in args]
That will get rid of the \r
EDIT: But wait -- is this only a partial solution? Is it still not parsing options correctly?
EDIT2: Seems like the \r needs to be stripped earlier. When there's no command, the /r never gets stripped, because the above only strips \r after getopt is done. This should have been obvious to me before -- instead of passing sys.argv[1:] here
opts, args = getopt.getopt(sys.argv[1:], '', cliOptions2)
modify it first
argv = [arg.strip() for arg in sys.argv[1:]]
opts, args = getopt.getopt(argv, '', cliOptions2)
You could also just do sys.argv[-1] = sys.argv[-1].strip()... but the c programmer in me starts to feel a bit queasy looking at that. Probably irrational, I know.
Or just do what Zack said and convert clenotes.cmd to linux format -- however, note that stripping here will ensure that other people will not have to solve the same problem over again. (On the other hand, it's a little ugly, or at least mysterious to people not expecting such problems.)