Can't pass file handle to subprocess - python

I created a file in the current directory with echo "foo" > foo. I then tried to pass that file to subprocess.run, but I seem to misunderstand how file paths are handled in Python, since I'm getting an error. What's wrong?
My test code
with open('foo') as file:
import subprocess
subprocess.run(['cat',file])
yields
TypeError: expected str, bytes or os.PathLike object, not _io.TextIOWrapper
What is a PathLike object? How to I get it from open('foo')? Where can I find more information about how files are handled in Python?

There's no need to open the file in the first place. You can simply run
import subprocess
subprocess.run(['cat', 'foo'])
The cat command is being run as a shell command by your machine, so you should just be able to pass the file name as a string.
Python does not handle the file at all. The point of subprocess is to pass a command to the underlying system (in this case, apparently a UNIX based OS). All you are doing is passing a plaintext command to the command line.
I won't, however, discourage you from reading about file handling. Look at this documentation.

PathLike object: docs
How to get it from the open call's return value:
Use the name field
subprocess.run(['cat',file.name])
Learn about python files: Reading and writing files

Related

Finding which files are being read from during a session (python code)

I have a large system written in python. when I run it, it reads all sorts of data from many different files on my filesystem. There are thousands lines of code, and hundreds of files, most of them are not actually being used. I want to see which files are actually being accessed by the system (ubuntu), and hopefully, where in the code they are being opened. Filenames are decided dynamically using variables etc., so the actual filenames cannot be determined just by looking at the code.
I have access to the code, of course, and can change it.
I try to figure how to do this efficiently, with minimal changes in the code:
is there a Linux way to determine which files are accessed, and at what times? this might be useful, although it won't tell me where in the code this happens
is there a simple way to make an "open file" command also log the file name, time, etc... of the open file? hopefully without having to go into the code and change every open command, there are many of them, and some are not being used at runtime.
Thanks
You can trace file accesses without modifying your code, using strace.
Either you start your program with strace, like this
strace -f -e trace=file your_program.py
Otherwise you attach strace to a running program like this
strace -f -e trace=file -p <PID>
For 1 - You can use
ls -la /proc/<PID>/fd`
Replacing <PID> with your process id.
Note that it will give you all the open file descriptors, some of them are stdin stdout stderr, and often other things, such as open websockets (which use a file descriptor), however filtering it for files should be easy.
For 2- See the great solution proposed here -
Override python open function when used with the 'as' keyword to print anything
e.g. overriding the open function with your own, which could include the additional logging.
One possible method is to "overload" the open function. This will have many effects that depend on the code, so I would do that very carefully if needed, but basically here's an example:
>>> _open = open
>>> def open(filename):
... print(filename)
... return _open(filename)
...
>>> open('somefile.txt')
somefile.txt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in open
FileNotFoundError: [Errno 2] No such file or directory: 'somefile.txt'
As you can see my new open function will return the original open (renamed as _open) but will first print out the argument (the filename). This can be done with more sophistication to log the filename if needed, but the most important thing is that this needs to run before any use of open in your code

What is 'target' in 'target = open(...)'

I'm learning Python as someone more familiar with databases and ETL. I'm not sure where target comes from in the following code.
from sys import argv
script, filename = argv
target = open(filename, 'w')
I think argv is a class in the sys module, but I don't think target comes from argv.
If you run type(target), you will get this: <_io.TextIOWrapper name='dde-recommendation-engine/sample_data/synthetic-micro/ratings.txt' mode='r' encoding='UTF-8'>
What that means in simple terms is that it is an object accessing that particular file (with only write permission because you have a w mode).
You can use this object to add stuff into the file by target.write(.....)
Do remember to close the file however by doing target.close() at the end.
Another way to do the same and I prefer this most of the times is:
with open(filename, 'w') as target:
target.write(...)
This way the file is closed automatically once you are out of the with context.
argv is the list populated by the arguments provided by user while running the program from shell. Please see https://docs.python.org/3/library/sys.html#sys.argv for more info on that.
User supplied the filename from shell, program used the open call https://docs.python.org/3/library/functions.html#open to get a file handle on that filename
And that file handle is stored in variable called target (which could be named anything you like) so that you can process the file using other file methods.
You are using open() - a built-in function in python. This function returns an File object - which is assigned to target variable. Now you can interact with target to write data (since you are using the w mode)
.

Facing errors when reading huge text in python

Using Python3 my requirement is to read email files from a directory and filter Html tags in it.
I have managed to do it to a large extent.When I try to read the content of my output, it gives an error
for line in output.splitlines():
AttributeError: 'int' object has no attribute 'splitlines'
for file in glob.glob('spam/*.*'):
output = os.system("python html2txt.py " + file)
for line in output.splitlines():
print(line)
When I print output, it shows a filtered text.Any help is appreciated.
Try this as a replacement for the code you've provided:
import glob
files = glob.glob('spam/*.*')
for f in files:
with open(f) as spam_file:
for line in spam_file:
print(line)
If the files are indeed html files, I would recommend looking into BeautifulSoup.
The return value of os.system(command) is system-dependent, it supposes to return the (encoded) process exit value which represented by an int. read more here
On Unix, the return value is the exit status of the process encoded in
the format specified for wait(). Note that POSIX does not specify the
meaning of the return value of the C system() function, so the return
value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell
after running command, given by the Windows environment variable
COMSPEC: on command.com systems (Windows 95, 98 and ME) this is always
0; on cmd.exe systems (Windows NT, 2000 and XP) this is the exit
status of the command run; on systems using a non-native shell,
consult your shell documentation.
But in no system it returns a str and the method splitlines() is a str method. read more here
You are calling a str method on a int that is why you get the error:
AttributeError: 'int' object has no attribute 'splitlines'
On Unix, the return value is the exit status of the process encoded in
the format specified for wait(). Note that POSIX does not specify the
meaning of the return value of the C system() function, so the return
value of the Python function is system-dependent.
On Windows, the return value is that returned by the system shell
after running command. The shell is given by the Windows environment
variable COMSPEC: it is usually cmd.exe, which returns the exit status
of the command run; on systems using a non-native shell, consult your
shell documentation.
python docs
So your output variable is a integer not the result of the file being parsed by the
html2txt.py script.
And why do you run another python script outside of your current process ? Can't you just import whatever class of function that is doing the job from that module ?
Also there is an email module that can help you

Python stdin filename

I'm trying to get the filename thats given in the command line. For example:
python3 ritwc.py < DarkAndStormyNight.txt
I'm trying to get DarkAndStormyNight.txt
When I try fileinput.filename() I get back same with sys.stdin. Is this possible? I'm not looking for sys.argv[0] which returns the current script name.
Thanks!
In general it is not possible to obtain the filename in a platform-agnostic way. The other answers cover sensible alternatives like passing the name on the command-line.
On Linux, and some related systems, you can obtain the name of the file through the following trick:
import os
print(os.readlink('/proc/self/fd/0'))
/proc/ is a special filesystem on Linux that gives information about processes on the machine. self means the current running process (the one that opens the file). fd is a directory containing symbolic links for each open file descriptor in the process. 0 is the file descriptor number for stdin.
You can use ArgumentParser, which automattically gives you interface with commandline arguments, and even provides help, etc
from argparse import ArgumentParser
parser = ArgumentParser()
parser.add_argument('fname', metavar='FILE', help='file to process')
args = parser.parse_args()
with open(args.fname) as f:
#do stuff with f
Now you call python2 ritwc.py DarkAndStormyNight.txt. If you call python3 ritwc.py with no argument, it'll give an error saying it expected argument for FILE. You can also now call python3 ritwc.py -h and it will explain that a file to process is required.
PS here's a great intro in how to use it: http://docs.python.org/3.3/howto/argparse.html
In fact, as it seams that python cannot see that filename when the stdin is redirected from the console, you have an alternative:
Call your program like this:
python3 ritwc.py -i your_file.txt
and then add the following code to redirect the stdin from inside python, so that you have access to the filename through the variable "filename_in":
import sys
flag=0
for arg in sys.argv:
if flag:
filename_in = arg
break
if arg=="-i":
flag=1
sys.stdin = open(filename_in, 'r')
#the rest of your code...
If now you use the command:
print(sys.stdin.name)
you get your filename; however, when you do the same print command after redirecting stdin from the console you would got the result: <stdin>, which shall be an evidence that python can't see the filename in that way.
I don't think it's possible. As far as your python script is concerned it's writing to stdout. The fact that you are capturing what is written to stdout and writing it to file in your shell has nothing to do with the python script.

Calling a subprocess with mixed data type arguments in Python

I am a bit confused as to how to get this done.
What I need to do is call an external command, from within a Python script, that takes as input several arguments, and a file name.
Let's call the executable that I am calling "prog", the input file "file", so the command line (in Bash terminal) looks like this:
$ prog --{arg1} {arg2} < {file}
In the above {arg1} is a string, and {arg2} is an integer.
If I use the following:
#!/usr/bin/python
import subprocess as sbp
sbp.call(["prog","--{arg1}","{arg2}","<","{file}"])
The result is an error output from "prog", where it claims that the input is missing {arg2}
The following produces an interesting error:
#!/usr/bin/python
import subprocess as sbp
sbp.call(["prog","--{arg1} {arg2} < {file}"])
all the spaces seem to have been removed from the second string, and equal sign appended at the very end:
command not found --{arg1}{arg2}<{file}=
None of this behavior seems to make any sense to me, and there isn't much that one can go by from the Python man pages found online. Please note that replacing sbp.call with sbp.Popen does not fix the problem.
The issue is that < {file} isn’t actually an argument to the program, but is syntax for the shell to set up redirection. You can tell Python to use the shell, or you can setup the redirection yourself.
from subprocess import *
# have shell interpret redirection
check_call('wc -l < /etc/hosts', shell=True)
# set up redirection in Python
with open('/etc/hosts', 'r') as f:
check_call(['wc', '-l'], stdin=f.fileno())
The advantage of the first method is that it’s faster and easier to type. There are a lot of disadvantages, though: it’s potentially slower since you’re launching a shell; it’s potentially non-portable because it depends on the operating system shell’s syntax; and it can easily break when there are spaces or other special characters in filenames.
So the second method is preferred.

Categories

Resources