Finding which files are being read from during a session (python code)

Finding which files are being read from during a session (python code) - python

I have a large system written in python. when I run it, it reads all sorts of data from many different files on my filesystem. There are thousands lines of code, and hundreds of files, most of them are not actually being used. I want to see which files are actually being accessed by the system (ubuntu), and hopefully, where in the code they are being opened. Filenames are decided dynamically using variables etc., so the actual filenames cannot be determined just by looking at the code.
I have access to the code, of course, and can change it.
I try to figure how to do this efficiently, with minimal changes in the code:
is there a Linux way to determine which files are accessed, and at what times? this might be useful, although it won't tell me where in the code this happens
is there a simple way to make an "open file" command also log the file name, time, etc... of the open file? hopefully without having to go into the code and change every open command, there are many of them, and some are not being used at runtime.
Thanks

You can trace file accesses without modifying your code, using strace.
Either you start your program with strace, like this
strace -f -e trace=file your_program.py
Otherwise you attach strace to a running program like this
strace -f -e trace=file -p <PID>

For 1 - You can use
ls -la /proc/<PID>/fd`
Replacing <PID> with your process id.
Note that it will give you all the open file descriptors, some of them are stdin stdout stderr, and often other things, such as open websockets (which use a file descriptor), however filtering it for files should be easy.
For 2- See the great solution proposed here -
Override python open function when used with the 'as' keyword to print anything
e.g. overriding the open function with your own, which could include the additional logging.

One possible method is to "overload" the open function. This will have many effects that depend on the code, so I would do that very carefully if needed, but basically here's an example:
>>> _open = open
>>> def open(filename):
... print(filename)
... return _open(filename)
...
>>> open('somefile.txt')
somefile.txt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in open
FileNotFoundError: [Errno 2] No such file or directory: 'somefile.txt'
As you can see my new open function will return the original open (renamed as _open) but will first print out the argument (the filename). This can be done with more sophistication to log the filename if needed, but the most important thing is that this needs to run before any use of open in your code

Related

Remove file in python with absolute path: error "no such file or directory" but file exists

I'm trying to remove a file in Python 3 on Linux (RHEL) the following way:
os.remove(or.getcwd() + '/file.txt')
(sorry not allowed to publish the real paths).
and it gives me the usual error
No such file or directory: '/path/to/file/file.txt'
(I've respected slash or antislash in the path)
What is strange is that when I just ls the file (by copy pasting, so the very same path) the file does exist.
I've read this post but i'm not on Windows and slash direction seems correct.
Any idea ?
EDIT: as suggested by #DominicPrice os.system('ls') is showing the file while os.listdir() does not show it (but shows other files in the same directory)
EDIT 2: So my issue was due a a bad usage of os.popen. I used this method to copy file but did not wait for the subprocess to be terminated. So my understanding is that the file was not copied yet when I tried to delete it.

The problem is that, as you have explained in the comments, you are creating the file using os.popen("cp ..."). This works asynchronously, so it may not have had time to complete by the time you call os.remove(). You can force python to wait for it to finish by calling the close method:
proc = os.popen("cp myfile myotherfile")
proc.close() # wait for process to finish
os.remove("myotherfile") # we're all good
I would highly recommend staying away from using os.popen in favour of the subprocess library, which has a run function which is way safer to use.
For the specific functions of copying a file, an even better (and cross platform) solution is to use the shutil library:
import shutil
shutil.copyfile("myfile", "myotherfile")

you should use os.path.dirname(__file__).
this is an inbuilt function of os module in python.
you can read more here.
https://www.geeksforgeeks.org/find-path-to-the-given-file-using-python/

Why does python3.7 pass argv elements alongside 'garbage'

I had this script working for me, before I decided I'm gonna rewrite everything and make it portable.
Without delving too much into the details, there's a central Bash script, which calls 5 other Bash scripts in their own respective folders. I have no intention of porting to Windows anytime soon, as of current this is just for Linux.
The execution path of the central Bash script is:
dos.1/1-init.sh dos.1/
dos.2/1-trace-to-file.sh dos.2/ dos.1/
dos.3/1-recognize-categories.sh dos.3/
dos.4/1-ping-in-groups.sh dos.4/ dos.3/
dos.5/init.sh dos.5/ dos.4/
I run with ./init.sh
Before the script was 'portable' I was using explicit file paths inside each respective script. All was well and good. The program itself is a combination of Bash and Python, and writes to files in one directory, so that they can be manipulated in various ways, before being read back into different parts of the program.
I understand that the fastest way to do this would be to write a monolithic Python script, using subprocess calls for the Bash side of things... However, I am doing it this way to ease maintenance, and (before I started making it 'portable') it was lightning fast.
My issue now is this: each time I have to read text into Python (either from SQL or from file) there's always this added garbage. Up until this point, I have been using sed, awk and Python's .rstrip() function to manage this... Which is all well and good, but this one damn function will not play nice... And I feel there must be a better way.
In bash I call it with:
$prog_dir=$1
$data_dir=$2
$prog_dir/2fast-ping.py $data_dir/group0.txt > $prog_dir/group0_averages.txt
$prog_dir/2fast-ping.py $data_dir/group1.txt > $prog_dir/group1_averages.txt
...
Now I know that I could write to file from within Python, but in this instance I have other reasons not to.
The issue, is that when the 2fast-ping.py script is ran, it reads the text file in with commas and a newline char. I have vigorously checked and I can confirm that the group#.txt files 100% do not contain commas. Here's the Python:
import sys
import subprocess
import select
from concurrent.futures import ThreadPoolExecutor
filename = sys.argv[1]
f = open(filename, "r")
ips = [elem.rstrip('\n') for elem in f]
print(ips)
f.close()
The script goes on to do some work on the IPs afterwards, but this is the painful part. If I call the script direct from CLI: ./2fast-ping.py ../dos.3/group0.txt, the text is processed PROPERLY and the superseding instructions actually function. But, when called from the first init script, the program basically sh*ts itself because each line is read in with commas. It works until the point where it starts to use the processed info, then:
<actual IP would be here>
ping: ('##.###.###.###',): Name or service not known
Of course, the issue is the ('',) But, Python is adding that in, and I don't know how to stop it :(
Any ideas?

Python code was okay, just passing an additional / with the argument :(

Using python subprocess to fake running a cmd from a terminal

We have a vendor-supplied python tool ( that's byte-compiled, we don't have the source ). Because of this, we're also locked into using the vendor supplied python 2.4. The way to the util is:
source login.sh
oupload [options]
The login.sh just sets a few env variables, and then 2 aliases:
odownload () {
${PYTHON_CMD} ${OCLIPATH}/ocli/commands/word_download_command.pyc "$#"
}
oupload () {
${PYTHON_CMD} ${OCLIPATH}/ocli/commands/word_upload_command.pyc "$#"
}
Now, when I run it their way - works fine. It will prompt for a username and password, then do it's thing.
I'm trying to create a wrapper around the tool to do some extra steps after it's run and provide some sane defaults for the utility. The problem I'm running into is I cannot, for the life of me, figure out how to use subprocess to successfully do this. It seems to realize that the original command isn't running directly from the terminal and bails.
I created a '/usr/local/bin/oupload' and copied from the original login.sh. Only difference is instead of doing an alias at the end, I actually run the command.
Then, in my python script, I try to run my new shell script:
if os.path.exists(options.zipfile):
try:
cmd = string.join(cmdargs,' ')
p1 = Popen(cmd, shell=True, stdin=PIPE)
But I get:
Enter Opsware Username: Traceback (most recent call last):
File "./command.py", line 31, in main
File "./controller.py", line 51, in handle
File "./controllers/word_upload_controller.py", line 81, in _handle
File "./controller.py", line 66, in _determineNew
File "./lib/util.py", line 83, in determineNew
File "./lib/util.py", line 112, in getAuth
Empty Username not legal
Unknown Error Encountered
SUMMARY:
Name: Empty Username not legal
Description: None
So it seemed like an extra carriage return was getting sent ( I tried rstripping all the options, didn't help ).
If I don't set stdin=PIPE, I get:
Enter Opsware Username: Traceback (most recent call last):
File "./command.py", line 31, in main
File "./controller.py", line 51, in handle
File "./controllers/word_upload_controller.py", line 81, in _handle
File "./controller.py", line 66, in _determineNew
File "./lib/util.py", line 83, in determineNew
File "./lib/util.py", line 109, in getAuth
IOError: [Errno 5] Input/output error
Unknown Error Encountered
I've tried other variations of using p1.communicate, p1.stdin.write() along with shell=False and shell=True, but I've had no luck in trying to figure out how to properly send along the username and password. As a last result, I tried looking at the byte code for the utility they provided - it didn't help - once I called the util's main routine with the proper arguments, it ended up core dumping w/ thread errors.
Final thoughts - the utility doesn't want to seem to 'wait' for any input. When run from the shell, it pauses at the 'Username' prompt. When run through python's popen, it just blazes thru and ends, assuming no password was given. I tried to lookup ways of maybe preloading the stdin buffer - thinking maybe the process would read from that if it was available, but couldn't figure out if that was possible.
I'm trying to stay away from using pexpect, mainly because we have to use the vendor's provided python 2.4 because of the precompiled libraries they provide and I'm trying to keep distribution of the script to as minimal a footprint as possible - if I have to, I have to, but I'd rather not use it ( and I honestly have no idea if it works in this situation either ).
Any thoughts on what else I could try would be most appreciated.
UPDATE
So I solved this by diving further into the bytecode and figuring out what I was missing from the compiled command.
However, this presented two problems -
The vendor code, when called, was doing an exit when it completed
The vendor code was writing to stdout, which I needed to store and operate on ( it contains the ID of the uploaded pkg ). I couldn't just redirect stdout, because the vendor code was still asking for the username/password.
1 was solved easy enough by wrapping their code in a try/except clause.
2 was solved by doing something similar to: https://stackoverflow.com/a/616672/677373
Instead of a log file, I used cStringIO. I also had to implement a fake 'flush' method, since it seems the vendor code was calling that and complaining that the new obj I had provided for stdout didn't supply it - code ends up looking like:
class Logger(object):
def __init__(self):
self.terminal = sys.stdout
self.log = StringIO()
def write(self, message):
self.terminal.write(message)
self.log.write(message)
def flush(self):
self.terminal.flush()
self.log.flush()
if os.path.exists(options.zipfile):
try:
os.environ['OCLI_CODESET'] = 'ISO-8859-1'
backup = sys.stdout
sys.stdout = output = Logger()
# UploadCommand was the command found in the bytecode
upload = UploadCommand()
try:
upload.main(cmdargs)
except Exception, rc:
pass
sys.stdout = backup
# now do some fancy stuff with output from output.log
I should note that the only reason I simply do a 'pass' in the except: clause is that the except clause is always called. The 'rc' is actually the return code from the command, so I will probably add handling for non-zero cases.

I tried to lookup ways of maybe preloading the stdin buffer
Do you perhaps want to create a named fifo, fill it with username/password info, then reopen it in read mode and pass it to popen (as in popen(..., stdin=myfilledbuffer))?
You could also just create an ordinary temporary file, write the data to it, and reopen it in read mode, again, passing the reopened handle as stdin. (This is something I'd personally avoid doing, since writing username/passwords to temporary files is often of the bad. OTOH it's easier to test than FIFOs)
As for the underlying cause: I suspect that the offending software is reading from stdin via a non-blocking method. Not sure why that works when connected to a terminal.
AAAANYWAY: no need to use pipes directly via Popen at all, right? I kinda laugh at the hackishness of this, but I'll bet it'll work for you:
# you don't actually seem to need popen here IMO -- call() does better for this application.
statuscode = call('echo "%s\n%s\n" | oupload %s' % (username, password, options) , shell=True)
tested with status = call('echo "foo\nbar\nbar\nbaz" |wc -l', shell = True) (output is '4', naturally.)

The original question was solved by just avoiding the issue and not using the terminal and instead importing the python code that was being called by the shell script and just using that.
I believe J.F. Sebastian's answer would probably work better for what was originally asked, however, so I'd suggest people looking for an answer to a similar question look down the path of using the pty module.

Can't seem to get fortran executable to run correctly through python

I have read a bunch of different topics on SO and other sites and cannot get a direct answer to my question/problem. Currently I have this python script that runs completely fine, with the exception of no calls made to run a fortran program are working correctly. I have tried using subprocess commands, os.system commands, opening bash script files that are opened through python, and no luck. Here are some examples and errors I'm getting.
One attmept:
subprocess.Popen(["sh", "{0}{1}".format(SCRIPTS,"qlmtconvertf.sh"), "qlmt"], shell=False, stdout=subprocess.PIPE)
This gives an error that the program has trouble reading the file correctly.
forrtl: severe (24): end-of-file during read, unit 1, file /home/akoufos/lapw/Ar/lda/bcc55_mt1.5_lo_e8_o4/DOS/lat70/qlmt
Another attempt:
subprocess.Popen(["./{0}{1}".format(SOURCE,"qlmtconvertf"), "qlmt"], shell=False, stdout=subprocess.PIPE)
This gives an error of not finding the file.
File "/home/akoufos/lapw/Scripts_Plots/LAPWanalysis.py", line 59, in DOS
subprocess.Popen(["./{0}{1}".format(SOURCE,"qlmtconvertf"), "qlmt"], shell=False, stdout=subprocess.PIPE)
File "/usr/lib64/python2.7/subprocess.py", line 672, in __init__
errread, errwrite)
File "/usr/lib64/python2.7/subprocess.py", line 1202, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
Yet another attempt:
os.system("{0}{1}".format(SOURCE,"qlmtconvertf qlmt"))
This gives an error equivalent to the first example. In all examples SOURCE="/home/myusername/lapw/Source/", where the fortran source files are, SCRIPTS="/home/myusername/lapw/Scripts_Plots/", where I have other files and the python scripts in, qlmtconvertf is a compiled fortran program, and qlmt is a file the qlmtconvertf reads. This source code works completely fine if I call it in the shell, like I have done countless times, but I'm trying to automate calling these codes. I have written a bash script as well, that does what I need, but I'm trying to do everything through python instead. Any ideas, suggestions, or answers to what I am doing incorrectly and what is going on would be greatly appreciated. Thank you all in advance.
EDIT: I got it working with the suggestion given below by Francis. I had to keep the complete paths (i.e. /home/username/etc) and the os.path.join to call the program correctly.
import os.path
LAPW = "/home/myusername/lapw/"
SOURCE = os.path.join(LAPW,'Source')
SCRIPTS = os.path.join(LAPW,'Scripts_Plots')
QLMTCONVERT = os.path.join(SOURCE,'qlmtconvertf')
qargs = [QLMTCONVERT,'qlmt']
#CALLING PROGRAM
subprocess.Popen(qargs, stdout=subprocess.PIPE).communicate(input=None)
To get it to work correctly I had to also close the 'qlmt' file I had created during the python script. Also I am working in the directory that contains the 'qlmt' file.
(edit Also added .communicate(input=None) to the end of the subprocess. This was unnecessary for this process call, but it was important for a latter one I made in the script that tried to use a file the process was creating. From my understanding the .communicate talks to the process and basically waits for it to finish before the next python line is executed. Similar to .wait(), but more advanced. If someone who understands this more wants to elaborate, please feel free. edit)
I'm not exactly sure why this method worked, but using strings as inputs for the subprocess was giving errors. If any one has any insight on this I would be very thankful if you could pass on your knowledge. Thank you everyone for the help.

I think you forgot a slash in your filenames:
"{0}{1}".format(SOURCE,"qlmtconvertf qlmt") == '/home/myusername/lapw/Sourceqlmtconvertf qlmt'
I assume you mean this?
"{0}/{1}".format(SOURCE,"qlmtconvertf qlmt") == '/home/myusername/lapw/Source/qlmtconvertf qlmt'
I recommend using os.path.join rather than direct string construction for pathname creation:
import os.path
executable = os.path.join(SOURCE, 'qlmtconvertf')
args = ['qlmt']
subprocess.Popen(executable+args, stdout=subprocess.PIPE)

Detect file handle leaks in python?

My program appears to be leaking file handles. How can I find out where?
My program uses file handles in a few different places—output from child processes, call ctypes API (ImageMagick) opens files, and they are copied.
It crashes in shutil.copyfile, but I'm pretty sure this is not the place it is leaking.
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python25\Lib\site-packages\magpy\magpy.py", line 874, in main
magpy.run_all()
File "C:\Python25\Lib\site-packages\magpy\magpy.py", line 656, in run_all
[operation.operate() for operation in operations]
File "C:\Python25\Lib\site-packages\magpy\magpy.py", line 417, in operate
output_file = self.place_image(output_file)
File "C:\Python25\Lib\site-packages\magpy\magpy.py", line 336, in place_image
shutil.copyfile(str(input_file), str(self.full_filename))
File "C:\Python25\Lib\shutil.py", line 47, in copyfile
fdst = open(dst, 'wb')
IOError: [Errno 24] Too many open files: 'C:\\Documents and Settings\\stuart.axon\\Desktop\\calzone\\output\\wwtbam4\\Nokia_NCD\\nl\\icon_42x42_V000.png'
Press any key to continue . . .

I had similar problems, running out of file descriptors during subprocess.Popen() calls. I used the following script to debug on what is happening:
import os
import stat
_fd_types = (
('REG', stat.S_ISREG),
('FIFO', stat.S_ISFIFO),
('DIR', stat.S_ISDIR),
('CHR', stat.S_ISCHR),
('BLK', stat.S_ISBLK),
('LNK', stat.S_ISLNK),
('SOCK', stat.S_ISSOCK)
)
def fd_table_status():
result = []
for fd in range(100):
try:
s = os.fstat(fd)
except:
continue
for fd_type, func in _fd_types:
if func(s.st_mode):
break
else:
fd_type = str(s.st_mode)
result.append((fd, fd_type))
return result
def fd_table_status_logify(fd_table_result):
return ('Open file handles: ' +
', '.join(['{0}: {1}'.format(*i) for i in fd_table_result]))
def fd_table_status_str():
return fd_table_status_logify(fd_table_status())
if __name__=='__main__':
print fd_table_status_str()
You can import this module and call fd_table_status_str() to log the file descriptor table status at different points in your code.
Also, make sure that subprocess.Popen instances are destroyed. Keeping references of Popen instances in Windows prevent the GC from running. And if the instances are kept, the associated pipes are not closed. More info here.

Use Process Explorer, select your process, View->Lower Pane View->Handles - then look for what seems out of place - usually lots of the same or similar files open points to the problem.

lsof -p <process_id> works well on several UNIX-like systems including FreeBSD.

Look at the output from ls -l /proc/$pid/fd/ (substituting the PID of your process, of course) to see which files are open [or, on win32, use Process Explorer to list open files]; then figure out where in your code you're opening them, and make sure that close() is being called. (Yes, the garbage collector will eventually close things, but it's not always fast enough to avoid running out of fds).
Checking for any circular references which might be preventing garbage collection is also a good practice. (The cycle collector will eventually dispose of these -- but it may not run frequently enough to avoid file descriptor exhaustion; I've been bitten by this personally).

While the OP has a Windows system, I'm sure plenty of people here (such as myself) are looking for others too (it's not even tagged Windows).
Google has a psutil package with a get_open_files() method. It looks like an excellent interface, but it hasn't been maintained in a couple years it seems. I actually wrote an implementation for my own Python 2 project on Linux. I'm using it with unittest to make sure my functions clean up their resources.
import os
# calling this **synchronously** will accurately relay open files on Linux
def get_open_files(pid):
# directory spawned by Python process, containing its file descriptors
path = "/proc/%d/fd" % pid
# list the abspaths belonging to that directory
links = ["%s/%s" % (path, f) for f in os.listdir(path)]
# filter out the bad ones returned by os.listdir()
valid_links = filter(lambda f: os.path.exists(f), links)
# these links are fd integers, so map them to their actual file devices
devices = map(lambda f: os.readlink(f), valid_links)
# remove any ones that are stdin, stdout, stderr, etc.
return filter(lambda f: "/dev/pts" not in f, devices)

Python's own test suite has a refleak module that utilizes fd_count. Works across operating systems and is available on full installs:
>>> from test.support.os_helper import fd_count
>>> fd_count()
27
On Python 3.9 and earlier, the os_helper doesn't exist, so from test.support import fd_count.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Finding which files are being read from during a session (python code) - python

You can trace file accesses without modifying your code, using strace. Either you start your program with strace, like this strace -f -e trace=file your_program.py Otherwise you attach strace to a running program like this strace -f -e trace=file -p <PID>

Related

Remove file in python with absolute path: error "no such file or directory" but file exists

Why does python3.7 pass argv elements alongside 'garbage'

Using python subprocess to fake running a cmd from a terminal

Can't seem to get fortran executable to run correctly through python

Detect file handle leaks in python?

Categories

Resources