What is Hudson doing with my Python script? - python

I am developing a Python script (here after script) that searches through large log files for a search string. In normal use, it is called from a Hudson front end.
About the Hudson interface:
The Hudson job creates a temporary batch file on a connected virtual machine (VM) that calls the script and passes it some parameters. We have had hundreds of successful instances using this setup, but something is now creating an error.
About the script:
The log files are contained in dozens of compressed .tgz files. My script searches each log in each .tgz file.
One of the command line args that my script accepts is a True/False parameter called PROCESS_IN_PARALLEL. If PROCESS_IN_PARALLEL is set to True, then each .tgz file is searched in its own thread (using the multiprocessing module). If PROCESS_IN_PARALLEL is set to False, then each .tgz file is searched in sequence (using a loop).
What works:
I have a batch file on the VM that I use for testing my script. I can successfully use this .bat to call my script with PROCESS_IN_PARALLEL set to either (1a) True or (1b) False. Of course, it runs much faster when True (about 4x faster).
I have quadruple-checked that this .bat passes the same parameters to my script as Hudson, and in the same order. I have also added a line to my script that logs the input parameters to a file, and have found that Hudson is indeed passing the correct parameters in the correct order.
I can successfully use Hudson to call my script with PROCESS_IN_PARALLEL set to False.
I have tested the current iteration of my script using the above three test cases multiple times (even using multiple configurations of other parameters), all successfully.
What doesn't work:
If I use Hudson to call my script with PROCESS_IN_PARALLEL set to True, then I get a strange error. Here is the traceback:
Traceback (most recent call last):
File "F:\Scripts\Parse_LogFiles_Archive\parseLogs_Archive_8-19-13.py", line 40, in main
searchHits = searchTarList(searchDir, newDirectory, argv)
File "F:\Scripts\Parse_LogFiles_Archive\parseLogs_Archive_8-19-13.py", line 163, in searchTarList
hits = processPool.map(searchTar, tarMap)
File "E:\Python27\lib\multiprocessing\pool.py", line 225, in map
return self.map_async(func, iterable, chunksize).get()
File "E:\Python27\lib\multiprocessing\pool.py", line 522, in get
raise self._value IOError: [Errno 9] Bad file descriptor
According to my research, this error happens when Python attempts to read a file which is opened in write mode.
My question:
Is there a genius out there who know both Python and Hudson well enough to know what is happening?

Related

Pyinstaller not allowing multiprocessing with MacOS

I have a python file that I would like to package as an executable for MacOS 11.6.
The python file (called Service.py) relies on one other json file and runs perfectly fine when running with python. My file uses argparse as the arguments can differ depending on what is needed.
Example of how the file is called with python:
python3 Service.py -v Zephyr_Scale_Cloud https://myurl.cloud/ philippa#email.com password1 group3
The file is run in exactly the same way when it is an executable:
./Service.py -v Zephyr_Scale_Cloud https://myurl.cloud/ philippa#email.com password1 group3
I can package the file using PyInstaller and the executable runs.
Command used to package the file:
pyinstaller --paths=/Library/Frameworks/Python.framework/Versions/3.9/lib/python3.9/site-packages/ Service.py
However, when I get to the point that requires multiprocessing, the arguments get lost. My second argument (here noted as https://myurl.cloud) is a URL that I require.
The error I see is:
[MAIN] Starting new process RUNID9157
url before constructing the client recognised as pipe_handle=15
usage: Service [-h] test_management_tool url
Service: error: the following arguments are required: url
Traceback (most recent call last):
File "urllib3/connection.py", line 174, in _new_conn
File "urllib3/util/connection.py", line 72, in create_connection
File "socket.py", line 954, in getaddrinfo
I have done some logging and the url does get correctly read. But as soon as the process started and picking up what it needs to, the url is changed to 'pipe_handle=x', in the code above it is pipe_handle=15.
I need the url to retrieve an authentication token, but it just stops being read as the correct value and is changed to this pipe_handle value. I have no idea why.
Has anyone else seen this?!
I am using Python 3.9, PyInstaller 4.4 and ArgParse.
I have also added
if __name__ == "__main__":
if sys.platform.startswith('win'):
# On Windows - multiprocessing is different to Unix and Mac.
multiprocessing.freeze_support()
to my if name = "main" section as I saw this on other posts but it doesn't help.
Can someone please assist?
Sending commands via sys.argv is complicated by the fact that multiprocessing's "spawn" start method uses that to pass the file descriptors for the initial communication pipes between the parent and child.
I'm projecting here a little because you did not share the code of how/where you call argparse, and how/where you call multiprocessing
If you are parsing args outside of if __name__ == "__main__":, the args may get parsed (re-parsed on child import __main__) before sys.argv gets automatically cleaned up by multiprocessing.spawn.prepare() in the child. You should be able to fix this by moving the argparse stuff inside your target function. It also may be easier to parse the args in the parent, and simply send the parsed results as an argument to the target function. See this answer of mine for further discussion on sys.argv with multiprocessing.

Finding which files are being read from during a session (python code)

I have a large system written in python. when I run it, it reads all sorts of data from many different files on my filesystem. There are thousands lines of code, and hundreds of files, most of them are not actually being used. I want to see which files are actually being accessed by the system (ubuntu), and hopefully, where in the code they are being opened. Filenames are decided dynamically using variables etc., so the actual filenames cannot be determined just by looking at the code.
I have access to the code, of course, and can change it.
I try to figure how to do this efficiently, with minimal changes in the code:
is there a Linux way to determine which files are accessed, and at what times? this might be useful, although it won't tell me where in the code this happens
is there a simple way to make an "open file" command also log the file name, time, etc... of the open file? hopefully without having to go into the code and change every open command, there are many of them, and some are not being used at runtime.
Thanks
You can trace file accesses without modifying your code, using strace.
Either you start your program with strace, like this
strace -f -e trace=file your_program.py
Otherwise you attach strace to a running program like this
strace -f -e trace=file -p <PID>
For 1 - You can use
ls -la /proc/<PID>/fd`
Replacing <PID> with your process id.
Note that it will give you all the open file descriptors, some of them are stdin stdout stderr, and often other things, such as open websockets (which use a file descriptor), however filtering it for files should be easy.
For 2- See the great solution proposed here -
Override python open function when used with the 'as' keyword to print anything
e.g. overriding the open function with your own, which could include the additional logging.
One possible method is to "overload" the open function. This will have many effects that depend on the code, so I would do that very carefully if needed, but basically here's an example:
>>> _open = open
>>> def open(filename):
... print(filename)
... return _open(filename)
...
>>> open('somefile.txt')
somefile.txt
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 3, in open
FileNotFoundError: [Errno 2] No such file or directory: 'somefile.txt'
As you can see my new open function will return the original open (renamed as _open) but will first print out the argument (the filename). This can be done with more sophistication to log the filename if needed, but the most important thing is that this needs to run before any use of open in your code

Swift task open file descriptor of pipe

I am trying to open a Swift Pipe from a python script that is executed via a Swift Task
Swift code
let pipe=Pipe()
let task = Process()
var env=ProcessInfo.processInfo.environment
task.launchPath = "/pythonscript.py"
let fh=pipe.fileHandleForWriting
task.arguments = ["\(fh.fileDescriptor)"]
task.launch()
Python
#!/usr/local/bin/python
import os
import sys
fd=int(sys.argv[1])
print(os.fdopen(fd, u'w'))
What I get back from the python script is
Traceback (most recent call last):
File "./test.py", line 7, in <module>
print(os.fdopen(fd, u'w'))
OSError: [Errno 9] Bad file descriptor
Why can't python open the file descriptor I created in Swift?
Why can't python open the file descriptor I created in Swift?
Short answer (fudging a little): because the file descriptor is a process local identifier which is used by the OS to link to the open file information it keeps for process. You cannot copy them between processes.
Long answer:
In macOS/Unix/Linux (*nix) a file descriptor is just a process-local value which is used by the OS to link to the appropriate open file information within the OS. Different processes can have exactly the same file descriptor values which identify completely different files. Therefore you cannot simply copy a file descriptor value between processes.
In *nix a child process inherits the open files, and their associated descriptors, from its parent. This is the only way file descriptors get passed between processes. In outline the steps are:
The parent process forks, creating a clone of itself
The clone then closes any files the child should not access (usually all of them except the standard input, output and error files).
If the parent has pre-opened files that should be the child's standard input, output or error the clone then reassigns the file descriptors for those files to the standard file descriptors for standard input, output and error.
After all this file descriptor work is done the clone then replaces its code with the code the child needs to run - this keeps the open files and file descriptors.
The child code now executes unaware of all this setup.
In Swift all the above is handled by Process, in Terminal it is handled by the shell which uses it to set up file redirection, pipes etc.
To get your pipe to your Python process you can (a) use the Process methods to attach it to the spawned processes standard input or output; (b) create a named pipe, that is one with a file path, and pass the file path to your python to open; or (c) go low-level and write some interfacing C code which does the fork/dup(2)/exec calls and starts up your python code with the pipe on a known descriptor other than standard input or output.
(a) is easiest! (b) requires you to do some research on named pipes, its not hard but you'll need work with sandboxing if its enabled and create the pipe in a directory both process can access. (c) is best avoided.
Have fun, and if you get stuck ask a new question showing what you've tried, where it goes wrong, etc. and someone will undoubtedly help you along.
HTH

Python Path.rglob failing on network error when encountering folder without permission

I am new to Python and have been using this site as a reference...thanks for everything, I have learned a ton. First question:
I am running a basic recursive file search with Path.rglob(). I am running into a error when it encounters a folder that it does not have permission to access. I am running Python 3.7 on Windows and connecting to a windows share on a network drive.
Here's my code:
scan_folder = pathlib.Path("//192.168.1.242/Media")
nfo_files = list(scan_folder.rglob("*.nfo"))
It works perfectly until I encounter a folder that I do not have permission to access, then it errors out with:
Traceback (most recent call last):
File "D:/Working/media_tools/media_tools/movies_nfo_cataloger.py", line 337, in <module>
nfo_files = list(scan_folder.rglob("*.nfo"))
File "C:\Users\ulrick65\Anaconda3\lib\pathlib.py", line 1094, in rglob
for p in selector.select_from(self):
File "C:\Users\ulrick65\Anaconda3\lib\pathlib.py", line 544, in _select_from
for p in successor_select(starting_point, is_dir, exists, scandir):
File "C:\Users\ulrick65\Anaconda3\lib\pathlib.py", line 507, in _select_from
entries = list(scandir(parent_path))
OSError: [WinError 59] An unexpected network error occurred: '\\\\192.168.1.242\\Media\\#recycle'
Process finished with exit code 1
I searched and found the following Issue for Pathlib that appears to have been fixed, however the error is different in my case as it points to "Unexpected network error" instead of permissions.
https://bugs.python.org/issue24120
I verified that this is indeed a permissions error, I do not have access to that Recycle folder as the user I am logged in as. I edited the permissions for that folder and gave myself access and the code runs fine after that.
I know I could use oswalk as it ignores these...but I figured given the bug fix I linked to above, so should path.glob however it doesn't. Also, using path.rglob() is pretty slick, one line of code and is fast (not that oswalk wouldn't be just as fast).
Any help is appreciated.

extendscript return argument from python script

I am working on an extendscript code (Adobe After Effects - but it is basically just javascript) which needs to iterate over tens of thousands of file names on a server. This is extremely slow in extendscript but I can accomplish what I need to in just a few seconds with python, which is my preferred language anyway. So I would like to run a python file and return an array back into extendscript. I'm able to run my python file and pass an argument (the root folder) by creating and executing a batch file, but how would pass the result (an array) back into extendscript? I suppose I could write out a .csv and read this back in but that seems a bit "hacky".
In After Effects you can use the "system" object's callSystem() method. This gives you access to the system's shell so you can run any script from the code. So, you can write your python script that echos or prints the array and that is essentially what is returned by the system.callSystem() method. It's a synchronous call, so it has to complete before the next line in ExtendScript executes.
The actual code might by something like:
var stdOut = system.callSystem("python my-python-script.py")

Categories

Resources