python checking file changes without reading the full file

python checking file changes without reading the full file - python

I have a web app (in the backend) where I am using pysondb (https://github.com/pysonDB/pysonDB) to upload some tasks which will be executed by another program (sniffer).
The sniffer program (a completely separate program) now checks the database for any new unfinished uploaded tasks in an infinite loop and executes them and updates the database.
I don't want to read the database repeatedly, instead want to look for any file changes in the database file (db.json), then read the database only. I have looked into watchdog but was looking for something lightweight and modern to suit my needs.
# infinite loop
import pysondb
import time
from datetime import datetime
# calling aligner with os.system
import os
import subprocess
from pathlib import Path
while True:
# always alive
time.sleep(2)
try:
# process files
db = pysondb.getDb("../tasks_db.json")
tasks = db.getBy({"task_status": "uploaded"})
for task in tasks:
try:
task_path = task["task_path"]
cost = task["cost"]
corpus_folder = task_path
get_output = subprocess.Popen(f"mfa validate {corpus_folder} english english", shell=True, stdout=subprocess.PIPE).stdout
res = get_output.read().decode("utf-8")
# print(type(res))
if "ERROR - There was an error in the run, please see the log." in res:
# log errors
f = open("sniffer_log.error", "a+")
f.write(f"{datetime.now()} :: {str(res)}\n")
f.close()
else:
align_folder = f"{corpus_folder}_aligned"
Path(align_folder).mkdir(parents=True, exist_ok=True)
o = subprocess.Popen(f"mfa align {corpus_folder} english english {align_folder}", shell=True, stdout=subprocess.PIPE).stdout.read().decode("utf-8")
# success
except subprocess.CalledProcessError:
# mfa align ~/mfa_data/my_corpus english english ~/mfa_data/my_corpus_aligned
# log errors
f = open("sniffer_log.error", "a+")
f.write(f"{datetime.now()} :: Files not in right format\n")
f.close()
except Exception as e:
# log errors
f = open("sniffer_log.error", "a+")
f.write(f"{datetime.now()} :: {e}\n")
f.close()

Using python-rq would be a much more efficient way of doing this that wouldn't need a database. It has no requirements other then needing a redis install. From there, you could just move all of that into a function:
def task(task_path, cost):
corpus_folder = task_path
get_output = subprocess.Popen(f"mfa validate {corpus_folder} english english", shell=True, stdout=subprocess.PIPE).stdout
res = get_output.read().decode("utf-8")
# print(type(res))
if "ERROR - There was an error in the run, please see the log." in res:
# log errors
f = open("sniffer_log.error", "a+")
f.write(f"{datetime.now()} :: {str(res)}\n")
... #etc
Obviously you would rename that function and put the try-except statement back, but then you could just call that through RQ:
# ... where you want to call the function
from wherever.you.put.your.task.function import task
result = your_redis_queue.enqueue(task, "whatever", "arguments)

Related

python multiple threads redirecting stdout

I'm building an icecast2 radio station which will restream existing stations in lower quality. This program will generate multiple FFmpeg processes restreaming 24/7. For troubleshooting purposes, I would like to have an output of every FFmpeg process redirected to the separate file.
import ffmpeg, csv
from threading import Thread
def run(name, mount, source):
icecast = "icecast://"+ICECAST2_USER+":"+ICECAST2_PASS+"#localhost:"+ICECAST2_PORT+"/"+mount
stream = (
ffmpeg
.input(source)
.output(
icecast,
audio_bitrate=BITRATE, sample_rate=SAMPLE_RATE, format=FORMAT, acodec=CODEC,
reconnect="1", reconnect_streamed="1", reconnect_at_eof="1", reconnect_delay_max="120",
ice_name=name, ice_genre=source
)
)
return stream
with open('stations.csv', mode='r') as data:
for station in csv.DictReader(data):
stream = run(station['name'], station['mount'], station['url'])
thread = Thread(target=stream.run)
thread.start()
As I understand I can't redirect stdout of each thread separately, I also can't use ffmpeg reporting which is only configured by an environment variable. Do I have any other options?

You need to create a thread function of your own
def stream_runner(stream,id):
# open a stream-specific log file to write to
with open(f'stream_{id}.log','wt') as f:
# block until ffmpeg is done
sp.run(stream.compile(),stderr=f)
for i, station in enumerate(csv.DictReader(data)):
stream = run(station['name'], station['mount'], station['url'])
thread = Thread(target=stream_runner,args=(stream,i))
thread.start()
Something like this should work.

ffmpeg-python doesn't quite give you the tools to do this - you want to control one of the arguments to subprocess, stderr, but ffmpeg doesn't have an argument for this.
However, what ffmpeg-python does have, is the ability to show the command line arguments that it would have used. You can make your own call to subprocess after that.
You also don't need to use threads to do this - you can set up each ffmpeg subprocess, without waiting for it to complete, and check in on it each second. This example starts up two ffmpeg instances in parallel, and monitors each one by printing out the most recent line of output from each one every second, as well as tracking if they've exited.
I made two changes for testing:
It gets the stations from a dictionary rather than a CSV file.
It transcodes an MP4 file rather than an audio stream, since I don't have an icecast server. If you want to test it, it expects to have a file named 'sample.mp4' in the same directory.
Both should be pretty easy to change back.
import ffmpeg
import subprocess
import os
import time
stations = [
{'name': 'foo1', 'input': 'sample.mp4', 'output': 'output.mp4'},
{'name': 'foo2', 'input': 'sample.mp4', 'output': 'output2.mp4'},
]
class Transcoder():
def __init__(self, arguments):
self.arguments = arguments
def run(self):
stream = (
ffmpeg
.input(self.arguments['input'])
.output(self.arguments['output'])
)
args = stream.compile(overwrite_output=True)
with open(self.log_name(), 'ab') as logfile:
self.subproc = subprocess.Popen(
args,
stdin=None,
stdout=None,
stderr=logfile,
)
def log_name(self):
return self.arguments['name'] + "-ffmpeg.log"
def still_running(self):
return self.subproc.poll() is None
def last_log_line(self):
with open(self.log_name(), 'rb') as f:
try: # catch OSError in case of a one line file
f.seek(-2, os.SEEK_END)
while f.read(1) not in [b'\n', 'b\r']:
f.seek(-2, os.SEEK_CUR)
except OSError:
f.seek(0)
last_line = f.readline().decode()
last_line = last_line.split('\n')[-1]
return last_line
def name(self):
return self.arguments['name']
transcoders = []
for station in stations:
t = Transcoder(station)
t.run()
transcoders.append(t)
while True:
for t in list(transcoders):
if not t.still_running():
print(f"{t.name()} has exited")
transcoders.remove(t)
print(t.name(), repr(t.last_log_line()))
if len(transcoders) == 0:
break
time.sleep(1)

Frequently check a file while subprocess is writing to it

I have the following piece of code where the c++ executable (run.out) prints out a bunch of info in the runtime using std::cout. This code stores the outputs of run.out into the storage.txt.
storage = open("storage.txt", "w")
shell_cmd = "run.out"
proc = subprocess.Popen([shell_cmd], stdout=storage, stderr=storage)
Once the subprocess starts, I need to frequently check the contents of storage.txt and decide based on what has just been stored in there. How may I do that?

You could use subprocess.poll() which returns immediately and indicates if the subprocess is still running:
while proc.poll() is None:
time.sleep(0.25) # reads the content 4 times a seconds!
data = open("storage.txt").read()
if 'error' in data:
print("failed ...")
# somesomething ...

Python : Handling multiple telnet sessions to same IP

I have few list of commands (any number), which have to be executed over telnet on one particular IP/HOST. And the output to be stored in separate file. These commands are specific to log collection.
I need them in such a way that, Execute all those required commands at once (Start/enabling for log collection) - Multiple telnet sessions, one session per command. After sometime (Not a timed activity), require another script to stop all of them & logs stored in separate file respectively (based on the list of commands executed).
I could able to do it only for one particular command, that too only for short interval of time.
I hope I'm clear with the details. Please let me know if you are not clear with the concept. Please help me in this regard.
import sys
import telnetlib
import time
orig_stdout = sys.stdout
f = open('out.txt', 'w')
sys.stdout = f
try:
tn = telnetlib.Telnet(IP)
tn.read_until(b"login: ")
tn.write(username.encode('ascii') + b"\n")
tn.read_until(b"# ")
tn.write(command1.encode('ascii') + b"\n")
#time.sleep(30)
z = tn.read_until(b'abcd\b\n',4) >> Just a random pattern, so that it reads for long duration.
#z=tn.read_very_eager()
output = z.splitlines( )
except:
sys.exit("Telnet Failed to ",IP)
for i in output:
i=i.strip().decode("utf-8")
print(i)
sys.stdout = orig_stdout
f.close()

tail multiple logfiles in python

This is probably a bit of a silly excercise for me, but it raises a bunch of interesting questions. I have a directory of logfiles from my chat client, and I want to be notified using notify-osd every time one of them changes.
The script that I wrote basically uses os.popen to run the linux tail command on every one of the files to get the last line, and then check each line against a dictionary of what the lines were the last time it ran. If the line changed, it used pynotify to send me a notification.
This script actually worked perfectly, except for the fact that it used a huge amount of cpu (probably because it was running tail about 16 times every time the loop ran, on files that were mounted over sshfs.)
It seems like something like this would be a great solution, but I don't see how to implement that for more than one file.
Here is the script that I wrote. Pardon my lack of comments and poor style.
Edit: To clarify, this is all linux on a desktop.

Not even looking at your source code, there are two ways you could easily do this more efficiently and handle multiple files.
Don't bother running tail unless you have to. Simply os.stat all of the files and record the last modified time. If the last modified time is different, then raise a notification.
Use pyinotify to call out to Linux's inotify facility; this will have the kernel do option 1 for you and call back to you when any files in your directory change. Then translate the callback into your osd notification.
Now, there might be some trickiness depending on how many notifications you want when there are multiple messages and whether you care about missing a notification for a message.
An approach that preserves the use of tail would be to instead use tail -f. Open all of the files with tail -f and then use the select module to have the OS tell you when there's additional input on one of the file descriptors open for tail -f. Your main loop would call select and then iterate over each of the readable descriptors to generate notifications. (You could probably do this without using tail and just calling readline() when it's readable.)
Other areas of improvement in your script:
Use os.listdir and native Python filtering (say, using list comprehensions) instead of a popen with a bunch of grep filters.
Update the list of buffers to scan periodically instead of only doing it at program boot.
Use subprocess.popen instead of os.popen.

If you're already using the pyinotify module, it's easy to do this in pure Python (i.e. no need to spawn a separate process to tail each file).
Here is an example that is event-driven by inotify, and should use very little cpu. When IN_MODIFY occurs for a given path we read all available data from the file handle and output any complete lines found, buffering the incomplete line until more data is available:
import os
import select
import sys
import pynotify
import pyinotify
class Watcher(pyinotify.ProcessEvent):
def __init__(self, paths):
self._manager = pyinotify.WatchManager()
self._notify = pyinotify.Notifier(self._manager, self)
self._paths = {}
for path in paths:
self._manager.add_watch(path, pyinotify.IN_MODIFY)
fh = open(path, 'rb')
fh.seek(0, os.SEEK_END)
self._paths[os.path.realpath(path)] = [fh, '']
def run(self):
while True:
self._notify.process_events()
if self._notify.check_events():
self._notify.read_events()
def process_default(self, evt):
path = evt.pathname
fh, buf = self._paths[path]
data = fh.read()
lines = data.split('\n')
# output previous incomplete line.
if buf:
lines[0] = buf + lines[0]
# only output the last line if it was complete.
if lines[-1]:
buf = lines[-1]
lines.pop()
# display a notification
notice = pynotify.Notification('%s changed' % path, '\n'.join(lines))
notice.show()
# and output to stdout
for line in lines:
sys.stdout.write(path + ': ' + line + '\n')
sys.stdout.flush()
self._paths[path][1] = buf
pynotify.init('watcher')
paths = sys.argv[1:]
Watcher(paths).run()
Usage:
% python watcher.py [path1 path2 ... pathN]

Simple pure python solution (not the best, but doesn't fork, spits out 4 empty lines after idle period and marks everytime the source of the chunk, if changed):
#!/usr/bin/env python
from __future__ import with_statement
'''
Implement multi-file tail
'''
import os
import sys
import time
def print_file_from(filename, pos):
with open(filename, 'rb') as fh:
fh.seek(pos)
while True:
chunk = fh.read(8192)
if not chunk:
break
sys.stdout.write(chunk)
def _fstat(filename):
st_results = os.stat(filename)
return (st_results[6], st_results[8])
def _print_if_needed(filename, last_stats, no_fn, last_fn):
changed = False
#Find the size of the file and move to the end
tup = _fstat(filename)
# print tup
if last_stats[filename] != tup:
changed = True
if not no_fn and last_fn != filename:
print '\n<%s>' % filename
print_file_from(filename, last_stats[filename][0])
last_stats[filename] = tup
return changed
def multi_tail(filenames, stdout=sys.stdout, interval=1, idle=10, no_fn=False):
S = lambda (st_size, st_mtime): (max(0, st_size - 124), st_mtime)
last_stats = dict((fn, S(_fstat(fn))) for fn in filenames)
last_fn = None
last_print = 0
while 1:
# print last_stats
changed = False
for filename in filenames:
if _print_if_needed(filename, last_stats, no_fn, last_fn):
changed = True
last_fn = filename
if changed:
if idle > 0:
last_print = time.time()
else:
if idle > 0 and last_print is not None:
if time.time() - last_print >= idle:
last_print = None
print '\n' * 4
time.sleep(interval)
if '__main__' == __name__:
from optparse import OptionParser
op = OptionParser()
op.add_option('-F', '--no-fn', help="don't print filename when changes",
default=False, action='store_true')
op.add_option('-i', '--idle', help='idle time, in seconds (0 turns off)',
type='int', default=10)
op.add_option('--interval', help='check interval, in seconds', type='int',
default=1)
opts, args = op.parse_args()
try:
multi_tail(args, interval=opts.interval, idle=opts.idle,
no_fn=opts.no_fn)
except KeyboardInterrupt:
pass

In Django, how to call a subprocess with a slow start-up time

Suppose you're running Django on Linux, and you've got a view, and you want that view to return the data from a subprocess called cmd that operates on a file that the view creates, for example likeso:
def call_subprocess(request):
response = HttpResponse()
with tempfile.NamedTemporaryFile("W") as f:
f.write(request.GET['data']) # i.e. some data
# cmd operates on fname and returns output
p = subprocess.Popen(["cmd", f.name],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = p.communicate()
response.write(p.out) # would be text/plain...
return response
Now, suppose cmd has a very slow start-up time, but a very fast operating time, and it does not natively have a daemon mode. I would like to improve the response-time of this view.
I would like to make the whole system would run much faster by starting up a number of instances of cmd in a worker-pool, have them wait for input, and having call_process ask one of those worker pool processes handle the data.
This is really 2 parts:
Part 1. A function that calls cmd and cmd waits for input. This could be done with pipes, i.e.
def _run_subcmd():
p = subprocess.Popen(["cmd", fname],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
out, err = p.communicate()
# write 'out' to a tmp file
o = open("out.txt", "W")
o.write(out)
o.close()
p.close()
exit()
def _run_cmd(data):
f = tempfile.NamedTemporaryFile("W")
pipe = os.mkfifo(f.name)
if os.fork() == 0:
_run_subcmd(fname)
else:
f.write(data)
r = open("out.txt", "r")
out = r.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Part 2. A set of workers running in the background that are waiting on the data. i.e. We want to extend the above so that the subprocess is already running, e.g. when the Django instance initializes, or this call_process is first called, a set of these workers is created
WORKER_COUNT = 6
WORKERS = []
class Worker(object):
def __init__(index):
self.tmp_file = tempfile.NamedTemporaryFile("W") # get a tmp file name
os.mkfifo(self.tmp_file.name)
self.p = subprocess.Popen(["cmd", self.tmp_file],
stdout=subprocess.PIPE, stderr=subprocess.PIPE)
self.index = index
def run(out_filename, data):
WORKERS[self.index] = Null # qua-mutex??
self.tmp_file.write(data)
if (os.fork() == 0): # does the child have access to self.p??
out, err = self.p.communicate()
o = open(out_filename, "w")
o.write(out)
exit()
self.p.close()
self.o.close()
self.tmp_file.close()
WORKERS[self.index] = Worker(index) # replace this one
return out_file
#classmethod
def get_worker() # get the next worker
# ... static, incrementing index
There should be some initialization of workers somewhere, like this:
def init_workers(): # create WORKERS_COUNT workers
for i in xrange(0, WORKERS_COUNT):
tmp_file = tempfile.NamedTemporaryFile()
WORKERS.push(Worker(i))
Now, what I have above becomes something likeso:
def _run_cmd(data):
Worker.get_worker() # this needs to be atomic & lock worker at Worker.index
fifo = open(tempfile.NamedTemporaryFile("r")) # this stores output of cmd
Worker.run(fifo.name, data)
# please ignore the fact that everything will be
# appended to out.txt ... these will be tmp files, too, but named elsewhere.
out = fifo.read()
# read 'out' from a tmp file
return out
def call_process(request):
response = HttpResponse()
out = _run_cmd(request.GET['data'])
response.write(out) # would be text/plain...
return response
Now, the questions:
Will this work? (I've just typed this off the top of my head into StackOverflow, so I'm sure there are problems, but conceptually, will it work)
What are the problems to look for?
Are there better alternatives to this? e.g. Could threads work just as well (it's Debian Lenny Linux)? Are there any libraries that handle parallel process worker-pools like this?
Are there interactions with Django that I ought to be conscious of?
Thanks for reading! I hope you find this as interesting a problem as I do.
Brian

It may seem like i am punting this product as this is the second time i have responded with a recommendation of this.
But it seems like you need a Message Queing service, in particular a distributed message queue.
ere is how it will work:
Your Django App requests CMD
CMD gets added to a queue
CMD gets pushed to several works
It is executed and results returned upstream
Most of this code exists, and you dont have to go about building your own system.
Have a look at Celery which was initially built with Django.
http://www.celeryq.org/
http://robertpogorzelski.com/blog/2009/09/10/rabbitmq-celery-and-django/

Issy already mentioned Celery, but since comments doesn't work well
with code samples, I'll reply as an answer instead.
You should try to use Celery synchronously with the AMQP result store.
You could distribute the actual execution to another process or even another machine. Executing synchronously in celery is easy, e.g.:
>>> from celery.task import Task
>>> from celery.registry import tasks
>>> class MyTask(Task):
...
... def run(self, x, y):
... return x * y
>>> tasks.register(MyTask)
>>> async_result = MyTask.delay(2, 2)
>>> retval = async_result.get() # Now synchronous
>>> retval 4
The AMQP result store makes sending back the result very fast,
but it's only available in the current development version (in code-freeze to become
0.8.0)

How about "daemonizing" the subprocess call using python-daemon or its successor, grizzled.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

python checking file changes without reading the full file - python

Related

python multiple threads redirecting stdout

Frequently check a file while subprocess is writing to it

Python : Handling multiple telnet sessions to same IP

tail multiple logfiles in python

In Django, how to call a subprocess with a slow start-up time

Categories

Resources