Knowing how to time a task when building a progress bar

Knowing how to time a task when building a progress bar - python

In my program a user uploads a csv file.
While the file is uploading & being processed by my app, I'd like to show a progress bar.
The problem is that this process isn't entirely under my control (I can't really tell how long it'll take for the file to finish loading & be processed, as this depends on the file content and the size).
What would be the correct approach for doing this? It's not like I have many steps and I could increment the progress bar every time a step happens.... It's basically waiting for a file to be loaded, I cannot determine the time for that!
Is this even possible?
Thanks in advance

You don't give much detail, so I'll explain what I think is happening and give some suggestions from my thought process.
You have some kind of app that has some kind of function/process that
is a black-box (i.e you can't see inside it or change it), this
black-box uploads a csv file to some server and returns control back to
your app when it's done. Since you can't see inside the black-box you
can't determine how much it has uploaded and thus can't create an
accurate progress bar.
Named Pipes:
If you're passing only the filename of the csv to the black-box, you might be able to create a named pipe (depending on your situation.) Since named pipes block after the buffer is full - until the receiver reads it, you could keep track of how much has been read and thus create an accurate progress bar.
So you would create a named pipe, pass the black-box its filename, and then read in from the csv - and write to the named pipe. How far you've read in - is your progress.
More Pythonic:
Since you tagged Python, if you're passing the csv as a file-like object, this activestate recipe could help.
Same kind of idea just for Python.
Conclusion: These are two possible solutions. I'm getting tired, and there may be many more - but I can't help more since you haven't given us much to work with.
To answer your question at an abstract level: you can't make accurate progress bars for black-box functions, after all they could have a sleep(random()) call in them for all you know.
There are ways around this that are implementation specific, the two ideas above are examples: the idea being you can make the black-box take a stream instead, and count the bytes as you pass them through.
Alternatively you can guess/approximate, a rough calculation of how many bytes are going in and a (previously calculated) average speed per byte would give you some kind of indication of when it would complete. You could even save how long each run took in your code and do the previous idea automatically getting better each time.

Related

ffmpeg in python - extracting meta data

I use ffmpeg for Python to extract meta data from video files. I think the official documentation is available here: https://kkroening.github.io/ffmpeg-python/
To extract meta data (duration, resolution, frames per second, etc.) I use the function "ffmpeg.probe" as provided. Sadly, when running it on a large amount of video files, it is rather inefficient as it seems to (obviously?) load the whole file into memory each time to read just a small amount of data.
If this is not what it does, maybe someone could explain what the cause might be for the rather extensive runtime.
Otherwise, is there any way to retrieve meta data in a more efficient way using ffmpeg or some other library?
Any feedback or help is very much appreciated.
Edit: For clarity I added my code here:
pool = mp.Pool()
videos = []
for file in os.listdir(directory):
pool.apply_async(ffmpeg.probe, args=[os.path.join(directory, file)], callback=videos.append)
pool.close()
pool.join()
The imports and the definition of the paths are missing, but it should suffice to understand what is going on.

running it on a large amount of video files
This is where multithreading/multiprocessing could potentially be helpful IF the slowdown comes from subprocess spawning (and not from actual I/O). This may not help as file I/O in general takes time compared to virtually everything else.
load the whole file into memory each time to read just a small amount of data
This is incorrect assertion IMO. It should only read relevant headers/packets to retrieve the metadata. You are likely paying subprocess tax more than anything else.
a way to retrieve meta data
(1) Adding to what #Peter Hassaballeh said above, ffprobe has options to limit what to look up. If you only need to get the container(format)-level info or only of a particular stream, you can specify exactly what you need (to an extent). This could save some time.
(2) You can try MediaInfo (another free tool like ffprobe) which you should be able to call from Python as well.
(3) If you are dealing with a particular file format, the fastest way is to decode it yourself in Pyton, read only the bytes that matters to you. Depending on what is the current bottleneck, it may not be that drastic of an improvement, tho.

I suggest using the ffprobe directly. Unfortunately ffmpeg can be CPU expensive sometimes but it all depends on your hardware specs.

discord.py: too big variable?

I'm very new to python and programming in general, and I'm looking to make a discord bot that has a lot of hand-written chat lines to randomly pick from and send back to the user. Making a really huge variable full of a list of sentences seems like a bad idea. Is there a way that I can store the chatlines on a different file and have the bot pick from the lines in that file? Or is there anything else that would be better, and how would I do it?

I'll interpret this question as "how large a variable is too large", to which the answer is pretty simple. A variable is too large when it becomes a problem. So, how can a variable become a problem? The big one is that the machien could possibly run out of memory, and an OOM killer (out-of-memory killer) or similiar will stop your program. How would you know if your variable is causing these issues? Pretty simple, your program crashes.
If the variable is static (with a size fully known at compile-time or prior to interpretation), you can calculate how much RAM it will take. (This is a bit finnicky with Python, so it might be easier to load it up at runtime and figure it out with a profiler.) If it's more than ~500 megabytes, you should be concerned. Over a gigabyte, and you'll probably want to reconsider your approach[^0]. So, what do you do then?
As suggested by #FishballNooodles, you can store your data line-by-line in a file and read the lines to an array. Unfortunately, the code they've provided still reads the entire thing into memory. If you use the code they're providing, you've got a few options, non-exhaustively listed below.
Consume a random number of newlines from the file when you need a line of text. You would look at one character at a time, compare it to \n, and read the line if you've encountered the requested number of newlines. This is O(n) worst case with respect to the number of lines in the file.
Rather than storing the text you need at a given index, store its location in a file. Then, you can seek to the location (which is probably O(1)), and read the text. This requires an O(n) construction cost at the start of the program, but would work much better at runtime.
Use an actual database. It's usually better not to reinvent the wheel. If you're just storing plain text, this is probably overkill, but don't discount it.
[^0]: These numbers are actually just random. If you control the server environment on which you run the code, then you can probably come up with some more precise signposts.

You can store your data in a file, supposedly named response.txt
and retrieve it in the discord bot file as open("response.txt").readlines()

Is it possible to make writing to files/reading from files safe for a questionnaire type website?

My web app asks users 3 questions and simple writes that to a file, a1,a2,a3. I also have real time visualization of the average of the data (reads real time from file).
Must I use a database to ensure that no/minimal information is lost? Is it possible to produce a queue of read/writes>(Since files are small I am not too worried about the execution time of each call). Does python/flask already take care of this?
I am quite experienced in python itself, but not in this area(with flask).

I see a few solutions:
read /dev/urandom a few times, calculate sha-256 of the number and use it as a file name; collision is extremely improbable
use Redis and command like LPUSH, using it from Python is very easy; then RPOP from right end of the linked list, there's your queue

Python Progress Bar - Is Threading the Answer Here?

I've done some research on progress bars in Python, and a lot of the solutions seem to be based on work being divided into known, discrete chunks. I.e., iterating a known number of times and updating the progress bar with stdout every time a percentage point of the progress toward the end of the iterations is made.
My problem is a little less discrete. It involves walking a user directory that contains hundreds of sub-directories, gathering MP3 information, and entering it into a database. I could probably count the number of MP3 files in the directory before iteration and use that as a guideline for discrete chunks, but many of the mp3s may already be in the database, some of the files will take longer to read than others, errors will occur and have to be handled in some cases, etc. Besides, I'd like to know how to pull this off with non-discrete chunks for future reference. Here is the code for my directory-walk/database-update, if you're interested:
import mutagen
import sys
import os
import sqlite3 as lite
for root, dirs, files in os.walk(startDir):
for file in files:
if isMP3(file):
fullPath = os.path.join(root, file)
# Check if path already exists in DB, skip iteration if so
if unicode(fullPath, errors="replace") in pathDict:
continue
try:
audio = MP3(fullPath)
except mutagen.mp3.HeaderNotFoundError: # Invalid file/ID3 info
#TODO: log for user to look up what files were not visitable
continue
# Do database operations and error handling therein.
Is threading the best way to approach something like this? And if so, are there any good examples on how threading achieves this? I don't want a module for this because (a) it seems like something I should know how to do and (b) I'm developing for a dependency-lite situation.

If you don't know how many steps are in front of you, then how can you get a progress? That's the first thing. You have to count all of them before starting the job.
Now even if tasks differ in terms of needed time to finish you should not worry about that. Think about games. Sometimes when you see progress bars they seem to stop in one point and then jump very fast. This is exactly what's happening under the hood: some tasks take longer then others. But it's not a big deal ( unless the task is really long, like minutes maybe? ).
Of course you can use threads. It might be quite simple actually with Queue and ThreadPool. Run for example 20 threads and build a Queue of jobs. Your progress would then be number of items in Queue with initial length of Queue as a limit. This seems like a good design.

Pythonistas, please help convert this to utilize Python Threading concepts

Update : For anyone wondering what I went with at the end -
I divided the result-set into 4 and ran 4 instances of the same program with one argument each indicating what set to process. It did the trick for me. I also consider PP module. Though it worked, it prefer the same program. Please pitch in if this is a horrible implementation! Thanks..
Following is what my program does. Nothing memory intensive. It is serial processing and boring. Could you help me convert this to more efficient and exciting process? Say, I process 1000 records this way and with 4 threads, I can get it to run in 25% time!
I read articles on how python threading can be inefficient if done wrong. Even python creator says the same. So I am scared and while I am reading more about them, want to see if bright folks on here can steer me in the right direction. Muchos gracias!
def startProcessing(sernum, name):
'''
Bunch of statements depending on result,
will write to database (one update statement)
Try Catch blocks which upon failing,
will call this function until the request succeeds.
'''
for record in result:
startProc = startProcessing(str(record[0]), str(record[1]))

Python threads can't run at the same time due to the Global Interpreter Lock; you want new processes instead. Look at the multiprocessing module.
(I was instructed to post this as an answer =p.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.