I've done some research on progress bars in Python, and a lot of the solutions seem to be based on work being divided into known, discrete chunks. I.e., iterating a known number of times and updating the progress bar with stdout every time a percentage point of the progress toward the end of the iterations is made.
My problem is a little less discrete. It involves walking a user directory that contains hundreds of sub-directories, gathering MP3 information, and entering it into a database. I could probably count the number of MP3 files in the directory before iteration and use that as a guideline for discrete chunks, but many of the mp3s may already be in the database, some of the files will take longer to read than others, errors will occur and have to be handled in some cases, etc. Besides, I'd like to know how to pull this off with non-discrete chunks for future reference. Here is the code for my directory-walk/database-update, if you're interested:
import mutagen
import sys
import os
import sqlite3 as lite
for root, dirs, files in os.walk(startDir):
for file in files:
if isMP3(file):
fullPath = os.path.join(root, file)
# Check if path already exists in DB, skip iteration if so
if unicode(fullPath, errors="replace") in pathDict:
continue
try:
audio = MP3(fullPath)
except mutagen.mp3.HeaderNotFoundError: # Invalid file/ID3 info
#TODO: log for user to look up what files were not visitable
continue
# Do database operations and error handling therein.
Is threading the best way to approach something like this? And if so, are there any good examples on how threading achieves this? I don't want a module for this because (a) it seems like something I should know how to do and (b) I'm developing for a dependency-lite situation.
If you don't know how many steps are in front of you, then how can you get a progress? That's the first thing. You have to count all of them before starting the job.
Now even if tasks differ in terms of needed time to finish you should not worry about that. Think about games. Sometimes when you see progress bars they seem to stop in one point and then jump very fast. This is exactly what's happening under the hood: some tasks take longer then others. But it's not a big deal ( unless the task is really long, like minutes maybe? ).
Of course you can use threads. It might be quite simple actually with Queue and ThreadPool. Run for example 20 threads and build a Queue of jobs. Your progress would then be number of items in Queue with initial length of Queue as a limit. This seems like a good design.
Related
In my program a user uploads a csv file.
While the file is uploading & being processed by my app, I'd like to show a progress bar.
The problem is that this process isn't entirely under my control (I can't really tell how long it'll take for the file to finish loading & be processed, as this depends on the file content and the size).
What would be the correct approach for doing this? It's not like I have many steps and I could increment the progress bar every time a step happens.... It's basically waiting for a file to be loaded, I cannot determine the time for that!
Is this even possible?
Thanks in advance
You don't give much detail, so I'll explain what I think is happening and give some suggestions from my thought process.
You have some kind of app that has some kind of function/process that
is a black-box (i.e you can't see inside it or change it), this
black-box uploads a csv file to some server and returns control back to
your app when it's done. Since you can't see inside the black-box you
can't determine how much it has uploaded and thus can't create an
accurate progress bar.
Named Pipes:
If you're passing only the filename of the csv to the black-box, you might be able to create a named pipe (depending on your situation.) Since named pipes block after the buffer is full - until the receiver reads it, you could keep track of how much has been read and thus create an accurate progress bar.
So you would create a named pipe, pass the black-box its filename, and then read in from the csv - and write to the named pipe. How far you've read in - is your progress.
More Pythonic:
Since you tagged Python, if you're passing the csv as a file-like object, this activestate recipe could help.
Same kind of idea just for Python.
Conclusion: These are two possible solutions. I'm getting tired, and there may be many more - but I can't help more since you haven't given us much to work with.
To answer your question at an abstract level: you can't make accurate progress bars for black-box functions, after all they could have a sleep(random()) call in them for all you know.
There are ways around this that are implementation specific, the two ideas above are examples: the idea being you can make the black-box take a stream instead, and count the bytes as you pass them through.
Alternatively you can guess/approximate, a rough calculation of how many bytes are going in and a (previously calculated) average speed per byte would give you some kind of indication of when it would complete. You could even save how long each run took in your code and do the previous idea automatically getting better each time.
My web app asks users 3 questions and simple writes that to a file, a1,a2,a3. I also have real time visualization of the average of the data (reads real time from file).
Must I use a database to ensure that no/minimal information is lost? Is it possible to produce a queue of read/writes>(Since files are small I am not too worried about the execution time of each call). Does python/flask already take care of this?
I am quite experienced in python itself, but not in this area(with flask).
I see a few solutions:
read /dev/urandom a few times, calculate sha-256 of the number and use it as a file name; collision is extremely improbable
use Redis and command like LPUSH, using it from Python is very easy; then RPOP from right end of the linked list, there's your queue
How can I use the wx.ProgressDialog to time my method called imgSearch? The imgSearch method finds image files on the user's pc. How can I make the wx.ProgressDialog run while imgSearch is still running and display how long the imgSearch is taking?
Here's my code:
def onFind (self,event)# triggered by a button click
max = 80
dlg = wx.ProgressDialog("Progress dialog example","An informative message",parent=self, style = wx.PD_CAN_ABORT| wx.PD_APP_MODAL| wx.PD_ELAPSED_TIME| wx.PD_REMAINING_TIME)
keepGoing = True
count = 0
imageExtentions = ['*.jpg', '*.jpeg', '*.png', '*.tif', '*.tiff']
selectedDir = 'C:\\'
imgSearch.findImages(imageExtentions, selectedDir)# my method
while keepGoing and count < max:
count += 1
wx.MilliSleep(250)
if count >= max / 2:
(keepGoing, skip) = dlg.Update(count, "Half-time!")
else:
(keepGoing, skip) = dlg.Update(count)
dlg.Destroy()
The Problem
Your question is vague but after looking at your code I am going to guess that your problem is that imgSearch.findImages() completes before the dialog ever even opens. I'd appreciate if you could edit your question so I know if this is correct.
I see you are using code from this tutorial to show you how to use wx.ProgressDialog. Unfortunately, you are taking a rather naive approach to this. The purpose of that tutorial is to introduce readers to a wide range of built-in dialogs. As such, the wx.ProgressDialog example is a simple example simply to show you the methods, not to be used as an implementation.
Background Information
First, you need to know what multithreading is and how it works. Everything a computer does is broken down into a series of mathematical operations. The computer processor can only do one operation at a time. Therefore, the computer can only do one thing at a time. Now you might be thinking "well that isn't true. I'm listening to an MP3 while browsing the internet. That's at least two things". Well you're right and you're wrong. Imagine a chef. A chef can prepare many dishes at once but he can only pay attention to them one at a time. Therefore, he must switch between the dishes, performing some small task, until they are all complete. That's how a computer multitasks. Instead of "dishes", computers switch between "threads".
The Solution
Step 1: Threading
By default wxPython is only a single thread (this isn't exactly true, but just go with it). wxPython literally runs your code line-by-line such that one line must finish completely before the next line is run. Therefore, while imgSearch.findImages() is running no other code is executing and it must complete before the wx.ProgressDialog can even be created. The good news is that there are several ways to fix this. The bad news is that fixing this can become quite complicated. I'm not an expert but I suggest doing a search for "wxpython threading" and trying to find some tutorials. I found this tutorial but it is fairly old. The main point is that you will need to run imgSearch.findImages() in it's own thread so that it does not block the GUI.
Step 2: Metrics
So now, how do you make the wx.ProgressDialog reflect the actual status of imgSearch.findImages()? Well, that depends. You need to find some metric to help you measure how much has been done and how much more is left to do. In your code the metric is time because the progress is programmed to take 80 * 250ms = 20s. Time probably isn't what you want. You will need to decide what is best for your program but I will use the example of "number of folders" because it's easy to understand. I will count the total number of folders as well as the number of folders I have completely scanned. Now there's two ways to initialize your metric. You can calculate the max you need if the number is small or easy to calculate (such as time) or you can take a dynamic approach. In my case, calculating "number of folders" exactly would require almost as much time as the image search itself so dynamic is the best option. So, I would modify imgSearch.findImages() so that it keeps a count of the number of folders it has seen as well as the number of folders it has scanned (e.g. if my starting location has 3 folders I start with 3/0. I begin scanning the first folder which itself contains 2 more folders, so I now have 5/0. Every time I completely scan a folder I add 1 to the number of scanned folders, i.e. 5/1 after scanning one folder.). Knowing how many folders I have scanned and how many folders I have left to scan I can estimate my percentage completion using # of folders scanned / total # of folders. 1 / 5 = 0.20, therefore I estimate that I am 20% complete (note that if the next folder I scan contains 5 more folders then this number is automatically decreased to 10%. That is why I called 20% an estimate)
Step 3: Updating
So by now we should have imgSearch.findImages() in it's own thread and it should be able to estimate how far from completion it is. Now we need to update the wx.ProgresDialog. My favourite method for these types of problems is using the pubsub library. Basically, the imgSearch.findImages() thread must occasionally send a message which says how close to completion it is. Maybe at the end of each folder. Maybe ever three folders. It's up to you. Next, you will need to create a method in your main GUI's thread which can read these messages and update wx.ProgressDialog accordingly.
It's a lot of work but I wish you the best of luck.
I have created a simple substring search program that recursively looks through a folder and scans a large number of files. The program uses the Boyer-Moore-Horspool algorithm and is very efficient at parsing large amounts of data.
Link to program: http://pastebin.com/KqEMMMCT
What I'm trying to do now is make it even more efficient. If you look at the code, you'll notice that there are three different directories being searched. I would like to be able to create a process/thread that searches each directory concurrently, it would greatly speed up my program.
What is the best way to implement this? I have done some preliminary research, but my implementations have been unsuccessful. They seem to die after 25 minutes or so of processing (right now the single process version takes nearly 24 hours to run; it's a lot of data, and there are 648 unique keywords.)
I have done various experiments using the multiprocessing API and condensing all the various files into 3 files (one for each directory) and then mapping the files to memory via mmap(), but a: I'm not sure if this is the appropriate route to go, and b: my program kept dying at random points, and debugging was an absolute nightmare.
Yes, I have done extensive googleing, but I'm getting pretty confused between pools/threads/subprocesses/multithreading/multiprocessing.
I'm not asking for you to write my program, just help me understand the thought process needed to go about implementing a solution. Thank you!
FYI: I plan to open-source the code once I get the program running. I think it's a fairly useful script, and there are limited examples of real world implementations of multiprocessing available online.
What to do depends on what's slowing down the process.
If you're reading on a single disk, and disk I/O is slowing you down, multiple threads/process will probably just slow you down as the read head will now be jumping all over the place as different threads get control, and you'll be spending more time seeking than reading.
If you're reading on a single disk, and processing is slowing you down, then you might get a speedup from using multiprocessing to analyze the data, but you should still read from a single thread to avoid seek time delays (which are usually very long, multiple milliseconds).
If you're reading from multiple disks, and disk I/O is slowing you down, then either multiple threads or processes will probably give you a speed improvement. Threads are easier, and since most of your delay time is away from the processor, the GIL won't be in your way.
If you're reading from multiple disks,, and processing is slowing you down, then you'll need to go with multiprocessing.
Multiprocessing is easier to understand/use than multithreading(IMO). For my reasons, I suggest reading this section of TAOUP. Basically, everything a thread does, a process does, only the programmer has to do everything that the OS would handle. Sharing resources (memory/files/CPU cycles)? Learn locking/mutexes/semaphores and so on for threads. The OS does this for you if you use processes.
I would suggest building 4+ processes. 1 to pull data from the hard drive, and the other three to query it for their next piece. Perhaps a fifth process to stick it all together.
This naturally fits into generators. See the genfind example, along with the gengrep example that uses it.
Also on the same site, check out the coroutines section.
Attempt #2:
People don't seem to be understanding what I'm trying to do. Let me see if I can state it more clearly:
1) Reading a list of files is much faster than walking a directory.
2) So let's have a function that walks a directory and writes the resulting list to a file. Now, in the future, if we want to get all the files in that directory we can just read this file instead of walking the dir. I call this file the index.
3) Obviously, as the filesystem changes the index file gets out of sync. To overcome this, we have a separate program that hooks into the OS in order to monitor changes to the filesystem. It writes those changes to a file called the monitor log. Immediately after we read the index file for a particular directory, we use the monitor log to apply the various changes to the index so that it reflects the current state of the directory.
Because reading files is so much cheaper than walking a directory, this should be much faster than walking for all calls after the first.
Original post:
I want a function that will recursively get all the files in any given directory and filter them according to various parameters. And I want it to be fast -- like, an order of magnitude faster than simply walking the dir. And I'd prefer to do it in Python. Cross-platform is preferable, but Windows is most important.
Here's my idea for how to go about this:
I have a function called all_files:
def all_files(dir_path, ...parms...):
...
The first time I call this function it will use os.walk to build a list of all the files, along with info about the files such as whether they are hidden, a symbolic link, etc. I'll write this data to a file called ".index" in the directory. On subsequent calls to all_files, the .index file will be detected, and I will read that file rather than walking the dir.
This leaves the problem of the index getting out of sync as files are added and removed. For that I'll have a second program that runs on startup, detects all changes to the entire filesystem, and writes them to a file called "mod_log.txt". It detects changes via Windows signals, like the method described here. This file will contain one event per line, with each event consisting of the path affected, the type of event (create, delete, etc.), and a timestamp. The .index file will have a timestamp as well for the time it was last updated. After I read the .index file in all_files I will tail mod_log.txt and find any events that happened after the timestamp in the .index file. It will take these recent events, find any that apply to the current directory, and update the .index accordingly.
Finally, I'll take the list of all files, filter it according to various parameters, and return the result.
What do you think of my approach? Is there a better way to do this?
Edit:
Check this code out. I'm seeing a drastic speedup from reading a cached list over a recursive walk.
import os
from os.path import join, exists
import cProfile, pstats
dir_name = "temp_dir"
index_path = ".index"
def create_test_files():
os.mkdir(dir_name)
index_file = open(index_path, 'w')
for i in range(10):
print "creating dir: ", i
sub_dir = join(dir_name, str(i))
os.mkdir(sub_dir)
for i in range(100):
file_path = join(sub_dir, str(i))
open(file_path, 'w').close()
index_file.write(file_path + "\n")
index_file.close()
#
# 0.238 seconds
def test_walk():
for info in os.walk("temp_dir"):
pass
# 0.001 seconds
def test_read():
open(index_path).readlines()
if not exists("temp_dir"):
create_test_files()
def profile(s):
cProfile.run(s, 'profile_results.txt')
p = pstats.Stats('profile_results.txt')
p.strip_dirs().sort_stats('cumulative').print_stats(10)
profile("test_walk()")
profile("test_read()")
Do not try to duplicate the work that the filesystem already does. You are not going to do better than it already does.
Your scheme is flawed in many ways and it will not get you an order-of-magnitude improvement.
Flaws and potential problems:
You are always going to be working with a snapshot of the file system. You will never know with any certainty that it is not significantly disjoint from reality. If that is within the working parameters of your application, no sweat.
The filesystem monitor program still has to recursively walk the file system, so the work is still being done.
In order to increase the accuracy of the cache, you have to increase the frequency with which the filesystem monitor runs. The more it runs, the less actual time that you are saving.
Your client application likely won't be able to read the index file while it is being updated by the filesystem monitor program, so you'll lose time while the client waits for the index to be readable.
I could go on.
If, in fact, you don't care about working with a snapshot of the filesystem that may be very disjoint from reality, I think that you'd be much better off with keeping the index in memory and updating from with the application itself. That will scrub any file contention issues that will otherwise arise.
The best answer came from MichaĆ Marczyk toward the bottom of the comment list on the initial question. He pointed out that what I'm describing is very close to the UNIX locate program. I found a Windows version here: http://locate32.net/index.php. It solved my problem.
Edit: Actually the Everything search engine looks even better. Apparently Windows keeps journals of changes to the filesystem, and Everything uses that to keep the database up to date.
Doesn't Windows Desktop Search provide such an index as a byproduct? On the mac the spotlight index can be queried for filenames like this: mdfind -onlyin . -name '*'.
Of course it's much faster than walking the directory.
The short answer is "no". You will not be able to build an indexing system in Python that will outpace the file system by an order of magnitude.
"Indexing" a filesystem is an intensive/slow task, regardless of the caching implementation. The only realistic way to avoid the huge overhead of building filesystem indexes is to "index as you go" to avoid the big traversal. (After all, the filesystem itself is already a data indexer.)
There are operating system features that are capable of doing this "build as you go" filesystem indexing. It's the very foundation of services like Spotlight on OSX and Windows Desktop Search.
To have any hope of getting faster speeds than walking the directories, you'll want to leverage one of those OS or filesystem level tools.
Also, try not to mislead yourself into thinking solutions are faster just because you've "moved" the work to a different time/process. Your example code does exactly that. You traverse the directory structure of your sample files while you're building the same files and create the index, and then later just read that file.
There are two lessons, here. (a) To create a proper test it's essential to separate the "setup" from the "test". Here your performance test essentially says, "Which is faster, traversing a directory structure or reading an index that's already been created in advance?" Clearly this is not an apples to oranges comparison.
However, (b) you've stumbled on the correct answer at the same time. You can get a list of files much faster if you use an already existing index. This is where you'd need to leverage something like the Windows Desktop Search or Spotlight indexes.
Make no mistake, in order to build an index of a filesystem you must, by definition, "visit" every file. If your files are stored in a tree, then a recursive traversal is likely going to be the fastest way you can visit every file. If the question is "can I write Python code to do exactly what os.walk does but be an order of magnitude faster than os.walk" the answer is a resounding no. If the question is "can I write Python code to index every file on the system without taking the time to actually visit every file" then the answer is still no.
(Edit in response to "I don't think you understand what I'm trying to do")
Let's be clear here, virtually everyone here understands what you're trying to do. It seems that you're taking "no, this isn't going to work like you want it to work" to mean that we don't understand.
Let's look at this from another angle. File systems have been an essential component to modern computing from the very beginning. The categorization, indexing, storage, and retrieval of data is a serious part of computer science and computer engineering and many of the most brilliant minds in computer science are working on it constantly.
You want to be able to filter/select files based on attributes/metadata/data of the files. This is an extremely common task utilized constantly in computing. It's likely happening several times a second even on the computer you're working with right now.
If it were as simple to speed up this process by an order of magnitude(!) by simply keeping a text file index of the filenames and attributes, don't you think every single file system and operating system in existence would do exactly that?
That said, of course caching the results of your specific queries could net you some small performance increases. And, as expected, file system and disk caching is a fundamental part of every modern operating system and file system.
But your question, as you asked it, has a clear answer: No. In the general case, you're not going to get an order of magnitude faster reimplementing os.walk. You may be able to get a better amortized runtime by caching, but you're not going to be beat it by an order of magnitude if you properly include the work to build the cache in your profiling.
I would like to recommend you just use a combination of os.walk (to get directory trees) & os.stat (to get file information) for this. Using the std-lib will ensure it works on all platforms, and they do the job nicely. And no need to index anything.
As other have stated, I don't really think you're going to buy much by attempting to index and re-index the filesystem, especially if you're already limiting your functionality by path and parameters.
I'm new to Python, but I'm using a combination of list comprehensions, iterator and a generator should scream according to reports I've read.
class DirectoryIterator:
def __init__(self, start_dir, pattern):
self.directory = start_dir
self.pattern = pattern
def __iter__(self):
[([DirectoryIterator(dir, self.pattern) for dir in dirnames], [(yield os.path.join(dirpath, name)) for name in filenames if re.search(self.pattern, name) ]) for dirpath, dirnames, filenames in os.walk(self.directory)]
###########
for file_name in DirectoryIterator(".", "\.py$"): print file_name