Related
I'm workin with fairly large dataframes and textfiles (thousands of docs) that I am opening up in my ipython notebook. I'm noticing that after a while, my computer becomes really slow. Is there a way to take inventory of my python program to find out what's slowing down my computer?
You have a few options. First, you can use third party tools like heapy or PySizer to evaluate your memory usage at different points in your program. This (now closed) SO question discusses them a little bit. Additionally, there is a third option simply called 'memory_profiler' hosted here on GitHub, and according to this blog there are some special shortcuts in IPython for memory_profiler.
Once you have identified the data structures that are consuming the most memory, there are a few options:
Refactor to take advantage of garbage collection
Examine the flow of data through your program and see if there are any places where large data structures are kept around when they don't need to be. If you have a large data structure that you do some processing on, put that processing in a function and returned the processed result so the original memory hog can go out of scope and be destroyed.
A comment suggested using the del statement. Although the commenter is correct that it will free memory, it really should indicate to you that your program isn't structured correctly. Python has good garbage collection, and if you find yourself manually messing with memory freeing, you should probably put that section of code in a function or method instead, and let the garbage collector do its thing.
Temporary Files
If you really need access to large data structures (almost) simultaneously, consider writing one or several of them to temporary files while not needed. You can use the JSON or Pickle libraries to write stuff out in sophisticated formats, or simply pprint your data to a file and read it back in later.
I know that seems like some kind of manual hard disk thrashing, but it gives you great control over exactly when the writes to and reads from the hard disk occur. Also, in this case only your files are bouncing on and off the disk. When you use up your memory and swapping starts occurring, everything gets bounced around - data files, program instructions, memory page tables, etc... Everything grinds to a halt instead of just your program running a little more slowly.
Buy More Memory
Yes, this is an option. But like the del statement, it can usually be avoided by more careful data abstraction and should be a last resort, reserved for special cases.
iPython it's a wonderful tool, but sometimes it tends to slow things up.
If you have large print output statements, lots of graphics, or your code has grown too big, the autosave takes forever to snap your Notebooks. Try autosaving sparingly with:
%autosave 300
Or disabling it entirely:
%autosave 0
I'm developing a Python command line utility that potentially involves rather large queries against a set of files. It's a reasonably finite list of queries (think indexed DB columns) To improve performance in-process I can generated sorted/structured lists, maps and trees once, and hit those repeatedly, rather than hit the file system each time.
However, these caches are lost when the process ends, and need to be rebuilt every time the script runs, which dramatically increases the runtime of my program. I'd like to identify the best way to share this data between multiple executions of my command, which may be concurrent, one after another, or with significant delays between executions.
Requirements:
Must be fast - any sort of per-execution processing should be minimized, this includes disk IO and object construction.
Must be OS agnostic (or at least be able to hook into similar underlying behaviors on Unix/Windows, which is more likely).
Must allow reasonably complex querying / filtering - I don't think a key/value map will be good enough
Does not need to be up-to-date - (briefly) stale data is perfectly fine, this is just a cache, the actual data is being written to disk separately.
Can't use a heavyweight daemon process, like MySQL or MemCached - I want to minimize installation costs, and asking each user to install these services is too much.
Preferences:
I'd like to avoid any sort long running daemon process at all, if possible.
While I'd like to be able to update the cache quickly, rebuilding the whole cache on update isn't the end of the world, fast reads are much more important than fast writes.
In my ideal fantasy world, I'd be able to directly keep Python objects around between executions, sort of like Java threads (like Tomcat requests) sharing singleton data store objects, but I realize that may not be possible. The closer I can get to that though, the better.
Candidates:
SQLite in memory
SQLite on it's own doesn't seem fast enough for my use case, since it's backed by disk and therefore will have to read from the file on every execution. Perhaps this isn't as bad as it seems, but it seems necessary to persistently store the database in memory. SQLite allows for DBs to use memory as storage but these DBs are destroyed upon program exit, and cannot be shared between instances.
Flat file database loaded into memory with mmap
On the opposite end of the spectrum, I could write the caches to disk, then load them into memory with mmap, can share the same memory space between separate executions. It's not clear to me what happens to the mmap if all processes exit however. It's ok if the mmap is eventually flushed from memory, but I'd want it to stick around for a little bit (30 seconds? a few minutes?) so a user can run commands one after another, and the cache can be reused. This example seems to imply that there needs to be an open mmap handle, but I haven't found any exact description of when memory mapped files get dropped from memory and need to be reloaded from disk.
I think I could implement this, if mmap objects do stick around after exit, but it feels very low level, and I imagine someone's already got a more elegant solution implemented. I'd hate to start building this only to realize I've been rebuilding SQLite. On the other hand, it feels like it would be very fast, and I could make optimizations given my specific use case.
Share Python objects between processes using Processing
The Processing package indicates "Objects can be shared between processes using ... shared memory". Looking through the rest of the docs, I didn't see any further mention of this behavior, but that sounds very promising. Can anyone direct me to more information?
Store data on a RAM disk
My concern here is OS-specific capabilities, but I could create a RAM disk and then simply read/write to it as I please (SQLite?). The fs.memoryfs package seems like a promising alternative to work with multiple OSs, but the comments imply a fair number of limitations.
I know pickle is an efficient way to store Python objects, so it might have speed advantages over any sort of manual data storage. Can I hook pickle into any of the above options? Would that be better than flat files or SQLite?
I know there's a lot of questions related to this, but I did a fair bit of digging and couldn't find anything directly addressing my question with regards to multiple command line executions.
I fully admit, I may be way overthinking this. I'm just trying to get a feel for my options, and if they're worthwhile or not.
Thank you so much for your help!
I would just do the simplest thing that might possibly work. ...which in your case would likely just be to dump to a pickle file. If you find it's not fast enough, try something more involved (like memcached or SQLite). Donald Knuth says "Premature optimization is the root of all evil"!
I have a relatively large dictionary. How do I know the size? well when I save it using cPickle the size of the file will grow approx. 400Mb. cPickle is supposed to be much faster than pickle but loading and saving this file just takes a lot of time. I have a Dual Core laptop 2.6 Ghz with 4GB RAM on a Linux machine. Does anyone have any suggestions for a faster saving and loading of dictionaries in python? thanks
Use the protocol=2 option of cPickle. The default protocol (0) is much slower, and produces much larger files on disk.
If you just want to work with a larger dictionary than memory can hold, the shelve module is a good quick-and-dirty solution. It acts like an in-memory dict, but stores itself on disk rather than in memory. shelve is based on cPickle, so be sure to set your protocol to anything other than 0.
The advantages of a database like sqlite over cPickle will depend on your use case. How often will you write data? How many times do you expect to read each datum that you write? Will you ever want to perform a search of the data you write, or load it one piece at a time?
If you're doing write-once, read-many, and loading one piece at a time, by all means use a database. If you're doing write once, read once, cPickle (with any protocol other than the default protocol=0) will be hard to beat. If you just want a large, persistent dict, use shelve.
I know it's an old question but just as an update for those who still looking for an answer to this question:
The protocol argument has been updated in python 3 and now there are even faster and more efficient options (i.e. protocol=3 and protocol=4) which might not work under python 2.
You can read about it more in the reference.
In order to always use the best protocol supported by the python version you're using, you can simply use pickle.HIGHEST_PROTOCOL. The following example is taken from the reference:
import pickle
# ...
with open('data.pickle', 'wb') as f:
# Pickle the 'data' dictionary using the highest protocol available.
pickle.dump(data, f, pickle.HIGHEST_PROTOCOL)
Sqlite
It might be worthwhile to store the data in a Sqlite database. Although there will be some development overhead when refactoring your program to work with Sqlite, it also becomes much easier and performant to query the database.
You also get transactions, atomicity, serialization, compression, etc. for free.
Depending on what version of Python you're using, you might already have sqlite built-in.
I have tried this for many projects and concluded that shelve is faster than pickle in saving data. Both perform the same at loading data.
Shelve is in fact a dirty solution.
That is because you have to be very careful with it. If you do not close a shelve file after opening it, or due to any reason some interruption happens in your code when you're in the middle of opening and closing it, the shelve file has high chance of getting corrupted (resulting in frustrating KeyErrors); which is really annoying given that we who are using them are interested in them because of storing our LARGE dict files which clearly also took a long time to be constructed
And that is why shelve is a dirty solution... It's still faster though. So!
You may test to compress your dictionnary (with some restrictions see : this post) it will be efficient if the disk access is the bottleneck.
That is a lot of data...
What kind of contents has your dictionary? If it is only primitive or fixed datatypes, maybe a real database or a custom file-format is the better option?
We've got a Python-based web server that unpickles a number of large data files on startup using cPickle. The data files (pickled using HIGHEST_PROTOCOL) are around 0.4 GB on disk and load into memory as about 1.2 GB of Python objects -- this takes about 20 seconds. We're using Python 2.6 on 64-bit Windows machines.
The bottleneck is certainly not disk (it takes less than 0.5s to actually read that much data), but memory allocation and object creation (there are millions of objects being created). We want to reduce the 20s to decrease startup time.
Is there any way to deserialize more than 1GB of objects into Python much faster than cPickle (like 5-10x)? Because the execution time is bound by memory allocation and object creation, I presume using another unpickling technique such as JSON wouldn't help here.
I know some interpreted languages have a way to save their entire memory image as a disk file, so they can load it back into memory all in one go, without allocation/creation for each object. Is there a way to do this, or achieve something similar, in Python?
Try the marshal module - it's internal (used by the byte-compiler) and intentionally not advertised much, but it is much faster. Note that it doesn't serialize arbitrary instances like pickle, only builtin types (don't remember the exact constraints, see docs). Also note that the format isn't stable.
If you need to initialize multiple processes and can tolerate one process always loaded, there is an elegant solution: load the objects in one process, and then do nothing in it except forking processes on demand. Forking is fast (copy on write) and shares the memory between all processes. [Disclaimers: untested; unlike Ruby, Python ref counting will trigger page copies so this is probably useless if you have huge objects and/or access a small fraction of them.]
If your objects contain lots of raw data like numpy arrays, you can memory-map them for much faster startup. pytables is also good for these scenarios.
If you'll only use a small part of the objects, then an OO database (like Zope's) can probably help you. Though if you need them all in memory, you will just waste lots of overhead for little gain. (never used one, so this might be nonsense).
Maybe other python implementations can do it? Don't know, just a thought...
Are you load()ing the pickled data directly from the file? What about to try to load the file into the memory and then do the load?
I would start with trying the cStringIO(); alternatively you may try to write your own version of StringIO that would use buffer() to slice the memory which would reduce the needed copy() operations (cStringIO still may be faster, but you'll have to try).
There are sometimes huge performance bottlenecks when doing these kinds of operations especially on Windows platform; the Windows system is somehow very unoptimized for doing lots of small reads while UNIXes cope quite well; if load() does lot of small reads or you are calling load() several times to read the data, this would help.
I haven't used cPickle (or Python) but in cases like this I think the best strategy is to
avoid unnecessary loading of the objects until they are really needed - say load after start up on a different thread, actually its usually better to avoid unnecessary loading/initialization at anytime for obvious reasons. Google 'lazy loading' or 'lazy initialization'. If you really need all the objects to do some task before server start up then maybe you can try to implement a manual custom deserialization method, in other words implement something yourself if you have intimate knowledge of the data you will deal with which can help you 'squeeze' better performance then the general tool for dealing with it.
Did you try sacrificing efficiency of pickling by not using HIGHEST_PROTOCOL? It isn't clear what performance costs are associated with using this protocol, but it might be worth a try.
Impossible to answer this without knowing more about what sort of data you are loading and how you are using it.
If it is some sort of business logic, maybe you should try turning it into a pre-compiled module;
If it is structured data, can you delegate it to a database and only pull what is needed?
Does the data have a regular structure? Is there any way to divide it up and decide what is required and only then load it?
I'll add another answer that might be helpful - if you can, can you try to define _slots_ on the class that is most commonly created? This may be a little limiting and impossible, however it seems to have cut the time needed for initialization on my test to about a half.
As a maintenance issue I need to routinely (3-5 times per year) copy a repository that is now has over 20 million files and exceeds 1.5 terabytes in total disk space. I am currently using RICHCOPY, but have tried others. RICHCOPY seems the fastest but I do not believe I am getting close to the limits of the capabilities of my XP machine.
I am toying around with using what I have read in The Art of Assembly Language to write a program to copy my files. My other thought is to start learning how to multi-thread in Python to do the copies.
I am toying around with the idea of doing this in Assembly because it seems interesting, but while my time is not incredibly precious it is precious enough that I am trying to get a sense of whether or not I will see significant enough gains in copy speed. I am assuming that I would but I only started really learning to program 18 months and it is still more or less a hobby. Thus I may be missing some fundamental concept of what happens with interpreted languages.
Any observations or experiences would be appreciated. Note, I am not looking for any code. I have already written a basic copy program in Python 2.6 that is no slower than RICHCOPY. I am looking for some observations on which will give me more speed. Right now it takes me over 50 hours to make a copy from a disk to a Drobo and then back from the Drobo to a disk. I have a LogicCube for when I am simply duplicating a disk but sometimes I need to go from a disk to Drobo or the reverse. I am thinking that given that I can sector copy a 3/4 full 2 terabyte drive using the LogicCube in under seven hours I should be able to get close to that using Assembly, but I don't know enough to know if this is valid. (Yes, sometimes ignorance is bliss)
The reason I need to speed it up is I have had two or three cycles where something has happened during copy (fifty hours is a long time to expect the world to hold still) that has caused me to have to trash the copy and start over. For example, last week the water main broke under our building and shorted out the power.
Thanks for the early responses but I don't think it is I/O limitations. I am not going over a network, the drive is plugged into my mother board with a sata connection and my Drobo is plugged into a Firewire port, my thinking is that both connections should allow faster transfer.
Actually I can't use a sector copy except going from a single disk to the Drobo. It won't work the other way since the Drobo file structure is a mystery. My unscientific observation is that the copy from one internal disk to another is no faster than a copy to or from the Drobo to an internal disk.
I am bound by the hardware, I can't afford 10K rpm 2 terabyte drives (if they even make them).
A number of you are suggesting a file synching solution. But that does not solve my problem. First off, the file synching solutions I have played with build a map (for want of a better term) of the data first, I have too many little files so they choke. One of the reasons I use RICHCOPY is that it starts copying immediately, it does not use memory to build a map. Second, I had one of my three Drobo backups fail a couple of weeks ago. My rule is if I have a backup failure the other two have to stay off line until the new one is built. So I need to copy from one of the three back up single drive copies I have that I use with the LogicCube.
At the end of the day I have to have a good copy on a single drive because that is what I deliver to my clients. Because my clients have diverse systems I deliver to them on SATA drives.
I rent some cloud space from someone where my data is also stored as the deepest backup but it is expensive to pull if off of there.
Copying files is an I/O bound process. It is unlikely that you will see any speed up from rewriting it in assembly, and even multithreading may just cause things to go slower as different threads requesting different files at the same time will result in more disk seeks.
Using a standard tool is probably the best way to go here. If there is anything to optimize, you might want to consider changing your file system or your hardware.
As the other answers mention (+1 to mark), when copying files, disk i/o is the bottleneck. The language you use won't make much of a difference. How you've laid out your files will make a difference, how you're transferring data will make a difference.
You mentioned copying to a DROBO. How is your DROBO connected? Check out this graph of connection speeds.
Let's look at the max copy rates you can get over certain wire types:
USB = 97 days (1.5 TB / 1.5 Mbps). Lame, at least your performance is not this bad.
USB2.0 = ~7hrs (1.5 TB / 480 Mbps). Maybe LogicCube?
Fast SCSI = ~40hrs (1.5 TB / 80 Mbps). Maybe your hard drive speed?
100 Mbps ethernet = 1.4 days (1.5 TB / 100 Mbps).
So, depending on the constraints of your problem, it's possible you can't do better. But you may want to start doing a raw disk copy (like Unix's dd), which should be much faster than a file-system level copy (it's faster because there are no random disk seeks for directory walks or fragmented files).
To use dd, you could live boot linux onto your machine (or maybe use cygwin?). See this page for reference or this one about backing up from windows using a live-boot of Ubuntu.
If you were to organize your 1.5 TB data on a RAID, you could probably speed up the copy (because the disks will be reading in parallel), and (depending on the configuration) it'll have the added benefit of protecting you from drive failures.
There are 2 places for slowdown:
Per-file copy is MUCH slower than a disk copy (where you literally clone 100% of each sector's data). Especially for 20mm files. You can't fix that one with the most tuned assembly, unless you switch from cloning files to cloning raw disk data. In the latter case, yes, Assembly is indeed your ticket (or C).
Simply storing 20mm files and recursively finding them may be less efficient in Python. But that's more likely a function of finding better algorithm and is not likely to be significantly improved by Assembly. Plus, that will NOT be the main contributor to 50 hrs
In summary - Assembly WILL help if you do raw disk sector copy, but will NOT help if you do filesystem level copy.
I don't think writing it in assembly will help you. Writing a routine in assembly could help you if you are processor-bound and think you can do something smarter than your compiler. But in a network copy, you will be IO bound, so shaving a cycle here or there almost certainly will not make a difference.
I think the genreal rule here is that it's always best to profile your process to see where you are spending the time before thinking about optimizations.
I don't believe it will make a discernable difference which language you use for this purpose. The bottleneck here is not your application but the disk performance.
Just because a language is interpreted, it doesn't mean that every single operation in it is slow. As an example, it's a fairly safe bet that the lower-level code in Python will call assembly (or compiled) code to do copying.
Similarly, when you do stuff with collections and other libraries in Java, that's mostly compiled C, not interpreted Java.
There are a couple of things you can do to possibly speed up the process.
Buy faster hard disks (10K RPMs rather than 7.5K or less latency, larger caches and so forth).
Copying between two physical disks may be faster than copying on a single disk (due to the head movement).
If you're copying across the network, stage it. In other words, copy it fast to another local disk, then slow from there across the net.
You can also stage it in a different way. If you run a nightly (or even weekly) process to keep the copy up to date (only copying changed files) rather than three times a year, you won't find yourself in a situation where you have to copy a massive amount.
Also if you're using the network, run it on the box where the repository is. You don't want to copy all the data from a remote disk to another PC then back to yet another remote disk.
You may also want to be careful with Python. I may be mistaken (and no doubt the Pythonistas will set me straight if I'm mistaken on this count ) but I have a vague recollection that its threading may not fully utilise multi-core CPUs. In that case, you'd be better off with another solution.
You may well be better off sticking with your current solution. I suspect a specialised copy program will already be optimised as much as possible since that's what they do.
There's no reason at all to write a copy program in assembly. The problem is with the amount of IO involved not CPU. Also, the copy function in python is already written in C by experts and you won't eek out any more speed writing one yourself in assembler.
Lastly, threading won't help either, especially in python. Go with with either Twisted or just use the new multiprocessing module in Python 2.6 and kick off a pool of processes to do the copies. Save yourself a lot of torment while getting the job done.
Before you question the copying app, you should most likely question the data path. What are the theoretical limits and what are you achieving? What are the potential bottlenecks? If there is a single data path, you are probably not going to get a significant boost by parallelizing storage tasks. You may even exacerbate it. Most of the benefits you'll get with asynchronous I/O come at the block level - a level lower than the file system.
One thing you could do to boost I/O is decouple the fetch from source and store to destination portions. Assuming that the source and destination are separate entities, you could theoretically halve the amount of time for the process. But are the standard tools already doing this??
Oh - and on Python and the GIL - with I/O-bound execution, the GIL is really not quite that bad of a penalty.
RICHCOPY is already copying files in parallel, and I expect the only way to beat it is to get in bed with the filesystem so that you minimize disk I/O, especially seeking. I suggest you try ntfsclone to see if it meets your needs. If not, my next suggestion would be to parallize ntfsclone.
In any case, working directly with filesystem layout on disk is going to be easiest in C, not Python and certainly not assembly. Especially since you can get started by using the C code from the NTFS 3G project. This code is designed for reliability and ease of porting, not performance, but it's still probably the easiest way to get started.
My time is precious enough that I am trying to get a sense of whether or not I will see significant enough gains in copy speed.
No. Or more accurately, at your current level of mastery of systems programming, achieving significant improvements in speed will be prohibitively expensive. What you're asking for requires very specialized expertise. Although I myself have prior experience in implementing filesystems (much simpler ones than NTFS, XFS, or ext2), I would not tackle this job; I would hire it done.
Footnote: if you have access to a Linux box, find out what raw write bandwidth you can get to the target drive:
time dd if=/dev/zero of=/dev/sdc bs=1024k count=100
will give you the time to write 100MB sequentially in the fastest possible way. That will give you an absolute limit on what is possible with your hardware. Don't try this without understanding the man page for dd! dd stands for "destroy data". (Actually it stands for "copy and convert", but cc was taken.)
A Windows programmer can probably point you to an equivalent test for Windows.
Right, here the bottleneck is not in the execution of the copying software itself but rather the disk access.
Going lower level does not mean that you will have better performance. Take a simple example of open() and fopen() APIs where open is much lower level is is more direct and fopen() is a library wrapper for the system open() function.
But in reality fopen has better berformance because it adds buffering and optimizes a lot of stuff that is not done in the raw open() function.
Implementing optimizations in assembly level is much harder and less efficient than in python.
1,5 TB in approximately 50 hours gives a throughput of (1,5 * 1024^2) MB / (50 * 60^2) s = 8,7 MB/s. A theoretical 100 mbit/s bandwidth should give you 12,5 MB/s. It seems to me that your firewire connection is a problem. You should look at upgrading drivers, or upgrading to a better firewire/esata/usb interface.
That said, rather than the python/assembly question, you should look at acquiring a file syncing solution. It shouldn't be necessary copying that data over and over again.
As already said, it is not the language here to make the difference; assembly could be cool or fast for computations, but when the processor have to "speak" to peripherals, the limit is given by these. In this case the speed is given by your hard disk speed, and this is a limit you hardly can change wiithout changing your hd and waiting for better hd in future, but also by the way data are organized on the disk, i.e. by the filesystem. AFAIK, most used filesystems are not optimized to handle fastly tons of "small" files, rather they are optimized to hold "few" huge files.
So, changing the filesystem you're using could increase your copy speed, as far as it is more suitable to your case (and of course hd limits still apply!). If you want to "taste" the real limit of you hd, you should try a copy "sector by sector", replycating the exact image of your source hd to the dest hd. (But this option has some points to be aware of)
Since I posted the question I have been playing around with some things and I think first off, not to be argumentative but those of you that have been posting the response that I am i/o bound are only partially correct. It is seek time that is the constraint. Long story to test various options I built a new machine with an I-7 processor and a reasonably powerful/functional motherboard and then using the same two drives I was working with before I noted a fairly significant increase in speed. I also noted that when I am moving big files (one gigabyte or so) I get sustained transfer speeds of in excess of 50 mb/s and the speed drops significantly when moving small files. I think the speed difference is due to an unordered disk relative to the way the copy program reads the directory structure to determine the files to copy.
What I think needs to be done is to
1: Read the MFT and sort by sector working from the outside to the inside of the platter
(it means I have to figure out how multi-platter disks work)
2: Analyze and separate all contiguous versus non-contiguous files. I would handle the
contiguous files first and go back to handle the non-contiguous files
3: start copying the contiguous files from the outside to inside
4. When finished copy the non-contiguous files, by default they will end up on the inner
rings of the platter(s) and they will be contiguous. (I want to note that I do
regularly defragment and have less than 1% of my files/directories are fragmented)but
1% of 20 million is still 200K
Why is this better than just running a copy program.
When running a copy program the program is going to use some internal ordering mechanism to determine the copy order. Windows uses alphabetic (more or less) I imagine others do something similar but that order may-not (in my case probably does not) conform to the way the files were initially laid on the disk which is what I beli8eve is the biggest factor that affects copy speed.
The problem with a sector-copy is it does not fix anything and so when I migrate across disk sizes and add data I end up with new problems to handle.
If I do this right I should be able to check file headers and the eof record and do some housekeeping. CHKDSK is a great program but kind of dumb. When I do get file/folder corruption it is really hard to identify what was lost, by building my own copy program I could include a maintenance cycle that I could invoke when I want to run some tests on the files during copying. This might slow it up some but I don't think very much because the CPU is going to move the files much faster than they can get pulled or written. And even if it slows it up some when being run, at least I get some control (maybe understanding is
a better word) of the problems that will invariably crop up in an imperfect world.
I may not have to do this in A, I have been looking around for ways to play (read) the MFT and there are even Python tools for this see http://www.integriography.com
Neither. If you want to take advantage of OS features to speed up I/O, you'll need to use some specialized system calls that are most easily accessed in C (or C++). You don't need to know a lot of C to write such a program, but you really need to know the system call interfaces.
In all likelihood, you can solve the problem without writing any code by using an existing tool or tuning the operating system, but if you really do need to write a tool, C is the most straightforward way to do it.