Conclusion:
It seems that HDF5 is the way to go for my purposes. Basically "HDF5 is a data model, library, and file format for storing and managing data." and is designed to handle incredible amounts of data. It has a Python module called python-tables. (The link is in the answer below)
HDF5 does the job done 1000% better in saving tons and tons of data. Reading/modifying the data from 200 million rows is a pain though, so that's the next problem to tackle.
I am building directory tree which has tons of subdirectories and files. There are about 10 million files spread around a hundred thousand directories. Each file is under 32 subdirectories.
I have a python script that builds this filesystem and reads & writes those files. The problem is that when I reach more than a million files, the read and write methods become extremely slow.
Here's the function I have that reads the contents of a file (the file contains an integer string), adds a certain number to it, then writes it back to the original file.
def addInFile(path, scoreToAdd):
num = scoreToAdd
try:
shutil.copyfile(path, '/tmp/tmp.txt')
fp = open('/tmp/tmp.txt', 'r')
num += int(fp.readlines()[0])
fp.close()
except:
pass
fp = open('/tmp/tmp.txt', 'w')
fp.write(str(num))
fp.close()
shutil.copyfile('/tmp/tmp.txt', path)
Relational databases seem too slow for accessing these data, so I opted for a filesystem approach.
I previously tried performing linux console commands for these but it was way slower.
I copy the file to a temporary file first then access/modify it then copy it back because i found this was faster than directly accessing the file.
Putting all the files into 1 directory (in reiserfs format) caused too much slowdown when accessing the files.
I think the cause of the slowdown is because there're tons of files. Performing this function 1000 times clocked at less than a second.. but now it's reaching 1 minute.
How do you suggest I fix this? Do I change my directory tree structure?
All I need is to quickly access each file in this very huge pool of files*
I know this isn't a direct answer to your question, but it is a direct solution to your problem.
You need to research using something like HDF5. It is designed for just the type of hierarchical data with millions of individual data points.
You are REALLY in luck because there are awesome Python bindings for HDF5 called pytables.
I have used it in a very similar way and had tremendous success.
Two suggestions:
First, a structure that involves 32-deep nesting of subdirectories is inherently flawed. Assuming that you really have "about 10 million files", one level of subdirectories should absolutely be enough (assuming you use a modern filesystem).
Second: You say you have "about 10 million files" and that each file "contains an integer string". Assuming that those are 32-bit integers and you store them directly instead of as strings, that amounts to a total dataset size of 40MiB (10M files * 4 bytes per file). Assuming that each filename is 32 bytes long, add another 320MiB for "keys" to this data.
So you'll be able to easily fit the whole dataset into memory. I suggest doing just that, and operate over the data held in main memory. And unless there is any reason you need an elaborate directory structure, I further suggest storing the data in a single file.
I would suggest you rethink your approach, using lots of extremely small files is bound to give you serious performance problems. Depending on the purpose of your program some kind of database could be far more efficient.
If you're doing lots of I/O you can also just throw more hardware at the problem and use SSDs or keep all the data in RAM (explicitly or by caching). With harddrives alone you have no chance of achiving good performance in this scenario.
I've never used it, but e.g. Redis is a persistent key-value store that is supposed to be very fast. If your data fits this model I would definately try this or something similar. You'll find some performance data in this article, which should give you an idea what speeds you can achieve.
The disk is limited by amount of bytes it can read/write per second and also by amount of operations it can perform in second.
While your small files are cached, operations are significantly faster than with uncached files.
It looks like you are hitting both issues,
doing too many i/o operations
running out of cache
I'd suggest revisiting the structure you are using, and using less larger files. Keep in minf (as a rule of thumb) than I/O operation less than 128K runtime cost is more or less equal to I/O of 1byte!
Resolving all of those subdirectories takes time. You're over-taxing the file-system.
Maybe instead of using the directory tree, you could instead encode the path information into the file name, so instead of creating a file with a path like this:
/parent/00/01/02/03/04/05/06/07
/08/09/0A/0B/0C/0D/0E/0F
/10/11/12/13/14/15/16/17
/18/19/1A/1B/1C/1D/1E/1F.txt
...you could create a file with a path like this:
/parent/00_01_02_03_04_05_06_07_
08_09_0A_0B_0C_0D_0E_0F_
10_11_12_13_14_15_16_17_
18_19_1A_1B_1C_1D_1E_1F.txt
...of course, you'll still have a problem, because now all of your ten million files will be in a single directory, and in my experience (NTFS), a directory with more than a few thousand files in it still over-taxes the file-system.
You could come up with a hybrid approach:
/parent/00_01_02_03/04_05_06_07
/08_09_0A_0B/0C_0D_0E_0F
/10_11_12_13/14_15_16_17
/18_19_1A_1B/1C_1D_1E_1F.txt
But that will still give you problems if you exhaustively create all those directories. Even though most of those directories are "empty" (in that they don't contain any files), the operating system still has to create an INODE record for each directory, and that takes space on disk.
Instead, you should only create a directory when you have a file to put into it. Also, if you delete all the files in any given directory, then delete the empty directory.
How many levels deep should you create the directory hierarchy? In my little example, I transformed your 32-level hierarchy into an 8-level hierarchy, but after doing some testing, you might decide on a slightly different mapping. It really depends on your data, and how evenly those paths are distributed through the combinatorial solution space. You need to optimize a solution with two constraints:
1) Minimize the number of directories you create, knowing that each directory becomes an INODE in the underlying file-system, and creating too many of them will overwhelm the file system.
2) Minimize the number of files in each directory, knowing that having too many files per directory (in my experience, more than 1000) overwhelms the file-system.
There's one other consideration to keep in mind: Storage space on disks is addressed and allocated using "blocks". If you create a file smaller than the minimum block size, it nevertheless consumes the whole block, wasting disk space. In NTFS, those blocks are defined by their "cluster size" (which is partially determined by the overall size of the volume), and usually defaults to 4kB:
http://support.microsoft.com/kb/140365
So if you create a file with only one byte of data, it will still consume 4kB worth of disk space, wasting 4095 bytes.
In your example, you said you had about 10 million files, with about 1gB of data. If that's true, then each of your files is only about 100 bytes long. With a cluster size of 4096, you have about a 98% space-wasted ratio.
If at all possible, try to consolidate some of those files. I don't know what kind of data they contain, but if it's a text format, you might try doing something like this:
[id:01_23_45_67_89_AB_CD_EF]
lorem ipsum dolor sit amet consectetur adipiscing elit
[id:fe_dc_ba_98_76_54_32_10]
ut non lorem quis quam malesuada lacinia
[id:02_46_81_35_79_AC_DF_BE]
nulla semper nunc id ligula eleifend pulvinar
...and so on and so forth. It might look like you're wasting space with all those verbose headers, but as far as the disk is concerned, this is a much more space-efficient strategy than having separate files for all those little snippets. This little example used exactly 230 bytes (including newlines) for three records, so you might try to put about sixteen records into each file (remembering that it's much better to have slightly less than 4096 bytes-per-file than to have slightly more than 4096, wasting a whole extra disk block).
Anyhow, good luck!
You're copying a file, opening it to read, closing it, then reopening it for writing, then recopying it back. It would be faster to do it in one go.
EDIT: the previous version has a bug when the number of digits become less than the current number of digits (e.g. if you're subtracting or adding by negative number); this version fixes it, timing result is barely unaffected
def addInFile(path, scoreToAdd):
try:
fp = open(path, 'r+')
except IOError as e:
print e
else:
num = str(scoreToAdd + int(fp.read()))
fp.seek(0)
fp.write(num)
fp.truncate(len(num))
finally:
fp.close()
alternatively, if you want to avoid file loss and writes to cache, you should do the copying and the summing in one go, then do a an overwrite-dance in another step:
def addInFile(path, scoreToAdd):
try:
orig = open(path, 'r')
tmp = open('/home/lieryan/junks/tmp.txt', 'w')
except IOError as e:
print e
else:
num = int(orig.read())
tmp.write(str(scoreToAdd + num))
finally:
orig.close()
tmp.close()
try:
# make sure /tmp/ and path is in the same partition
# otherwise the fast shutil.move become a slow shutil.copy
shutil.move(path, '/home/lieryan/junks/backup.txt')
shutil.move('/home/lieryan/junks/tmp.txt', path)
os.remove('/home/lieryan/junks/backup.txt')
except (IOError, shutil.Error) as e:
print e
also, don't use bare excepts.
Alternatively, how about grouping all the 256 files in the lowest leaf into one bigger file? Then you can read multiple numbers in one go, in one cache. And if you used a fixed width file, then you can quickly use seek() to get to any entry in the file in O(1).
Some timings, writing 1000 times on the same file:
Your original approach: 1.87690401077
My first approach (open with rw+): 0.0926730632782
My second approach, copy to the same partition: 0.464048147202
(all functions untested on their error handling path)
If you under linux and got large memory(64GB+), try tmpfs, its truly works like mounted disk and you do not need to change your code or buy another SSD.
Related
I have an interesting problem at hand. As someone who's a beginner when it comes to working with data at even a morderate scale, I'd love some tips from the veterans here.
I have around 6000 Json.gz files totalling around 5GB compressed and 20GB uncompressed.
I'm opening each file and reading them line by line using the gzip module; then using json.loads() loading each line and parsing the complicated JSON structure. Then I'm inserting lines from each file into a Pytable all at once before iterating to the next file.
All this is taking me around 3 hours. Bulk inserting into the Pytable didn't really help the speed at all. Much of the time is gone getting values from the parsed JSON line since they have a truly horrible structure. Some are straightforward like 'attrname':attrvalue, but some are complicated and time consuming structures like:
'attrarray':[{'name':abc, 'value':12},{'value':12},{'name':xyz, 'value':12}...]
...where I need to pick up the value of all those objects in the attr array which have some corresponding name, and ignore those that don't. So I need to iterate through the list and inspect each JSON object inside. (I'd be glad if you can point out any quicker clever way, if it exists)
So I suppose the actual parsing part of it doesn't have much scope of speedup. Where I think their might be scope of speedup is the actual reading the file part.
So I ran a few tests (I don't have the numbers with me right now) and even after removing the parsing part of my program; simply going through the files line by line itself was taking a considerable amount of time.
So I ask: Is there any part of this problem that you think I might be doing suboptimally?
for filename in filenamelist:
f = gzip.open(filename):
toInsert=[]
for line in f:
parsedline = json.loads(line)
attr1 = parsedline['attr1']
attr2 = parsedline['attr2']
.
.
.
attr10 = parsedline['attr10']
arr = parsedline['attrarray']
for el in arr:
try:
if el['name'] == 'abc':
attrABC = el['value']
elif el['name'] == 'xyz':
attrXYZ = el['value']
.
.
.
except KeyError:
pass
toInsert.append([attr1,attr2,...,attr10,attrABC,attrXYZ...])
table.append(toInsert)
One clear piece of "low-hanging fruit"
If you're going to be accessing the same compressed files over and over (it's not especially clear from your description whether this is a one-time operation), then you should decompress them once rather than decompressing them on-the-fly each time you read them.
Decompression is a CPU-intensive operation, and Python's gzip module is not that fast compared to C utilities like zcat/gunzip.
Likely the fastest approach is to gunzip all these files, save the results somewhere, and then read from the uncompressed files in your script.
Other issues
The rest of this is not really an answer, but it's too long for a comment. In order to make this faster, you need to think about a few other questions:
What are you trying to do with all this data?
Do you really need to load all of it at once?
If you can segment the data into smaller pieces, then you can reduce the latency of the program if not the overall time required. For example, you might know that you only need a few specific lines from specific files for whatever analysis you're trying to do... great! Only load those specific lines.
If you do need to access the data in arbitrary and unpredictable ways, then you should load it into another system (RDBMS?) which stores it in a format that is more amenable to the kinds of analyses you're doing with it.
If the last bullet point is true, one option is to load each JSON "document" into a PostgreSQL 9.3 database (the JSON support is awesome and fast) and then do your further analyses from there. Hopefully you can extract meaningful keys from the JSON documents as you load them.
I need a smart copy function for reliable and fast file copying & linking. The files are very large (from some gigabytes to over 200GB) and distributed over a lot of folders with people renaming files and maybe folders during the day, so I want to use hashes to see if I've copied a file already, maybe under a different name, and only create a link in that case.
Im completely new to hashing and I'm using this function here to hash:
import hashlib
def calculate_sha256(cls, file_path, chunk_size=2 ** 10):
'''
Calculate the Sha256 for a given file.
#param file_path: The file_path including the file name.
#param chunk_size: The chunk size to allow reading of large files.
#return Sha256 sum for the given file.
'''
sha256 = hashlib.sha256()
with open(file_path, mode="rb") as f:
for i in xrange(0,16):
chunk = f.read(chunk_size)
if not chunk:
break
sha256.update(chunk)
return sha256.hexdigest()
This takes one minute for a 3GB file, so in the end, the process might be very slow for a 16TB HD.
Now my idea is to use some additional knowledge about the files' internal structure to speed things up: I know they contain a small header, then a lot of measurement data, and I know they contain real-time timestamps, so I'm quite sure that the chance that, let's say, the first 16MB of two files are identical, is very low (for that to happen, two files would need to be created at exactly the same time under exactly the same environmental conditions). So my conclusion is that it should be enough to hash only the first X MB of each file.
It works on my example data, but as I'm unexperienced I just wanted to ask if there is something I'm not aware of (hidden danger or a better way to do it).
Thank you very much!
You can get the MD5 hash of large files, by breaking them into small byte chunks.
Also, calculating MD5 hashes is significantly faster than SHA-256 and should be favored for performance reasons for any application that doesn't rely on the hash for security purposes.
baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)
I am working on a project that basically monitors a set remote directories (FTP, networked paths, and another), if the file is considered new and meets criteria we download it and process it. However i am stuck on what the best way is to keep track of the files we already downloaded. I don't want to download any duplicate files, so i need to keep track of what is already downloaded.
Orignally i was storing it as a tree:
server->directory->file_name
When the service shuts down it writes it to a file, and rereads it back when it starts up. However given that when there is around 20,000 or so files in the tree stuff starts to slow down alot.
Is there a better way to do this?
EDIT
The lookup times start to slowdown alot, my basic implementation is a dict of a dict. The storing stuff on the disk is fine, its more or less just the lookup time. I know i can optimize the tree and partition it. However that seems excessive for such a small project i was hoping python would have something like that.
I would create a set of tuples, then pickle it to a file. The tuples would be (server, directory, file_name), or even just (server, full_file_name_including_directory). There's no need for a multiple-level data structure. The tuples will hash into the set, and give you a O(1) lookup.
You mention "stuff starts to slow down alot," but you don't say if it's reading and writing time, or lookup times that are slowing down. If your lookup times are slowing down, you may be paging. Is your data structure approaching a significant fraction of your physical memory?
One way to get back some memory is to intern() the server names. This way, each server name will be stored only once in memory.
An interesting alternative is to use a Bloom filter. This will let you use far less memory, but will occasionally download a file that you didn't have to. This might be a reasonable trade-off, depending on why you didn't want to download the file twice.
What is the most efficient way to delete an arbitrary chunk of a file, given the start and end offsets? I'd prefer to use Python, but I can fall back to C if I have to.
Say the file is this
..............xxxxxxxx----------------
I want to remove a chunk of it:
..............[xxxxxxxx]----------------
After the operation it should become:
..............----------------
Reading the whole thing into memory and manipulating it in memory is not a feasible option.
The best performance will almost invariably be obtained by writing a new version of the file and then having it atomically write the old version, because filesystems are strongly optimized for such sequential access, and so is the underlying hardware (with the possible exception of some of the newest SSDs, but, even then, it's an iffy proposition). In addition, this avoids destroying data in the case of a system crash at any time -- you're left with either the old version of the file intact, or the new one in its place. Since every system could always crash at any time (and by Murphy's Law, it will choose the most unfortunate moment;-), integrity of data is generally considered very important (often data is more valuable than the system on which it's kept -- hence, "mirroring" RAID solutions to ensure against disk crashes losing precious data;-).
If you accept this sane approach, the general idea is: open old file for reading, new one for writing (creation); copy N1 bytes over from the old file to the new one; then skip N2 bytes of the old file; then copy the rest over; close both files; atomically rename new to old. (Windows apparently has no "atomic rename" system call usable from Python -- to keep integrity in that case, instead of the atomic rename, you'd do three step: rename old file to a backup name, rename new file to old, delete backup-named file -- in case of system crash during the second one of these three very fast operations, one rename is all it will take to restore data integrity).
N1 and N2, of course, are the two parameters saying where the deleted piece starts, and how long it is. For the part about opening the files, with open('old.dat', 'rb') as oldf: and with open('NEWold.dat', 'wb') as newf: statements, nested into each other, are clearly best (the rest of the code until the rename step must be nested in both of them of course).
For the "copy the rest over" step, shutil.copyfileobj is best (be sure to specify a buffer length that's comfortably going to fit in your available RAM, but a large one will tend to give better performance). The "skip" step is clearly just a seek on the oldf open-for-reading file object. For copying exactly N1 bytes from oldf to newf, there is no direct support in Python's standard library, so you have to write your own, e.g:
def copyN1(oldf, newf, N1, buflen=1024*1024):
while N1:
newf.write(oldf.read(min(N1, buflen)))
N1 -= buflen
I'd suggest memory mapping. Though it is actually manipulating file in memory it is more efficient then plain reading of the whole file into memory.
Well, you have to manipulate the file contents in-memory one way or another as there's no system call for such operation neither in *nix nor in Win (at least not that I'm aware of).
Try mmaping the file. This won't necessarily read it all into memory at once.
If you really want to do it by hand, choose some chunk size and do back-and-forth reads and writes. But the seeks are going to kill you...