I'm reading from certain offsets in several hundred and possibly thousands files. Because I need only certain data from certain offsets at that particular time, I must either keep the file handle open for later use OR I can write the parts I need into seperate files.
I figured keeping all these file handles open rather than doing a significant amount of writing to the disk of new temporary files is the lesser of two evils. I was just worried about the efficiency of having so many file handles open.
Typically, I'll open a file, seek to an offset, read some data, then 5 seconds later do the same thing but at another offset, and do all this on thousand of files within a 2 minute timeframe.
Is that going to be a problem?
A followup: Really, I"m asking which is better to leave these thousands file handles open, or to constantly close them and re-open them just when I instantaneously need them.
Some systems may limit the number of file descriptors that a single process can have open simultaneously. 1024 is a common default, so if you need "thousands" open at once, you might want to err on the side of portability and design your application to use a smaller
pool of open file descriptors.
I recommend that you take a look at Storage.py in BitTorrent. It includes an implementation of a pool of file handles.
Related
Though I would imagine that append mode is "smart" enough to only insert the new bytes being appended, I want to make absolutely sure that Python doesn't handle it by re-writing the entire file along with the new bytes.
I am attempting to keep a running backup of a program log, and it could reach several thousand records in a CSV format.
Python file operations are convenience wrappers over operating system file operations. The operating system either implements this file system operations internally, forwards them to a loadable module (plugin) or an external server (NFS,SMB). Most of the operating systems since very 1971 are capable to perform appending data to the existing file. At least all the ones that claim to be even remotely POSIX compliant.
The POSIX append mode simply opens the file for writing and moves the file pointer to the end of the file. This means that all the write operations will just write past the end of the file.
There might be a few exceptions to that, for example some routine might use low level system calls to move the file pointer backwards. Or the underlying file system might be not POSIX compliant and use some form of object transactional storage like AWS S3. But for any standard scenario I wouldn't worry about such cases.
However since you mentioned backup as your use case you need to be extra careful. Backups are not as easy as they seem on the surface. Things to worry about, various caches that might hold data in memory before if it is written to disk. What will happen if the power goes out just right after you appended new records. Also, what will happen if somebody starts several copies of your program?
And the last thing. Unless you are running on a 1980s 8bit computer a few thousand CSV lines is nothing to the modern hardware. Even if the files are loaded and written back you wouldn't notice any difference
Background:
I am getting a temperature float from an arduino via a serial connection. I need to be able to cache this temperature data every 30 seconds for other applications (e.g. web, thermostat controller) to access and not overload the serial connection.
Currently I cache this data to RAM as a file in /run (I'm trying to follow Linux convention). Then, other applications can poll the file for the temperature as they want it all day long with i/o now the only bottle neck (using an rpi, so not a lot of enterprise-level need here).
Problem:
I think when an app reads these files, it risks reading corrupt data. Should a writer update the file, and a reader try to read the file at the same time, can corrupt data be read, causing the thermostat to behave erratically?
Should I just use sqlite3 as an overkill solution, or use file locks (and does that risk something else not working perfectly)?
This is all taking place in multiple python processes. Is Linux able to handle this situation natively or do I need to apply somehow the principles mentioned here?
Calls to write(2) ought to be atomic under Linux.
Which means as long you are writing a single buffer, you can be certain that readers won't read an incomplete record. You might want to use os.write to make sure that no buffering/chunking happens you are not aware of.
if a read is happening and a file is updated, will the read use the new data while in the middle of a file, or does it somehow know how to get data from the old file (how)?
If there is exactly one read(2) and one write(2), you are guaranteed to see a consistent result. If you split your write into two, it might happen that you write the first part, read and then write the second part which would be an atomicity violation. In case you need to write multiple buffers, either combine them yourself or use writev(2).
baseline - I have CSV data with 10,000 entries. I save this as 1 csv file and load it all at once.
alternative - I have CSV data with 10,000 entries. I save this as 10,000 CSV files and load it individually.
Approximately how much more inefficient is this computationally. I'm not hugely interested in memory concerns. The purpose of the alternative method is because I frequently need to access subsets of the data and don't want to have to read the entire array.
I'm using python.
Edit: I can other file formats if needed.
Edit1: SQLite wins. Amazingly easy and efficient compared to what I was doing before.
SQLite is ideal solution for your application.
Simply import your CSV file into SQLite database table (it is going to be single file), then add indexes as necessary.
To access your data, use python sqlite3 library. You can use this tutorial on how to use it.
Compared to many other solutions, SQLite will be the fastest way to select partial data sets locally - certainly much, much faster than access 10000 files. Also read this answer which explains why SQLite is so good.
I would write all the lines to one file. For 10,000 lines it's probably not worthwhile, but you can pad all the lines to the same length - say 1000 bytes.
Then it's easy to seek to the nth line, just multiply n by the line length
10,000 files is going to be slower to load and access than one file, if only because the files' data will likely be fragmented around your disk drive, so accessing it will require a much larger number of seeks than would accessing the contents of a single file, which will generally be stored as sequentially as possible. Seek times are a big slowdown on spinning media, since your program has to wait while the drive heads are physically repositioned, which can take milliseconds. (slow seeks times aren't an issue for SSDs, but even then there will still be the overhead of 10,000 file's worth of metadata for the operating system to deal with). Also with a single file, the OS can speed things up for you by doing read-ahead buffering (as it can reasonably assume that if you read one part of the file, you will likely want to read the next part soon). With multiple files, the OS can't do that.
My suggestion (if you don't want to go the SQLite route) would be to use a single CSV file, and (if possible) pad all of the lines of your CSV file out with spaces so that they all have the same length. For example, say you make sure when writing out the CSV file to make all lines in the file exactly 80 bytes long. Then reading the (n)th line of the file becomes relatively fast and easy:
myFileObject.seek(n*80)
theLine = myFileObject.read(80)
so i am developing my own download manager for educational purpose. I have multiple connections/threads downloading a file, each connection works on a particular range of the file. Now after they have all fetched their chunks, i dont exact know how to i bring this chunks together to re-make the original file.
What i did:
First, i created a temporary file in 'wb' mode, and allowed each connections/threads to dump their chunks. But everytime a connection does this, it overwrites previously saved chunks. I figured this was because i used the 'wb' file descriptor. I changed it to 'ab', but i can no longer perform seek() operations
What i am looking for:
I need an elegant way of re-packaging this chunk to the original file. I would like to know how other download managers do it.
Thank in advance.
You need to write chunks it different temporary files and then join them in the original order. If you open one file for all the threads, you should make the access to it sequential to preserve to correct order of data, which discards thread usage since a thread should wait for the previous one. BTW, you should open files in wb mode.
You were doing it just fine: seek() and write(). That should work!
Now, if you want a cleaner structure, without so many threads moving their hands all over a file, you might want to consider having downloader threads and a disk-writing thread. This last one may just sleep until woken by one of the others, write some kb to disk, and go back to sleep.
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".