Questions: How do you represent a file in python as an object, entirely in memory, making no calls to the hard drive?
Info:
I'm working on a project that has files distributed across many computers at once, these files are all stored in a SQLite3 database with identifiers so that the files can be keep in sync and I only have to have 1 file to deal with on the computers rather than many.
My problem is that the "open" command requires a path on the hard drive. well that path doesn't exist. I still need to be able to interact with these file objects though. What I'm looking for is a way to interact with these files as if they were on the hard drive, but they are only in memory, probably as a byte string. so kind like if i were to go:
file = open(<location in memory>,'r')
I've tried searching for this, but all search results just point to streaming files from the hard drive. so just to make this clear, I am no streaming from the hard drive, and that is not an option. If I have to do that, I'll rework my system for that. but right now that is an extra pointless step.
Take a look at the StringIO (and cStringIO) module: http://docs.python.org/library/stringio.html
Related
I am in the process of writing a program and need some guidance. Essentially, I am trying to determine if a file has some marker or flag attached to it. Sort of like the attributes for a HTTP Header.
If such a marker exists, that file will be manipulated in some way (moved to another directory).
My question is:
Where exactly should I be storing this flag/marker? Do files have a system similar to HTTP Headers? I don't want to access or manipulate the contents of the file, just some kind of property of the file that can be edited without corrupting the actual file--and it must be rather universal among file types as my potential domain of file types is unbound. I have some experience with Web APIs so I am familiar with HTTP Headers and json. Does any similar system exist for local files in windows? I am especially interested in anyone who has professional/industry knowledge of common techniques that programmers use when trying to store 'meta data' in files in order to access them later. Or if anyone knows of where to point me, as I am unsure to what I should be researching.
For the record, I am going to write a program for Windows probably using Golang or Python. And the files I am going to manipulate will be potentially all common ones (.docx, .txt, .pdf, etc.)
Metadata you wish to add is best kept in a separate file or database for all files.
Or in another file with same name and different extension or prefix, that you can make hidden.
Relying on a file system is very tricky and your data will be bound by the restrictions and capabilities of the file system your file is stored on.
And, you cannot count on your data remaining intact as any application may wish to change these flags.
And some of those have very specific, clearly defined use, such as creation time, modification time, access time...
See, if you need only flagging the document, you may wish to use creation time, which will stay unchanged through out the live of this document (until is copied) to store your flags. :D
Very dirty business, unprofessional, unreliable and all that.
But it's a solution. Poor one, but exists.
I do not know that FAT32 or NTFS file systems support any extra bits for flagging except those already used by the OS.
Unixes EXT family FS's do support some extra bits. And even than you should be careful in case some other important application makes use of them for something.
Mac OS may support some metadata by itself, but I am not 100% sure.
On Windows, you have one more option to associate more data with a file, but I wouldn't use that as well.
Well, NTFS file system (FAT doesn't support that) has a feature called streams.
In essential, same file can have multiple data streams under itself. I.e. You have more than one file contents under same file node.
To be more clear. Same file contains two different files.
When you open the file normally only main stream is visible to the application. Applications must check whether the other streams are present and choose the one they want to follow.
So, you may choose to store metadata under the second stream of the file.
But, what if all streams are taken?
Even more, anti-virus programs may prevent you access to the metadata out of paranoya, or at least ask for a permission.
I don't know why MS included that option, probably for file duplication or something, but bad hackers made use of the fact that you can store some data, under existing regular file, that nobody is aware of.
Imagine a virus writing it's copy into another stream of one of programs already there.
All that is needed for it to start, instead of your old program next time you run it is a batch script added to task scheduler that flips two streams making the virus data the main one.
Nasty trick! So when this feature started to be abused, anti-virus software started restricting files with multiple streams, so it's like this feature doesn't exist.
If you want to add some metadata using OS's technology, use Windows registry,
but even that is unwise.
What to tell you?
Don't add metadata to files, organize a separate file, or index your data in special files with same name as the file you are refering to and in same folder.
If you are dealing with binary files like docx and pdf, you're best off storing the metadata in seperate files or in a sqlite file.
Metadata is usually stored seperate from files, in data structures called inodes (at least in Unix systems, Windows probably has something similar). But you probably don't want to get that deep into the rabbit hole.
If your goal is to query the system based on metadata, then it would be easier and more efficient to use something SQLite. Having the meta data in the file would mean that you would need to open the file, read it into memory from disk, and then check the meta data - i.e slower queries.
If you don't need to query based on metadata, then storing metadata in the file might make sense. It would reduce the dependencies in your application, but in order to access the contents of the file through Word or Adobe Reader, you'd need to strip the metadata before handing it off to the application. Not worth the hassle, usually
I have a large text file stored in a shared directory on a server in which different other machines have access to that. I'm running various analysis on this text file without changing or updating it. I'd like to know whether I can run different python scripts on different machines in which all of them reading that large text file? None of the scripts make any change to that file, they just need to read it.
You should be able to do multiple read access, but it might get really, really, really slow, compared to reading the same file by several scripts on the same computer (obviously, the degree will very much depend on how much reading you are doing). You may want to copy the file over before processing.
I'm implementing a FUSE driver for Google Drive. The aim is to allow a user to mount her Google Drive/Docs account as a virtual filesystem. Full source at https://github.com/jforberg/drivefs. I use the fusepy bindings to integrate FUSE with Python, and Google's Document List API to access Drive.
My driver is complete to the degree that readdir(2), stat(2) and read(2) work as expected. In the filesystem, each file read translates to a HTTPS request which has a large overhead. I've managed to limit the overhead by forcing a larger buffer size for reads.
Now to my problem. File explorers like Thunar and Nautilus build thumbs and determine file types by reading the first part of each file (the first 4k bytes or so). But in my filessystem, reading from many files at once is a painful procedure, and getting a file listing in thunar takes a very long time compared with a simple ls (which only stat(2)s each file).
I need some way to tell file explorers that my filessystem does not play well with "mini-reads", or some way to identify these mini-reads and feed them made-up data to make them happy. Any help would be appreciated!
EDIT: The problem was not with HTTPS overhead, but with my handling of Google's native "doc" format. I added a line to make read(2) return an empty string when someone tries to read a native doc, and the file listing is now almost instantaneous.
This seems a mild limitation, as not even Google's official client program is able to edit native docs.
Here is pycloudfuse which is a similar attempt but for cloud files / openstack object storage which you might find useful bits in.
When writing this I can't say I noticed any problems with Thunar and Nautilus with the directory listings.
I don't think you can feed the file managers made up data - that is bound to lead to problems.
I like the option is to signal to the file explorer not to do thumbnails etc, but I don't think that is possible either.
I think the best option is to remind your users that drivefs is not a real filesystem, and to give a list of its limitations, and if it is anything like pycloudfuse there will be lots!
If you open a file for reading, read from it, close it, and then repeat the process (in a loop) does python continually access the hard-drive? Because it doesn't seem to from my experimentation, and I'd like to understand why.
An (extremely) simple example:
while True:
file = open('/var/log/messages', 'r')
stuff = file.read()
file.close()
time.sleep(2)
If I run this, my hard-drive access LED lights up once and then the hard-drive remains dormant. How is this possible? Is python somehow keeping the contents of the file stored in RAM? Because logic would tell me that it should be accessing the hard-drive every two seconds.
The answer depends on your operating system and the type of hard drive you have. Most of the time, when you access something off the drive, the information is cached in main memory in case you need it again soon. Depending on the replacement strategy used by your OS, the data may stay in main memory for a while or be replaced relatively soon.
In some cases, your hard drive will cache frequently accessed information in its own memory, and then while the drive might be accessed, it will retrieve the information faster and send it to the processor faster than if it had to search the drive platters.
Likely your operating system or file-system is smart enough to serve the file from the OS cache if the file has not changed in between.
Python does not cache, the operating system does. You can find out the size of these caches with top. In the third line, it will say something like:
Mem: 5923332k total, 5672832k used, 250500k free, 43392k buffers
That means about 43MB are being used by the OS to cache data recently written to or read from the hard disk. You can turn this caching off by writing 2 or 3 to /proc/sys/vm/drop_caches.
The title could have probably been put better, but anyway. I was wondering if there are any functions for writing to files that are like what the ACID properties are for databases. Reason is, I would like to make sure that the file writes I am doin won't mess up and corrupt the file if the power goes out.
Depending on what exactly you're doing with your files and the platform there are a couple options:
If you're serializing a blob from memory to disk repeatedly to maintain state (example: dhcp leases file),
if you're on a Posix system you can write your data to a temporary file and 'rename' the temporary file to your target. On Posix compliant systems this is guaranteed to be an atomic operation, shouldn't even matter if the filesystem is journaled or not. If you're on a Windows system, there's a native function named MoveFileTransacted that you might be able to utilize via bindings. But the key concept here is, the temporary file protects your data, if the system reboots the worst case is that your file contains the last good refresh of data. This option requires that you write the entire file out every time you want a change to be recorded. In the case of dhcp.leases file this isn't a big performance hit, larger files might prove to be more cumbersome.
If you're reading and writing bits of data constantly, sqlite3 is the way to go -- it supports atomic commits for groups of queries and has it's own internal journal. One thing to watch out for here is that atomic commits will be slower due to the overhead of locking the database, waiting for the data to flush, etc.
A couple other things to consider -- if your filesystem is mounted async, writes will appear to be complete because the write() returns, but it might not be flushed to disk yet. Rename protects you in this case, sqlite3 does as well.
If your filesystem is mounted async, it might be possible to write data and move it before the data is written. So if you're on a unix system it might be safest to mount sync. That's on the level of 'people might die if this fails' paranoia though. But if it's an embedded system and it dies 'I might lose my job if this fails' is also a good rationalization for the extra protection.
The ZODB is an ACID compliant database storage written in (mostly) python, so in a sense the answer is yes. But I can imagine this is a bit overkill :)
Either the OS has to provide this for you, or you'll need to implement your own ACID compliancy. For example, by defining 'records' in the file you write and, when opening/reading, verifying which records have been written (which may mean you need to throw away some non-fully written data). ZODB, for example, implements this by ending a record by writing the size of the record itself; if you can read this size and it matches, you know the record has been fully written.
And, of course, you always need to append records and not rewrite the entire file.
It looks to me that your main goal is to ensure the integrity of written files in case of power failures and system crashes. There a couple of things to be considered when doing this:
Ensure that your data is written to disk when you close a file. Even if you close it, some of the data may be in OS cache for several seconds waiting to be written to the disk. You can force writing to disk with f.flush(), followed with os.fsync(f.fileno()).
Don't modify existing data before you are certain that the updated data is safely on the disk. This part can be quite tricky (and OS/filesystem dependent).
Use file format that helps you to verify the integrity of data (e.g. use checksums).
Another alternative is to use sqlite3.
EDIT: Regarding my second point, I highly recommend this presentation: http://www.flamingspork.com/talks/2007/06/eat_my_data.odp. This also covers issues with "atomic rename".