Download Manager: How to re-construct chunks fetched by multiple connections - python

so i am developing my own download manager for educational purpose. I have multiple connections/threads downloading a file, each connection works on a particular range of the file. Now after they have all fetched their chunks, i dont exact know how to i bring this chunks together to re-make the original file.
What i did:
First, i created a temporary file in 'wb' mode, and allowed each connections/threads to dump their chunks. But everytime a connection does this, it overwrites previously saved chunks. I figured this was because i used the 'wb' file descriptor. I changed it to 'ab', but i can no longer perform seek() operations
What i am looking for:
I need an elegant way of re-packaging this chunk to the original file. I would like to know how other download managers do it.
Thank in advance.

You need to write chunks it different temporary files and then join them in the original order. If you open one file for all the threads, you should make the access to it sequential to preserve to correct order of data, which discards thread usage since a thread should wait for the previous one. BTW, you should open files in wb mode.

You were doing it just fine: seek() and write(). That should work!
Now, if you want a cleaner structure, without so many threads moving their hands all over a file, you might want to consider having downloader threads and a disk-writing thread. This last one may just sleep until woken by one of the others, write some kb to disk, and go back to sleep.

Related

Python: How to link a filename to a memory file

What I want to realize has the following feature:
Python program (or say process, thread,...) create a memory file that can be read or written.
As long as the program is alive, the file data only exists in memory (NOT in disk). As long as the program is NOT alive, there is no date left.
However there is an interface on the disk, it has a filename. This interface is linked to the memory file. Read and write operations on the interface are possible.
Why not use IO
The memory file will be an input of another program (not Python). So a file name is needed.
Why not use tempfile?
The major reason is security. For different OS the finalization of tempfile will be different (right?) And for some occasional cases, such as interruptions on OS, data may remain on disk. So a program-holding data seems more secure (at least to an extent).
Anyway I just want a try to see if tempfile can be avoided.
You could consider using a named pipe (using mkfifo). Another option is to create an actual file which the two programs open. Once both open it, you can unlink it so that it's no longer accessible on disk.

How do I create a file I can stream data to in Python?

I'd like to create a file similar to those under /dev that I can stream lines of text to without actually writing anything to the disk.
I want to still be able to read this stream like a regular text file.
Call the os.mkfifo function, then open the file it creates as normal. Anything that gets written by one process will get read back out by another, and not saved to disk or anywhere else along the way. Note that reads and writes will block (i.e. appear to hang) if one process gets too far ahead of the other.
Alternatively, you can use the socket library to create a UNIX domain socket, which is bidirectional and has more features, but is more complicated to set up.

Fine control over h5py buffering

I have some data in memory that I want to store in a HDF file.
My data are not huge (<100 MB, so they fit in memory very comfortably), so for performance it seems to make sense to keep them there. At the same time, I also want to store it on disk. It is not critical that the two are always exactly in sync, as long as they are both valid (i.e. not corrupted), and that I can trigger a synchronization manually.
I could just keep my data in a separate container in memory, and shovel it into an HDF object on demand. If possible I would like to avoid writing this layer. It would require me to keep track of what parts have been changed, and selectively update those. I was hoping HDF would take care of that for me.
I know about the driver='core' with backing store functionality, but it AFAICT, it only syncs the backing store when closing the file. I can flush the file, but does that guarantee to write the object to storage?
From looking at the HDF5 source code, it seems that the answer is yes. But I'd like to hear a confirmation.
Bonus question: Is driver='core' actually faster than normal filesystem back-ends? What do I need to look out for?
What the H5Fflush command does is request to the file system to transfer all buffers to the file.
The documentation has a specific note about it:
HDF5 does not possess full control over buffering. H5Fflush flushes
the internal HDF5 buffers then asks the operating system (the OS) to
flush the system buffers for the open files. After that, the OS is
responsible for ensuring that the data is actually flushed to disk.
In practice, I have noticed that I can use most of the time read the data from a HDF5 file that has been flushed (even if the process was subsequently killed) but this is not guaranteed by HDF5: there is no safety in relying on the flush operation to have a valid HDF5 file as further operations (on the metadata, for instance) can corrupt the file is the process is then interrupted. You have to close the file completely to have this consistency.
If you need consistency and avoid corrupted hdf5 files, you may like to:
1) use write-ahead-log, use append logs to write what's being added/updated each time, no need to write to hdf5 this moment.
2) periodically, or at the time you need to shutdown, you replay the logs to apply them one by one, write to the hdf5 file
3) if your process down during 1), you won't lose data, after you startup next time, just replay the logs and writes them to hdf5 file
4) if your process is down during 2), you will not lose data, just remove the corrupted hdf5 file, replay the logs and write it again.

ACID Transactions at the File System

Background:
I am getting a temperature float from an arduino via a serial connection. I need to be able to cache this temperature data every 30 seconds for other applications (e.g. web, thermostat controller) to access and not overload the serial connection.
Currently I cache this data to RAM as a file in /run (I'm trying to follow Linux convention). Then, other applications can poll the file for the temperature as they want it all day long with i/o now the only bottle neck (using an rpi, so not a lot of enterprise-level need here).
Problem:
I think when an app reads these files, it risks reading corrupt data. Should a writer update the file, and a reader try to read the file at the same time, can corrupt data be read, causing the thermostat to behave erratically?
Should I just use sqlite3 as an overkill solution, or use file locks (and does that risk something else not working perfectly)?
This is all taking place in multiple python processes. Is Linux able to handle this situation natively or do I need to apply somehow the principles mentioned here?
Calls to write(2) ought to be atomic under Linux.
Which means as long you are writing a single buffer, you can be certain that readers won't read an incomplete record. You might want to use os.write to make sure that no buffering/chunking happens you are not aware of.
if a read is happening and a file is updated, will the read use the new data while in the middle of a file, or does it somehow know how to get data from the old file (how)?
If there is exactly one read(2) and one write(2), you are guaranteed to see a consistent result. If you split your write into two, it might happen that you write the first part, read and then write the second part which would be an atomicity violation. In case you need to write multiple buffers, either combine them yourself or use writev(2).

Having several file pointers open simultaneaously alright?

I'm reading from certain offsets in several hundred and possibly thousands files. Because I need only certain data from certain offsets at that particular time, I must either keep the file handle open for later use OR I can write the parts I need into seperate files.
I figured keeping all these file handles open rather than doing a significant amount of writing to the disk of new temporary files is the lesser of two evils. I was just worried about the efficiency of having so many file handles open.
Typically, I'll open a file, seek to an offset, read some data, then 5 seconds later do the same thing but at another offset, and do all this on thousand of files within a 2 minute timeframe.
Is that going to be a problem?
A followup: Really, I"m asking which is better to leave these thousands file handles open, or to constantly close them and re-open them just when I instantaneously need them.
Some systems may limit the number of file descriptors that a single process can have open simultaneously. 1024 is a common default, so if you need "thousands" open at once, you might want to err on the side of portability and design your application to use a smaller
pool of open file descriptors.
I recommend that you take a look at Storage.py in BitTorrent. It includes an implementation of a pool of file handles.

Categories

Resources