Read a file using threads

Read a file using threads - python

I try to write a python program that send files from one PC to another using python's sockets. But when file size increase it takes lots of time. Is it possible to read lines of a file sequentially using threads?
The concepts which I think is as follows:
Each thread separately and sequentially read lines from file and send it over socket. Is it possible to do? Or do you have any suggestion for it?

First, if you want to speed this up as much as possible without using threads, reading and sending a line at a time can be pretty slow. Python does a great job of buffering up the file to give you a line at a time for reading, but then you're sending tiny 72-byte packets over the network. You want to try to send at least 1.5KB at a time when possible.
Ideally, you want to use the sendfile method. Python will tell the OS to send the whole file over the socket in whatever way is most efficient, without getting your code involved at all. Unfortunately, this doesn't work on Windows; if you care about that, you may want to drop to the native APIs1 directly with pywin32 or switch to a higher-level networking library like twisted or asyncio.
Now, what about threading?
Well, reading a line at a time in different threads is not going to help very much. The threads have to read sequentially, fighting over the read pointer (and buffer) in the file object, and they presumably have to write to the socket sequentially, and you probably even need a mutex to make sure they write things in order. So, whichever one of those is slowest, all of your threads are going to end up waiting for their turn.2
Also, even forgetting about the sockets: Reading a file in parallel can be faster in some situations on modern hardware, but in general it's actually a lot slower. Imagine the file is on a slow magnetic hard drive. One thread is trying to read the first chunk, the next thread is trying to read the 64th chunk, the next thread is trying to read the 4th chunk… this means you spend more time seeking the disk head back and forth than actually reading data.
But, if you think you might be in one of those situations where parallel reads might help, you can try it. It's not trivial, but it's not that hard.
First, you want to do binary reads of fixed-size chunks. You're going to need to experiment with different sizes—maybe 4KB is fastest, maybe 1MB… so make sure to make it a constant you can easily change in just one place in the code.
Next, you want to be able to send the data as soon as you can get it, rather than serializing. This means you have to send some kind of identifier, like the offset into the file, before each chunk.
The function will look something like this:
def sendchunk(sock, lock, file, offset):
with lock:
sock.send(struct.pack('>Q', offset)
sent = sock.sendfile(file, offset, CHUNK_SIZE)
if sent < CHUNK_SIZE:
raise OopsError(f'Only sent {sent} out of {CHUNK_SIZE} bytes')
… except that (unless your files actually are all multiples of CHUNK_SIZE) you need to decide what you want to do for a legitimate EOF. Maybe send the total file size before any of the chunks, and pad the last chunk with null bytes, and have the receiver truncate the last chunk.
The receiving side can then just loop reading 8+CHUNK_SIZE bytes, unpacking the offset, seeking, and writing the bytes.
1. See TransmitFile—but in order to use that, you have to know about how to go between Python-level socket objects and Win32-level HANDLEs, and so on; if you've never done that, there's a learning curve—and I don't know of a good tutorial to get you started..
2. If you're really lucky, and, say, the file reads are only twice as fast as the socket writes, you might actually get a 33% speedup from pipelining—that is, only one thread can be writing at a time, but the threads waiting to write have mostly already done their reading, so at least you don't need to wait there.

Not Threads.
source_path = r"\\mynetworkshare"
dest_path = r"C:\TEMP"
file_name = "\\myfile.txt"
shutil.copyfile(source_path + file_name, dest_path + file_name)
https://docs.python.org/3/library/shutil.html
Shutil offers a high level copy function that uses the OS layer to copy. It is your best bet for this scenario.

Related

Python 3.5 non blocking functions

I have a fairly large python package that interacts synchronously with a third party API server and carries out various operations with the server. Additionally, I am now also starting to collect some of the data for future analysis by pickling the JSON responses. After profiling several serialisation/database methods, using pickle was the fastest in my case. My basic pseudo-code is:
While True:
do_existing_api_stuff()...
# additional data pickling
data = {'info': []} # there are multiple keys in real version!
if pickle_file_exists:
data = unpickle_file()
data['info'].append(new_data)
pickle_data(data)
if len(data['info']) >= 100: # file size limited for read/write speed
create_new_pickle_file()
# intensive section...
# move files from "wip" (Work In Progress) dir to "complete"
if number_of_pickle_files >= 100:
compress_pickle_files() # with lzma
move_compressed_files_to_another_dir()
My main issue is that the compressing and moving of the files takes several seconds to complete and is therefore slowing my main loop. What is the easiest way to call these functions in a non-blocking way without any major modifications to my existing code? I do not need any return from the function, however it will raise an error if anything fails. Another "nice to have" would be for the pickle.dump() to also be non-blocking. Again, I am not interested in the return beyond "did it raise an error?". I am aware that unpickle/append/re-pickle every loop is not particularly efficient, however it does avoid data loss when the api drops out due to connection issues, server errors, etc.
I have zero knowledge on threading, multiprocessing, asyncio, etc and after much searching, I am currently more confused than I was 2 days ago!
FYI, all of the file related functions are in a separate module/class, so that could be made asynchronous if necessary.
EDIT:
There may be multiple calls to the above functions, so I guess some sort of queuing will be required?

Easiest solution is probably the threading standard library package. This will allow you to spawn a thread to do the compression while your main loop continues.
There is almost certainly quite a bit of 'dead time' in your existing loop waiting for the API to respond and conversely there is quite a bit of time spent doing the compression when you could be usefully making another API call. For this reason I'd suggest separating these two aspects. There are lots of good tutorials on threading so I'll just describe a pattern which you could aim for
Keep the API call and the pickling in the main loop but add a step which passes the file path to each pickle to a queue after it is written
Write a function which takes a the queue as its input and works through the filepaths performing the compression
Before starting the main loop, start a thread with the new function as its target

How to know when the disk is ready after calling flush() on the stream?

I have a file called foo.txt with a 1B rows.
I want to apply an operation on each row that produces 10 new rows. Output is expected to be around 10B rows.
To increase the speed and the IO, foo.txt is on DiskA and bar.txt DiskB (different drives - physically speaking).
DiskB is going to be the limiting factor. Because there's a lot of rows to write, I added a large buffer when I write to DiskB.
My question is: When I call flush() on diskB, the buffer of the file handler will flush it to the hard drive. It seems to be a non-blocking call since the command returns but I can still see that the disk is writing and its busy indicator is 100%. After a couple of seconds, the indicator goes back to 0%. Is there a way in python to wait for the disk to finish? Ideally, I would like flush() to be a blocking call. The only solution I see now is to add arbitrary sleep() and hopes that the disk is ready.
Here's a snippet to show visually (It's a bit more complicated in practice as bar.txt is not just one file but thousands of files so the IO efficiency is very poor):
with open('bar.txt', 'w', buffering=100 * io.DEFAULT_BUFFER_SIZE) as w:
with open('foo.txt') as r:
for line in r:
# writes each line of foo 10 times in bar.
for i in range(10):
w.write(line)
# w.flush()

I think there are several issues.
"when the disk is ready" had to be defined very clearly.
your OS, file system and configuration might be important
your exact use case might be important
To know when data is written to the disk will have different answers depending on the OS and the file system and the OS/filesystem configuration.
Please note that following situations are not (or might not be) identical:
to know when your process can write the next bytes without blocking
know when another process can read the bytes from a file (might be after a flush)
know when the OS wrote the last byte to the disk controller (when all write caches are flushed)
know when you could cut the power without losing data / when data was really written to the disk (when your disk controller flushed out its buffers)
The main question is really, why exactly you have to know when the last byte has been written?
If your motivation is performance, then perhaps it's just good enough to have two threads:
a reading and processing thread who places the data to write in a Queue (threading.Queue) with a max amount of entries. That means, when the queue reaches a certain size the reading/processing thread will be blocked
a writing thread that just reads from the queue and writes to disk.
If above is the case and you never used threading and threading Queue I can enhance my answer. Just tell me.
However if you say, that writing / flushing is not / never blocking, then
this wouldn't help.
Just for fun you could implement above threads and check with a third thread periodically the size of the queue to see whether writing is really the bottle neck. If it were, then your FIFO should be (almost) full most of the time.
Comment after first feedback:
Youre running on linux with an SSD drive with ext4 to write to.
It seems but I'm still not sure, that a more representative example than the one in the question would be a script just writing to N files in an alternating manner with different data rates.
I still have the impression, that increased write buffer sizes and letting the OS do the rest should give you a performance, that is difficult to improve by manual interventions.
Disabling journaling on the disk might improve performance though
writers = []
writers.append((open("f1", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "a")
writers.append((open("f2", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "b" * 10000)
writers.append((open("f3", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "c" * 100)
...
writers.append((open("f1000", "w", buffering=100 * io.DEFAULT_BUFFER_SIZE), "a" * 200)
for n in range(x):
# Normally this is where you would read data from a file,
# analyse the data and write some data to one or multiple writers
# as a very approximate simulation I just write to the writers in
# data chunks in alternating order
for writer, data in writers:
writer.write(data)
# this is the question:
# Can I write lines of the following nature, that will increase
# the write rate?
if some_condition:
writer.flush()
Would this model your problem? (I know in reality the write rates of a writer won't be constant and the order in which writers would write are random)
I have the impression I am missing something.
Why should these flushes accelerate anything?
This is an SSD. It doesn't have any mechanical delays for waiting, that the disk spun to a certain place. the buffering will only write to a file if you have 'enough data worth being written'.
What I'm also confused about is, that you say flush() is non blocking.
A buffered write is just putting data in a buffer and calling flush() when the buffer is full, this means, that write() would also be non blocking.
If everything is non blocking, than your process would loose 0 time for writing and there wouldn't be anything to optimize.
So I guess, that write() and flush() are blocking calls, but not blocking in the way you expect them to block. They are probably blocking until the OS accepted the data for writing, (which does not mean, the data has been written)
The real writes to the disk will happen whenever the OS decides to do so.
There are write caches involved, the disk controller might add some other layers of write caching / write reordering.
in order to check this you can add debug code around every write of following kind.
import time
global t_max = 0
...
# this had to be done for every `write` or `flush`
# or at least for some representative calls of them
t0 = time.time()
bla.write()
t = time.time() - t0
t_max = max(t_max, t)
You will probably have a t_max, that indicates, that .write is blocking
....

Download webpage source up to a keyword

I'm looking to download the source code from a website up to a particular keyword (The websites are all from a forum so I'm only interested in the source code for the first posts user details) so I only need to download the source code until I find "<!-- message, attachments, sig -->" for the first time in the source code.
How to get webpage title without downloading all the page source
This question although in a different language is quite similar to what I'm looking to do although I'm not that experienced with python so I can't figure out how to recode that answer into python.

First, be aware that you may have already gotten all or most of each page into your OS buffers, NIC, router, or ISP before you cancel, so there may be no benefit at all to doing this. And there will be a cost—you can't reuse connections if you close them early; you have to recv smaller pieces at a time if you want to be able to cancel early; etc.
If you have a rough idea of how many bytes you probably need to read (better to often go a little bit over than to sometimes go a little bit under), and the server handles HTTP range requests, you may want to try that instead of requesting the entire file and then closing the socket early.
But, if you want to know how to close the socket early:
urllib2.urlopen, requests, and most other high-level libraries are designed around the idea that you're going to want to read the whole file. They buffer the data as it comes in, to give you a high-level file-like interface. On top of that, their API is blocking. Neither of that is what you want. You want to get the bytes as they come in, as fast as possible, and when you close the socket, you want that to be as soon after the recv as possible.
So, you may want to consider using one of the Python wrappers around libcurl, which gives you a pretty good balance between power/flexibility and ease-of-use. For example, with pycurl:
import pycurl
buf = ''
def callback(newbuf):
global buf
buf += newbuf
if '<div style="float: right; margin-left: 8px;">' in buf:
return 0
return len(newbuf)
c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, callback)
try:
c.perform()
except Exception as e:
print(e)
c.close()
print len(buf)
As it turns out, this ends up reading 12259/12259 bytes on that test. But if I change it to a string that comes in the first 2650 bytes, I only read 2650/12259 bytes. And if I fire up Wireshark and instrument recv, I can see that, although the next packet did arrive at my NIC, I never actually read it; I closed the socket immediately after receiving 2650 bytes. So, that might save some time… although probably not too much. More importantly, though, if I throw it at a 13MB image file and try to stop after 1MB, I only receive a few KB extra, and most of the image hasn't even made it to my router yet (although it may have all left the server, if you care at all about being nice to the server), so that definitely will save some time.
Of course a typical forum page is a lot closer to 12KB than to 13MB. (This page, for example, was well under 48KB even after all my rambling.) But maybe you're dealing with atypical forums.
If the pages are really big, you may want to change the code to only check buf[-len(needle):] + newbuf instead of the whole buffer each time. Even with a 13MB image, searching the whole thing over and over again didn't add much to the total runtime, but it did raise my CPU usage from 1% to 9%…
One last thing: If you're reading from, say, 500 pages, doing them concurrently—say, 8 at a time—is probably going to save you a lot more time than just canceling each one early. Both together might well be better than either on its own, so that's not an argument against doing this—it's just a suggestion to do that as well. (See the receiver-multi.py sample if you want to let curl handle the concurrency for you… or just use multiprocessing or concurrent.futures to use a pool of child processes.)

How to stream a video online while it is being generated and a failed CGI approach

I have a program that can generate video in real time. Now I would like to stream this video online while it is being generated. Does anybody know an easy way to do it?
I am describing a CGI approach I tried but did not work, but please note that I am open to all options that would achieve my goal. I am just wondering if anybody knows why my approach doesn't work and how I should fix it
I set the content-type to mpeg for example, and print out a chunk of data in the mpeg file periodically. But the video only lasts for very short amount of time and stop streaming. My code is something like this (in Python).
print "Content-type: video/mpeg"
print
f = open("test2.mpg")
while (True):
st = f.read(1024*1024)
sys.stdout.write(st)
time.sleep(0.5)
Though this would work fine. I really don't see why the output of these two programs are different. But obviously I can't use this approach since i can't wait until the entire file is generated before reading in.
print "Content-type: video/mpeg"
print
f = open("test2.mpg")
print f.read()

What type of file is test2.mpg?
If it's an mpeg4 file your approach won't work because you will have headers at the start or end of the file.
If your file is an mpeg2 transport stream, then this should work.

You're probably hitting end-of-file and so your loop is failing, either with EOFError or crashing somewhere. If the video is being generated in real time, unless test2.mpg is a FIFO pipe (created using mkfifo -- in which case you can only have one reader at a time) -- reading from the pipe may return no data, and your loop is likely to run much, much faster than your video data is being saved. So you need a strategy to handle EOF.
Also, you need to make sure to flush your output -- both after the sys.stdout.write() line in this program, and after the video stream in the other program. Since your loop has no end condition and no output, and you may never end up writing any data, it could be that after one iteration of the loop, something fails, and the webserver discards the buffered data.
Additionally, reading a constant size of 1MB at a time may cause latency issues. For better latency, it's good to use smaller sizes; however, for better quality and throughput, you can use larger sizes. However, the latency point is moot if the program generating the video, your cgi script, or the webserver aren't all flushing at regular intervals.
I'd also suggest looking into "select" or "poll"/epoll -- either of those methods will give you better control over reading, and might help you solve the end-of-file issue by sleeping until data is available. If you find yourself needing to sleep(0.5), you might be better off using select/poll correctly.

Python twisted asynchronous write using deferred

With regard to the Python Twisted framework, can someone explain to me how to write asynchronously a very large data string to a consumer, say the protocol.transport object?
I think what I am missing is a write(data_chunk) function that returns a Deferred. This is what I would like to do:
data_block = get_lots_and_lots_data()
CHUNK_SIZE = 1024 # write 1-K at a time.
def write_chunk(data, i):
d = transport.deferredWrite(data[i:i+CHUNK_SIZE])
d.addCallback(write_chunk, data, i+1)
write_chunk(data, 0)
But, after a day of wandering around in the Twisted API/Documentation, I can't seem to locate anything like the deferredWrite equivalence. What am I missing?

As Jean-Paul says, you should use IProducer and IConsumer, but you should also note that the lack of deferredWrite is a somewhat intentional omission.
For one thing, creating a Deferred for potentially every byte of data that gets written is a performance problem: we tried it in the web2 project and found that it was the most significant performance issue with the whole system, and we are trying to avoid that mistake as we backport web2 code to twisted.web.
More importantly, however, having a Deferred which gets returned when the write "completes" would provide a misleading impression: that the other end of the wire has received the data that you've sent. There's no reasonable way to discern this. Proxies, smart routers, application bugs and all manner of network contrivances can conspire to fool you into thinking that your data has actually arrived on the other end of the connection, even if it never gets processed. If you need to know that the other end has processed your data, make sure that your application protocol has an acknowledgement message that is only transmitted after the data has been received and processed.
The main reason to use producers and consumers in this kind of code is to avoid allocating memory in the first place. If your code really does read all of the data that it's going to write to its peer into a giant string in memory first (data_block = get_lots_and_lots_data() pretty directly implies that) then you won't lose much by doing transport.write(data_block). The transport will wake up and send a chunk of data as often as it can. Plus, you can simply do transport.write(hugeString) and then transport.loseConnection(), and the transport won't actually disconnect until either all of the data has been sent or the connection is otherwise interrupted. (Again: if you don't wait for an acknowledgement, you won't know if the data got there. But if you just want to dump some bytes into the socket and forget about it, this works okay.)
If get_lots_and_lots_data() is actually reading a file, you can use the included FileSender class. If it's something which is sort of like a file but not exactly, the implementation of FileSender might be a useful example.

The way large amounts of data is generally handled in Twisted is using the Producer/Consumer APIs. This doesn't give you a write method that returns a Deferred, but it does give you notification about when it's time to write more data.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.