Download webpage source up to a keyword - python

I'm looking to download the source code from a website up to a particular keyword (The websites are all from a forum so I'm only interested in the source code for the first posts user details) so I only need to download the source code until I find "<!-- message, attachments, sig -->" for the first time in the source code.
How to get webpage title without downloading all the page source
This question although in a different language is quite similar to what I'm looking to do although I'm not that experienced with python so I can't figure out how to recode that answer into python.

First, be aware that you may have already gotten all or most of each page into your OS buffers, NIC, router, or ISP before you cancel, so there may be no benefit at all to doing this. And there will be a cost—you can't reuse connections if you close them early; you have to recv smaller pieces at a time if you want to be able to cancel early; etc.
If you have a rough idea of how many bytes you probably need to read (better to often go a little bit over than to sometimes go a little bit under), and the server handles HTTP range requests, you may want to try that instead of requesting the entire file and then closing the socket early.
But, if you want to know how to close the socket early:
urllib2.urlopen, requests, and most other high-level libraries are designed around the idea that you're going to want to read the whole file. They buffer the data as it comes in, to give you a high-level file-like interface. On top of that, their API is blocking. Neither of that is what you want. You want to get the bytes as they come in, as fast as possible, and when you close the socket, you want that to be as soon after the recv as possible.
So, you may want to consider using one of the Python wrappers around libcurl, which gives you a pretty good balance between power/flexibility and ease-of-use. For example, with pycurl:
import pycurl
buf = ''
def callback(newbuf):
global buf
buf += newbuf
if '<div style="float: right; margin-left: 8px;">' in buf:
return 0
return len(newbuf)
c = pycurl.Curl()
c.setopt(c.URL, 'http://curl.haxx.se/dev/')
c.setopt(c.WRITEFUNCTION, callback)
try:
c.perform()
except Exception as e:
print(e)
c.close()
print len(buf)
As it turns out, this ends up reading 12259/12259 bytes on that test. But if I change it to a string that comes in the first 2650 bytes, I only read 2650/12259 bytes. And if I fire up Wireshark and instrument recv, I can see that, although the next packet did arrive at my NIC, I never actually read it; I closed the socket immediately after receiving 2650 bytes. So, that might save some time… although probably not too much. More importantly, though, if I throw it at a 13MB image file and try to stop after 1MB, I only receive a few KB extra, and most of the image hasn't even made it to my router yet (although it may have all left the server, if you care at all about being nice to the server), so that definitely will save some time.
Of course a typical forum page is a lot closer to 12KB than to 13MB. (This page, for example, was well under 48KB even after all my rambling.) But maybe you're dealing with atypical forums.
If the pages are really big, you may want to change the code to only check buf[-len(needle):] + newbuf instead of the whole buffer each time. Even with a 13MB image, searching the whole thing over and over again didn't add much to the total runtime, but it did raise my CPU usage from 1% to 9%…
One last thing: If you're reading from, say, 500 pages, doing them concurrently—say, 8 at a time—is probably going to save you a lot more time than just canceling each one early. Both together might well be better than either on its own, so that's not an argument against doing this—it's just a suggestion to do that as well. (See the receiver-multi.py sample if you want to let curl handle the concurrency for you… or just use multiprocessing or concurrent.futures to use a pool of child processes.)

Related

Read a file using threads

I try to write a python program that send files from one PC to another using python's sockets. But when file size increase it takes lots of time. Is it possible to read lines of a file sequentially using threads?
The concepts which I think is as follows:
Each thread separately and sequentially read lines from file and send it over socket. Is it possible to do? Or do you have any suggestion for it?
First, if you want to speed this up as much as possible without using threads, reading and sending a line at a time can be pretty slow. Python does a great job of buffering up the file to give you a line at a time for reading, but then you're sending tiny 72-byte packets over the network. You want to try to send at least 1.5KB at a time when possible.
Ideally, you want to use the sendfile method. Python will tell the OS to send the whole file over the socket in whatever way is most efficient, without getting your code involved at all. Unfortunately, this doesn't work on Windows; if you care about that, you may want to drop to the native APIs1 directly with pywin32 or switch to a higher-level networking library like twisted or asyncio.
Now, what about threading?
Well, reading a line at a time in different threads is not going to help very much. The threads have to read sequentially, fighting over the read pointer (and buffer) in the file object, and they presumably have to write to the socket sequentially, and you probably even need a mutex to make sure they write things in order. So, whichever one of those is slowest, all of your threads are going to end up waiting for their turn.2
Also, even forgetting about the sockets: Reading a file in parallel can be faster in some situations on modern hardware, but in general it's actually a lot slower. Imagine the file is on a slow magnetic hard drive. One thread is trying to read the first chunk, the next thread is trying to read the 64th chunk, the next thread is trying to read the 4th chunk… this means you spend more time seeking the disk head back and forth than actually reading data.
But, if you think you might be in one of those situations where parallel reads might help, you can try it. It's not trivial, but it's not that hard.
First, you want to do binary reads of fixed-size chunks. You're going to need to experiment with different sizes—maybe 4KB is fastest, maybe 1MB… so make sure to make it a constant you can easily change in just one place in the code.
Next, you want to be able to send the data as soon as you can get it, rather than serializing. This means you have to send some kind of identifier, like the offset into the file, before each chunk.
The function will look something like this:
def sendchunk(sock, lock, file, offset):
with lock:
sock.send(struct.pack('>Q', offset)
sent = sock.sendfile(file, offset, CHUNK_SIZE)
if sent < CHUNK_SIZE:
raise OopsError(f'Only sent {sent} out of {CHUNK_SIZE} bytes')
… except that (unless your files actually are all multiples of CHUNK_SIZE) you need to decide what you want to do for a legitimate EOF. Maybe send the total file size before any of the chunks, and pad the last chunk with null bytes, and have the receiver truncate the last chunk.
The receiving side can then just loop reading 8+CHUNK_SIZE bytes, unpacking the offset, seeking, and writing the bytes.
1. See TransmitFile—but in order to use that, you have to know about how to go between Python-level socket objects and Win32-level HANDLEs, and so on; if you've never done that, there's a learning curve—and I don't know of a good tutorial to get you started..
2. If you're really lucky, and, say, the file reads are only twice as fast as the socket writes, you might actually get a 33% speedup from pipelining—that is, only one thread can be writing at a time, but the threads waiting to write have mostly already done their reading, so at least you don't need to wait there.
Not Threads.
source_path = r"\\mynetworkshare"
dest_path = r"C:\TEMP"
file_name = "\\myfile.txt"
shutil.copyfile(source_path + file_name, dest_path + file_name)
https://docs.python.org/3/library/shutil.html
Shutil offers a high level copy function that uses the OS layer to copy. It is your best bet for this scenario.

set_sequential_download() and set_piece_deadline() in libtorrent

i'm working on my project which is to make a streaming client over libtorrent.
i'm using the python client (python binding).
i searched a lot about these functions set_sequential_download() and set_piece_deadline() and i couldn't find a good answer on how to force download pieces in order, which means first piece 1 and then 2,3,4 etc..
i saw people are asking this in forums, but none of them got a good answer on the changes need to be done in order it to succeed.
i understood that the set_sequential_download() just asks for the pieces in order but in fact they are randomly downloaded. i tried to change the deadline of the pieces using set_piece_deadline() , increment each piece but it doesn't work for me at all.
** UPDATE
the goal i'm trying to acomplish , it's downloading one piece at a time so i can make a streaming throgh torrents.
i hope some of you can help me,
thanks Ben.
set_sequential_download() will request pieces in order. However:
all peers may not have all pieces. If the next piece you want to download is 3 and one of your peers doesn't have 3 but the next it has is 5, libtorrent will start requesting blocks from piece 5 from that peer.
peers provide varying upload rates, which means that some peers will satisfy your request sooner than others.
This makes it possible for the pieces to complete out-of-order.
set_piece_deadline() is a more flexible way to specify piece priority. It supports arbitrary range requests (as described by Jacob Zelek). Its main feature, though, is that it uses a different approach to requesting blocks. Instead of considering a peer at a time, and asking "what should I request from this peer", it considers a piece at a time, asking "which peer should I request this block from".
This makes it deliberately attempt to make pieces complete in the order of their deadlines. It is still an estimate based on historical download rates from peers, and if the bottleneck for download rates is your own download capacity, it may be very difficult to make predictions of future download rates for peers. A few important things to keep in mind when using the `set_piece_deadline()`` API are:
It's not important that the deadline is in the future. If the deadline cannot be met given the current download or upload capacity, the pieces will be prioritized in the order they were asked to be completed.
If a deadline is far out in the future, libtorrent may wait to prioritize it until it believe it needs to request it to make the deadline. If you're streaming a large file, and you know the bit-rate, you can set up deadlines for every piece, and if your capacity is higher than the bitrate, you'll still request some pieces in rarest-first order. Improves swarm quality.
When streaming data, it's absolutely critical to read-ahead. If you don't set the deadline until you want the piece, you'll always fall behind. There's typically a pretty long round-trip between requesting a piece and completing it. If you don't keep the request pipe full of deadline-pieces, libtorrent will start requesting other pieces again, and you'll get non-prioritized pieces interleaved with your high-priority pieces. You should probably keep a few seconds and at least a few pieces as read-ahead. For video, I would imagine tens of megabytes is appropriate (but experimentation and measurement is the best way to tweak it).
If you are in fact looking to stream video to a player or web browser over HTTP, you may want to take a look at (or use and submit pull requests to):
https://github.com/arvidn/libtorrent-webui/blob/master/src/file_downloader.cpp
that's a file-downloader provider that fits into simple http framework in that repository.
UPDATE:
If all you want is to guarantee that piece 1 completes before piece 2 (at any cost, specifically very poor performance), you can set the priority of all pieces to 0, except for the one piece you want to download. Once it completes, you'll be notified by an alert and you can set the priority of the next piece you want to 1. And so on.
This will be incredibly slow, since you'll pause the download constantly, and be in constant end-game mode (where you may download the same block from multiple peers, if one is slow). For instance, if you have more peers than there are blocks in one piece, you will leave download bandwidth unused, by not being able to request from all peers.
I've ran into the same problem as you. Setting a torrent to sequential download means the pieces will be downloaded in a somewhat ordered fashion. This may be the intuitive solution for streaming. However, streaming video is more complicated then just downloading all the pieces in order.
Video files come in different containers (e.g. mkv, mp4, avi) and different codes (h264, theora, etc). Some codecs/containers store metadata/headers in different locations in a file. I can't remember off the top of my head but a certain container/codec stores all header information at the end of the file. Such a file may not stream well if downloaded sequentially.
Unless you write the code for determining which pieces are needed to start streaming, you will have to rely on an existing mechanisms. Take for example Peerflix which spawns a browser video player, VLC, of Mplayer. These applications have a good idea of what byte ranges they need for various containers/codecs. When Peerflix launches VLC to play, lets say, an AVI file, VLC will attempt to read the first several bytes and last several bytes (headers).
The genius behind Peerflix is that it tries to serve the video file through it's own web server and therefore knows what byte ranges of the file VLC is seeking. It then determines which pieces the byte ranges fall into and prioritizes those pieces. Peerflix uses some Node.js BitTorrent library, whose exact piece prioritization mechanisms are unknown to me. However, in the case of libtorrent-rasterbar, the set_piece_deadline() function allows you to signal the library to what pieces you need. In my experience, once I determined the pieces needed, I would call set_piece_deadline() with a short deadline (50ms or so) and wait for the arrival. Please note that using set_piece_dealine() is incompatible with sequential downloads (just set them to false).
One thing to note, libtorrent-rasterbar will not write the piece to the hard drive as soon as it gets it. This is a trap I fell into because I tried to read that byte range from the file when the piece arrived. For this you will need to run a thread to catch the alerts that libtorrent-rasterbar passes to your application. More specifically you will receive the raw binary data for that piece in a read_piece_alert.

Improving speed of xmlrpclib

I'm working with a device that is essentially a black box, and the only known communication method for it is XML-RPC. It works for most needs, except for when I need to execute two commands very quickly after each other. Due to the overhead and waiting for the RPC response, this is not as quick as desired.
My main question is, how does one reduce this overhead to make this functionality possible? I know the obvious solution is to ditch XML-RPC, but I don't think that's possible for this device, as I have no control over implementing any other protocols from the "server". This also makes it impossible to do a MultiCall, as I can not add valid instructions for MultiCall. Does MultiCall have to be implemented server side? For example, if I have method1(), method2(), and method3() all implemented by the server already, should this block of code work to execute them all in one reply? I'd assume no from my testing so far, as the documentation shows examples where I need to initialize commands on the server side.
server=xmlrpclib.ServerProxy(serverURL)
multicall=xmlrpclib.MultiCall(server)
multicall.method1()
multicall.method2()
mutlicall.method3()
multicall()
Also, looking through the source of xmlrpclib, I see references to a "FastParser" as opposed to a default one that is used. However, I can not determine how to enable this parser over the default. Additionally, the comment on this answer mentions that it parses one character at a time. I believe this is related, but again, no idea how to change this setting.
Unless the bulk size of your requests or responses are very large, it's unlikely that changing the parser will affect the turnaround time (since CPU is much faster than network).
You might want to consider, if possible, sending more than one command to the device without waiting for the response from the first one. If the device can handle multiple requests at once, then this may be of benefit. Even if the device only handles requests in sequence, you can still have the next request waiting at the device so that there is no delay after processing the previous one. If the device serialises requests in this way, then that's goingn to be about the best you can do.

How to stream a video online while it is being generated and a failed CGI approach

I have a program that can generate video in real time. Now I would like to stream this video online while it is being generated. Does anybody know an easy way to do it?
I am describing a CGI approach I tried but did not work, but please note that I am open to all options that would achieve my goal. I am just wondering if anybody knows why my approach doesn't work and how I should fix it
I set the content-type to mpeg for example, and print out a chunk of data in the mpeg file periodically. But the video only lasts for very short amount of time and stop streaming. My code is something like this (in Python).
print "Content-type: video/mpeg"
print
f = open("test2.mpg")
while (True):
st = f.read(1024*1024)
sys.stdout.write(st)
time.sleep(0.5)
Though this would work fine. I really don't see why the output of these two programs are different. But obviously I can't use this approach since i can't wait until the entire file is generated before reading in.
print "Content-type: video/mpeg"
print
f = open("test2.mpg")
print f.read()
What type of file is test2.mpg?
If it's an mpeg4 file your approach won't work because you will have headers at the start or end of the file.
If your file is an mpeg2 transport stream, then this should work.
You're probably hitting end-of-file and so your loop is failing, either with EOFError or crashing somewhere. If the video is being generated in real time, unless test2.mpg is a FIFO pipe (created using mkfifo -- in which case you can only have one reader at a time) -- reading from the pipe may return no data, and your loop is likely to run much, much faster than your video data is being saved. So you need a strategy to handle EOF.
Also, you need to make sure to flush your output -- both after the sys.stdout.write() line in this program, and after the video stream in the other program. Since your loop has no end condition and no output, and you may never end up writing any data, it could be that after one iteration of the loop, something fails, and the webserver discards the buffered data.
Additionally, reading a constant size of 1MB at a time may cause latency issues. For better latency, it's good to use smaller sizes; however, for better quality and throughput, you can use larger sizes. However, the latency point is moot if the program generating the video, your cgi script, or the webserver aren't all flushing at regular intervals.
I'd also suggest looking into "select" or "poll"/epoll -- either of those methods will give you better control over reading, and might help you solve the end-of-file issue by sleeping until data is available. If you find yourself needing to sleep(0.5), you might be better off using select/poll correctly.

Python twisted asynchronous write using deferred

With regard to the Python Twisted framework, can someone explain to me how to write asynchronously a very large data string to a consumer, say the protocol.transport object?
I think what I am missing is a write(data_chunk) function that returns a Deferred. This is what I would like to do:
data_block = get_lots_and_lots_data()
CHUNK_SIZE = 1024 # write 1-K at a time.
def write_chunk(data, i):
d = transport.deferredWrite(data[i:i+CHUNK_SIZE])
d.addCallback(write_chunk, data, i+1)
write_chunk(data, 0)
But, after a day of wandering around in the Twisted API/Documentation, I can't seem to locate anything like the deferredWrite equivalence. What am I missing?
As Jean-Paul says, you should use IProducer and IConsumer, but you should also note that the lack of deferredWrite is a somewhat intentional omission.
For one thing, creating a Deferred for potentially every byte of data that gets written is a performance problem: we tried it in the web2 project and found that it was the most significant performance issue with the whole system, and we are trying to avoid that mistake as we backport web2 code to twisted.web.
More importantly, however, having a Deferred which gets returned when the write "completes" would provide a misleading impression: that the other end of the wire has received the data that you've sent. There's no reasonable way to discern this. Proxies, smart routers, application bugs and all manner of network contrivances can conspire to fool you into thinking that your data has actually arrived on the other end of the connection, even if it never gets processed. If you need to know that the other end has processed your data, make sure that your application protocol has an acknowledgement message that is only transmitted after the data has been received and processed.
The main reason to use producers and consumers in this kind of code is to avoid allocating memory in the first place. If your code really does read all of the data that it's going to write to its peer into a giant string in memory first (data_block = get_lots_and_lots_data() pretty directly implies that) then you won't lose much by doing transport.write(data_block). The transport will wake up and send a chunk of data as often as it can. Plus, you can simply do transport.write(hugeString) and then transport.loseConnection(), and the transport won't actually disconnect until either all of the data has been sent or the connection is otherwise interrupted. (Again: if you don't wait for an acknowledgement, you won't know if the data got there. But if you just want to dump some bytes into the socket and forget about it, this works okay.)
If get_lots_and_lots_data() is actually reading a file, you can use the included FileSender class. If it's something which is sort of like a file but not exactly, the implementation of FileSender might be a useful example.
The way large amounts of data is generally handled in Twisted is using the Producer/Consumer APIs. This doesn't give you a write method that returns a Deferred, but it does give you notification about when it's time to write more data.

Categories

Resources