I am trying to send multiple requests to different web pages. At the moment I am using the "requests" library in multithreading, because I have found it most performing than urllib2. Is it possible to load only a part of the webpage? Do you have any other idea to speed my requests than KeepAlive and multithreading?
Thanks.
As you clarified in a comment:
Hi, I'm trying to extract several stock quotes and financial ratios from the Italian Stock Exchange website. Every page that I load is related to a specific company.
This means there aren't very many easy optimisations left to make. If the web pages themselves are very large and the data you want is early-on in the page, you might be able to avoid downloading some of the data by streaming the download: that is, setting stream=True on the request and then using Response.iter_content() to read in chunks at a time.
If you're fortunate, you might be able to take advantage of caching to reduce response times or sizes. Try plugging something like CacheControl into your Session objects to see if this improves anything.
Otherwise, you're already getting almost as big an improvement as you can get in software alone. If the Italian Stock Exchange supports SPDY (they probably don't), using a SPDY library could improve things, but that rules out Requests (and possibly multithreading as well, for reasons that are totally tangential to this answer). Another outside-the-box option is to run on a machine closer to the web server providing the data.
Related
Framework: Scrapy.
I am currently using a web-scraper but I am getting disconnected from the server.
The scraper will (eventually) scrape between 100k and 150k pages with each page containing 11 fields that contain data that will be scraped.
My idea is that the scraper will be used once per month.
Estimated size of database upon completion is between 200mb and 300mb (not accounting for bandwidth).
I do not know if I need a paid proxy for this or if I can use free proxies.
Any advice (or proxy provider for my needs) will be greatly received.
there are several techniques to avoid being disconnected to the server you are scraping
this are some of the common techniques
you can add user agents here is a library and in this page there are tutorials on how to use user agents
you can go to your settings.py and uncomment DOWNLOAD_DELAY = x
Without a proxy you're very likely to have your IP address blocked and then even with proxies you may run into a CAPTCHA that prevents you from scraping pages.
For scraping 100K - 150K pages per month, as you indicated, I would highly recommend not using free proxies. The problem with free proxies is that they're incredibly unreliable - you never know who's using them, what they're being used for, when they'll no longer work, etc... Which could lead to any/all of your proxies being shut down or blocked and therefore your scraper will no longer work at any given moment.
Paid proxies are certainly the way to go although they can get quite expensive and some of the proxy providers have been known to use shady techniques to obtain IP addresses.
However https://htmlapi.io (my service) can bridge this gap for you and it's free to get started with (you don't even need to create an account). HtmlAPI returns the raw HTML contents of the page you're scraping. It handles rotating proxy servers for you automatically, defeating CAPTCHA's, rendering JavaScript, and more...
All you have to do is call the API and extract the data you need from the webpage.
Try it from your command line:
curl "https://htmlapi.io/example.com"
I'm scraping websites for a research project, and for the first time I've encountered'
dotdefender blocked your request."
I'm not doing anything malicious; just scraping basic information. Is it possible to let them know this and/or overcome the block?
Here's the site.
Some sites will block scraping even if it is not malicious. You can try to run the scraping through a proxy but depending on how quickly you are scraping and the quality of the proxy you may still eventually get blocked. If you are doing a low amount of data collection the proxy should work, but if you are doing a larger quantity you may want to consider a premium service, rather an IP rotation service(think premium proxy).
Also you could try TOR but you may still run into speed issues.
For Proxies there are plenty of free and paid options but the quality is hard to measure.
Id like to know if there is a way to get information from my banking website with Python, Id like to retrieve my card history and display it, and possibly save it into a text document each month.
I have found the urls ext to login and get the information from the website, which works from a browser, but I have been using liburl2 to "open" the webpages from Python and I have a feeling its not working because of some cookie or session things.
I can get any information I want from a website that does not require a login with urllib2, and then save the actual HTML and go through it later, but I cant on my banks website,
Any help would be appreciated
This is a part of Web-Scraping :
Web-scraping is a standard task that can serve various needs.
Scraping data out of secure-website means https
Handling https is not a problem with mechanize and BeautifulSoup
Although urllib2 with HTTPCookieJar also works fine
If managing the cookies is the problem, then I would recommend mechanize
Considering the case of your BANK-Site :
I would recommend not to play with your account.
If you must then, its not as easy as any normal secure/non-secure site.
These sites are designed to with-stand such scripts.
Problems that you would face with this:
BANK sites will surely have Captcha that is almost impossible to by-pass with a script unless you employee a lot of rocket-science and effort.
Other problem that you will definitely face is javascript, standard scripting solutions are focused to manage cookies, HTML parsing, etc. For processing javascript on links you will have to process js in your python script. That again needs a lot of effort.
Then, AJAX that again comes from javascript fetches data from server after page-load.
So, it will require you to take a lot of effort to do this task.
Also, if you try doing this you risk of blocking access to your account since banking sites are quick to block account access on 3-4 unsuccessful attempt on login or captcha, etc.
So, think before you do.
I'm working with a librarian to re-structure his organization's digital photography archive.
I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.
I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?
I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.
Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.
They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.
There's no built-in rate limit for HTTP. Most common web servers are not configured out of the box to rate limit. If rate limiting is in place, it will almost certainly have been put there by the administrators of the website and you'd have to ask them what they've configured.
Some search engines respect a non-standard extension to robots.txt that suggests a rate limit, so check for Crawl-delay in robots.txt.
HTTP does have a concurrent connection limit of two connections, but browsers have already started ignoring that and efforts are underway to revise that part of the standard as it is quite outdated.
I am working on a social-network type of application on App Engine, and would like to send multiple images to the client based on a single get request. In particular, when a client loads a page, they should see all images that are associated with their account.
I am using python on the server side, and would like to use Javascript/JQuery on the client side to decode/display the received images.
The difficulty is that I would like to only perform a single query on the server side (ie. query for all images associated with a single user) and send all of the images resulting from the query to the client as a single unit, which will then be broken up into the individual images. Ideally, I would like to use something similar to JSON, but while JSON appears to allow multiple "objects" to be sent as a JSON response, it does not appear to have the ability to allow multiple images (or binary files) to be sent as a JSON response.
Is there another way that I should be looking at this problem, or perhaps a different technology that I should be considering that might allow me to send multiple images to the client, in response to a single get request?
Thank you and Kind Regards
Alexander
The App Engine part isn't much of a problem (as long as the number of images and total size doesn't exceed GAE's limits), but the user's browser is unlikely to know what to do in order to receive multiple payloads per GET request -- that's just not how the web works. I guess you could concatenate all the blobs/bytestreams (together with metadata needed for the client to reconstruct them) and send that (it will still have to be a separate payload from the HTML / CSS / Javascript that you're also sending), as long as you can cajole Javascript into separating the megablob into the needed images again (but for that part you should open a separate question and tag it Javascript, as Python has little to do with it, and GAE nothing at all).
I would instead suggest just accepting the fact that the browser (presumably via ajax, as you mention in tags) will be sending multiple requests, just as it does to every other webpage on the WWW, and focus on optimizing the serving side -- the requests will be very close in time, so you should just use memcache to keep the yet-unsent images to avoid multiple fetch-from-storage requests in your GAE app.
As an improvement to Alex's answer, there's no need to use memcache: Simply do a keys-only query to get a list of keys of images you want to send to the client, then use db.get() to fetch the image corresponding to the required key for each image request. This requires roughly the same amount of effort as a single regular query.
Trying to send all of the images in one request means that you will be fighting very hard against some of the fundamental assumptions of the web and browser technology. If you don't have a really, really compelling reason to do this, you should consider delivering one image per request. That already works now, no sweat, no effort, no wheels reinvented.
I can't think of a sensible way to do what you ask, but I can tell you that you are asking for pain in trying to implement the solution that you are describing.
Send the client URLs for all the images in one hit, and deal with it on the client. That fits with the design of the protocol, and still lets you only make one query. The client might, if you're lucky, be able to stream those back in its next request, but the neat thing is that it'll work (eventually) even if it can't reuse the connection for some reason (usually a busted proxy in the way).