Web Scraper: Limit to Requests Per Minute/Hour on Single Domain? - python

I'm working with a librarian to re-structure his organization's digital photography archive.
I've built a Python robot with Mechanize and BeautifulSoup to pull about 7000 poorly structured and mildy incorrect/incomplete documents from a collection. The data will be formatted for a spreadsheet he can use to correct it. Right now I'm guesstimating 7500 HTTP requests total to build the search dictionary and then harvest the data, not counting mistakes and do-overs in my code, and then many more as the project progresses.
I assume there's some sort of built-in limit to how quickly I can make these requests, and even if there's not I'll give my robot delays to behave politely with the over-burdened web server(s). My question (admittedly impossible to answer with complete accuracy), is about how quickly can I make HTTP requests before encountering a built-in rate limit?
I would prefer not to publish the URL for the domain we're scraping, but if it's relevant I'll ask my friend if it's okay to share.
Note: I realize this is not the best way to solve our problem (re-structuring/organizing the database) but we're building a proof-of-concept to convince the higher-ups to trust my friend with a copy of the database, from which he'll navigate the bureaucracy necessary to allow me to work directly with the data.
They've also given us the API for an ATOM feed, but it requires a keyword to search and seems useless for the task of stepping through every photograph in a particular collection.

There's no built-in rate limit for HTTP. Most common web servers are not configured out of the box to rate limit. If rate limiting is in place, it will almost certainly have been put there by the administrators of the website and you'd have to ask them what they've configured.
Some search engines respect a non-standard extension to robots.txt that suggests a rate limit, so check for Crawl-delay in robots.txt.
HTTP does have a concurrent connection limit of two connections, but browsers have already started ignoring that and efforts are underway to revise that part of the standard as it is quite outdated.

Related

Web scraping and proxy type

Framework: Scrapy.
I am currently using a web-scraper but I am getting disconnected from the server.
The scraper will (eventually) scrape between 100k and 150k pages with each page containing 11 fields that contain data that will be scraped.
My idea is that the scraper will be used once per month.
Estimated size of database upon completion is between 200mb and 300mb (not accounting for bandwidth).
I do not know if I need a paid proxy for this or if I can use free proxies.
Any advice (or proxy provider for my needs) will be greatly received.
there are several techniques to avoid being disconnected to the server you are scraping
this are some of the common techniques
you can add user agents here is a library and in this page there are tutorials on how to use user agents
you can go to your settings.py and uncomment DOWNLOAD_DELAY = x
Without a proxy you're very likely to have your IP address blocked and then even with proxies you may run into a CAPTCHA that prevents you from scraping pages.
For scraping 100K - 150K pages per month, as you indicated, I would highly recommend not using free proxies. The problem with free proxies is that they're incredibly unreliable - you never know who's using them, what they're being used for, when they'll no longer work, etc... Which could lead to any/all of your proxies being shut down or blocked and therefore your scraper will no longer work at any given moment.
Paid proxies are certainly the way to go although they can get quite expensive and some of the proxy providers have been known to use shady techniques to obtain IP addresses.
However https://htmlapi.io (my service) can bridge this gap for you and it's free to get started with (you don't even need to create an account). HtmlAPI returns the raw HTML contents of the page you're scraping. It handles rotating proxy servers for you automatically, defeating CAPTCHA's, rendering JavaScript, and more...
All you have to do is call the API and extract the data you need from the webpage.
Try it from your command line:
curl "https://htmlapi.io/example.com"

Steam Web API GetOwnedGames multiple SteamIDs

We are trying to get the owned games of a lot of users but our problem is that after a while the API call limit (100.000 a day) kicks in and we stop getting results.
We use 'IPlayerService/GetOwnedGames/v0001/?key=APIKEY&steamid=STEAMID' in our call and it works for the first entries.
There are several other queries like the GetPlayerSummaries query which take multiple Steam IDs, but according to the documentation, this one only takes one.
Is there any other way to combine/ merge our queries? We are using Python and the urllib.request library to create the request.
Depending on the payload of the requests you have the following possibilities:
if each request brings only the newest updates, you could serialize the steam ID's when you get the response that you've hit the daily limit
if you have the ability to control via the request payload what data you receive, you could go for a multithreaded / multiprocessing approach that consume the request queries and the steam ID's from a couple of shared resources
As #andreihondrari indirectly stated in his comment under his answer, one can request to get an API key which can get more then the 100.000 calls/ day. This is stated under part "License to Steam Web API & Steam Data" of the documentation:
You are limited to one hundred thousand (100,000) calls to the Steam Web API per day. Valve may approve higher daily call limits if you adhere to these API Terms of Use.
This may be complicated and there is of cause the possibility that you wont get approved, but this is pretty much the only stable way you can go.
Furthermore you could theoretically use multiple Steam Web API keys, BUT:
Each API key still has the limitation of 100.000 calls/day so you'll need to implement a fail safe and a transition between used keys and possibly need to create lots of accounts.
As each user has his own specific friendlist and blocked list the API key can "see" a portion of the Steam Community exclusively (friends data is not public otherwise). So it could be that you are using one API key which can't "see" a certain user when you could've used another to "see" it properly.
You'll need a unique email adress for each created account.
Note: Having multiple accounts actually complies with Valves ToS according to this post on Arqade.

Django and Rails with one common DB

I have earlier worked on Java+Spring to create a web-app.
I have to build a new web-app now.
It will have one centralized db.
There will be two different type of instance of web-app.
Web-App 1:
a) It would have nothing to UI render, no html,js etc.
b) All it need to give is some set of rest API which will
b.1) create some new entries in DB
b.2) modify some entries in DB
b.3) retrieve some of DB records in JSON format.
some frontend code ( doesn't belong to this app) will periodically fetch
this details.
c) it will be used by max by 100,000 people but at a given point of time,
we can expect about 1000 users logged in and doing whats being said in b)
Web-App2 :
a) It will have some dashboards
b) 90% of DB operations would be read operations
c) 10% of DB operations would be write/modify
d) There will be about 1000s of user of this system and at any given point of time
hardly 50-1000 people will be accessing it.
I am thinking of following.
Have Web-App 1 created in python+Django and Web-App 2 created in RoR.
I am planning to use to Dynamo DB and memcache.
Why two different frameworks?
1) So that I get to learn both of them
2) There have been concern about scalability in RoR (and I also know people claim its not there), Web-app 1 may have scaling needs in future.
My questions is Do you see any problem with this combination?
for example active records would want you to use specific namings format for your data base tables? Are there any other concerns similar to this?
Anyone else who have used similar technology stack?
both frameworks are full stack framework and and provide MVC, templating, unit testing, security, db migration, caching, security, ORMs.
For my startup, we also needed to put out a full fleshed website along with an API. We are also using DynamoDB for storing most of the data and are only using MySQL for session info.
I opted to use Ruby on Rails for the Webapp and Sinatra for the API. If you're criteria is simply learning as many new things as possible, then it would make sense to opt for relatively different stacks (django/python and RoR). In our case, we went with sinatra because it's essentially a very lightweight wrapper around Rack and perfect for an API which essentially receives requests, calls one or more services or does some processing and hands out a formatted response. While I don't see any problem with using python/django instead of sinatra, in our case the benefit was having to spend less time working with a different language.
Also, scalability in rails is a bit of an iffy subject. In the end, it's about how you use it. We've had no issues scaling rails with unicorn and nginx. Our business logic is all in the API service and the rails server as well uses the API for most of the work. This means we don't use active record on rails and the website is just another consumer for our API which does all the heavy lifting whether the request comes from an app or the website. Using MySQL for the session store ensures we can route requests to any of the application servers without having to worry about always routing requests from the same client to the same server every time. This allows us to ramp up and down easily only considering the amount of traffic we're getting.
At the time we started working on this, there wasn't an ORM for dynamo db which looked and felt just like active record, so we ended up writing a few high level classes of our own to handle storage and retrieval of models on DynamoDb. Considering DynamoDB is not tailored for scans or joins, this didn't take a lot of effort since we were almost always doing lookups based on keys and ranges. This meant we didn't really need a replacement for active record since the real strength of active record is being able to intuitively do joins, etc. by convention.
DynamoDB does have it's limitations though and you might find yourself in situations where you will need to scan a large number of records. In our case, we also use CloudSearch to index some important info and use it as a fallback for cases when we need to do text based searches which need to scan all our data.

requests library: how to speed it up?

I am trying to send multiple requests to different web pages. At the moment I am using the "requests" library in multithreading, because I have found it most performing than urllib2. Is it possible to load only a part of the webpage? Do you have any other idea to speed my requests than KeepAlive and multithreading?
Thanks.
As you clarified in a comment:
Hi, I'm trying to extract several stock quotes and financial ratios from the Italian Stock Exchange website. Every page that I load is related to a specific company.
This means there aren't very many easy optimisations left to make. If the web pages themselves are very large and the data you want is early-on in the page, you might be able to avoid downloading some of the data by streaming the download: that is, setting stream=True on the request and then using Response.iter_content() to read in chunks at a time.
If you're fortunate, you might be able to take advantage of caching to reduce response times or sizes. Try plugging something like CacheControl into your Session objects to see if this improves anything.
Otherwise, you're already getting almost as big an improvement as you can get in software alone. If the Italian Stock Exchange supports SPDY (they probably don't), using a SPDY library could improve things, but that rules out Requests (and possibly multithreading as well, for reasons that are totally tangential to this answer). Another outside-the-box option is to run on a machine closer to the web server providing the data.

URL fetch: prevent abuse, mailcious urls etc. in python/django

I'm building a webpage featuring a very much a-like the facebook wall/newsfeed. Registered users (or through Facebook-connect, google auth) can submit urls. At the moment, I'm taking these URLs and use urllib2 to fetch the content of the URL and search for relevant information like og:properties, HTML title-tag and perheps some -tags for images.
Now, I understand that I'm putting my server at risk when I'm letting users feed my server with URLs to open.
My question is how high the risk is? What standard security checks can I make?
As for now, I am simply opening the url without any "active" protection because I don't know what to check for.
And what about storing fetched content into the database. Does django have built-in protection against SQL-injections?
Thank you!
One of the obvious risks here is that one could use your website as a vector for spreading malicious URLs.
E.g. Say I figure out a malformed html that allows for arbitrary code execution in webkit based browsers, say by exploiting a certain 0-day buffer overflow. Say your website goes popular, that'd be one of the spots I'd definitely try.
Now, you can't possibly match the contents of the URLs submitted to look for security flaws. You'd become an anti-virus/security company then. Both Chrome & Safari do take care of these to some extent.
For user's/content's sake and for the risk I explained, you could build in a flagging system that learns by user's actions. You could train a classifier whenever someone flags a URL, see examples here.
I'm sure there is a variety of such solutions, also in python.
For a quick overview of security, sql injections in Django's context, checkout this link.

Categories

Resources