I'm writing a web crawler with the ultimate goal of creating a map of the path the crawler has taken. While I haven't a clue at what rate other, and most definitely better crawlers pull down pages, mine clocks about 2,000 pages per minute.
The crawler works on a recursive backtracking algorithm which I have limited to a depth of 15.
Furthermore, in order to prevent my crawler from endlessly revisitng pages, it stores the url of each page it has visited in a list, and checks that list for the next candidate url.
for href in tempUrl:
...
if href not in urls:
collect(href,parent,depth+1)
This method seems to become a problem by the time it has pulled down around 300,000 pages. At this point the crawler on average has been clocking 500 pages per minute.
So my question is, what is another method of achieving the same functionality while improving its efficiency.
I've thought that decreasing the size of each entry may help, so instead of appending the entire url, I append the first 2 and the last to characters of each url as a string. This, however hasn't helped.
Is there a way I could do this with sets or something?
Thanks for the help
edit: As a side note, my program is not yet multithreaded. I figured I should resolve this bottleneck before I get into learning about threading.
Perhaps you could use a set instead of a list for the urls that you have seen so far.
Simply replace your 'list of crawled URLS" with a "set of crawled urls". Sets are optimised for random access (using the same hashing algorithms that dictionaries use) and they're a heck of a lot faster. A lookup operation for lists is done using a linear search so it's not particularly fast. You won't need to change the actual code that does the lookup.
Check this out.
In [3]: timeit.timeit("500 in t", "t = list(range(1000))")
Out[3]: 10.020853042602539
In [4]: timeit.timeit("500 in t", "t = set(range(1000))")
Out[4]: 0.1159818172454834
I had a similar problem. Ended up profiling various methods (list/file/sets/sqlite) for memory vs time. See these 2 posts.
Finally sqlite was the best choice. You can also use url hash to reduce the size
Searching for a string in a large text file - profiling various methods in python
sqlite database design with millions of 'url' strings - slow bulk import from csv
Use a dict with the urls as keys instead (O(1) access time).
But a set will also work. See
http://wiki.python.org/moin/TimeComplexity
Related
I am building a web crawler which has to crawl hundreds of websites. My crawler keeps a list of urls already crawled. Whenever crawler is going to crawl a new page, it first searches the list of urls already crawled and if it is already listed the crawler skips to the next url and so on. Once the url has been crawled, it is added to the list.
Currently, I am using binary search to search the url list, but the problem is that once the list grows large, searching becomes very slow. So, my question is that what algorithm can I use in order to search a list of urls (size of list grows to about 20k to 100k daily).
Crawler is currently coded in Python. But I am going to port it to C++ or other better languages.
You have to decide at some point just how large you want your crawled list to become. Up to a few tens of millions of items, you can probably just store the URLs in a hash map or dictionary, which gives you O(1) lookup.
In any case, with an average URL length of about 80 characters (that was my experience five years ago when I was running a distributed crawler), you're only going to get about 10 million URLs per gigabyte. So you have to start thinking about either compressing the data or allowing re-crawl after some amount of time. If you're only adding 100,000 URLs per day, then it would take you 100 days to crawl 10 million URLs. That's probably enough time to allow re-crawl.
If those are your limitations, then I would suggest a simple dictionary or hash map that's keyed by URL. The value should contain the last crawl date and any other information that you think is pertinent to keep. Limit that data structure to 10 million URLs. It'll probably eat up close to 2 GB of space, what with dictionary overhead and such.
You will have to prune it periodically. My suggestion would be to have a timer that runs once per day and cleans out any URLs that were crawled more than X days ago. In this case, you'd probably set X to 100. That gives you 100 days of 100,000 URLs per day.
If you start talking about high capacity crawlers that do millions of URLs per day, then you get into much more involved data structures and creative ways to manage the complexity. But from the tone of your question, that's not what you're interested in.
I think hashing your values before putting them into your binary searched list- this will get rid of the probable bottleneck of string comparisons, swapping to int equality checks. It also keeps the O(log2(n)) binary search time- you may not get consistent results if you use python's builtin hash() between runs, however- it is implementation-specific. Within a run, it will be consistent. There's always the option to implement your own hash which can be consistent between sessions as well.
So, I'm working on a python web application, it's a search engine for sporting goods (sport outfits, tools ....etc) . Basically it should search for a given keyword on multiple stores and compare results to return the 20 best results .
I was thinking that the best and easiest way to do this is to write a json file wich contains rules for the scraper on how to extract data on each website . For ex:
[{"www.decathlon.com" : { "rules" : { "productTag" : "div['.product']",
"priceTag" : "span[".price"]" } }]
So for decathlon, to get product item we search for div tags with the product class .
I have a list of around 10 - 15 websites to scrape . So for each website it goes to rules.json, see the related rules and use them to extract data .
Pros for this Method :
Very Easy to write, we need a minimal python script for the logic on how to read and map urls to their rules and extract the data through BeautifulSoup + It's also very easy to add, delete new urls and their rules .
Cons of this method : For each search we launch a request to each website, so making around 10 requests at the same time, then compare results, so if 20 users search at the same time we will have around 200 requests which will slow down our app a lot !
Another Method :
I thought that we could have a huge list of keywords, then at 00:00, a script launch requests to all the urls for each keyword in the list, compare them, then store the results in CouchDB, to be used through the day, and It will be updated daily . The only problem with this method is that it's nearly impossible to have a list of all possible keywords .
So please, help me on how should I proceed with this ? Given that I don't have a lot of time .
Along the lines of your "keyword" list: rather than keeping a list of all possible keywords, perhaps you could maintain a priority queue of keywords with importance based on how often a keyword is searched. When a new keyword is encountered, add it to the list, otherwise update it's importance every time it's searched. Launch a script to request urls for the top, say, 30 keywords each day (more or less depending on how often words are searched and what you want to do).
This doesn't necessarily solve your problem of having too many requests, but may decrease the likelihood of it becoming too much of a problem.
HTTP requests can be very expensive. That's why you want to make sure you parallelize your requests and for that you can use something like Celery. This way you will reduce total time to the time of slowest responding website.
It may be a good idea to set request timeout to shorter time (5 seconds?) in case one of the website is not responding to your request.
Have the ability to flag domain as "down/not responding" and be able to handle those exceptions.
Other optimization would be to store page contents after each search for some time in case same search keyword comes in so you can skip expensive requests.
I am currently working on a web application where, ideally, I would be able to support a search bar on the documents that are going to be stored for users. Each of these documents is going to be a small snippet up to a decently-sized article. (I don't imagine any documents are going to be larger than a few KB of text for search purposes) As I have been reading about the proper ways of using RethinkDB, one of the bits of information that has stuck out as worrying to me is the performance of doing something such as a filter on non-indexed data, where I have seen people mention multiple minutes spent in one of those calls. Considering I expect that, over the long run, there are going to be at least 10,000+ documents (and in the really long run, 100,000+, 1,000,000+, etc.), is there a way to be able to search those documents in a way that has sub-second (preferably in the 10s of milliseconds) response time in the standard RethinkDB API? Or am I going to have to come up with a separate scheme that allows for quick search through clever use of indexes? Or would I be better off using another database that provides that capability?
If you don't use an index, your query is going to have to look at every document in your table, so it will get slower as your table gets larger. 10,000 documents should be reasonable to search through on fast hardware, but you probably can't do it in 10s of milliseconds, and millions of documents will probably be slow to search through.
You may want to look into elasticsearch as a way to do this: http://www.rethinkdb.com/docs/elasticsearch/
How to generate a random yet valid website link, regardless of languages. Actually, the more diverse the language of the website it generates, the better it is.
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?. I've been doing it as such:
import webbrowser
from random import choice
random_page_generator = ['http://www.randomwebsite.com/cgi-bin/random.pl',
'http://www.uroulette.com/visit']
webbrowser.open(choice(random_page_generator), new=2)
I've been doing it by using other people's script on their webpage, how can i not rely on these random site forwarding script and make my own?
There are two ways to do this:
Create your own spider that amasses a huge collection of websites, and pick from that collection.
Access some pre-existing collection of websites, and pick from that collection. For example, DMOZ/ODP lets you download their entire database;* Google used to have a customized random site URL;** etc.
There is no other way around it (short of randomly generating and testing valid strings of arbitrary characters, which would be a ridiculously bad idea).
Building a web spider for yourself can be a fun project. Link-driven scraping libraries like Scrapy can do a lot of the grunt work for you, leaving you to write the part you care about.
* Note that ODP is a pretty small database compared to something like Google's or Yahoo's, because it's primarily a human-edited collection of significant websites rather than an auto-generated collection of everything anyone has put on the web.
** Google's random site feature was driven by both popularity and your own search history. However, by feeding it an empty search history, you could remove that part of the equation. Anyway, I don't think it exists anymore.
A conceptual explanation, not a code one.
Their scripts are likely very large and comprehensive. If it's a random website selector, they have a huge, huge list of websites line by line, and the script just picks one. If it's a random URL generator, it probably generates a string of letters (e.g. "asljasldjkns"), plugs it between http:// and .com, tries to see if it is a valid URL, and if it is, sends you that URL.
The easiest way to design your own might be to ask to have a look at theirs, though I'm not certain of the success you'd have there.
The best way as a programmer is simply to decipher the nature of URL language. Practice the building of strings and testing them, or compile a huge database of them yourself.
As a hybridization, you might try building two things. One script that, while you're away, searches for/tests URLs and adds them to a database. Another script that randomly selects a line out of this database to send you on your way. The longer you run the first, the better the second becomes.
EDIT: Do Abarnert's thing about spiders, that's much better than my answer.
The other answers suggest building large databases of URL, there is another method which I've used in the past and documented here:
http://41j.com/blog/2011/10/find-a-random-webserver-using-libcurl/
Which is to create a random IP address and then try and grab a site from port 80 of that address. This method is not perfect with modern virtual hosted sites, and of course only fetches the top page but it can be an easy and effective way of getting random sites. The code linked above is C but it should be easily callable from python, or the method could be easily adapted to python.
I have some webpages where I'm collecting data over time. I don't care about the content itself, just whether the page has changed.
Currently, I use Python's requests.get to fetch a page, hash the page (md5), and store that hash value to compare in the future.
Is there a computationally cheaper or smaller-storage strategy for this? Things work now; I just wanted to check if there's a better/cheaper way. :)
You can keep track of the date of the last version you got and use the If-Modified-Since header in your request. However, some resources ignore that header. (In general it's difficult to handle it for dynamically-generated content.) In that case you'll have to fall back to less efficient method.
A hash would be the most trustable source of change detection. I would use CRC32. It's only 32 bits as opposed to 128bits for md5. Also, even in browser Javascript it can be very fast. I have personal experience in improving the speed for a JS implementation of CRC32 for very large datasets.