Is there any way to find urls folders? - python

I am kind of asking a weird question, but i am making a spider and i am wondering is there any way to have folders of certain urls like:
mysite.com/drupal
mysite.com/wordpress
mysite.com/abc
is there any way to find for this kind of information???

Web sites don't typically advertise their entire set of URLs. You can try a few things:
Read the main page, and follow the links on the page. Each leads to another page, which contains links, and so on.
Guess at common folder names.
Eacmine the robots.txt file if the site has one. You should be a good citizen and not retrieve pages it forbids you to.
Try to get the site's sitemap, as this shows: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156184

If you implement a traditional spider, it will only traverse Urls is finds in the content as it goes along. You could try a dictionary or every-string-in-the-universe check at every directory level, but that wouldn't be playing nice.
So, the short answer is "no".

Related

What methods are used to crawl a website that only offers a search bar for navigation

How would you go about crawling a website such that you can index every page when there is only really a search bar for navagation like the following sites.
https://plejehjemsoversigten.dk/
https://findadentist.ada.org/
Do people just brute force the search queries, or is there a method that's usually implemented to index these kinds of websites?
There could be several ways to approach your issue (however if the owner of a resource does not want the resource to be crawled, that might be really challenging)
Check robots.txt of a resource. It might give you a clue on the site structure.
Check sitemap.xml of a resource. It might give URLs of the pages a resource owner wishes to be public
Use alternative indexers (like google). Use advanced syntax narrowing the scope of search to a particular site (like site:your.domain)
Use breaches in site design. For example first site from your list does not have a minimal search string so that you can search for, say, a and get 800 results containing a. Then list remaining letters.
Having search result crawl all the links on the search result items pages since there often might be related pages listed.

How to web scrape to find out new updates on website

I know it's a broad question, but I'm looking for ideas to go about doing this. Not looking for the exact coded answer, but a rough gameplan of how to go about this!
I'm trying to scrape a blog site to check for new blog posts, and if so, to return the URL of that particular blog post.
There are 2 parts to this question, namely
Finding out if the website has been updated
Finding what is the difference (new content)
I'm wondering what are the approaches I could go about doing this. I have been using Selenium for quite a bit, and am aware that with the Selenium driver I could check for 1. with driver.page_source.
Is there a better way to do both 1 and 2 together, and if possible even across various different blog sites (thinking whether it is possible to write more general code applied to various blogposts at once, not a customs script for each post)?
Bonus: Is there a way to do a "diff" on the before and after of the code to see the difference, and extract necessary information from there?
Thanks so much in advance!
If you're looking for a way to know if pages have been added or deleted, you can either look at directly, or build yourself a copy of a sitemap.xml file. If they do not have a sitemap.xml, you can crawl the menu and navigation for the site and build up your own from that. Sitemap files have a 'last modified' entry. If you know the interval you are scraping on, you can calculate rather quickly if the change occurred within the interval. This is good for site-wide changes.
Alternatively, you can also check the site-header to determine the last modified time for the page. Apply the same interval check as the site-map and go from there.
You can always check the last modified value in the web sites header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified

How to stop infinite loops while creating a Web Site Crawler due to dynamic links?

I am a doing small project of creating a Crawler which will extract all the links present on a website with the maximum possible depth :
I have shown a portion of the following code, which i am using to avoid erroneous links or the links which take crawler outside the Target Website.
Code Snippet :
# block all things that can't be urls
if url[0:4]!="http" and url[0:4]!="https" and url[0:1]!='/':
continue
# block all links going away from website
if url[0:len(seed)]!=seed and (url[0:4]=='http'or url[0:4]=="https"):
continue
if "php" in url.split('/')[1]:
url = seed + url
What problem I am facing is that I encountered a link as :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
this link keeps producing infinite results the part of the link that i have highlighted shows the date.
Now when the Crawler crawls this link, it gets into an infinite loop as follows. I checked on the website even the link for 2050/10/13 exists, this means it will take huge time.
Few Output Sequences :
http://www.msit.in/index.php/component/jevents/day.listevents/2015/04/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/05/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/06/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/07/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/08/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/09/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/10/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/13/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/14/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/15/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/16/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/17/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/18/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/19/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/20/-?Itemid=1
http://www.msit.in/index.php/component/jevents/day.listevents/2015/11/21/-?Itemid=1
My Question:
My question is how can i avoid this problem?
If you are writing your project for this site specifically, you can try to find out if links are different from past events by comparing the dates in the URL. However, this will most likely result in site specific code, and if this project needs to be more general, is probably not an option.
If this doesn't work for you, can you add some more information (what is this project for, are there time constraints, etc.)
Edit- I missed the part about dynamic links, so this is not a finite set, so the first part of my answer didn't apply
If the content of a site is stored in a database and pulled for display on pages on demand, dynamic URLs maybe used. In that case the site serves basically as a template for the content. Usually, a dynamic URL would look something like this: http://code.google.com/p/google-checkout-php-sample-code/issues/detail?id=31.
You can spot dynamic URLs by looking for characters like: ? = &. Dynamic URLs have the disadvantage that different URLs can have the same content. So different users might link to URLs with different parameters which have the same content. That's one reason why webmasters sometimes want to rewrite their URLs to static ones.

how to scrawl file hosting website with scrapy in python?

Can anyone help me to figure out how to scrawl file hosting website like filefactory.com? I don't want to download all the file hosted but just to index all available files with scrapy.
I have read the tutorial and docs with respect to spider class for scrapy. If I only give the website main page as the begining url I wouldn't not scrawl the whole site, because the scrawling depends on links but the begining page seems not point to any file pages. That's the problem I am thinking and any help would be appreciated!
I have two pieces of advise. The first is to ensure that you are using Scrapy correctly, and the second pertains to the best way to collect a larger sample of the URLs.
First:
Make sure you are using the CrawlSpider to crawl the website. This is what most people use when they want to take all the links on a crawled page and turn them into new requests for Scrapy to crawl. See http://doc.scrapy.org/en/latest/topics/spiders.html for more information on the crawl spider.
If you build the crawl spider correctly, it should be able to find, and then crawl, the majority all the links that each page has.
However, if the pages that contain the download links are not themselves linked to by pages that Scrapy is encountering, then there is no way that Scrapy can know about them.
One way to counter this would be to use multiple entry points on the website, in the areas you know that Scrapy is having difficulty finding. You can do this by putting multiple initial urls in the start_urls variable.
Secondly
Since it is likely that this is already what you were doing, here is my next bit of advice.
If you go onto Google, and type site:www.filefactory.com , you will see a link to every page that Google has indexed for www.filefactory.com. Make sure you also check site:filefactory.com because there are some canonicalization issues. Now, when I did this, I saw that there were around 600,000 pages indexed. What you should do is crawl Google, and collect all of these indexed urls first, and store them in a database. Then, use all of these to seed further searches on the FileFactory.com website.
Also
If you have a membership to Filefactory.com, you can also program scrapy to submit forms or sign in. Doing this might allow you even further access.

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Categories

Resources