How to web scrape to find out new updates on website - python

I know it's a broad question, but I'm looking for ideas to go about doing this. Not looking for the exact coded answer, but a rough gameplan of how to go about this!
I'm trying to scrape a blog site to check for new blog posts, and if so, to return the URL of that particular blog post.
There are 2 parts to this question, namely
Finding out if the website has been updated
Finding what is the difference (new content)
I'm wondering what are the approaches I could go about doing this. I have been using Selenium for quite a bit, and am aware that with the Selenium driver I could check for 1. with driver.page_source.
Is there a better way to do both 1 and 2 together, and if possible even across various different blog sites (thinking whether it is possible to write more general code applied to various blogposts at once, not a customs script for each post)?
Bonus: Is there a way to do a "diff" on the before and after of the code to see the difference, and extract necessary information from there?
Thanks so much in advance!

If you're looking for a way to know if pages have been added or deleted, you can either look at directly, or build yourself a copy of a sitemap.xml file. If they do not have a sitemap.xml, you can crawl the menu and navigation for the site and build up your own from that. Sitemap files have a 'last modified' entry. If you know the interval you are scraping on, you can calculate rather quickly if the change occurred within the interval. This is good for site-wide changes.
Alternatively, you can also check the site-header to determine the last modified time for the page. Apply the same interval check as the site-map and go from there.

You can always check the last modified value in the web sites header:
https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified

Related

Web scraping when multiple clicks are needed

Kindly have a short look here. https://www.cbp.gov/contact/find-broker-by-port/4901. Trying to scrape the list of all brokers, port wise. My question is directed to the approach that needs to be taken when multiple clicks(forward/back) are needed to arrive at a single/multiple data item(s). Could you point me to some reading material on this or any other solution you deem fit. Many Thanks.
You can use selenium for automating multiple clicks (forward/back) as needed, and also for identifying specific data item.
below you have a very good example.
[1] https://selenium-python.readthedocs.io/getting-started.html
Update: Another approach if the website is static is to use requests with beautifulsoup here is an example https://medium.com/#itylergarrett.tag/learning-web-scraping-with-python-requests-beautifulsoup-936e6445312

I am trying to scrape this website for all of the documents that are produced from the drop down forms

The site I am trying to scrap has drop-down menus that end up producing a link to a document. The end documents are what I want. I have no experience with web scraping so I don't know where to start on this. I don't know where to start. I have tried adapting this to my needs, but I couldn't get it working. I also tried to adapt this.
I know basically I need to:
for state in states:
select state
for type in types:
select type
select wage_area_radio button
for area in wage_area:
select area
for locality in localities:
select locality
for date in dates:
select date
get_document
I just haven't found anything that works for me yet. Is there a tool better than Selenium for this? I am currently trying to bend it to my will using the the code from my second example as a starter.
Depending on your coding skills and knowledge of HTTP, I would try one of two things. Note that scraping this site appears slightly non-trivial because of the different form options that appear based on what was previously selected, and the fact that there's a lot of AJAX calls happening.
1) Follow the HTTP requests (especially the AJAX ones) that are being made in something like Chrome DevTools. You'll get a good understanding of how the final URL is being formed and how to construct it yourself. In particular, it looks like the last POST to AFWageScheduleYearSelected is the one that generates the final url. Then, you can make these calls yourself in a Python HTTP library to get the documents.
2) Use something like PhantomJS (http://phantomjs.org/) which is a headless browser. I don't have experience scraping with Selenium, but my understanding is that it is more of a testing/automation tool. In any case, PhantomJS is pretty easy to get up and running and you can basically click page elements, fill out forms, etc.
If you do end up using PhantomJS (or any other browser-like tool), you'll run into issues with the AJAX calls that populate the forms. Basically, you'll end up trying to fill out forms that don't yet exist on the page because the data is still being sent over the network. The easiest way to get around this is to just set timeouts (of say 2 seconds) in between each form field that you fill out. The alternative to using timeouts (which may be unreliable and slow) is to continuously poll the page until the AJAX call is finished.

Scraping information from a flash object on a website using python or any other method

I was just wondering if it is possible to scrape information form this website that contained in a flash file.(http://www.tomtom.com/lib/doc/licensing/coverage/)
I am trying to get the all the text from the different components of this website.
Can anyone suggest a good starting point in python or any simpler method.
I believe the following blog post answers your question well. The author had the same need, to scrape Flash content using Python. And the same problem came up. He realized that he just needed to instantiate a browser (even just an in-memory one that did not even display to the screen) and then scrape its output. I think this could be a successful approach for what you need, and he makes it easy to understand.
http://blog.motane.lu/2009/06/18/pywebkitgtk-execute-javascript-from-python/

Is there any way to find urls folders?

I am kind of asking a weird question, but i am making a spider and i am wondering is there any way to have folders of certain urls like:
mysite.com/drupal
mysite.com/wordpress
mysite.com/abc
is there any way to find for this kind of information???
Web sites don't typically advertise their entire set of URLs. You can try a few things:
Read the main page, and follow the links on the page. Each leads to another page, which contains links, and so on.
Guess at common folder names.
Eacmine the robots.txt file if the site has one. You should be a good citizen and not retrieve pages it forbids you to.
Try to get the site's sitemap, as this shows: http://support.google.com/webmasters/bin/answer.py?hl=en&answer=156184
If you implement a traditional spider, it will only traverse Urls is finds in the content as it goes along. You could try a dictionary or every-string-in-the-universe check at every directory level, but that wouldn't be playing nice.
So, the short answer is "no".

Read all pages within a domain

I am using the urllib library to fetch pages. Typically I have the top-level domain name & I wish to extract some information from EVERY page within that domain. Thus, if I have xyz.com, I'd like my code to fetch the data from xyz.com/about etc. Here's what I am using:
import urllib,re
htmlFile = urllib.urlopen("http://www.xyz.com/"+r"(.*)")
html = htmlFile.read()
...............
This doe not do the trick for me though. Any ideas are appreciated.
Thanks.
-T
I don't know why you would expect domain.com/(.*) to work. You need to have a list of all the pages (dynamic or static) within that domain. Your python program cannot automatically know that. This knowledge you must obtain from elsewhere, either by following links or looking at the sitemap of the website.
As a footnote, scraping is a slightly shady business. Always make sure, no matter what method you employ, that you are not violating any terms and conditions.
You are trying to use a regular expression on the web server. Turns out, web servers don't actually support this kind of format, so it's failing.
To do what you're trying to, you need to implement a spider. A program that will download a page, find all the links within it, and decide which of them to follow. Then, downloads each of those pages, and repeats.
Some things to watch out for - looping, multiple links that end up pointing at the same page, links going outside of the domain, and getting banned from the webserver for spamming it with 1000s of requests.
In addition to #zigdon answer I recommend you to take a look at scrapy framework.
CrawlSpider will help you to implement crawling quite easily.
Scrapy has this functionality built in. No recursively getting links. It asynchronously automatically handles all the heavy lifting for you. Just specify your domain and search terms and how deep you want it to search in the page .ie the whole site.
http://doc.scrapy.org/en/latest/index.html

Categories

Resources