Scrape website for certain text and send e-mail - python

I am fairly new to coding but I have a use case that I want to realize with python.
Use Case:
I'd like to have a python script that checks a certain website for a certain string. If the string is found I'd like to get some kind of notification or e-mail. Maybe you could point me to the right library or code snippets on how I can realize this.

The Usecase that you mentioned (Apart from sending notification/E-mail) is called Web Scraping. I have mentioned different python modules below that will help you learn web-scraping.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/

Related

Does requests rely on selenium?

I am running some crawls in order to test if the results I get deviate. For this effort I created two test suites, the first one I created with the requests and BeautifulSoup library, the other one is based on selenium. I would like to find out, if pages detect both bots in the same way.
But I am still unsure if I am right, by assuming that requests and BeautifulSoup are independent from Selenium.
I hope its not a dump question, but I haven't find any proper answer yet (maybe because of the wrong keywords). However, any help would be appreciated.
Thanks in advance
I checked the requests documentation. I wrote a mail to the developer, without any answer. And of course I checked on google. I found something about scrapy vs selenium but well... are requests and BeautyfulSoup related to scrapy?
The python requests module does not use Selenium, neither does BeautifulSoup. Both will run independent of a web browser. Both are pure python implementations.
Selenium automates browsers, so you'll present to a web service with the user-agent string and other variables that the browser you choose to drive with Selenium would present.
You can specify user-agent string when you use requests, or not, but requests doesn't drive a browser inherently, so you'll be presenting as a different entity from the user-agent perspective, like python-requests/2.18.4.
BeautifulSoup is a parser, and so it presents to a web service through another library (like requests); it doesn't have its own native presentation.

Python - how to trick anti adblock filter while scraping?

I`m trying to download content of a website using python urllib, but i have a problem because the site has an addblock filter and only thing i can get is text that asks me to disable addblock... Is there any way to trick this kind of filter?
Thanks in advance. (:
Javascript Parsing
The issue you are running into is a JavaScript filter that loads data after the page has loaded. The message that warns that you are using adblock is there in raw HTML and is completely static. It is replaced when a JavaScript call is able to validate where adblock is or is not present. There are several ways you can get around this, however each requires finding some way of loading JavaScript.
Solution(s)
There are several solutions to your problem. You can read more about them here.
Embed a web browser within an application and simulate a normal user.
Remotely connect to a web browser and automate it from a scripting
language.
Use special purpose add-ons to automate the browser
Use a framework/library to simulate a complete browser.
As you can see each one in some way requires emulating a browser and DOM objects. Since there are several libraries to help you accomplish this, I highly recommend you look into the url above.
The following is a code example from the same page that shows how to retrieve the URLs on a page that generates URLs via JavaScript. It relies on a library from gargoylesoftware.
import com.gargoylesoftware.htmlunit.WebClient as WebClient
import com.gargoylesoftware.htmlunit.BrowserVersion as BrowserVersion
def main():
webclient = WebClient(BrowserVersion.FIREFOX_3_6) # creating a new webclient object.
url = "http://www.gartner.com/it/products/mq/mq_ms.jsp"
page = webclient.getPage(url) # getting the url
articles = page.getByXPath("//table[#id='mqtable']//tr/td/a") # getting all the hyperlinks
if __name__ == '__main__':
main()
However,
I am not sure why you are scraping a webpage, or what website you are scraping it from. However, it is against the terms and conditions of various sites to automate such data-collection, and I advise your revise these terms before you get yourself into any trouble.
Further Research
If you are looking for a more generic answer to your question (e.g. "How can I load javascript with Python.") I highly recommend looking at previous answers on this site, because they offer some really good insight into the matter:
Web-scraping JavaScript page with Python

python read data from web application

hi everyone I would like to create a small bot to help me on binary option.
i am not an expert on python but actualy I can read a web page and
retrieve a precise value in a tag,
but the information what I need is on a web application
and not in the source code of the web page. I am not an expert of eb application and I want to know if I retrieve a value displayed on the application with python.
here is a link to the picture of the application:
"http://comparatif-options-binaires.fr/wp-content/uploads/2014/05/optionweb-analyse-technique-ow-school.jpg"
I think the problem you face here is the value you need is being loaded via Javascript of some sort (though without access to the web application and no visible effort from your code I can't be sure).
Expanding on #sabhirams answer (and agreeing that requests and BeautifulSoup are excellent libraries for static text) I would recommend having a look at the following:
Selenium - automates web browser usage in python (so will run the full javascript).
Webkit - Again another headless browser for python that has some excellent SO questions on the matter.
Ghost.py - attempts to make the Webkit experience a little smoother.
pyv8 - something a bit more barebones, pyv8 is a python wrapper for the Google V8 Javascript engine and can be used to run the javascript on the page and, hopefully, extract the element you need.
And if you're really not settled with python why not look at using a Javascript headless browser to run the javascript like PhantomJS.
As mentioned before; Respect others when scraping and be aware there may be consequences if you are caught.
I think you mean you want to build a script which can scrape a given webpage, and extract a certain value out of a given target DOM element.
I dont currently have the time to write this code for you, but it should be rather simple to put together. Here are some modules which might help you:
Request - Use this to fetch a given webpage into your py script
BeautifulSoup - Feed the above "DOM text" to beautiful soup, and you will be able to more easily manipulate the HTML page (fetch your var of interest etc...).
EDIT:
As pointed out in the comments above, please consider the Terms and Conditions of the web-service you are trying to scrape info from.

Python 3 - way to interact with a web page

I have experience with reading and extracting html source 'as given'(via urllib.request), but now I would like to perform browser-alike actions(like filling a form, or selecting a value from the option menu) and then, of course, read a resulting html code as usual. I did come across some modules that seemed promising, but turned out not supporting Python 3.
So, I'm here asking for a name of library/module that does the wanted, or pointing to a solution within standard libraries if it's there and I failed to see it.
Usually many websites (like Twitter, facebook or Wikipedia) provide their API's to let developers hook into their app and perform activities programmatically. For what so ever web site you wish to perform activities through code, just look for their API support.
In case you need to do web scraping, you can use scrapy. But it only has support upto python 2.7.x. Anyways, you can use requests for HTTP client and beautiful soup for HTML parsing.

I want to scrape a site using GAE and post the results into a Google Entity

I want to scrape this URL : https://www.xstreetsl.com/modules.php?searchSubmitImage_x=0&searchSubmitImage_y=0&SearchLocale=0&name=Marketplace&SearchKeyword=business&searchSubmitImage.x=0&searchSubmitImage.y=0&SearchLocale=0&SearchPriceMin=&SearchPriceMax=&SearchRatingMin=&SearchRatingMax=&sort=&dir=asc
Go into each of the links and extract out various pieces of information e.g. permissions, prims etc then post the results into a Entity on google app engine.
I want to know the best way to go about it?
Chris
For normalizing HTML using a pure Python library I have had better experiences with html5lib than BeautifulSoup.
However you just want to extract simply structured information, which doesn't actually require normalizing the HTML. I have a few scraping apps on Google App Engine which use my own xpath library that works with raw HTML.
Or you can use regular expressions for one off jobs.
There are several nice screen scraping libraries you can use in Python.
Perhaps the easiest to knock up an advanced scraper with is scrapy. It relies on Twisted to implement the main engine but provides a very easy to use interface for implementing custom scraping code.
Otherwise you can look at doing it more manually with something like BeautifulSoup, or Mechanize which provides a "mechanical" browser implementation.
BeautifulSoup and Mechanize should both work out of the box on App Engine - it provides a wrapper around httplib and urllib that uses urlfetch as a backend. Only scrapy will be problematic, due to its use of twisted. [thanks to Nick Johnson for the update].

Categories

Resources