Are bots different from crawlers from python Django point of view - python

Actually i am confused with the terminology. I am studying the scrapy and i think its for crawling the website and extract some data.
But i want to make some python programs which does something like the actual users does. I mean like automating tasks.
E,g Go to www.myblah.com and then get the cheapest product in some category and if that is less than my preset amount , then send me email.
Now i dont know whether these type of things come under crawling or something else.
Can i do that in scrapy or we have other libraries for doing those kind of tasks.

Scrapy is framework that can be used to create a bot or a crawler (aka spider). A crawler is specific kind of bot, but a bot isn't necessarily a crawler. Crawlers are defined by being designed to explore the graph of pages (nodes) and their embedded URLs (edges) although they may be restricted from following particular URLs.
Automating tasks is the work of a bot. Whether Scrapy will work for that will depend on what information is needed and how actions have to be taken. Many sites are heavy on javascript these days, so if the bot can't execute javascript and correctly provide cookies it may not be able to get the information to it's task. Some web automation tasks may require a browser plug-in or even GUI automation tools may be needed.

Related

How to Figure Out The Relationship among Webpages in An Individual Website?

Normally, when we make the test scripts either it is Robot Framework or Behat Framework. We will manually find the locator of the element we focused on each web page by using Developer Tools on the browser we use. We can make the scripts because we know the flow or the step of an individual testing scenario. However, I want to figure the automated way to extract the information of the relationship among web pages inside an individual website without any manual input and make a script out of it
The purpose of this question is to figure out the way to automatically detect the relationship between each page in an individual website in order to develop a test step and further develop the automated test scripts generator which the scripts can be Robot Framework or Behat Framework for automated testing on an individual website developed by using Laravel Framework which normally contains many web pages inside that are related to each other.
Do you guys have any ideas about this?
Please tell me if you have any.
Please leave your comments below.

What information do I need when scraping a website that requires logging in?

I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?

Can I scrape this kind of website?

I am learning python right now and I want to level up my knowledge on it particularly scraping. I am now on using Scrapy and getting in to use it along with Splash. I wanted to scrape a more challenging website - an airline website "https://www.airasia.com/en/home.page?cid=1" - one of my web developer friend told me that it would be impossible to scrape this type of websites since no regular json or xml files are returned for the data to be scrape. He said data can only be access using API (he said something about RESTFUL API) I don't somehow believe him. So as not wasting my time, if someone can CONFIRM it, I would be happy and if someone would say it can be scraped, I would be more happy if that guy can give me tips on how to scrape it and hands down if that guy can show proofs..
Many thanks.
Almost ANY website can be scraped but some websites are trickier than others.
Instead of Scrapy, I would recommend using a better alternative called Selenium which happens to have a library for python as well.
Long story made short: You will start a web browser in form of a driver and navigate to the page of your choice and simulate user interactions such as clicking, entering data in forms and submission. You will also be able to run JavaScript functions.
You might also want to do some research on legal constraints to ensure your operation is not unlawful. For instance, refer to case law of Ryanair Ltd v PR Aviation BV (Case C-30/14 CJEU).
You have 2 options: Use their API if they use one, to make http requests and obtain data and informations from their servers.
Or use a python scraping / web test framework, eg scrapy or selenium, to scrap their website directly in a python program.
Scrapy will be harder than selenium on this website because a lot of content is dynamic and will require custom code to trigger. Selenium should be easy to use.

Python 3 - way to interact with a web page

I have experience with reading and extracting html source 'as given'(via urllib.request), but now I would like to perform browser-alike actions(like filling a form, or selecting a value from the option menu) and then, of course, read a resulting html code as usual. I did come across some modules that seemed promising, but turned out not supporting Python 3.
So, I'm here asking for a name of library/module that does the wanted, or pointing to a solution within standard libraries if it's there and I failed to see it.
Usually many websites (like Twitter, facebook or Wikipedia) provide their API's to let developers hook into their app and perform activities programmatically. For what so ever web site you wish to perform activities through code, just look for their API support.
In case you need to do web scraping, you can use scrapy. But it only has support upto python 2.7.x. Anyways, you can use requests for HTTP client and beautiful soup for HTML parsing.

Python get data from secured website

Id like to know if there is a way to get information from my banking website with Python, Id like to retrieve my card history and display it, and possibly save it into a text document each month.
I have found the urls ext to login and get the information from the website, which works from a browser, but I have been using liburl2 to "open" the webpages from Python and I have a feeling its not working because of some cookie or session things.
I can get any information I want from a website that does not require a login with urllib2, and then save the actual HTML and go through it later, but I cant on my banks website,
Any help would be appreciated
This is a part of Web-Scraping :
Web-scraping is a standard task that can serve various needs.
Scraping data out of secure-website means https
Handling https is not a problem with mechanize and BeautifulSoup
Although urllib2 with HTTPCookieJar also works fine
If managing the cookies is the problem, then I would recommend mechanize
Considering the case of your BANK-Site :
I would recommend not to play with your account.
If you must then, its not as easy as any normal secure/non-secure site.
These sites are designed to with-stand such scripts.
Problems that you would face with this:
BANK sites will surely have Captcha that is almost impossible to by-pass with a script unless you employee a lot of rocket-science and effort.
Other problem that you will definitely face is javascript, standard scripting solutions are focused to manage cookies, HTML parsing, etc. For processing javascript on links you will have to process js in your python script. That again needs a lot of effort.
Then, AJAX that again comes from javascript fetches data from server after page-load.
So, it will require you to take a lot of effort to do this task.
Also, if you try doing this you risk of blocking access to your account since banking sites are quick to block account access on 3-4 unsuccessful attempt on login or captcha, etc.
So, think before you do.

Categories

Resources