I have experience with reading and extracting html source 'as given'(via urllib.request), but now I would like to perform browser-alike actions(like filling a form, or selecting a value from the option menu) and then, of course, read a resulting html code as usual. I did come across some modules that seemed promising, but turned out not supporting Python 3.
So, I'm here asking for a name of library/module that does the wanted, or pointing to a solution within standard libraries if it's there and I failed to see it.
Usually many websites (like Twitter, facebook or Wikipedia) provide their API's to let developers hook into their app and perform activities programmatically. For what so ever web site you wish to perform activities through code, just look for their API support.
In case you need to do web scraping, you can use scrapy. But it only has support upto python 2.7.x. Anyways, you can use requests for HTTP client and beautiful soup for HTML parsing.
Related
I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?
I am learning python right now and I want to level up my knowledge on it particularly scraping. I am now on using Scrapy and getting in to use it along with Splash. I wanted to scrape a more challenging website - an airline website "https://www.airasia.com/en/home.page?cid=1" - one of my web developer friend told me that it would be impossible to scrape this type of websites since no regular json or xml files are returned for the data to be scrape. He said data can only be access using API (he said something about RESTFUL API) I don't somehow believe him. So as not wasting my time, if someone can CONFIRM it, I would be happy and if someone would say it can be scraped, I would be more happy if that guy can give me tips on how to scrape it and hands down if that guy can show proofs..
Many thanks.
Almost ANY website can be scraped but some websites are trickier than others.
Instead of Scrapy, I would recommend using a better alternative called Selenium which happens to have a library for python as well.
Long story made short: You will start a web browser in form of a driver and navigate to the page of your choice and simulate user interactions such as clicking, entering data in forms and submission. You will also be able to run JavaScript functions.
You might also want to do some research on legal constraints to ensure your operation is not unlawful. For instance, refer to case law of Ryanair Ltd v PR Aviation BV (Case C-30/14 CJEU).
You have 2 options: Use their API if they use one, to make http requests and obtain data and informations from their servers.
Or use a python scraping / web test framework, eg scrapy or selenium, to scrap their website directly in a python program.
Scrapy will be harder than selenium on this website because a lot of content is dynamic and will require custom code to trigger. Selenium should be easy to use.
hi everyone I would like to create a small bot to help me on binary option.
i am not an expert on python but actualy I can read a web page and
retrieve a precise value in a tag,
but the information what I need is on a web application
and not in the source code of the web page. I am not an expert of eb application and I want to know if I retrieve a value displayed on the application with python.
here is a link to the picture of the application:
"http://comparatif-options-binaires.fr/wp-content/uploads/2014/05/optionweb-analyse-technique-ow-school.jpg"
I think the problem you face here is the value you need is being loaded via Javascript of some sort (though without access to the web application and no visible effort from your code I can't be sure).
Expanding on #sabhirams answer (and agreeing that requests and BeautifulSoup are excellent libraries for static text) I would recommend having a look at the following:
Selenium - automates web browser usage in python (so will run the full javascript).
Webkit - Again another headless browser for python that has some excellent SO questions on the matter.
Ghost.py - attempts to make the Webkit experience a little smoother.
pyv8 - something a bit more barebones, pyv8 is a python wrapper for the Google V8 Javascript engine and can be used to run the javascript on the page and, hopefully, extract the element you need.
And if you're really not settled with python why not look at using a Javascript headless browser to run the javascript like PhantomJS.
As mentioned before; Respect others when scraping and be aware there may be consequences if you are caught.
I think you mean you want to build a script which can scrape a given webpage, and extract a certain value out of a given target DOM element.
I dont currently have the time to write this code for you, but it should be rather simple to put together. Here are some modules which might help you:
Request - Use this to fetch a given webpage into your py script
BeautifulSoup - Feed the above "DOM text" to beautiful soup, and you will be able to more easily manipulate the HTML page (fetch your var of interest etc...).
EDIT:
As pointed out in the comments above, please consider the Terms and Conditions of the web-service you are trying to scrape info from.
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)
I am building a web application as college project (using Python), where I need to read content from websites. It could be any website on internet.
At first I thought of using Screen Scrapers like BeautifulSoup, lxml to read content(data written by authors) but I am unable to search content based upon one logic as each website is developed on different standards.
Thus I thought of using RSS/ Atom (using Universal Feed Parser) but I could only get content summary! But I want all the content, not just summary.
So, is there a way to have one logic by which we can read a website's content using lib's like BeautifulSoup, lxml etc?
Or I should use API's provided by the websites.
My job becomes easy if its a blogger's blog as I can use Google Data API but the trouble is, should I need to write code for every different API for the same job?
What is the best solution?
Using the website's public API, when it exists, is by far the best solution. That is quite why the API exists, it is the way that the website administrators say "use our content". Scraping may work one day and break the next, and it does not imply the website administrator's consent to have their content reused.
You could look into content extraction libraries - I've used Full Text RSS (php) and Boilerpipe (java). Both have web service available so you can easily test if it meets your requirements. Also you can download and run them yourself and further modify its behavior on individual sites.