I have a small project to webscrape prices from some stores using requests and beautiful soup.
One of these stores is now checking for javascript enable browser. I have found that it uses Magento 1.
I have tried the requests-html library, but it did not work.
I know that I can use selenium and headeless-chrome. I have tested it and works fine.
But as long as I run the project on the cloud, it would be much easier and less expensive to use requestes.
On stackoverflow there is a post where one solution was suggested: send the request with the cookied that is set by a website when it checks for javascript enable.
https://stackoverflow.com/a/66917621
Has anyone tried this solution on Magento 1?
I suspect that the data that I need to scrape is not generated by javascript on the page, but the check for javascript enable browser does not allow the page to load.
Related
I'm trying to scrape a JS intensive website and I wanted to do this by loading the page, rendering the JS and then doing the scraping with BeautifulSoup.
I want to do this, if possible, on a RaspberryPi
I've tried using Requests-HTML, which worked fine for a while but couldn't get Python3.7 to run it on the Raspberry, due to memory limitations.
Then i tried using Selenium, with both Geckodriver, which isn't available for arm6 and I don't know how to compile for the Raspberry, and PhantomJS, which i couldn't get to work properly.
You have two options.
Use tool that could mimic browser and render the js parts of the page, like selenium
Examine the page and see which requests to the backend are obtaining the data you need
I would go with the first approach if I need a general purpose tool, that can scrape data from all sorts of pages
And I would go with the second if I need to scrape pages from several sites and be done with it. If you provide some link I can try to help you with this.
I am learning python right now and I want to level up my knowledge on it particularly scraping. I am now on using Scrapy and getting in to use it along with Splash. I wanted to scrape a more challenging website - an airline website "https://www.airasia.com/en/home.page?cid=1" - one of my web developer friend told me that it would be impossible to scrape this type of websites since no regular json or xml files are returned for the data to be scrape. He said data can only be access using API (he said something about RESTFUL API) I don't somehow believe him. So as not wasting my time, if someone can CONFIRM it, I would be happy and if someone would say it can be scraped, I would be more happy if that guy can give me tips on how to scrape it and hands down if that guy can show proofs..
Many thanks.
Almost ANY website can be scraped but some websites are trickier than others.
Instead of Scrapy, I would recommend using a better alternative called Selenium which happens to have a library for python as well.
Long story made short: You will start a web browser in form of a driver and navigate to the page of your choice and simulate user interactions such as clicking, entering data in forms and submission. You will also be able to run JavaScript functions.
You might also want to do some research on legal constraints to ensure your operation is not unlawful. For instance, refer to case law of Ryanair Ltd v PR Aviation BV (Case C-30/14 CJEU).
You have 2 options: Use their API if they use one, to make http requests and obtain data and informations from their servers.
Or use a python scraping / web test framework, eg scrapy or selenium, to scrap their website directly in a python program.
Scrapy will be harder than selenium on this website because a lot of content is dynamic and will require custom code to trigger. Selenium should be easy to use.
hi everyone I would like to create a small bot to help me on binary option.
i am not an expert on python but actualy I can read a web page and
retrieve a precise value in a tag,
but the information what I need is on a web application
and not in the source code of the web page. I am not an expert of eb application and I want to know if I retrieve a value displayed on the application with python.
here is a link to the picture of the application:
"http://comparatif-options-binaires.fr/wp-content/uploads/2014/05/optionweb-analyse-technique-ow-school.jpg"
I think the problem you face here is the value you need is being loaded via Javascript of some sort (though without access to the web application and no visible effort from your code I can't be sure).
Expanding on #sabhirams answer (and agreeing that requests and BeautifulSoup are excellent libraries for static text) I would recommend having a look at the following:
Selenium - automates web browser usage in python (so will run the full javascript).
Webkit - Again another headless browser for python that has some excellent SO questions on the matter.
Ghost.py - attempts to make the Webkit experience a little smoother.
pyv8 - something a bit more barebones, pyv8 is a python wrapper for the Google V8 Javascript engine and can be used to run the javascript on the page and, hopefully, extract the element you need.
And if you're really not settled with python why not look at using a Javascript headless browser to run the javascript like PhantomJS.
As mentioned before; Respect others when scraping and be aware there may be consequences if you are caught.
I think you mean you want to build a script which can scrape a given webpage, and extract a certain value out of a given target DOM element.
I dont currently have the time to write this code for you, but it should be rather simple to put together. Here are some modules which might help you:
Request - Use this to fetch a given webpage into your py script
BeautifulSoup - Feed the above "DOM text" to beautiful soup, and you will be able to more easily manipulate the HTML page (fetch your var of interest etc...).
EDIT:
As pointed out in the comments above, please consider the Terms and Conditions of the web-service you are trying to scrape info from.
I need to develop web app for extracting prices of books from different e-commerce sites like amazon,homeshop18 when user enters book name in the interface and displays all the information.
My questions are
1)how to pass that query to amazon site search box and i can get only the pages relevant to the query instead of crawling the whole site.
2)What can be used to develop this application?BeautifulSoup or scrappy?API's are not available for all e-commerce sites to use it
am new to python.so any help will be highly appreciated
I personnaly use BeautifulSoup to parse web pages, but beware it's a bit slow if you have to parse pages massively. I know that lxml is faster but a bit less coder-friendly.To guess the right parameters (either for an HTTP GET or POST) for getting the result page you want, you should proceed like this:
Switch on the firebug plugin for Firefox or the integrated inspector for Chrome
Go on the web page you're interested in, and do the search
Go into firebug/inspector to see the parameters of the HTTP request Firefox or Chrome sent to the website.
Reproduce the request in your python script. For example using urllib
There is another way to guess the right HTTP GET or POST parameters, it's to use a network analyzer like Wireshark. This is a more detailed approach but feels more like
finding a needle in a haystack once you used the tools in Firefox/Chrome.
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)