I need to develop web app for extracting prices of books from different e-commerce sites like amazon,homeshop18 when user enters book name in the interface and displays all the information.
My questions are
1)how to pass that query to amazon site search box and i can get only the pages relevant to the query instead of crawling the whole site.
2)What can be used to develop this application?BeautifulSoup or scrappy?API's are not available for all e-commerce sites to use it
am new to python.so any help will be highly appreciated
I personnaly use BeautifulSoup to parse web pages, but beware it's a bit slow if you have to parse pages massively. I know that lxml is faster but a bit less coder-friendly.To guess the right parameters (either for an HTTP GET or POST) for getting the result page you want, you should proceed like this:
Switch on the firebug plugin for Firefox or the integrated inspector for Chrome
Go on the web page you're interested in, and do the search
Go into firebug/inspector to see the parameters of the HTTP request Firefox or Chrome sent to the website.
Reproduce the request in your python script. For example using urllib
There is another way to guess the right HTTP GET or POST parameters, it's to use a network analyzer like Wireshark. This is a more detailed approach but feels more like
finding a needle in a haystack once you used the tools in Firefox/Chrome.
Related
I have a small project to webscrape prices from some stores using requests and beautiful soup.
One of these stores is now checking for javascript enable browser. I have found that it uses Magento 1.
I have tried the requests-html library, but it did not work.
I know that I can use selenium and headeless-chrome. I have tested it and works fine.
But as long as I run the project on the cloud, it would be much easier and less expensive to use requestes.
On stackoverflow there is a post where one solution was suggested: send the request with the cookied that is set by a website when it checks for javascript enable.
https://stackoverflow.com/a/66917621
Has anyone tried this solution on Magento 1?
I suspect that the data that I need to scrape is not generated by javascript on the page, but the check for javascript enable browser does not allow the page to load.
I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?
I am learning python right now and I want to level up my knowledge on it particularly scraping. I am now on using Scrapy and getting in to use it along with Splash. I wanted to scrape a more challenging website - an airline website "https://www.airasia.com/en/home.page?cid=1" - one of my web developer friend told me that it would be impossible to scrape this type of websites since no regular json or xml files are returned for the data to be scrape. He said data can only be access using API (he said something about RESTFUL API) I don't somehow believe him. So as not wasting my time, if someone can CONFIRM it, I would be happy and if someone would say it can be scraped, I would be more happy if that guy can give me tips on how to scrape it and hands down if that guy can show proofs..
Many thanks.
Almost ANY website can be scraped but some websites are trickier than others.
Instead of Scrapy, I would recommend using a better alternative called Selenium which happens to have a library for python as well.
Long story made short: You will start a web browser in form of a driver and navigate to the page of your choice and simulate user interactions such as clicking, entering data in forms and submission. You will also be able to run JavaScript functions.
You might also want to do some research on legal constraints to ensure your operation is not unlawful. For instance, refer to case law of Ryanair Ltd v PR Aviation BV (Case C-30/14 CJEU).
You have 2 options: Use their API if they use one, to make http requests and obtain data and informations from their servers.
Or use a python scraping / web test framework, eg scrapy or selenium, to scrap their website directly in a python program.
Scrapy will be harder than selenium on this website because a lot of content is dynamic and will require custom code to trigger. Selenium should be easy to use.
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)
Id like to know if there is a way to get information from my banking website with Python, Id like to retrieve my card history and display it, and possibly save it into a text document each month.
I have found the urls ext to login and get the information from the website, which works from a browser, but I have been using liburl2 to "open" the webpages from Python and I have a feeling its not working because of some cookie or session things.
I can get any information I want from a website that does not require a login with urllib2, and then save the actual HTML and go through it later, but I cant on my banks website,
Any help would be appreciated
This is a part of Web-Scraping :
Web-scraping is a standard task that can serve various needs.
Scraping data out of secure-website means https
Handling https is not a problem with mechanize and BeautifulSoup
Although urllib2 with HTTPCookieJar also works fine
If managing the cookies is the problem, then I would recommend mechanize
Considering the case of your BANK-Site :
I would recommend not to play with your account.
If you must then, its not as easy as any normal secure/non-secure site.
These sites are designed to with-stand such scripts.
Problems that you would face with this:
BANK sites will surely have Captcha that is almost impossible to by-pass with a script unless you employee a lot of rocket-science and effort.
Other problem that you will definitely face is javascript, standard scripting solutions are focused to manage cookies, HTML parsing, etc. For processing javascript on links you will have to process js in your python script. That again needs a lot of effort.
Then, AJAX that again comes from javascript fetches data from server after page-load.
So, it will require you to take a lot of effort to do this task.
Also, if you try doing this you risk of blocking access to your account since banking sites are quick to block account access on 3-4 unsuccessful attempt on login or captcha, etc.
So, think before you do.