I am running some crawls in order to test if the results I get deviate. For this effort I created two test suites, the first one I created with the requests and BeautifulSoup library, the other one is based on selenium. I would like to find out, if pages detect both bots in the same way.
But I am still unsure if I am right, by assuming that requests and BeautifulSoup are independent from Selenium.
I hope its not a dump question, but I haven't find any proper answer yet (maybe because of the wrong keywords). However, any help would be appreciated.
Thanks in advance
I checked the requests documentation. I wrote a mail to the developer, without any answer. And of course I checked on google. I found something about scrapy vs selenium but well... are requests and BeautyfulSoup related to scrapy?
The python requests module does not use Selenium, neither does BeautifulSoup. Both will run independent of a web browser. Both are pure python implementations.
Selenium automates browsers, so you'll present to a web service with the user-agent string and other variables that the browser you choose to drive with Selenium would present.
You can specify user-agent string when you use requests, or not, but requests doesn't drive a browser inherently, so you'll be presenting as a different entity from the user-agent perspective, like python-requests/2.18.4.
BeautifulSoup is a parser, and so it presents to a web service through another library (like requests); it doesn't have its own native presentation.
Related
I am fairly new to coding but I have a use case that I want to realize with python.
Use Case:
I'd like to have a python script that checks a certain website for a certain string. If the string is found I'd like to get some kind of notification or e-mail. Maybe you could point me to the right library or code snippets on how I can realize this.
The Usecase that you mentioned (Apart from sending notification/E-mail) is called Web Scraping. I have mentioned different python modules below that will help you learn web-scraping.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/
I have a small project to webscrape prices from some stores using requests and beautiful soup.
One of these stores is now checking for javascript enable browser. I have found that it uses Magento 1.
I have tried the requests-html library, but it did not work.
I know that I can use selenium and headeless-chrome. I have tested it and works fine.
But as long as I run the project on the cloud, it would be much easier and less expensive to use requestes.
On stackoverflow there is a post where one solution was suggested: send the request with the cookied that is set by a website when it checks for javascript enable.
https://stackoverflow.com/a/66917621
Has anyone tried this solution on Magento 1?
I suspect that the data that I need to scrape is not generated by javascript on the page, but the check for javascript enable browser does not allow the page to load.
I want to access my business' database on some site and scrape it using Python (I'm using Requests and BS4, I can go further if needed). But I couldn't.
Can someone provide us with info and simple resources on how to scrape such sites.
I'm not talking about providing usernames and passwords. The site requires much more than this.
How do I know the info I am required to provide for my script aside of UN and PW(e.g. how do I know that I must provide, say, an auth token)?
How to deal with the site when there are no HTTP URLs, but hrefs in the form of javascript:__doPostBack?
And in this regard, how do I transit from the logging in page to the page I want (the one contained in the aforementioned mentioned javascript:__doPostBack)?
Are the libraries I'm using enough? or do you recommend using—and learning in my case—something else?
Your help is greatly appreciated and thanked.
You didn't mention what you use for scraping, but since this sounds like a lot of the interaction on this site is based on client-side code, I'd suggest using a real browser to do the scraping, and interacting with the site not using low-level HTTP requests but using client side interaction (such as typing in elements or clicking buttons). This way, you don't need to worry about what form data to send or how to get the URLs of links yourself.
One recommended method of doing this would be to use BeutifulSoup with Selenium / WebDriver. There are multiple resources on how to do this, for example: How can I parse a website using Selenium and Beautifulsoup in python?
I am currently trying to write a small bot for a banking site that doesn't supply an API. Nevertheless, the security of the login page seems a little more ingenious than what I'd have expected, since even though I don't see any significant difference between Chrome and Python, it doesn't let requests made by Python through (I accounted for things such as headers and cookies)
I've been wondering if there is a tool to record requests in FireFox/Chrome/Any browser and replicate them in Python (or any other language)? Think selenium, but without the overhead of selenium :p
You can use Selenium web drivers to actually use browsers to make the requests for you.
In such cases, I usually checkout the request made by Chrome from my dev tools "Network" tab. Then I right click on the request and copy the request as cURL to run it on command line to see if it works perfectly. If it does, then I can be certain it can be achieved using Python's requests package.
Look into Phantomjs or casperjs. That is a complete browser that can be programmed using JavaScript
We have developed a web based application, with user login etc, and we developed a python application that have to get some data on this page.
Is there any way to communicate python and system default browser ?
Our main goal is to open a webpage, with system browser, and get the HTML source code from it ? We tried with python webbrowser, opened web page succesfully, but could not get source code, and tried with urllib2, in that case, i think we have to use system default browser's cookie etc, and i dont want to this, because of security.
https://pypi.python.org/pypi/selenium
You can try to use Selenium, he was done for testing, but nothing prevents you from using it for other purposes
If your web site is navigable without Javascript, then you could try Mechanize or zope.testbrowser. These tools offer a higher level API than urllib2, letting you do things like follow links on pages and fill out HTML forms.
This can be helpful in navigating a site that uses cookie based authentication with HTML forms for login, for example.
Have a look at the nltk module---they have some utilities for looking at web pages and getting text. There's also BeautifulSoup, which is a bit more elaborate. I'm currently using both to scrape web pages for a learning algorithm---they're pretty widely used modules, so that means you can find lots of hints here :)