I would like to ask you, whether (and if yes, then how) it is possible to web scrape from a webpage that is already open. I do not want python to open the webpage, also, without the need to paste the URL into the code, because the webpage uses multiple authorizations (you need to approve the login on your app on your mobile phone and I don't know how to evade this) + its link doesn't change (no matter where on the webpage you are).
Basically, I need to web scrape the specific info from this webpage to excel (it doesn't matter the format). The webpage URL is constant and when pasted to the code, it won't let python access it because of the authentication.
In other words: How do I make python identify the webpage that is already running and web scrape the info I need?
Related
I'm new to web scraping and am trying to learn more. I know some websites load products on the back end before they make it available to the general public. Is there a way I can access that information using an HTML parser or any other library?
I suspect the website developers use dynamic javascript to alter the information after loading. Or use different tags/classes to hide the information?
I see two questions here:
1)Can I access information on the webserver that isn't sent to the client page?
No. You can only scrape what exists on the page. Anything else would be illegally accessing a non-public server and goes beyond scraping into hacking.
2) If the site loads asynchronously and/or dynamically, can I access the content that loads after the main portion of the html?
Yes, using browser automation tools like selenium, you can approximate a user experiencing the site and wait for the full content to load before you scrape it. This is different from simple requests/beautifulsoup, which only gathers the HTML at the point when you send the request.
TO be specific, I want to make a Python web crawler that uses a plugin on Chrome called "Adapt Prospector," which allows you to find people's emails once you land on their linkedin page. Here is an example of what I mean:
https://i.postimg.cc/DyxWzxWJ/example_pic.png
You first go on the person's linkedin page, then click the plugin logo on Chrome's extension bar, then the plugin will show you the linkedin profile's email (if there is one).
Basically, I want to create a program that goes to a person's linkedin page, then clicks the plugin logo on the extension bar, then scrapes the data the plugin is showing.
I definitely know how to do the first part, but I'm not sure if the last 2 parts are possible. I searched extensively on if its possible to make a web scraper that uses a plugin, but I haven't found any "yes" or "no" answers to this.
You can try to :
Find which request gives the informations you need using the Network tab of your browser console. Then do the same request with your favorite python library
Use selenium which will behave more or less like your browser, go to the person's linkedin page, and the informations should be somewhere in the page, maybe hidden.
Your plugin just reorganize the informations it finds on the page. Linkedin provides to your browser all the informations you need.
EDIT : Using Extensions with Selenium (Python) you can try this but I think that selenium without the extension will do fine as well
Running Python 3.6 and I'm having a whole lot of issues logging to a site primarily due to captcha. I really only need to search up URLs and retrieve the html on the page but I need to be logged in for certain additional information to appear on the accessible URLs.
I was using urllib to read the URLs but now I was looking for a solution to login and then request information. The automatic route won't seem to work due to those issues, so I'm looking for a method by which I am already logged in on an open browser and python opens up new tabs to search for URLs (the searches can be hidden, they don't have to literally open up new tabs). It appears that when I open new tabs manually on the site it still shows i'm logged in so If i can manually log in each time i want to run the script and then work based off that, it would actually work just fine.
Thanks
I am novice in python (c++ developer), I am trying to do some hands-on on web scraping on windows IE.
The problem which I am facing is, when I open a URL using "requests" library the server sends me a login page always. I figured out the problem. Its actually doing it because it presumes you are coming through IE tries to execute on function which uses some information from the SSO ( single signup object ) which is there executing on the background in Windows on the first login to the web server ( consider this as some weird setup.)
On observing this I changed my strategy & started using webbrowser lib.
Now, when I try to do a webbrowser.open("url"), the browser is opening the page properly which is great thing!!!
But, my problems now are :
1) I do not want that the browser page opened should be visible to the user ( some way that the browser is opened in background ). I tried to used this :
ie = webbrowser.BackgroundBrowser(webbrowser.iexplore)
ie.Visible = 0
ie.open('url')
but no success.
It opens the page which is visible to the user.
2) [This is main activity] I want to scrape the page which is opened in the web browser's IE page opened above. how to do?
I tried to dig into this link but did not find any APIs for getting the data.
Kindly help.
PS : I tried to use beautiful soup for scraping on some other web pages using requests. It was successful & I go the data I wanted. But not in this case.
The webbrowser module doesn't allow to do that. The get function you mentioned is to retrieve registered web browsers not to scrap a HTTP GET request.
I don't know what is triggering the behavior you described with IE, have you tried to change your User-Agent with IE ones? You can check this post for more details: Sending "User-agent" using Requests library in Python
I am writing a web scraping application. When I enter the URL directly into a browser, it displays the JSON data I want.
However, if I use Python's request lib, or URLDownloadToFile in C++, it simply downloads the html for the login page.
The site I am trying to scrape it from (DraftKings.com) requires a login. The other sites I scrape from don't.
I am 100% sure this is related, since if I paste the url when I am logged out, I get the login page, rather than the JSON data. Once I log in, if I paste the URL again, I get the JSON data again.
The thing is that if I remain logged in, and then use the Python script or C++ app to download the JSON data, as mentioned.... it downloads the Login HTML.
Anyone know how I can fix this issue?
Please don't ask us to help with an activity that violates the terms of service of the site you are trying to (ab-)use:
Using automated means (including but not limited to harvesting bots, robots, parser, spiders or screen scrapers) to obtain, collect or access any information on the Website or of any User for any purpose.
Even if that kind of usage were allowed, the answer would be boring:
You'd need to implement the login functionality in your scraper.