Interact with webpage Beautifulsoup and python

Interact with webpage Beautifulsoup and python - python

I'm using Python 2.7 with beautifulsoup and urllib2, I'm trying to scrap this page: angel.co/companies
As you see it shows a list with companies and it ends with a button "More" to show the others. As you click the button, more companies appear to watch and it creates a new tag with the new list of resutls. The button is in this div: <div class="more" data-page="2">More</div> and each time you click it the data-page increases.
I'd like to know if it's possible to scrap this page completely (so it clicks the "More" button each time it arrives to the end). I suppose it is scrapping the css and changing it but I never did so and I haven't found information about this anywhere.

Depending on what you want to do you could use their API for this. If you are not sure what it is and how to use it, try googling around for an answer. Here's one for starters.

Related

Get current HTML from browser tab with Python

I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks

I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.

Click an element to dynamically change content with python web scraping

So I'm scraping all the data off my works website to get all my shifts and data about those shifts with python and beautiful soup. Scraping the shifts is fine since they are just elements. But to get info like who is working on a shift you have to click an element which displays a hidden element but also changes the info depending on which day you have clicked on. This is changed with a javascript function showFloorPlan('N','N','N','20200624')
I need to be able to scrape the data from a weeks worth of shifts showing who is working on which shift. I've tried added javascript:showFloorPlan('N','N','N','20200624') to the URL I am scraping from but with no luck.
Any help is greatly appreciated

You can try selenium library for browser-like actions like clicking, scrolling, navigating, and much more. Functionalities of BeautifulSoup are limited when it comes to dynamic scraping.
Selenium Offical Documentation
Selenium for python
If you want to scrape without opening browser, you should also look for
Headless Browser

Selenium unable to automate click on web element

I should start by saying that I'm not a web developer but I have been using Selenium with Python for a few weeks and think I've gotten the basics down.
What I have is a page with a weekly calendar on it. Image as follows:
What happens is that you can click on any of the coloured boxes which will bring up a register for a class. It features items that you can click on which bring up new information. The problem I have is that I can't click on the items - or, rather, I don't know how to automate the click. I have automated clicking with Selenium before in other situations.
That looks as follows:
As you can see, the calendar still appears in the background. That is to say, the item we click on doesn't take us to a new page but, I suppose, runs some kind of method which shows the register and populates it with data.
My problem is this: I want to automate the process so that I can click on each of these items then scrape the information. In order to do that I need to be able to automate the clicking for each of the items.
So what have I done? When I've done this before, I've searched through the web html for the relevant part and then grabbed the xPath to the element I needed. But here I can't do that. Why not? Well, firstly, I just can't find that element!
Take a look at this close up of the first column:
It's divided into columns, but then I'd expect the clickable area to be an element within that. As you can see, it's not. Furthermore, the clickable area is just the coloured box itself, but if you look closely you can see the element goes outside of that area. I have gone very close with my mouse cursor to see exactly what's clickable, and it definitely is just the coloured box.
So I've not been able to get the element at all.
I thought I might be able to just find out where we went after I clicked the button, but when I got the link address, it just said it was the same page with no differences.
I appreciate I'm asking quite a broad question here, but the problem is that I don't really know where to start. If someone could give me at least that much, I would be grateful. Like if I could just click on each of these one at a time... I've found where the populated data is so I could grab that without a problem.
Well, here's to hoping.
Edit: I should add that there are some JavaScript items (tag type script, type='text/javascript'). I presume that the answer is in there somewhere, but there is a lot of Javascript and I'm not adept at reading it. It's hard for me to tell what script does what. If I could at least figure out what script runs when I click the item then I think I'd be onto something, but I have no idea. Even that would help me.

I had encountered similar problem when scraping for Instagram followers in mobile view where it was floating box when showing the accounts followers name. The approach I took was identifying the name of floating dialog box and clicking the element in it. It might different in your case of html.
Trying looking at this link. Selenium Scroll inside of popup div

Hard to say without the HTML. Maybe try Katalon Recorder (chrome extension) and see if that can detect the xpath for you? It might also be you have to use some kind of javascript to invoke the method for the element

How to use Python to scrape all the table contents on this website which is written by AJAX?

https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.

You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.

Access widget window beautifulsoup python mechanize

I am trying to scrape information off websites like this:
https://www.glassdoor.com/Overview/Working-at-7-Eleven-EI_IE3581.11,19.htm
using python + beautifulsoup + mechanize.
Accessing anything on the main-site is no problem. However, I also need the information that appears in a overlay-window that appears when one clicks on the "Rating Trends" button next to the bar with stars.
This overlay-window can also be accessed directly by using the url:
https://www.glassdoor.com/Reviews/7-Eleven-Reviews-E3581.htm#trends-overallRating
The html associated with this page is a modification of the original site's html.
However, regardless of what element I try to find (via findAll ) on that overlay-window website, beautifulsoup returns zero hits.
How can I fix this? I tried adding a sleep time between accessing the website and reading anything in, to no avail.
Thanks!

If you're using the Chrome browser select the background of that page (without the additional information displayed) and select 'Inspect' from the context menu (for Windows anyway), then the 'Network' tab, so that you can see network traffic. Now click on 'Rating trends'. The entry marked 'xhr' will be https://www.glassdoor.ca/api/employer/3581-rating.htm?locationStr=&jobTitleStr=&filterCurrentEmployee=false&filterEmploymentStatus=REGULAR&filterEmploymentStatus=PART_TIME (I much hope!) and its contents will be the following.
{"employerId":3581,"ratings":[{"hasRating":true,"type":"overallRating","value":2.9},{"hasRating":true,"type":"ceoRating","value":0.54},{"hasRating":true,"type":"bizOutlook","value":0.35},{"hasRating":true,"type":"recommend","value":0.4},{"hasRating":true,"type":"compAndBenefits","value":2.4},{"hasRating":true,"type":"cultureAndValues","value":2.5},{"hasRating":true,"type":"careerOpportunities","value":2.5},{"hasRating":true,"type":"workLife","value":2.4},{"hasRating":true,"type":"seniorManagement","value":2.3}],"week":0,"year":0}
Whether this URL can be altered for use in obtaining information for other employers, I regret, I cannot tell you.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.