In Python, I am trying to fetch pages from a specific website.
In this website, there are some parts in which the information is not completely accessible in the HTML page, and needs a bit of user interaction. To be more clear, there are some reviews, but the long reviews are shortened, and to see to whole review user must click on 'More' hyperlink. Is there any way to handle these hyperlinks in Python and fetch the whole reviews for all those cases?
Here is a snapshot of the 'More' hyperlink:
<span class="bla bla" onclick="ta.util.cookie.setPIDCookie(123); ta.call('ta.servlet.Reviews.expandReviews',event,this,'review_331979201', '1', 123);"> More </span>
you could use selenium webdriver api for example see this
https://www.reddit.com/r/selenium/comments/2lscf4/clicking_a_button_using_selenium_python/
for read complete docs use http://www.seleniumhq.org/docs/
Use Selenium python binding: http://selenium-python.readthedocs.org/
The algorithm may be following:
If "More" hyperlink is not visible in view port - scroll to this element
Click to hyperlink
Fetch all reviews
The similar case for scrolling and clicking on web element: https://stackoverflow.com/a/34271050/2517622
Related
I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks
I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.
I was trying to scrape website and I faced a problem: the data in the website is hidden and, when I clicked the "+" sign it showed the result.
How do I scrape this data using python?
<tr class="ob_gDGC" style="display: none;">
The style only denotes what the screen displays not what the document is, so display:none doesn't restrict you from accessing the data.
However if the data you are trying to access is not on the dom then you have a problem. View the page in dev tools to see if the data is there before you click the button. If you click the button and it appends children (or the dom node flashes in google chrome dev tools) then the website you are trying to scrape uses javascript dom manipulation and this is difficult to impossible to extract with the requests library. For that you would be looking for the package like pyppeteer (or equivalent). With that you could get a web page and simulate the click event on "the plus sign" and then extract your required data.
I would advise you modify your post to be a bit clearer and add a example of the dom you are trying to scrape.
https://www.fedsdatacenter.com/federal-pay-rates/index.php?y=2017&n=&l=&a=&o=
This website seems to be written by jquery(AJAX). I would like to scrape all pages' tables. When I inspect the 1,2,3,4 page tags, they do not have a specific href link. Besides, clicking on them does not create a clear pattern of get requests, therefore, I find it hard to use Python urllib to send a get request for each page.
You can use Selenium with Python http://selenium-python.readthedocs.io/ to navigate through the pages. I would find the Next button and .click() it then time.sleep(seconds) and scrape the page. I can't navigate to the last page on this site, unfortunately (it seems broken - which you should also be aware of), but I'm assuming the Next button disappears or something when you get to the last page. If not, you might want to save the what you've scraped everytime you go to a new page, this way you don't lose your data in the event of an error.
I'm using Python 2.7 with beautifulsoup and urllib2, I'm trying to scrap this page: angel.co/companies
As you see it shows a list with companies and it ends with a button "More" to show the others. As you click the button, more companies appear to watch and it creates a new tag with the new list of resutls. The button is in this div: <div class="more" data-page="2">More</div> and each time you click it the data-page increases.
I'd like to know if it's possible to scrap this page completely (so it clicks the "More" button each time it arrives to the end). I suppose it is scrapping the css and changing it but I never did so and I haven't found information about this anywhere.
Depending on what you want to do you could use their API for this. If you are not sure what it is and how to use it, try googling around for an answer. Here's one for starters.
I am trying to scrape the reviews on this webpage.
http://www.tripadvisor.com/Hotel_Review-g294265-d2309275-Reviews-The_Forest_by_Wangz-Singapore.html
The only problem in each review is "More", which loads more text on OnClick event.
For example:
<span class="taLnk hvrIE6 tr147826763 moreLink" onclick = " ta.util.cookie.setPIDCookie(2247); ta.call('ta.servlet.Reviews.expandReviews', event,this,'review_147826763', '1', 2247)">
More </span>
How to scrap the complete review text using LXML/BeautifulSoup?
This probably isn't the kind of answer you're looking for, but I've started looking at PhantomJS, which gives you a headless, scriptable, webkit browser. I'd bet it's an easier path than whatever ajax reverse engineering rabbit hole you're about to go down...