I am trying to scrape the reviews on this webpage.
http://www.tripadvisor.com/Hotel_Review-g294265-d2309275-Reviews-The_Forest_by_Wangz-Singapore.html
The only problem in each review is "More", which loads more text on OnClick event.
For example:
<span class="taLnk hvrIE6 tr147826763 moreLink" onclick = " ta.util.cookie.setPIDCookie(2247); ta.call('ta.servlet.Reviews.expandReviews', event,this,'review_147826763', '1', 2247)">
More </span>
How to scrap the complete review text using LXML/BeautifulSoup?
This probably isn't the kind of answer you're looking for, but I've started looking at PhantomJS, which gives you a headless, scriptable, webkit browser. I'd bet it's an easier path than whatever ajax reverse engineering rabbit hole you're about to go down...
Related
I am automating a process using Selenium and python. Right now, I am trying to click on a button in a webpage (sorry I cannot share the link, since it requires credential to login), but there is no way my code can find this button element. I have tried every selector (by id, css selector, xpath, etc.) and done a lot of googling, but no success.
Here is the source content from the web page:
<button onclick="javascript: switchTabs('public');" aria-selected="false" itemcount="-1" type="button" title="Public Reports" dontactassubmit="false" id="public" aria-label="" class="col-xs-12 text-left list-group-item tabbing_class active"> Public Reports </button>
I also added a sleep command before this to make sure the page is fully loaded, but it does not work.
Can anyone help how to select this onclick button?
Please let me know if you need more info.
Edit: you can take a look at this picture to get more insight (https://ibb.co/cYXWkL0). The yellow arrow indicates the button I want to click on.
The element you trying to click is inside an iframe. So, you need to switch driver into the iframe content before accessing elements inside it.
I can't give you a specific code solution since you didn't share a link to that page, even not all that HTML block. You can see solutions for similar questions here or enter link description here. More results can be found with google search
I was trying to scrape website and I faced a problem: the data in the website is hidden and, when I clicked the "+" sign it showed the result.
How do I scrape this data using python?
<tr class="ob_gDGC" style="display: none;">
The style only denotes what the screen displays not what the document is, so display:none doesn't restrict you from accessing the data.
However if the data you are trying to access is not on the dom then you have a problem. View the page in dev tools to see if the data is there before you click the button. If you click the button and it appends children (or the dom node flashes in google chrome dev tools) then the website you are trying to scrape uses javascript dom manipulation and this is difficult to impossible to extract with the requests library. For that you would be looking for the package like pyppeteer (or equivalent). With that you could get a web page and simulate the click event on "the plus sign" and then extract your required data.
I would advise you modify your post to be a bit clearer and add a example of the dom you are trying to scrape.
I'm writing a script in python to download each day the pdfs that are published on a site.
I had no problem in scraping the page and downloading the files.
The problem that I'm facing currently is due to the fact that the site has more pages, I know what you are thinking ;) but it wouldn't be a problem if the site was structured like this:
page 1 -> www.example.com/page1
page 2 -> www.example.com/page2 ...
But the problem is that, unfortunately, when I press on the page number to change page
nothing happens in the URL field.
The only thing I was able to find was this event in the console:
The page buttons I need to click are these:
<nav class="text-center">
<ul class="pagination pagination-sm files_paging"><li><a data-page="1" aria-label="Previous"><span aria-hidden="true">«</span></a></li><li class="active"><a data-page="1">1</a></li><li><a data-page="2">2</a></li><li><a data-page="3">3</a></li><li><a data-page="4">4</a></li><li class="disabled"><a data-page="4"><span aria-hidden="true">...</span></a></li><li><a data-page="9">9</a></li><li><a data-page="2" aria-label="Next"><span aria-hidden="true">»</span></a></li></ul>
</nav>
Is there anyone who has got any ideas?
I assume that the mentioned page uses a JavaScript framework for displaying the content. You should try the following options.
Guess the pattern of the URLs.
Download the frontend part of the page (HTML and JavaScript files) and search the point where the URLs have generated or retrieved.
If you are interested in similar tasks you should try Selenium or an other similar browser based, programmable testing tool.
In Python, I am trying to fetch pages from a specific website.
In this website, there are some parts in which the information is not completely accessible in the HTML page, and needs a bit of user interaction. To be more clear, there are some reviews, but the long reviews are shortened, and to see to whole review user must click on 'More' hyperlink. Is there any way to handle these hyperlinks in Python and fetch the whole reviews for all those cases?
Here is a snapshot of the 'More' hyperlink:
<span class="bla bla" onclick="ta.util.cookie.setPIDCookie(123); ta.call('ta.servlet.Reviews.expandReviews',event,this,'review_331979201', '1', 123);"> More </span>
you could use selenium webdriver api for example see this
https://www.reddit.com/r/selenium/comments/2lscf4/clicking_a_button_using_selenium_python/
for read complete docs use http://www.seleniumhq.org/docs/
Use Selenium python binding: http://selenium-python.readthedocs.org/
The algorithm may be following:
If "More" hyperlink is not visible in view port - scroll to this element
Click to hyperlink
Fetch all reviews
The similar case for scrolling and clicking on web element: https://stackoverflow.com/a/34271050/2517622
I'm using Python 2.7 with beautifulsoup and urllib2, I'm trying to scrap this page: angel.co/companies
As you see it shows a list with companies and it ends with a button "More" to show the others. As you click the button, more companies appear to watch and it creates a new tag with the new list of resutls. The button is in this div: <div class="more" data-page="2">More</div> and each time you click it the data-page increases.
I'd like to know if it's possible to scrap this page completely (so it clicks the "More" button each time it arrives to the end). I suppose it is scrapping the css and changing it but I never did so and I haven't found information about this anywhere.
Depending on what you want to do you could use their API for this. If you are not sure what it is and how to use it, try googling around for an answer. Here's one for starters.