How to scrape hidden website using python :: style display:none - python

I was trying to scrape website and I faced a problem: the data in the website is hidden and, when I clicked the "+" sign it showed the result.
How do I scrape this data using python?
<tr class="ob_gDGC" style="display: none;">

The style only denotes what the screen displays not what the document is, so display:none doesn't restrict you from accessing the data.
However if the data you are trying to access is not on the dom then you have a problem. View the page in dev tools to see if the data is there before you click the button. If you click the button and it appends children (or the dom node flashes in google chrome dev tools) then the website you are trying to scrape uses javascript dom manipulation and this is difficult to impossible to extract with the requests library. For that you would be looking for the package like pyppeteer (or equivalent). With that you could get a web page and simulate the click event on "the plus sign" and then extract your required data.
I would advise you modify your post to be a bit clearer and add a example of the dom you are trying to scrape.

Related

Get links from ::before ::after using Selenium in Python

I am trying to scrap a website using Selenium in Python in order to extract few links.
But for some of the tags, I am not able to find the links. When I inspect element for these links, it points me to ::before and ::after. One way to do this is to click on it which opens a new window and get the link from the new window. But this solutions is quite slow. Can someone help me know how can I fetch these links directly from this page?
Looks like the links you are trying to extract are not statically stored inside the i elements you see there. These links are dynamically generated by some JavaScripts running on that page.
So, the answer is "No", you can not extract there links from that page without human-like iterating elements of that page.

Get current HTML from browser tab with Python

I know there are plenty ways to get a HTML source passing the page url.
But is there a way to get the current html of a page if it displays data after some action ?
For example: A simple html page with a button (thats the source html) that displays random data when you click it.
Thanks
I believe you're looking for a tool collectively known as a "headless browser". The only one I've used that is available in Python (and can vouch for) is Selenium WebDriver, but there are plenty to choose from if you're searching up headless browsers for Python.
https://pypi.org/project/selenium
With this you should be able to programmatically load a web page, look up and click the button in the virtually rendered DOM, then lookup the innerHTML property of the targeted element.

How to access the subtags within a tag using beautifulsoup in python?

I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks
If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.

Fetching a page which needs user interaction

In Python, I am trying to fetch pages from a specific website.
In this website, there are some parts in which the information is not completely accessible in the HTML page, and needs a bit of user interaction. To be more clear, there are some reviews, but the long reviews are shortened, and to see to whole review user must click on 'More' hyperlink. Is there any way to handle these hyperlinks in Python and fetch the whole reviews for all those cases?
Here is a snapshot of the 'More' hyperlink:
<span class="bla bla" onclick="ta.util.cookie.setPIDCookie(123); ta.call('ta.servlet.Reviews.expandReviews',event,this,'review_331979201', '1', 123);"> More </span>
you could use selenium webdriver api for example see this
https://www.reddit.com/r/selenium/comments/2lscf4/clicking_a_button_using_selenium_python/
for read complete docs use http://www.seleniumhq.org/docs/
Use Selenium python binding: http://selenium-python.readthedocs.org/
The algorithm may be following:
If "More" hyperlink is not visible in view port - scroll to this element
Click to hyperlink
Fetch all reviews
The similar case for scrolling and clicking on web element: https://stackoverflow.com/a/34271050/2517622

Interact with webpage Beautifulsoup and python

I'm using Python 2.7 with beautifulsoup and urllib2, I'm trying to scrap this page: angel.co/companies
As you see it shows a list with companies and it ends with a button "More" to show the others. As you click the button, more companies appear to watch and it creates a new tag with the new list of resutls. The button is in this div: <div class="more" data-page="2">More</div> and each time you click it the data-page increases.
I'd like to know if it's possible to scrap this page completely (so it clicks the "More" button each time it arrives to the end). I suppose it is scrapping the css and changing it but I never did so and I haven't found information about this anywhere.
Depending on what you want to do you could use their API for this. If you are not sure what it is and how to use it, try googling around for an answer. Here's one for starters.

Categories

Resources