Scraping hidden content from a javascript webpage with python

Scraping hidden content from a javascript webpage with python - python

I'm trying to scrape the content from the following website:
https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7
I have previously scraped the content successfully using dryscrape and the following code:
import dryscrape
import webkit_server
from lxml import html
session = dryscrape.Session()
session.set_timeout(20)
session.set_attribute('auto_load_images', False)
session.visit('https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7')
response = session.body()
tree = html.fromstring(response)
print(tree.xpath('(//td[#class="team-name"]/text())[1]'))
The above example would print the home team (which in this case would be 'France')
It seems that the structure of the source has been changed, so I'm unable to scrape the contents properly.
What confuses me is that I'm able to see the tags using the Firefox Inspector tool, however it's not visible in the response when I pull the source.
I assume they must have hidden the content somehow to make it impossible (?) to scrape the data.
Could someone please point me in the right direction how to scrape the content properly.

The content that you need is loaded using jQuery (Ajax). I don't know if dryscrape has been updated lately, but the last time I used it didn't support ajax content loaded from jQuery...
Anyway.. just taking a look to the network inspector of chrome you will realize that the main content is loaded using an API. You can call to that API directly and you will get an awesome JSON with all the data of the page:
import requests
data = requests.get('https://mobile.admiral.at/;apiVer=json;api=main;jsonType=object;apiRw=1/en/api/event/get-event?id=15a822ab-84a1-e511-90a2-000c297013a7').json()

Related

How to scrape a react table when there are no table tags

I am trying to scrape the table from this link. There are no table tags on the page so I am trying to access it using the class "rt-table". When I inspect the table in developer tools, I can see the html I need in the elements section (I am using Chrome). However, when I view the source code using requests, this part of the code is now missing. Does anyone know what the problem could be?

If you just want those stats you can access the hidden api like this:
import requests
import pandas as pd
url = f'https://api.nhle.com/stats/rest/en/team/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22points%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22wins%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20seasonId%3C=20072008%20and%20seasonId%3E=20072008'
data = requests.get(url).json()
df = pd.DataFrame(data['data'])
df.to_csv('nj devils data.csv',index=False)
To see where I got that URL go to the page you are trying to scrape in your browser, then open Developer Tools - Network - fetch/XHR and hit refresh, you'll see a bunch of network requests happen if you click on them you can explore the data returns in the "preview" tab. This can be recreated like above. This is not always the case, sometimes you need to send certain headers or a payload of information to authenticate a request to an api, but in this case it works quite easily.

Cannot find css class using Request HTML

After following this tutorial on finding a css class and copying the text on a website, I tried to implement this into a small text code but sadly it didnt work.
I followed the tutorial exactly on the same website and did get the headline of the webpage, but cant get this process to work for any other class on that, or any other , webpage. Am I missing something? I am a beginner programmer and have never used Request HTML or anything similar before.
Here is an example of the code I'm using, the purpose being to grab the random fact that appears in the "af-description" class when you load the webpage.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://mentalfloss.com/amazingfactgenerator')
r.html.find('.af-description', first=True)
description = r.html.find('.af-description', first=True)
print("Fun Fact:" + description.text)
No matter how hard I try and no matter how I rearrange things or try different code, I cant get it to work. It seems to not be able to find the class or the text the class contains. Please help.

What you are trying to do requires that the HTML source contains an element with such a class. A browser does much more than just download HTML; it also downloads CSS and Javascript code when referenced by the page, and executes any scripts attached to the page, which can trigger further network activity. If the content you are looking for was generated by Javascript, you can see the elements in the browser development tools inspector, but that doesn't make the element accessible to the r.html object!
In the case of the URL you tried to scrape, if you look at the network console you'll see that an AJAX request GET request http://mentalfloss.com/api/facts is made to fill the <div af-details> structures, so if you wanted to scrape that data you could just get it as JSON directly from the API:
r = session.get('http://mentalfloss.com/api/facts')
description = r.json()[0]['fact']
print("Fun Fact:" + fact)
You can make the requests_html session render the page with Javascript too by calling r.html.render().
This then uses a headless browser to render the HTML, execute the JavaScript code embedded in it, fetch the AJAX request and render the additional DOM elements, then reflect the whole page back to HTML for your code to mine. The first time you do this the required libraries for the headless browser infrastructure are downloaded for you:
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('http://mentalfloss.com/amazingfactgenerator')
>>> r.html.render()
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
# .... a lot more information elided
[W:pyppeteer.chromium_downloader] chromium extracted to: /Users/mj/.pyppeteer/local-chromium/533271
>>> r.html.render()
>>> r.html.find('.af-description', first=True)
<Element 'div' class=('af-description',)>
>>> _.text
'The cubicle did not get its name from its shape, but from the Latin “cubiculum” meaning bed chamber.'
However, this requires your computer to do a lot more work; for this specific example, it's easier to just call the API directly.

The div which include the class 'af-description' is not included on the DOM but on a js script. Its normal to not be able to find it.
If you test your script to find a class from the DOM, like this one 'afg-page row' you should be fine.

How to extract values from an infinite scroll webpage with python?

I am unable to extract any data from this website. This code works for other sites. Also, this website is extendable if a registered user scrolls down. How can I extract data from the table from such a website?
from pyquery import PyQuery as pq
import requests
url = "https://uk.tradingview.com/screener/"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".tv-screener__symbol").text()
Tickers

You're using a class name which doesn't appear in the source of the page. The most likely reason for this is that the page uses javascript to either load data from a server or change the DOM once the page is loaded to add the class name in question.
Since neither the requests library nor the pyquery library you're using have a javascript engine to duplicate the feat, you get left with the raw static html which doesn't contain the tv-screener__symbol.
To solve this, look at the document you actually receive from a server and try to find the data you're interested in the the raw HTML document you receive:
...
content = requests.get(url).content
print(content)
(Or you can look at the data in the browser, but you must turn off Javascript in order to see the same document that Python gets to see)
If the data isn't in the raw HTML, you have to look at the javascript to see how it makes it's requests to the server backend to load the data, and then copy that request using your python requests' library.

Why am I unable to scrape this website?

Say I want to scrape the following url:
https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week
I have the following python code:
import requests
from lxml import html
urlToSearch = 'https://soundcloud.com/search/sounds?q=edm&filter.created_at=last_week'
page = requests.get(urlToSearch)
tree = html.fromstring(page.content)
print(tree.xpath('//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div/text()'))
The trouble is when I print the text at the following xpath:
//*[#id="content"]/div/div/div[3]/div/div/div/ul/div/div
nothing appears but [] despite me confirming that "Found 500+ tracks" should be there. What am i doing wrong?

The problem is that requests does not generate dynamic content.
Right click on the page and view the page source, you'll see that the static content does not include any of the content that you see after the dynamic content has loaded.
However, (using Chrome) open dev tools, click on network and XHR. It looks like you can get the data through an API which is better than scraping anyway!

Problem is that with modern websites almost all web pages will change quite a lot after its been loaded with JavaScript, css etc. You will fetch the basic html before any DOM updates etc been made and will look differently to actually visiting the page with a browser.
Use the Selenium WebDriver framework (mostly used for test automation), it will emulate loading the page, executing javascripts etc.
Selenium Documentation for Python

How to access the subtags within a tag using beautifulsoup in python?

I am attempting to retrieve player statistics from MLB.com for the 2016 season. I am using Beautiful Soup in Python, and I need to extract the information in the table seen here:
http://mlb.mlb.com/stats/sortable.jsp#elem=%5Bobject+Object%5D&tab_level=child&click_text=Sortable+Player+hitting&game_type='R'&season=2016&season_type=ANY&league_code='MLB'&sectionType=sp&statType=hitting&page=1&ts=1493672037085&playerType=ALL&sportCode='mlb'&split=&team_id=&active_sw=&position=&page_type=SortablePlayer&sortOrder='desc'&sortColumn=ab&results=&perPage=442&timeframe=&last_x_days=&extended=0
Here is what I have attempted:
r=requests.get(url)
soup=BeautifulSoup(r.content,'html.parser')
gdata=soup.find_all('div',{'id':'datagrid'})
print(gdata)
This should return all of the subtags within the tag, but it does not. This results in the following:
[<div id="datagrid"></div>]
Can anyone explain why this is not producing the contents of the table? Furthermore, what can I do to access the contents of the table?
Thanks

If you look at the source for the webpage, it looks like the datagrid div is actually empty & the stats are inserted dynamically as json from this URL. Maybe you can use that instead. To figure this out I looked at the page source to see that the div had no children and then used Chrome developer tools Network tab to find the request where it pulled the data:
Open the web page
Open the chrome developer tools, Command+Option+I (Mac) or Control+Shift+I (Windows, Linux).
Refresh the web page with the tools open so it processes the network requests then wait for the page to load
(optional) Type xml in the search bar on the web to narrow your search results to requests that are likely to have data
Click on each request and look at the preview of the response. At this point I just manually examined the responses to see which had your data. I got lucky and got yours on the first try since it has stats in the name.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Scraping hidden content from a javascript webpage with python - python

Related

How to scrape a react table when there are no table tags

Cannot find css class using Request HTML

How to extract values from an infinite scroll webpage with python?

Why am I unable to scrape this website?

How to access the subtags within a tag using beautifulsoup in python?

Categories

Resources