How to extract values from an infinite scroll webpage with python? - python

I am unable to extract any data from this website. This code works for other sites. Also, this website is extendable if a registered user scrolls down. How can I extract data from the table from such a website?
from pyquery import PyQuery as pq
import requests
url = "https://uk.tradingview.com/screener/"
content = requests.get(url).content
doc = pq(content)
Tickers = doc(".tv-screener__symbol").text()
Tickers

You're using a class name which doesn't appear in the source of the page. The most likely reason for this is that the page uses javascript to either load data from a server or change the DOM once the page is loaded to add the class name in question.
Since neither the requests library nor the pyquery library you're using have a javascript engine to duplicate the feat, you get left with the raw static html which doesn't contain the tv-screener__symbol.
To solve this, look at the document you actually receive from a server and try to find the data you're interested in the the raw HTML document you receive:
...
content = requests.get(url).content
print(content)
(Or you can look at the data in the browser, but you must turn off Javascript in order to see the same document that Python gets to see)
If the data isn't in the raw HTML, you have to look at the javascript to see how it makes it's requests to the server backend to load the data, and then copy that request using your python requests' library.

Related

How can i fix the find_all error while web scraping?

I am kinda a newbie in data world. So i tried to use bs4 and requests to scrap data from trending youtube videos. I have tried using soup.findall() method. To see if it works i displayed it. But it gives me an empty list. Can you help me fix it? Click here to see the spesific part of the html code.
from bs4 import BeautifulSoup
import requests
r = requests.get("https://www.youtube.com/feed/explore")
soup = BeautifulSoup(r.content,"lxml")
soup.prettify()
trendings = soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-
shelf-contents-renderer"})
print(trending)
This webpage is dynamic and contains scripts to load data. Whenever you make a request using requests.get("https://www.youtube.com/feed/explore"), it loads the initial source code file that only contains information like head, meta, etc, and scripts. In a real-world scenario, you will have to wait until scripts load data from the server. BeautifulSoup does not catch the interactions with DOM via JavaScript. That's why soup.find_all("ytd-video-renderer",attrs = {"class":"style-scope ytd-expanded-shelf-contents-renderer"}) gives you empty list as there is no ytd-video-renderer tag or style-scope ytd-expanded-shelf-contents-renderer class.
For dynamic webpages, I think you should use Selenium (or maybe Scrapy).
For Youtube, you can use it's API as well.

How to scrape a react table when there are no table tags

I am trying to scrape the table from this link. There are no table tags on the page so I am trying to access it using the class "rt-table". When I inspect the table in developer tools, I can see the html I need in the elements section (I am using Chrome). However, when I view the source code using requests, this part of the code is now missing. Does anyone know what the problem could be?
If you just want those stats you can access the hidden api like this:
import requests
import pandas as pd
url = f'https://api.nhle.com/stats/rest/en/team/summary?isAggregate=false&isGame=true&sort=%5B%7B%22property%22:%22points%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22wins%22,%22direction%22:%22DESC%22%7D,%7B%22property%22:%22teamId%22,%22direction%22:%22ASC%22%7D%5D&start=0&limit=100&factCayenneExp=gamesPlayed%3E=1&cayenneExp=gameTypeId=2%20and%20seasonId%3C=20072008%20and%20seasonId%3E=20072008'
data = requests.get(url).json()
df = pd.DataFrame(data['data'])
df.to_csv('nj devils data.csv',index=False)
To see where I got that URL go to the page you are trying to scrape in your browser, then open Developer Tools - Network - fetch/XHR and hit refresh, you'll see a bunch of network requests happen if you click on them you can explore the data returns in the "preview" tab. This can be recreated like above. This is not always the case, sometimes you need to send certain headers or a payload of information to authenticate a request to an api, but in this case it works quite easily.

Cannot find css class using Request HTML

After following this tutorial on finding a css class and copying the text on a website, I tried to implement this into a small text code but sadly it didnt work.
I followed the tutorial exactly on the same website and did get the headline of the webpage, but cant get this process to work for any other class on that, or any other , webpage. Am I missing something? I am a beginner programmer and have never used Request HTML or anything similar before.
Here is an example of the code I'm using, the purpose being to grab the random fact that appears in the "af-description" class when you load the webpage.
from requests_html import HTMLSession
session = HTMLSession()
r = session.get('http://mentalfloss.com/amazingfactgenerator')
r.html.find('.af-description', first=True)
description = r.html.find('.af-description', first=True)
print("Fun Fact:" + description.text)
No matter how hard I try and no matter how I rearrange things or try different code, I cant get it to work. It seems to not be able to find the class or the text the class contains. Please help.
What you are trying to do requires that the HTML source contains an element with such a class. A browser does much more than just download HTML; it also downloads CSS and Javascript code when referenced by the page, and executes any scripts attached to the page, which can trigger further network activity. If the content you are looking for was generated by Javascript, you can see the elements in the browser development tools inspector, but that doesn't make the element accessible to the r.html object!
In the case of the URL you tried to scrape, if you look at the network console you'll see that an AJAX request GET request http://mentalfloss.com/api/facts is made to fill the <div af-details> structures, so if you wanted to scrape that data you could just get it as JSON directly from the API:
r = session.get('http://mentalfloss.com/api/facts')
description = r.json()[0]['fact']
print("Fun Fact:" + fact)
You can make the requests_html session render the page with Javascript too by calling r.html.render().
This then uses a headless browser to render the HTML, execute the JavaScript code embedded in it, fetch the AJAX request and render the additional DOM elements, then reflect the whole page back to HTML for your code to mine. The first time you do this the required libraries for the headless browser infrastructure are downloaded for you:
>>> from requests_html import HTMLSession
>>> session = HTMLSession()
>>> r = session.get('http://mentalfloss.com/amazingfactgenerator')
>>> r.html.render()
[W:pyppeteer.chromium_downloader] start chromium download.
Download may take a few minutes.
# .... a lot more information elided
[W:pyppeteer.chromium_downloader] chromium extracted to: /Users/mj/.pyppeteer/local-chromium/533271
>>> r.html.render()
>>> r.html.find('.af-description', first=True)
<Element 'div' class=('af-description',)>
>>> _.text
'The cubicle did not get its name from its shape, but from the Latin “cubiculum” meaning bed chamber.'
However, this requires your computer to do a lot more work; for this specific example, it's easier to just call the API directly.
The div which include the class 'af-description' is not included on the DOM but on a js script. Its normal to not be able to find it.
If you test your script to find a class from the DOM, like this one 'afg-page row' you should be fine.

BeautifulSoup4 output with JS Filters

Newbie here. I'm trying to scrape some sports statistics off a website using BeautifulSoup4. The script below does output a table, but it's not actually the specific data that appears in the browser (the data that appears in browser is the data I'm after - goalscorer data for a season, not all time records).
#import libraries
from urllib.request import urlopen
from bs4 import BeautifulSoup
import requests
#specify the url
stat_page = 'https://www.premierleague.com/stats/top/players/goals?se=79'
# query the website and return the html to the variable ‘page’
page = urlopen(stat_page)
#parse the html using beautiful soup and store in variable `soup`
soup = BeautifulSoup(page, 'html.parser')
# Take out the <div> of name and get its value
stats = soup.find('tbody', attrs={'class': 'statsTableContainer'})
name = stats.text.strip()
print(name)
It appears there is some filtering of data going on behind the scenes but I am not sure how I can filter the output with BeautifulSoup4. It would appear there is some Javascript filtering happening on top of the HTML.
I have tried to identify what this specific filter is, and it appears the filtering is done here.
<div class="current" data-dropdown-current="FOOTBALL_COMPSEASON" role="button" tabindex="0" aria-expanded="false" aria-labelledby="dd-FOOTBALL_COMPSEASON" data-listen-keypress="true" data-listen-click="true">2017/18</div>
I've had a read of the below link, but I'm not entirely sure how to apply it to my answer (again, beginner here).
Having problems understanding BeautifulSoup filtering
I've tried installing, importing and applying the different parsers, but I always get the same error (Couldn't find a Tree Builder). Any suggestions on how I can pull data off a website that appears to be using a JS filter?
Thanks.
In these cases, it's usually useful to track the network requests using your browser's developer tools, since the data is usually retrieved using AJAX and then displayed in the browser with JS.
In this case, it looks like the data you're looking for can be accessed at:
https://footballapi.pulselive.com/football/stats/ranked/players/goals?page=0&pageSize=20&compSeasons=79&comps=1&compCodeForActivePlayer=EN_PR&altIds=true
It has a standard JSON format so you should be able to parse and extract the data with minimal effort.
However, note that this endpoint requieres the Origin HTTP header to be set to https://www.premierleague.com in order for it to serve your request.

Scraping hidden content from a javascript webpage with python

I'm trying to scrape the content from the following website:
https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7
I have previously scraped the content successfully using dryscrape and the following code:
import dryscrape
import webkit_server
from lxml import html
session = dryscrape.Session()
session.set_timeout(20)
session.set_attribute('auto_load_images', False)
session.visit('https://mobile.admiral.at/en/event/event/all#/event/15a822ab-84a1-e511-90a2-000c297013a7')
response = session.body()
tree = html.fromstring(response)
print(tree.xpath('(//td[#class="team-name"]/text())[1]'))
The above example would print the home team (which in this case would be 'France')
It seems that the structure of the source has been changed, so I'm unable to scrape the contents properly.
What confuses me is that I'm able to see the tags using the Firefox Inspector tool, however it's not visible in the response when I pull the source.
I assume they must have hidden the content somehow to make it impossible (?) to scrape the data.
Could someone please point me in the right direction how to scrape the content properly.
The content that you need is loaded using jQuery (Ajax). I don't know if dryscrape has been updated lately, but the last time I used it didn't support ajax content loaded from jQuery...
Anyway.. just taking a look to the network inspector of chrome you will realize that the main content is loaded using an API. You can call to that API directly and you will get an awesome JSON with all the data of the page:
import requests
data = requests.get('https://mobile.admiral.at/;apiVer=json;api=main;jsonType=object;apiRw=1/en/api/event/get-event?id=15a822ab-84a1-e511-90a2-000c297013a7').json()

Categories

Resources