How to scrape html table only after data loads using Python Requests? - python

I am trying to learn data scraping using python and have been using the Requests and BeautifulSoup4 libraries. It works well for normal websites. But when I tried to get some data out of websites where the table data loads after some delay, I found that I get an empty table. An examples would be this webpage
The script I've tried is a fairly routine one.
import requests
from bs4 import BeautifulSoup
response = requests.get("http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2")
soup = BeautifulSoup(response.text, "html.parser")
content = soup.find('div', {'id': 'odds-data-portal'})
The data loads in the table odds-data-portal in the page but the code doesn't give me that. How can I make sure the table is loaded with data and get it first?

Sorry, I can't open the link. But the table is probably generated in one of 2 ways:
Purely by JavaScript with no AJAX call.
Using an AJAX call and some JavaScript for DOM manipulation.
If it is the first case, then you have no option but to use selenium-webdriver in Python. Also, you can have a look at the example in this answer.
If it is the second case, then you can find out the URL and the data sent and then using requests module send a similar request to fetch the data. Data can be in JSON format or HTML (Depends on how good the developer is). You'll have to parse it accordingly.
Sometimes, the AJAX call may require, as data, a CSRF token or the cookie, in that case you'll have to revert back to the solution in the first case.

You will need to use something like selenium to get the html. You could though continue to use BeautifulSoup to parse it as follows:
from bs4 import BeautifulSoup
from operator import itemgetter
from selenium import webdriver
url = "http://www.oddsportal.com/soccer/england/premier-league/everton-arsenal-tnWxil2o#over-under;2"
browser = webdriver.Firefox()
browser.get(url)
soup = BeautifulSoup(browser.page_source)
data_table = soup.find('div', {'id': 'odds-data-table'})
for div in data_table.find_all_next('div', class_='table-container'):
row = div.find_all(['span', 'strong'])
if len(row):
print ','.join(cell.get_text(strip=True) for cell in itemgetter(0, 4, 3, 2, 1)(row))
This would display:
Over/Under +0.5,(8),1.04,11.91,95.5%
Over/Under +0.75,(1),1.04,10.00,94.2%
Over/Under +1,(1),1.04,11.00,95.0%
Over/Under +1.25,(2),1.13,5.88,94.8%
Over/Under +1.5,(9),1.21,4.31,94.7%
Over/Under +1.75,(2),1.25,3.93,94.8%
Over/Under +2,(2),1.31,3.58,95.9%
Over/Under +2.25,(4),1.52,2.59,95.7%
Update - as suggested by #JRodDynamite, to run the headless PhantomJS can be used instead of Firefox. To do this:
Download the PhantomJS Windows binary.
Extract the phantomjs.exe executable and ensure it is in your PATH.
Change the following line: browser = webdriver.PhantomJS()

Related

Scraping Data from Table with Multiple Pages

I am trying to scrape data from AGMARKNET website. The tables are split into 11 pages but all of the pages use the same url. I am very new to webscraping (or python in general), but AGMARKNET does not have a public API so scraping the page seems to be my only option. I am currently using BeautifulSoup to parse the HTML code and I am able to scrape the initial table, but that only contains the first 500 data points, but I want the entire 11 page data. I am stuck and frustrated. Link and my current code are below. Any direction would be helpful, thank you .
#αԋɱҽԃ αмєяιcαη
https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--
import requests
import pandas as pd
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
response = requests.get(url)
# Use BeautifulSoup to parse the HTML code
soup = BeautifulSoup(response.content, 'html.parser')
# changes stat_table from ResultSet to a Tag
stat_table = stat_table[0]
# Convert html table to list
rows = []
for tr in stat_table.find_all('tr')[1:]:
cells = []
tds = tr.find_all('td')
if len(tds) == 0:
ths = tr.find_all('th')
for th in ths:
cells.append(th.text.strip())
else:
for td in tds:
cells.append(td.text.strip())
rows.append(cells)
# convert table to df
table = pd.DataFrame(rows)
The website you linked to seems to be using JavaScript to navigate to the next page. The requests and BeautifulSoup libraries are only for parsing static HTML pages, so they can't run JavaScript.
Instead of using them, you should try something like Selenium that actually simulates a full browser environment (including HTML, CSS, etc.). In fact, Selenium can even open a full browser window so you can see it in action as it navigates!
Here is a quick sample code:
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.firefox.options import Options
# If you prefer Chrome to Firefox, there is a driver available
# for that as well
# Set the URL
url = 'https://agmarknet.gov.in/SearchCmmMkt.aspx?Tx_Commodity=17&Tx_State=JK&Tx_District=0&Tx_Market=0&DateFrom=01-Oct-2004&DateTo=18-Oct-2022&Fr_Date=01-Oct-2004&To_Date=18-Oct-2022&Tx_Trend=2&Tx_CommodityHead=Apple&Tx_StateHead=Jammu+and+Kashmir&Tx_DistrictHead=--Select--&Tx_MarketHead=--Select--'
# Start the browser
opts = Options()
driver = webdriver.Firefox(options=opts)
driver.get(url)
Now you can use functions like driver.find_element(...) and driver.find_elements(...) to extract the data you want from this page, the same way you did with BeautifulSoup.
For your given link, the page number navigators seem to be running a function of the form,
__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')
...replacing Page$2 with Page$3, Page$4, etc. depending on which page you want. So you can use Selenium to run that JavaScript function when you're ready to navigate.
driver.execute_script("__doPostBack('ctl00$cphBody$GridViewBoth','Page$2')")
A more generic solution is to just select which button you want and then run that button's click() function. General example (not necessarily for the current website):
btn = driver.find_element('id', 'next-button')
btn.click()
A final note: after the button is clicked, you might want to time.sleep(...) for a little while to make sure the page is fully loaded before you start processing the next set of data.

Web scraping an element using beautifulSoup and Python

I am trying to grab an element from tradingview.com. Specifically this link. I want the price of a symbol of whatever link I give my program. I noticed when looking through the elements of the url, I can find the price of the stock here.
<div class="tv-symbol-price-quote__value js-symbol-last">
"3.065"
<span class>57851</span>
</div>
When running this code below, I get this output.
#This will not run on online IDE
import requests
from bs4 import BeautifulSoup
URL = "https://www.tradingview.com/symbols/NEARUSD/"
r = requests.get(URL)
soup = BeautifulSoup(r.content, 'html.parser') # If this line causes an error, run 'pip install html5lib' or install html5lib
L = [soup.find_all(class_ = "tv-symbol-price-quote__value js-symbol-last")]
print(L)
output
[[<div class="tv-symbol-price-quote__value js-symbol-last"></div>]]
How can I grab the entire price from this website? I would like the 3.065 as well as the 57851.
You have the most common problem: page uses JavaScript to add/update elements but BeautifulSoup/lxml, requests/urllib can't run JS. You may need Selenium to control real web browser which can run JS. OR use (manually) DevTools in Firefox/Chrome (tab Network) to see if JavaScript reads data from some URL. And try to use this URL with requests. JS usually gets JSON which can be easy converted to Python dictionary (without BS). You can also check if page has (free) API for programmers.
Using DevTool I found it uses JavaScript to send POST (with some JSON data) and it gets fresh price.
import requests
payload = {
"columns": ["market_cap_calc", "market_cap_diluted_calc", "total_shares_outstanding", "total_shares_diluted", "total_value_traded"],
"range": [0, 1],
"symbols": {"tickers": ["BINANCE:NEARUSD"]}
}
url = 'https://scanner.tradingview.com/crypto/scan'
response = requests.post(url, json=payload)
print(response.text)
data = response.json()
print(data['data'][0]["d"][1]/1_000_000_000)
Result:
{"totalCount":1,"data":[{"s":"BINANCE:NEARUSD","d":[2507704855.0467912,3087555230,812197570,1000000000,106737372.9550421]}]}
3.08755523
EDIT:
It seems above code gives only market cap. And page uses websocket to get fresh price every few seconds.
wss://data.tradingview.com/socket.io/websocket?from=symbols%2FNEARUSD%2F&date=2022_10_17-11_33
And this would need more complex code.
Other answer (with Selenium) gives you correct value.
The webpage's contents are loaded dynamically by JavaScript. So you have to use an automation tool something like selenium or hidden API.
Here I use selenium with bs4 to grab the desired dynamic content.
import time
from selenium import webdriver
from bs4 import BeautifulSoup
from selenium.webdriver.chrome.service import Service
webdriver_service = Service("./chromedriver") #Your chromedriver path
driver = webdriver.Chrome(service=webdriver_service)
url= "https://www.tradingview.com/symbols/NEARUSD/"
driver.get(url)
driver.maximize_window()
time.sleep(5)
soup = BeautifulSoup(driver.page_source,"lxml")
price = soup.find('div',class_ = "tv-symbol-price-quote__value js-symbol-last").get_text(strip=True)
print(price)
Output:
3.07525163

How do I fix fix getting "None" as a response when web scraping?

So I am trying to create a small code that gets the views from a youtube video and prints them. However using this code when printing the text var I just get the response "None". Is there a way to get a response of the actual view count using these libraries?
import requests
from bs4 import BeautifulSoup
url = requests.get("https://www.youtube.com/watch?v=dQw4w9WgXcQ")
soup = BeautifulSoup(url.text, 'html.parser')
text = soup.find('span', {'class': "view-count style-scopeytd-video-view-count-renderer"})
print(text)
To see why, you should use wget or curl to fetch a copy of that page and look at it, or use "view source" from your browser. That's what requests sees. None of those classes appear in the HTML you get back. That's why you get None -- because there ARE none.
YouTube builds all of its pages dynamically, through Javascript. requests doesn't interpret Javascript. If you need to do this, you'll need to use something like Selenium to run a real browser with a Javascript interpreter built in.

Why am I getting an empty body tag content when trying to use web scraping using the requests library?

I have been trying to use web scraping on a website using the requests and Beautifulsoup python libraries.
The problem is that I'm getting the html data of the web page but the body tag content is empty while on the inspect panel on the website it isn't.
Does anyone can explain why is it happening and what can I do to get the content of the body?
Here is my code:
from bs4 import BeautifulSoup
import requests
source = requests.get('https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers').text
soup = BeautifulSoup(source, 'lxml')
print(soup)
Here is the inspect panel of the website:
And here is the output of my code:
Thank you :)
There are two reasons, your code could not work for. The fist one is, the website does require additional header or cookie information, that you could try to find using the Inspect Browser Tool and add via
requests.get(url, headers=headers, cookies=cookies)
where headers and cookies are dictionaries.
Another reason, which I believe it is, is that the content is dynamically loaded via Javascript after the side is build, and what you do get is the initially loaded website.
To also provide you a solution, I attache an example using Selenium, which simulates a whole browser, which does serve the full website, however selenium has a bit of a setup overhead, that you can easily google.
from time import sleep
from selenium import webdriver
from bs4 import BeautifulSoup
url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
driver = webdriver.Firefox()
driver.get(url)
sleep(10)
content = driver.page_source
soup = BeautifulSoup(content)
If you want the browser simulation to be none visible you can add
from selenium.webdriver.firefox.options import Options
options = Options()
options.headless = True
driver = webdriver.Firefox(options=options)
which will make it run in the backgroud.
Alternatively to Firefox, you can use pretty much any browser using the appropriate driver.
A Linux based setup example can be found here Link
Even though I find the use of Selenium easier for beginners, that site bothered me, so I figured out a pure requests way, that I also want to share.
Process:
When you look at the network traffic after loading the website, you find a lot of outgoing get requests. Assuming, you are interested in the products, that are loaded, I found a call right above the product images being loaded from Amazon S3 going to
https://client-il.rexail.com/client/public/public-catalog?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A
importantly
https://client-il.rexail.com/client/public/public-catalog?s_jwe=[...]
Upon clicking the URL I found it to be indeed a JSON of the products. However the s_jwe token is dynamic and without it, the JSON doesn't load.
Now investigating the initially loading url and searching for s_jwe you will find
<script>
window.customerStore = {store: angular.fromJson({"id":26,"name":"\u05de\u05e9\u05e7 \u05d4\u05e8 \u05e4\u05e8\u05d7\u05d9\u05dd","imagePath":"images\/stores\/26\/88aa6827bcf05f9484b0dafaedf22b0a.png","secondaryImagePath":"images\/stores\/4d5d1f54038b217244956071ca62312d.png","thirdImagePath":"images\/stores\/26\/2f9294180e7d656ba7280540379869ee.png","fourthImagePath":"images\/stores\/26\/bd2861565b18613497a6ce66903bf9eb.png","externalWebTrackingAccounts":"[{\"accountType\":\"googleAnalytics\",\"identifier\":\"UA-130110792-1\",\"primaryDomain\":\"ecomeshek.co.il\"},{\"accountType\":\"facebookPixel\",\"identifier\":\"3958210627568899\"}]","worksWithStoreCoupons":false,"performSellingUnitsEstimationLearning":false}), s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"};
const externalWebTrackingAccounts = angular.fromJson(customerStore.store.externalWebTrackingAccounts);
</script>
containing
s_jwe: "eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A"
So to summerize, even though, the initial page does not contain the products, it does contain the token and the product url.
Now you can extract the two and call the product catalog directly as such:
FINAL CODE:
import requests
import re
import json
s = requests.Session()
initial_url = 'https://webaccess-il.rexail.com/?s_jwe=eyJhbGciOiJkaXIiLCJlbmMiOiJBMTI4Q0JDLUhTMjU2In0..gKfb7AnqhUiIMIn0PGb35g.SUsLS70gBec9GBgraaV5BK8hKyqm-VvMSNjP3nIumtcrj9h19zOkYjaBHrW4SDL10DjeIcwQcz9ul1p8umMHKxPPC-QZpCyJbk7JQkUSqFM._d_sGsiSyPF_Xqs2hmLN5A#/store-products-shopping-non-customers'
initial_site = s.get(url= initial_url).content.decode('utf-8')
jwe = re.findall(r's_jwe:.*"(.*)"', initial_site)
product_url = "https://client-il.rexail.com/client/public/public-catalog?s_jwe="+ jwe[0]
products_site = s.get(url= product_url).content.decode('utf-8')
products = json.loads(products_site)["data"]
print(products[0])
There is a little bit of finetuning required with the decoding, but I am sure you can manage that. ;)
This of course is the leaner way of scraping that website, but as I hopefully showed, scraping is always a bit of playing Sherlock Holmes.
Any questions, glad to help.

Go through to original URL on social media management websites

I'm doing web scraping as part of an academic project, where it's important that all links are followed through to the actual content. Annoyingly, there are some important error cases with "social media management" sites, where users post their links to detect who clicks on them.
For instance, consider this link on linkis.com, which links to http:// + bit.ly + /1P1xh9J (separated link due to SO posting restrictions), which in turn links to http://conservatives4palin.com. The issue arises as the original link at linkis.com does not automatically redirect forward. Instead, the user has to click the cross in the top right corner to go to the original URL.
Furthermore, there seems to be different variations (see e.g. linkis.com link 2, where the cross is at the bottom left of the website). These are the only two variations I've found, but there might be more. Note that I'm using a web scraper very similar to this one. The functionality to go through to the actual link does not need to be stable/functioning over time as this is a one-time academic project.
How do I automatically go on to the original URL? Would the best approach be to design a regex that finds the relevant link?
In many cases, you will have to use browser automation to scrape web pages that generate their content using javascript, scraping the html returned by the a get request will not yield the result you want, you have two options here :
Try to get your way around all the additional javascript requests to get the content you want which can be very time consuming .
Use browser automation, which lets you open a real browser and automates its tasks, you can use Selenium for that.
I have been developing bots and scrapers for years now, and unless the webpage you are requesting does not rely heavily on javascript, you should use something like selenium.
Here is some code to get you started with selenium:
from selenium import webdriver
#Create a chrome browser instance, other drivers are also available
driver = webdriver.Chrome()
#Request a page
driver.get('http://linkis.com/conservatives4palin.com/uGXam')
#Select elements on the page and trigger events
#Selenium supports also xpath and css selectors
#Clicks the tag with the given id
driver.find_elements_by_id('some_id').click()
The common architecture that the website follows is that it shows the website as an iframe. The sample code runs for both the cases.
In order to get the final URL you can do something like this:
import requests
from bs4 import BeautifulSoup
urls = ["http://linkis.com/conservatives4palin.com/uGXam", "http://linkis.com/paper.li/gsoberon/jozY2"]
response_data = []
for url in urls:
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
short_url = soup.find("iframe", {"id": "source_site"})['src']
response_data.append(requests.get(short_url).url)
print(response_data)
According to the two websites that you given, i think you may try the following code to get the original url for they all hidden in a part of javascript(the main scraper code i am using is from the question that you post):
try:
from HTMLParser import HTMLParser
except ImportError:
from html.parser import HTMLParser
import requests, re
from contextlib import closing
CHUNKSIZE = 1024
reurl = re.compile("\"longUrl\":\"(.*?)\"")
buffer = ""
htmlp = HTMLParser()
with closing(requests.get("http://linkis.com/conservatives4palin.com/uGXam", stream=True)) as res:
for chunk in res.iter_content(chunk_size=CHUNKSIZE, decode_unicode=True):
buffer = "".join([buffer, chunk])
match = reurl.search(buffer)
if match:
print(htmlp.unescape(match.group(1)).replace('\\',''))
break
say you're able to grab the href attribute/value:
s = 'href="/url/go/?url=http%3A%2F%2Fbit.ly%2F1P1xh9J"'
then you need to do the following:
import urllib.parse
s=s.partition('http')
s=s[1]+urllib.parse.unquote(s[2][0:-1])
s=urllib.parse.unquote(s)
and s will now be a string of the original bit-ly link!
try the following code:
import requests
url = 'http://'+'bit.ly'+'/1P1xh9J'
realsite = requests.get(url)
print(realsite.url)
it prints the desired output:
http://conservatives4palin.com/2015/11/robert-tracinski-the-climate-change-inquisition-begins.html?utm_source=twitterfeed&utm_medium=twitter

Categories

Resources