This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I have tried making a website which uses Beautiful Soup 4 to search g2a for the prices of games (by class). The problem is that when I look in the HTML code, it clearly shows the price of the first result (£2.30), but when I search for the class in Beautiful Soup 4, there is nothing between the same class's tags:
#summoningg2a
r = requests.get('https://www.g2a.com/?search=x')
data = r.text
soup = BeautifulSoup(data, 'html.parser')
#finding prices
prices = soup.find_all("strong", class_="mp-pi-price-min")
print(soup.prettify())
requests doesn't handle dynamic page content. You're best bet is using Selenium to drive a browser. From there you can parse page_source with BeautifulSoup to get the results you're looking for.
In chrome development tools, you can check the ajax request(made by Javascript) URL. you can mimic that requests and get data back.
r = requests.get('the ajax requests url')
data = r.text
Related
This question already has answers here:
BS4 Beautiful Soup extract text from find_all
(2 answers)
Closed 2 years ago.
I'm trying to learn how to use beautiful soup
using this website as a very simple example.
https://www.espncricinfo.com/ci/content/ground/56490.html#Profile
Lets say I want to extract the capacity of the ground. I have so far written the following code which gives me the field names, but I can't seem to understand how to get the actual value of 18,000
Can anyone help?
url="https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text)
soup.findAll('label')
Perhaps something like
from bs4 import BeautifulSoup
import requests
url = "https://www.espncricinfo.com/ci/content/ground/56490.html"
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
stats = soup.find('div', {'id': 'stats'})
for e in stats.findAll('label'):
print(f"{e.text}: {e.nextSibling}")
demo
This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?
BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 4 years ago.
I'm trying to get all the links available on this page using BeautifulSoup.
But while fetching the URL with urllib and then parsing it with BeautifulSoup, i doesn't return all the information available on this page.
I have tried different parsers (html.parser,lxml, xml, html5lib), But it does not return me the desired result.
I know how to get tag details but the file in which I store the html data, does not contain the links available. But when I inspect element on chrome, it does show the links. Below is my code with the URL which I'm working on:
def fetch_html(fullurl,contextstring):
print("Opening the file connection for " + fullurl)
uh= urllib.request.urlopen(fullurl, context=contextstring)
print("HTTP status",uh.getcode())
html =uh.read()
bs = BeautifulSoup(html, 'lxml')
return bs
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
mainurl ='https://www.daad.de/deutschland/studienangebote/international-programmes/en/result/?q=°ree%5B%5D=2&lang%5B%5D=2&fos=3&crossFac=&cert=&admReq=&scholarshipLC=&scholarshipSC=&langDeAvailable=&langEnAvailable=&lvlEn%5B%5D=&cit%5B%5D=&tyi%5B%5D=&fee=&bgn%5B%5D=&dur%5B%5D=&sort=4&ins%5B%5D=&subjects%5B%5D=&limit=10&offset=&display=list'
a=(fetch_html(mainurl, ctx))
f= open("F:\Harsh docs\python\courselinks.py","w")
f.write(a.prettify())
f.close
For Result I'm interested in getting a link for "Embedded Systems (ESY)".
Seems the page you are scraping is rendering with javascript.
You can try using selenium and chrome.
Or you can use the requests_html package https://html.python-requests.org/
to render the javascript before getting the html
Only to get all links from page use below code:(Python 3)
from bs4 import BeautifulSoup
import re
from urllib.request import urlopen
html_page = urlopen("http://www.google.com/")
soup = BeautifulSoup(html_page)
for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
print (link.get('href'))
This question already has an answer here:
Python change Accept-Language using requests
(1 answer)
Closed 6 years ago.
I am using requests and bs4 to scrape some data from a Chinese website that also has an English version. I wrote this to see if I get the right data:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://dotamax.com/hero/rate/')
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
And I do, the only problem is that the text is in Chinese, although it is in English when I look at the page source. Why do I get Chinese instead of English. How to fix that?
The website appears to check the GET request for an Accept-Language parameter. If the request doesn't have one, it shows the Chinese version. However, this is an easy fix - use headers as described in the requests documentation:
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.8'}
page = requests.get('http://dotamax.com/hero/rate/', headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
produces:
Anti-Mage
Axe
Bane
Bloodseeker
Crystal Maiden
Drow Ranger
...
etc.
Usually when a request shows up differently in your browser and in the requests content, it has to do with the type of request and headers you're using. One really useful tip for web-scraping that I wish I had realized much earlier on is that if you hit F12 and go to the "Network" tab on Chrome or Firefox, you can get a lot of useful information that you can use for debugging:
you have to tell the server which language you like in the http headers:
import requests
from bs4 import BeautifulSoup
header={
'Accept-Language': 'en-US'
}
page = requests.get('http://dotamax.com/hero/rate/',headers=header)
soup = BeautifulSoup(page.content, "html5lib")
for i in soup.find_all('span'):
print(i.text)
This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 years ago.
I want to get the rainfall data of each day from here.
When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.
I am using urllib2 and BeautifulSoup from bs4
Here is my code:
import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"
r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")
And I got an empty array.
My question is: How can I get the page content, but not from the page source code?
If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.
If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.
Try using selenium
How can I parse a website using Selenium and Beautifulsoup in python?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')
html = driver.page_source
soup = BeautifulSoup(html)
print soup.find_all("td", class_="td1_normal_class")
However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.