scraping youtube website using beautiful soup [duplicate] - python

This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?

BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python

Related

Node missing data when scraping with Beautiful Soup [duplicate]

This question already has answers here:
python selenium get data from table
(3 answers)
Closed 2 years ago.
I am writing a web scraper to pull financial data and analyst recommendations. I have an issue where the data seems to be missing / incorrect form the node when I Extract the data from the page source code I get $0.00 but The correct value is $884.23
Here is the example code below:
import requests as rq
from bs4 import BeautifulSoup as bs
sym='cmg'
url='https://www.nasdaq.com/market-activity/stocks/{}/analyst-research'.format(sym)
page_response = rq.get(url, timeout=5)
page=bs(page_response.content, 'html.parser')
sr=page.find('div', attrs={'class':'analyst-target-price__price'})
print(sr.text)
Out[546]: '$0.00'
From the html code on the site the value should be $884.23 at the time of writing this question.
Like I was saying above I assume the issue is the site was not fully rendered when I got the page response / content. Does anyone have a solution to this ?
the value you are trying to scrape is being genrated by Javascript so it's not in the source code of the page .
You can get the same value by sending the same request js is making :
import requests as rq
sym = 'cmg'
url = 'https://api.nasdaq.com/api/analyst/{}/targetprice'.format(sym)
page_response = rq.get(url).json()
priceTarget = page_response['data']['consensusOverview']['priceTarget']
lowPriceTarget = page_response['data']['consensusOverview']['lowPriceTarget']
highPriceTarget = page_response['data']['consensusOverview']['highPriceTarget']
print('priceTarget',priceTarget)
print('lowPriceTarget ',lowPriceTarget )
print('highPriceTarget ',highPriceTarget )
OutPut:
priceTarget 884.23
lowPriceTarget 550.0
highPriceTarget 1050.0

Extract only image links from beautiful soup output [duplicate]

This question already has an answer here:
BeautifulSoup HTML getting src link
(1 answer)
Closed 3 years ago.
I'm new to BeautifulSoup, and I've been trying to pull each image link out of a webpage using bs4 and requests. However, when I try to print each image link it spits out html and not a direct link to any images.
I've tried switching from using 'find' to using 'findAll', but that still doesn't solve my problem.
import bs4
import requests
req = requests.get('https://www.gnu.org/home.en.html')
soup = bs4.BeautifulSoup(req.text, features='html.parser')
html = (soup.findAll('img'))
print(html)
I expected the output to be web url's such as
https://www.gnu.org/distros/screenshots/guixSD-gnome3-medium.jpg, but instead the output gives me html which looks like this.
[<img alt=" [A GNU head] " src="/graphics/heckert_gnu.transp.small.png"/>,
The relative link can be taken from the src attribute. You can use:
for im in html:
print(im['src'])
Then, concatenating with the base URL, you can get the complete URL.

Code not on BS4, but can be found in 'Inspect Element' [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I have tried making a website which uses Beautiful Soup 4 to search g2a for the prices of games (by class). The problem is that when I look in the HTML code, it clearly shows the price of the first result (£2.30), but when I search for the class in Beautiful Soup 4, there is nothing between the same class's tags:
#summoningg2a
r = requests.get('https://www.g2a.com/?search=x')
data = r.text
soup = BeautifulSoup(data, 'html.parser')
#finding prices
prices = soup.find_all("strong", class_="mp-pi-price-min")
print(soup.prettify())
requests doesn't handle dynamic page content. You're best bet is using Selenium to drive a browser. From there you can parse page_source with BeautifulSoup to get the results you're looking for.
In chrome development tools, you can check the ajax request(made by Javascript) URL. you can mimic that requests and get data back.
r = requests.get('the ajax requests url')
data = r.text

Why does python show me text in Chinese? [duplicate]

This question already has an answer here:
Python change Accept-Language using requests
(1 answer)
Closed 6 years ago.
I am using requests and bs4 to scrape some data from a Chinese website that also has an English version. I wrote this to see if I get the right data:
import requests
from bs4 import BeautifulSoup
page = requests.get('http://dotamax.com/hero/rate/')
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
And I do, the only problem is that the text is in Chinese, although it is in English when I look at the page source. Why do I get Chinese instead of English. How to fix that?
The website appears to check the GET request for an Accept-Language parameter. If the request doesn't have one, it shows the Chinese version. However, this is an easy fix - use headers as described in the requests documentation:
import requests
from bs4 import BeautifulSoup
headers = {'Accept-Language': 'en-US,en;q=0.8'}
page = requests.get('http://dotamax.com/hero/rate/', headers=headers)
soup = BeautifulSoup(page.content, "lxml")
for i in soup.find_all('span'):
print i.text
produces:
Anti-Mage
Axe
Bane
Bloodseeker
Crystal Maiden
Drow Ranger
...
etc.
Usually when a request shows up differently in your browser and in the requests content, it has to do with the type of request and headers you're using. One really useful tip for web-scraping that I wish I had realized much earlier on is that if you hit F12 and go to the "Network" tab on Chrome or Firefox, you can get a lot of useful information that you can use for debugging:
you have to tell the server which language you like in the http headers:
import requests
from bs4 import BeautifulSoup
header={
'Accept-Language': 'en-US'
}
page = requests.get('http://dotamax.com/hero/rate/',headers=header)
soup = BeautifulSoup(page.content, "html5lib")
for i in soup.find_all('span'):
print(i.text)

Get web page content (Not from source code) [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 years ago.
I want to get the rainfall data of each day from here.
When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.
I am using urllib2 and BeautifulSoup from bs4
Here is my code:
import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"
r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")
And I got an empty array.
My question is: How can I get the page content, but not from the page source code?
If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.
If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.
Try using selenium
How can I parse a website using Selenium and Beautifulsoup in python?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')
html = driver.page_source
soup = BeautifulSoup(html)
print soup.find_all("td", class_="td1_normal_class")
However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.

Categories

Resources