Node missing data when scraping with Beautiful Soup [duplicate]

Node missing data when scraping with Beautiful Soup [duplicate] - python

This question already has answers here:
python selenium get data from table
(3 answers)
Closed 2 years ago.
I am writing a web scraper to pull financial data and analyst recommendations. I have an issue where the data seems to be missing / incorrect form the node when I Extract the data from the page source code I get $0.00 but The correct value is $884.23
Here is the example code below:
import requests as rq
from bs4 import BeautifulSoup as bs
sym='cmg'
url='https://www.nasdaq.com/market-activity/stocks/{}/analyst-research'.format(sym)
page_response = rq.get(url, timeout=5)
page=bs(page_response.content, 'html.parser')
sr=page.find('div', attrs={'class':'analyst-target-price__price'})
print(sr.text)
Out[546]: '$0.00'
From the html code on the site the value should be $884.23 at the time of writing this question.
Like I was saying above I assume the issue is the site was not fully rendered when I got the page response / content. Does anyone have a solution to this ?

the value you are trying to scrape is being genrated by Javascript so it's not in the source code of the page .
You can get the same value by sending the same request js is making :
import requests as rq
sym = 'cmg'
url = 'https://api.nasdaq.com/api/analyst/{}/targetprice'.format(sym)
page_response = rq.get(url).json()
priceTarget = page_response['data']['consensusOverview']['priceTarget']
lowPriceTarget = page_response['data']['consensusOverview']['lowPriceTarget']
highPriceTarget = page_response['data']['consensusOverview']['highPriceTarget']
print('priceTarget',priceTarget)
print('lowPriceTarget ',lowPriceTarget )
print('highPriceTarget ',highPriceTarget )
OutPut:
priceTarget 884.23
lowPriceTarget 550.0
highPriceTarget 1050.0

Related

scraping youtube website using beautiful soup [duplicate]

This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?

BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python

Why is my web scraping producing HTML but won't return any text?

New coder here. I am trying to return all the earnings per share data from this website here: https://www.nasdaq.com/market-activity/stocks/csco/revenue-eps
I started off slow by just trying to return "March", and used this code:
from bs4 import BeautifulSoup
from requests import get
url = "https://www.nasdaq.com/market-activity/stocks/csco/revenue-eps"
response = get(url)
soup = BeautifulSoup(response.text, 'html.parser')
month = soup.find("th", {"class": "revenue-eps__cell revenue-eps__cell--rowheading"})
print(month.text)
When I run it there are no errors, but nothing is returned.
When I try running the same code but use print(month) instead, I return the HTML from the element that looks like the following:
th class="revenue-eps__cell revenue-eps__cell--rowheading" scope="row"> /th>
I noticed in the HTML that is returned, that the text isn't inside the th. Why is that? Am I doing something wrong or is it the site I'm trying to scrape?

The data is not embedded in the page but retrieved from an API. You can pass the company name as parameter to get all the data directly :
import requests
import json
company = "CSCO"
r = requests.get("https://api.nasdaq.com/api/company/{}/revenue?limit=1".format(company))
print(json.loads(r.text)['data'])

Extract only image links from beautiful soup output [duplicate]

This question already has an answer here:
BeautifulSoup HTML getting src link
(1 answer)
Closed 3 years ago.
I'm new to BeautifulSoup, and I've been trying to pull each image link out of a webpage using bs4 and requests. However, when I try to print each image link it spits out html and not a direct link to any images.
I've tried switching from using 'find' to using 'findAll', but that still doesn't solve my problem.
import bs4
import requests
req = requests.get('https://www.gnu.org/home.en.html')
soup = bs4.BeautifulSoup(req.text, features='html.parser')
html = (soup.findAll('img'))
print(html)
I expected the output to be web url's such as
https://www.gnu.org/distros/screenshots/guixSD-gnome3-medium.jpg, but instead the output gives me html which looks like this.
[<img alt=" [A GNU head] " src="/graphics/heckert_gnu.transp.small.png"/>,

The relative link can be taken from the src attribute. You can use:
for im in html:
print(im['src'])
Then, concatenating with the base URL, you can get the complete URL.

Code not on BS4, but can be found in 'Inspect Element' [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I have tried making a website which uses Beautiful Soup 4 to search g2a for the prices of games (by class). The problem is that when I look in the HTML code, it clearly shows the price of the first result (£2.30), but when I search for the class in Beautiful Soup 4, there is nothing between the same class's tags:
#summoningg2a
r = requests.get('https://www.g2a.com/?search=x')
data = r.text
soup = BeautifulSoup(data, 'html.parser')
#finding prices
prices = soup.find_all("strong", class_="mp-pi-price-min")
print(soup.prettify())

requests doesn't handle dynamic page content. You're best bet is using Selenium to drive a browser. From there you can parse page_source with BeautifulSoup to get the results you're looking for.

In chrome development tools, you can check the ajax request(made by Javascript) URL. you can mimic that requests and get data back.
r = requests.get('the ajax requests url')
data = r.text

Get web page content (Not from source code) [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 years ago.
I want to get the rainfall data of each day from here.
When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.
I am using urllib2 and BeautifulSoup from bs4
Here is my code:
import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"
r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")
And I got an empty array.
My question is: How can I get the page content, but not from the page source code?

If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.

If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.
Try using selenium
How can I parse a website using Selenium and Beautifulsoup in python?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')
html = driver.page_source
soup = BeautifulSoup(html)
print soup.find_all("td", class_="td1_normal_class")
However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Node missing data when scraping with Beautiful Soup [duplicate] - python

Related

scraping youtube website using beautiful soup [duplicate]

Why is my web scraping producing HTML but won't return any text?

Extract only image links from beautiful soup output [duplicate]

Code not on BS4, but can be found in 'Inspect Element' [duplicate]

Get web page content (Not from source code) [duplicate]

Categories

Resources