Extract only image links from beautiful soup output [duplicate] - python

This question already has an answer here:
BeautifulSoup HTML getting src link
(1 answer)
Closed 3 years ago.
I'm new to BeautifulSoup, and I've been trying to pull each image link out of a webpage using bs4 and requests. However, when I try to print each image link it spits out html and not a direct link to any images.
I've tried switching from using 'find' to using 'findAll', but that still doesn't solve my problem.
import bs4
import requests
req = requests.get('https://www.gnu.org/home.en.html')
soup = bs4.BeautifulSoup(req.text, features='html.parser')
html = (soup.findAll('img'))
print(html)
I expected the output to be web url's such as
https://www.gnu.org/distros/screenshots/guixSD-gnome3-medium.jpg, but instead the output gives me html which looks like this.
[<img alt=" [A GNU head] " src="/graphics/heckert_gnu.transp.small.png"/>,

The relative link can be taken from the src attribute. You can use:
for im in html:
print(im['src'])
Then, concatenating with the base URL, you can get the complete URL.

Related

scraping youtube website using beautiful soup [duplicate]

This question already has answers here:
Scraping YouTube links from a webpage
(3 answers)
Closed 2 years ago.
i am scraping youtube search results using the following code :
import requests
from bs4 import BeautifulSoup
url = "https://www.youtube.com/results?search_query=python"
response = requests.get(url)
soup = BeautifulSoup(response.content,'html.parser')
for each in soup.find_all("a", class_="yt-simple-endpoint style-scope ytd-video-renderer"):
print(each.get('href'))
but it is returning nothing . what is wrong with this code?
BeatifulSoup is not the right tool for Youtube scraping_ - Youtube is generating a lot of content using JavaScript.
You can easily test it:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> url = "https://www.youtube.com/results?search_query=python"
>>> response = requests.get(url)
>>> soup = BeautifulSoup(response.content,'html.parser')
>>> soup.find_all("a")
[About, Press, Copyright, Contact us, Creators, Advertise, Developers, Terms, Privacy, Policy and Safety, Test new features]
(pay attention there's that links you see on the screenshot are not present in the list)
You need to use another solution for that - Selenium might be a good choice. Please have at look at this thread for details Fetch all href link using selenium in python

Node missing data when scraping with Beautiful Soup [duplicate]

This question already has answers here:
python selenium get data from table
(3 answers)
Closed 2 years ago.
I am writing a web scraper to pull financial data and analyst recommendations. I have an issue where the data seems to be missing / incorrect form the node when I Extract the data from the page source code I get $0.00 but The correct value is $884.23
Here is the example code below:
import requests as rq
from bs4 import BeautifulSoup as bs
sym='cmg'
url='https://www.nasdaq.com/market-activity/stocks/{}/analyst-research'.format(sym)
page_response = rq.get(url, timeout=5)
page=bs(page_response.content, 'html.parser')
sr=page.find('div', attrs={'class':'analyst-target-price__price'})
print(sr.text)
Out[546]: '$0.00'
From the html code on the site the value should be $884.23 at the time of writing this question.
Like I was saying above I assume the issue is the site was not fully rendered when I got the page response / content. Does anyone have a solution to this ?
the value you are trying to scrape is being genrated by Javascript so it's not in the source code of the page .
You can get the same value by sending the same request js is making :
import requests as rq
sym = 'cmg'
url = 'https://api.nasdaq.com/api/analyst/{}/targetprice'.format(sym)
page_response = rq.get(url).json()
priceTarget = page_response['data']['consensusOverview']['priceTarget']
lowPriceTarget = page_response['data']['consensusOverview']['lowPriceTarget']
highPriceTarget = page_response['data']['consensusOverview']['highPriceTarget']
print('priceTarget',priceTarget)
print('lowPriceTarget ',lowPriceTarget )
print('highPriceTarget ',highPriceTarget )
OutPut:
priceTarget 884.23
lowPriceTarget 550.0
highPriceTarget 1050.0

Code not on BS4, but can be found in 'Inspect Element' [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 2 years ago.
I have tried making a website which uses Beautiful Soup 4 to search g2a for the prices of games (by class). The problem is that when I look in the HTML code, it clearly shows the price of the first result (£2.30), but when I search for the class in Beautiful Soup 4, there is nothing between the same class's tags:
#summoningg2a
r = requests.get('https://www.g2a.com/?search=x')
data = r.text
soup = BeautifulSoup(data, 'html.parser')
#finding prices
prices = soup.find_all("strong", class_="mp-pi-price-min")
print(soup.prettify())
requests doesn't handle dynamic page content. You're best bet is using Selenium to drive a browser. From there you can parse page_source with BeautifulSoup to get the results you're looking for.
In chrome development tools, you can check the ajax request(made by Javascript) URL. you can mimic that requests and get data back.
r = requests.get('the ajax requests url')
data = r.text

how to scraping text from hidden div and class using python?

i working on a script for scraping video titles from this webpage
" https://www.google.com.eg/trends/hotvideos "
but the proplem is the titles are hidden on the html source page but i can see it if i used the inspector to looking for that
that's my code it's working good with this ("class":"wrap")
but when i used that with the hidden one like "class":"hotvideos-single-trend-title-container" that's did't give me anything on output
#import urllib2
from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen('https://www.google.com.eg/trends/hotvideos').read()
soup = BeautifulSoup(html)
print (soup.findAll('div',{"class":"hotvideos-single-trend-title-container"}))
#wrap
The page is generated/populated by using Javascript.
BeautifulSoup won't help you here, you need a library which supports Javascript generated HTML pages, see here for a list or have a look at Selenium

Get web page content (Not from source code) [duplicate]

This question already has answers here:
Web-scraping JavaScript page with Python
(18 answers)
Closed 6 years ago.
I want to get the rainfall data of each day from here.
When I am in inspect mode, I can see the data. However, when I view the source code, I cannot find it.
I am using urllib2 and BeautifulSoup from bs4
Here is my code:
import urllib2
from bs4 import BeautifulSoup
link = "http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1"
r = urllib2.urlopen(link)
soup = BeautifulSoup(r)
print soup.find_all("td", class_="td1_normal_class")
# I also tried this one
# print.find_all("div", class_="dataTable")
And I got an empty array.
My question is: How can I get the page content, but not from the page source code?
If you open up the dev tools on chrome/firefox and look at the requests, you'll see that the data is generated from a request to http://www.hko.gov.hk/cis/dailyExtract/dailyExtract_2015.xml which gives the data for all 12 months which you can then extract from.
If you cannot find the div in the source it means that the div you are looking for is generated. It could be using some JS framework like Angular or just JQuery. If you want to browse through the rendered HTML you have to use a browser which runs the JS code included.
Try using selenium
How can I parse a website using Selenium and Beautifulsoup in python?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox()
driver.get('http://www.hko.gov.hk/cis/dailyExtract_e.htm?y=2015&m=1')
html = driver.page_source
soup = BeautifulSoup(html)
print soup.find_all("td", class_="td1_normal_class")
However note that using Selenium considerabily slows down the process since it has to pull up a headless browser.

Categories

Resources