How to scrape headline news, link and image?

How to scrape headline news, link and image? - python

I'd like to scrape news headline, link of news and picture of that news.
I try to use web scraping following as below.
but It's only headline code and It is not work.
import requests
import pandas as pd
from bs4 import BeautifulSoup
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
print(headlines[i].text)
Please recommend it to me.

This is because the site blocks bot. If you print the res.content which shows 403.
Add headers={'User-Agent':'Mozilla/5.0'} to the request.
Try the code below,
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
print(headlines[i].text)

First things first: never post code as an image.
<h2> in your HTML has no text. What it does have, is an <a> element, so:
for hl in headlines:
link = hl.findChild()
text = link.text
url = link.attrs['href']

Related

Scraping links from href on Sephora website

Hi So I am trying to scrape the links for all the products a specific page on Sephora. My code only gives me the first 12 links while there are 48 products on the website. I think this is because Sephora is a User-Interactive-website(Please correct me if I am wrong) so it doesn't load the rest. But I do not know how to get the rest. Please send some help! Thank you!!!
Here is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
link_list = []
keyword = 'product'
for link in soup.findAll('a'):
href = link.get('href')
if keyword in href:
link_list.append('https://www.sephora.com/' + href)
else:
continue

If you take a look at the source code, you will see their data stored as a json object. You can get the json object by this:
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
data = json.loads(soup.find('script', id='linkJSON').text)
products = data[3]['props']['products']
prefix = "https://www.sephora.com"
url_links = [prefix+p["targetUrl"] for p in products]
print(url_links)
By investigating the json data, you can find where the links stored. To view the json data more clearly, I use this website: https://codebeautify.org/jsonviewer

Web Scraping - URL extraction from Lazada ecommerce platform

I am currently trying to scrape the products URLs from Lazada ecommerce platform, however i am getting random links from the website rather than the products links.
https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-Products&from=wangpu&pageTypeId=2
My code below:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-
Products&from=wangpu&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
links = soup.find_all('div', {'class': 'c16H9d'})
for link in soup.find_all("a"):
print(link.get("href"))
The result I am getting out of this code(which is not what i want) :
This is the section of the links that I need, i wanted to list down all the products URLs from the products page.
I hope you guys can help me on this, I know it is simple it just doesn't seems to work, have been looking at this since yesterday.

The page is dynamic. Within the html source code is the script that generates a json format of the products. You can pull that, then parse the json object to print off the urls:
from bs4 import BeautifulSoup, SoupStrainer
import requests
import json
url = "https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-Products&from=wangpu&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
scripts = soup.find_all('script')
jsonObj = None
for script in scripts:
if 'window.pageData=' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split("window.pageData=")[1]
jsonObj = json.loads(jsonStr)
products = jsonObj['mods']['listItems']
for item in products:
print (item['productUrl'])
Output:
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-classic-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434054933-s631848411.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x6-packs-i436317818-s639072921.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x4-packs-i436315786-s639051449.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-extra-rich-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434056795-s631848533.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-2-in-1-coffee-creamer-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434061102-s631849047.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x2-packs-free-2-sticks-random-flavor-i434043717-s631808896.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x4-packs-i436302855-s639061262.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-less-sugar-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434054922-s631844581.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-x4-packs-i436302850-s639053701.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-i428864462-s623547178.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-x2-packs-free-2-sticks-random-flavor-i434051011-s631820678.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-i429053474-s623935676.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-i429057450-s623942383.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x6-packs-i436315838-s639079875.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-mocha-15s-i429056707-s623944523.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-i446475373-s668555737.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x2-packs-free-2-sticks-random-flavor-i434038995-s631814134.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-hazelnut-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434059429-s631843615.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x2-packs-free-2-sticks-random-flavor-i434048248-s631818679.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-x2-packs-free-2-sticks-random-flavor-i434045440-s631815896.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x4-packs-i436347739-s639057542.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-i434617928-s633601781.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x4-packs-i436353732-s639059385.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-i429056294-s623942054.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-2-in-1-coffee-creamer-15s-x4-packs-free-4-sticks-random-flavor-i434053005-s631824941.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x2-packs-free-2-sticks-random-flavor-i434041889-s631816196.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-i429055995-s623946380.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x6-packs-i436317821-s639075740.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-2-glass-i435467037-s636334815.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x2-packs-free-2-sticks-random-flavor-i434043759-s631818279.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-x2-packs-i436315754-s639016791.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x6-packs-i436353749-s639073757.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-3-in-1-white-milk-tea-13s-x6-packs-free-oldtown-4-sticks-random-flavor-i434058314-s631843262.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-3-in-1-hazelnut-15s-x4-packs-free-4-sticks-random-flavor-i446532232-s668570216.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x6-packs-i436303866-s639071227.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-2-glass-foc-glass-mug-i442451196-s654637474.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-3-in-1-white-milk-tea-13s-x4-packs-free-4-sticks-random-flavor-i434046913-s631825874.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-3-in-1-classic-15s-x4-packs-free-4-sticks-random-flavor-i434052193-s631832149.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x2-packs-i436282923-s639018195.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434062035-s631849174.html?mp=1

Scraping stopping at first line

I need to scrap a website to obtain some information like Film's title and the relative links. My code run correctly but it stops at the first line of the website. This is my code, thank you in advance for your help and sorry if this is not a smart question but I'm a novice.
import requests
from bs4 import BeautifulSoup
URL= 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(URL):
page = requests.get(URL)
html = page.text
soup = BeautifulSoup(html, 'lxml') l
films = soup.find_all("div",{"id": "movie_wide"})
for film in films:
link = film.find('p').find("a").attrs['href']
title = film.find('p').find("a").text.strip('>')
print (link)
print(title)

Try the below way. I've slightly modified your script to serve the purpose and make it look better. Let me know if you encounter any further issues:
import requests
from bs4 import BeautifulSoup
URL = 'http://www.simplyscripts.com/genre/horror-scripts.html'
def scarica_pagina(link):
page = requests.get(link)
soup = BeautifulSoup(page.text, 'lxml')
for film in soup.find(id="movie_wide").find_all("p"):
link = film.find("a")['href']
title = film.find("a").text
print (link,title)
if __name__ == '__main__':
scarica_pagina(URL)

BeautifulSoup scraping Bitcoin Price issue

I am still new to python, and especially BeautifulSoup. I've been reading up on this stuff for a few days and playing around with bunch of different codes and getting mix results. However, on this page is the Bitcoin Price I would like to scrape. The price is located in:
<span class="text-large2" data-currency-value="">$16,569.40</span>
Meaning that, I'd like to have my script print only that line where the value is. My current code prints the whole page and it doesn't look very nice, since it's printing a lot of data. Could anybody please help to improve my code?
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find('text-large2', attrs={'class': 'stripe'})
for row in soup.findAll('div'):
for cell in row.findAll('tr'):
print cell.text
And this is a snip of the output I get after running the code. It doesn't look very nice or readable.
#SourcePairVolume (24h)PriceVolume (%)Updated
1BitMEXBTC/USD$3,280,130,000$15930.0016.30%Recently
2BithumbBTC/KRW$2,200,380,000$17477.6010.94%Recently
3BitfinexBTC/USD$1,893,760,000$15677.009.41%Recently
4GDAXBTC/USD$1,057,230,000$16085.005.25%Recently
5bitFlyerBTC/JPY$636,896,000$17184.403.17%Recently
6CoinoneBTC/KRW$554,063,000$17803.502.75%Recently
7BitstampBTC/USD$385,450,000$15400.101.92%Recently
8GeminiBTC/USD$345,746,000$16151.001.72%Recently
9HitBTCBCH/BTC$305,554,000$15601.901.52%Recently

Try this:
import requests
from BeautifulSoup import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html)
div = soup.find("div", {"class" : "col-xs-6 col-sm-8 col-md-4 text-left"
}).find("span", {"class" : "text-large2"})
for i in div:
print i
This prints 16051.20 for me.
Later Edit: and if you put the above code in a function and loop it it will constantly update. I get different values now.

This works. But I think you use older version of BeautifulSoup, try pip install bs4 in command prompt or PowerShell
import requests
from bs4 import BeautifulSoup
url = 'https://coinmarketcap.com/currencies/bitcoin/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
value = soup.find('span', {'class': 'text-large2'})
print(''.join(value.stripped_strings))

Extracting image caption and image url using BeautifulSoup

I am trying to extract the image url and image caption from an article using BeautifulSoup. I can separate the article's image url and image caption from the preceding and following HTML but I can not figure out how to separate these two from their html tags . Here is my code:
from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-
koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-
letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})
The two sections I am trying to extract are the src= and the title= sections. Any ideas on how to accomplish these two parses would be appreciated.

from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('div', {'class': 'image'})
print [i.find('img')['src'] for i in links]
print [i.find('img')['title'] for i in links]

try the following to extract all the image tags
img = soup.findAll('img')
#depending on how many images are here you will probably need to loop through img
src = img.get('src')
title = img.get('title')

Late answer, but you can use:
from bs4 import BeautifulSoup
import requests
url = 'http://www.prnewswire.com/news-releases/dutch-philosopher-koert-van-mensvoort-founder-of-the-next-nature-network-writes-a-letter-to-humanity-619925063.html'
r = requests.get(url)
html = r.text
soup = BeautifulSoup(html, "html5lib")
links = soup.find_all('div', {'class': 'image'})
if links:
print(links[0].find('img')['src'])
print(links[0].find('img')['title'])
Output:
http://mma.prnewswire.com/media/491859/Koert_van_Mensvoort.jpg?w=950
Dutch philosopher Koert van Mensvoort – founder of the Next Nature
Network and Fellow of ‘Next Nature’ at the University of Technology in
Eindhoven – has written a ‘Letter to Humanity’ in support of
International Earth Day. (PRNewsfoto/Next Nature Network)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to scrape headline news, link and image? - python

First things first: never post code as an image. <h2> in your HTML has no text. What it does have, is an <a> element, so: for hl in headlines: link = hl.findChild() text = link.text url = link.attrs['href']

Related

Scraping links from href on Sephora website

Web Scraping - URL extraction from Lazada ecommerce platform

Scraping stopping at first line

BeautifulSoup scraping Bitcoin Price issue

Extracting image caption and image url using BeautifulSoup

Categories

Resources