Web scraping with python how to get to the text - python

I'm trying to get the text from a website but can't find a way do to it. How do I need to write it?
link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html"
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser')
info = soup.find('div', attrs={'class':'text14'})
name = info.text.strip()
print(name)
This is how it looks:
i'm getting none everytime

import requests
from bs4 import BeautifulSoup
import json
link="https://www.ynet.co.il/articles/0,7340,L-5553905,00.html"
response = requests.get(link)
soup = BeautifulSoup(response.text,'html.parser')
info = soup.findAll('script',attrs={'type':"application/ld+json"})[0].text.strip()
jsonDict = json.loads(info)
print(jsonDict['articleBody'])
The page seems to store all the article data in json in the <script> tag so try this code.

The solution is :
info = soup.find('meta', attrs={'property':'og:description'})
It gave me the text i needed

Related

How to scrape headline news, link and image?

I'd like to scrape news headline, link of news and picture of that news.
I try to use web scraping following as below.
but It's only headline code and It is not work.
import requests
import pandas as pd
from bs4 import BeautifulSoup
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False)
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2',{'class':'post-title-news'})
len(headlines)
for i in range(len(headlines)):
print(headlines[i].text)
Please recommend it to me.
This is because the site blocks bot. If you print the res.content which shows 403.
Add headers={'User-Agent':'Mozilla/5.0'} to the request.
Try the code below,
nbc_business = "https://news.mongabay.com/list/environment"
res = requests.get(nbc_business, verify=False, headers={'User-Agent':'Mozilla/5.0'})
soup = BeautifulSoup(res.content, 'html.parser')
headlines = soup.find_all('h2', class_='post-title-news')
print(len(headlines))
for i in range(len(headlines)):
print(headlines[i].text)
First things first: never post code as an image.
<h2> in your HTML has no text. What it does have, is an <a> element, so:
for hl in headlines:
link = hl.findChild()
text = link.text
url = link.attrs['href']

Soup works on one IMBD page but not on another. How to solve?

url1 = "https://www.imdb.com/user/ur34087578/watchlist"
url = "https://www.imdb.com/search/title/?groups=top_1000&ref_=adv_prv"
results1 = requests.get(url1, headers=headers)
results = requests.get(url, headers=headers)
soup1 = BeautifulSoup(results1.text, "html.parser")
soup = BeautifulSoup(results.text, "html.parser")
movie_div1 = soup1.find_all('div', class_='lister-item-content')
movie_div = soup.find_all('div', class_='lister-item mode-advanced')
#using unique tag for each movie in the respective link
print(movie_div1)
#empty list
print(movie_div)
#gives perfect list
Why is movie_div1 giving an empty list? I am not able to identify any difference in the URL structures to indicate the code should be different. All leads appreciated.
Unfortunately the div you want is processed by a javascript code so you can't get by scraping the raw html request.
You can get the movies you want by the request json your browser gets, which you won't need to scrape the code with beautifulsoup, making your script much faster.
2nd option is using Selenium.
Good luck.
As #SakuraFreak mentioned, you could parse the JSON received. However, this JSON response is embedded within the HTML itself which is later converted to HTML by browser JS (this is what you see as <div class="lister-item-content">...</div>.
For example, this is how you would extract the JSON content from the HTML to display movie/show names from the watchlist:
import requests
from bs4 import BeautifulSoup
import json
url = "https://www.imdb.com/user/ur34087578/watchlist"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
details = str(soup.find('span', class_='ab_widget'))
json_initial = "IMDbReactInitialState.push("
json_leftover = ");\n"
json_start = details.find(json_initial) + len(json_initial)
details = details[json_start:]
json_end = details.find(json_leftover)
json_data = json.loads(details[:json_end])
imdb_titles = json_data["titles"]
for item in imdb_titles.values():
print(item["primary"]["title"])

Scraping links from href on Sephora website

Hi So I am trying to scrape the links for all the products a specific page on Sephora. My code only gives me the first 12 links while there are 48 products on the website. I think this is because Sephora is a User-Interactive-website(Please correct me if I am wrong) so it doesn't load the rest. But I do not know how to get the rest. Please send some help! Thank you!!!
Here is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
link_list = []
keyword = 'product'
for link in soup.findAll('a'):
href = link.get('href')
if keyword in href:
link_list.append('https://www.sephora.com/' + href)
else:
continue
If you take a look at the source code, you will see their data stored as a json object. You can get the json object by this:
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
data = json.loads(soup.find('script', id='linkJSON').text)
products = data[3]['props']['products']
prefix = "https://www.sephora.com"
url_links = [prefix+p["targetUrl"] for p in products]
print(url_links)
By investigating the json data, you can find where the links stored. To view the json data more clearly, I use this website: https://codebeautify.org/jsonviewer

Web Scraping - URL extraction from Lazada ecommerce platform

I am currently trying to scrape the products URLs from Lazada ecommerce platform, however i am getting random links from the website rather than the products links.
https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-Products&from=wangpu&pageTypeId=2
My code below:
from bs4 import BeautifulSoup, SoupStrainer
import requests
url = "https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-
Products&from=wangpu&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
links = soup.find_all('div', {'class': 'c16H9d'})
for link in soup.find_all("a"):
print(link.get("href"))
The result I am getting out of this code(which is not what i want) :
This is the section of the links that I need, i wanted to list down all the products URLs from the products page.
I hope you guys can help me on this, I know it is simple it just doesn't seems to work, have been looking at this since yesterday.
The page is dynamic. Within the html source code is the script that generates a json format of the products. You can pull that, then parse the json object to print off the urls:
from bs4 import BeautifulSoup, SoupStrainer
import requests
import json
url = "https://www.lazada.com.my/oldtown-white-coffee/?langFlag=en&q=All-Products&from=wangpu&pageTypeId=2"
page = requests.get(url)
data = page.text
soup = BeautifulSoup(data)
scripts = soup.find_all('script')
jsonObj = None
for script in scripts:
if 'window.pageData=' in script.text:
jsonStr = script.text
jsonStr = jsonStr.split("window.pageData=")[1]
jsonObj = json.loads(jsonStr)
products = jsonObj['mods']['listItems']
for item in products:
print (item['productUrl'])
Output:
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-classic-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434054933-s631848411.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x6-packs-i436317818-s639072921.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x4-packs-i436315786-s639051449.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-extra-rich-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434056795-s631848533.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-2-in-1-coffee-creamer-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434061102-s631849047.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x2-packs-free-2-sticks-random-flavor-i434043717-s631808896.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-x4-packs-i436302855-s639061262.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-less-sugar-15s-x6-packs-free-oldtown-brown-mug-x1-4-sticks-random-flavor-i434054922-s631844581.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-x4-packs-i436302850-s639053701.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-i428864462-s623547178.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-x2-packs-free-2-sticks-random-flavor-i434051011-s631820678.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-i429053474-s623935676.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-i429057450-s623942383.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x6-packs-i436315838-s639079875.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-mocha-15s-i429056707-s623944523.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-hazelnut-15s-i446475373-s668555737.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x2-packs-free-2-sticks-random-flavor-i434038995-s631814134.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-hazelnut-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434059429-s631843615.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x2-packs-free-2-sticks-random-flavor-i434048248-s631818679.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-x2-packs-free-2-sticks-random-flavor-i434045440-s631815896.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x4-packs-i436347739-s639057542.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-i434617928-s633601781.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x4-packs-i436353732-s639059385.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-i429056294-s623942054.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-2-in-1-coffee-creamer-15s-x4-packs-free-4-sticks-random-flavor-i434053005-s631824941.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x2-packs-free-2-sticks-random-flavor-i434041889-s631816196.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-extra-rich-15s-i429055995-s623946380.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-less-sugar-15s-x6-packs-i436317821-s639075740.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-2-glass-i435467037-s636334815.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x2-packs-free-2-sticks-random-flavor-i434043759-s631818279.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-2-in-1-coffee-creamer-15s-x2-packs-i436315754-s639016791.html?mp=1
//www.lazada.com.my/products/oldtown-white-coffee-3-in-1-classic-15s-x6-packs-i436353749-s639073757.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-3-in-1-white-milk-tea-13s-x6-packs-free-oldtown-4-sticks-random-flavor-i434058314-s631843262.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-3-in-1-hazelnut-15s-x4-packs-free-4-sticks-random-flavor-i446532232-s668570216.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x6-packs-i436303866-s639071227.html?mp=1
//www.lazada.com.my/products/oldtown-black-series-enrich-freeze-dried-instant-coffee-100g-2-glass-foc-glass-mug-i442451196-s654637474.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-3-in-1-white-milk-tea-13s-x4-packs-free-4-sticks-random-flavor-i434046913-s631825874.html?mp=1
//www.lazada.com.my/products/bundle-of-4-oldtown-white-coffee-3-in-1-classic-15s-x4-packs-free-4-sticks-random-flavor-i434052193-s631832149.html?mp=1
//www.lazada.com.my/products/oldtown-3-in-1-white-milk-tea-13s-x2-packs-i436282923-s639018195.html?mp=1
//www.lazada.com.my/products/bundle-of-6-oldtown-white-coffee-3-in-1-natural-cane-sugar-15s-x6-packs-free-oldtown-4-sticks-random-flavor-i434062035-s631849174.html?mp=1

How to crawl the description for sfglobe using python

I am trying to use Python and Beautifulsoup to get this page from sfglobe website: http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore.
This is the code:
import urllib2
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = urllib2.urlopen(url)
html = req.read()
soup = BeautifulSoup(html)
desc = soup.find('span', class_='articletext intro')
Could anyone help me to solve this problem?
From the question title, I assuming that the only thing you want is the description of the article, which can be found in the <meta> tag within the HTML <head>.
You were on the right track, but I'm not exactly sure why you did:
desc = soup.find('span', class_='articletext intro')
Regardless, I came up with something using requests (see http://stackoverflow.com/questions/2018026/should-i-use-urllib-or-urllib2-or-requests) rather than urllib2
import requests
from bs4 import BeautifulSoup
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltim\
ore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
tag = soup.find(attrs={'name':'description'}) # find meta tag w/ description
desc = tag['value'] # get value of attribute 'value'
print desc
If that isn't what you are looking for, please clarify so I can try and help you more.
EDIT: after some clarification, I pieced together why you were originally using desc = soup.find('span', class_='articletext intro').
Maybe this is what you are looking for:
import requests
from bs4 import BeautifulSoup, NavigableString
url = 'http://sfglobe.com/2015/04/28/stirring-pictures-from-the-riots-in-baltimore'
req = requests.get(url)
html = req.text
soup = BeautifulSoup(html)
body = soup.find('span', class_='articletext intro')
# remove script tags
[s.extract() for s in body('script')]
text = ""
# iterate through non-script elements in the content body
for stuff in body.select('*'):
# get contents of tags, .contents returns a list
content = stuff.contents
# check if the list has the text content a.k.a. isn't empty AND is a NavigableString, not a tag
if len(content) == 1 and isinstance(content[0], NavigableString):
text += content[0]
print text

Categories

Resources