How to webscrape a text inside a link in Python? - python

I would like to webscrape the following page: https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html
In particular, I would like to get the text inside every link you see displayed clicking on the link above. I am able to do it only by clickling on the link. For example, clicking on the first one:
import pandas as pd
from bs4 import BeautifulSoup
import requests
x = "https://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in211222~5f9a709924.en.html"
x1=[requests.get(x)]
x2 = [BeautifulSoup(x1[0].text)]
x3 = [x2[0].select("p+ p") for i in range(len(x2)-1)]
The problem is that I am not able to automate the process that leads me from the url with the list of links containing text (https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html) to the actual link where the text I need is stored (e.g. https://www.ecb.europa.eu/press/inter/date/2021/html/ecb.in211222~5f9a709924.en.html)
Can anyone help me?
Thanks!

To get a list of all links on https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html:
from bs4 import BeautifulSoup
import requests
r = requests.get('https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html')
soup = BeautifulSoup(r.text, 'html.parser')
links = [link.get('href') for link in soup.find_all('a')]

Wouter's answer is correct for getting all links, but if you need just the the title links, you could try a more specific selector query like select('div.title > a'). Here's an example:
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
html = BeautifulSoup(requests.get(url).text, 'html.parser')
links = html.select('div.title > a')
for link in links:
print(link.attrs['href'])

In particular, I would like to get the text inside every link you see displayed clicking on the link above.
To get the text of every linked article you have to iterate over your list of links and request each of them:
for link in soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
data.append({
'title':link.text,
'url': url,
'subtitle':soup.main.h2.text,
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
Example
Contents are stored in a list of dicts, so you can easily access and process the data later.
from bs4 import BeautifulSoup
import requests
url = "https://www.ecb.europa.eu/press/inter/date/2021/html/index_include.en.html"
soup = BeautifulSoup(requests.get(url).content)
data = []
for link in soup.select('div.title > a'):
soup = BeautifulSoup(requests.get(f"https://www.ecb.europa.eu{link['href']}").content)
data.append({
'title':link.text,
'url': url,
'subtitle':soup.main.h2.text,
'text':' '.join([p.text for p in soup.select('main .section p:not([class])')])
})
print(data)

Related

BeautifulSoup - Why am I getting an empty list as a result?

Why does my code result in an empty list? It's as if the page is too big and it doesn't parse it all... could it be the case?
from bs4 import BeautifulSoup
source = requests.get('https://www.youtube.com/nitroparkour')
soup = BeautifulSoup(source.text, 'lxml')
doc = soup.findAll("a", id="video-title")
print(doc)
If you right-click on the page then "view page source" you will find the html of the website, try and search for any of the titles of the videos ex. "Super Mario Maker", you will find them stored on a JSON inside script tag in the HTML.
then why do you see the videos inside a tag with id="video-title" in the page when you "inspect element" using the "dev-tools"?
that's because youtube uses javascript to render the site.
here is how to capture that JSON, you will need to explore it and figure which data you need.
import requests, json, re
from bs4 import BeautifulSoup
source = requests.get('https://www.youtube.com/nitroparkour')
soup = BeautifulSoup(source.text, 'lxml')
unparsed_js = soup.find(string=re.compile('var ytInitialData ='))
js = json.loads(unparsed_js.replace('var ytInitialData = ', '').rstrip(';'))

Can't scrape <h3> tag from page

Seems like i can scrape any tag and class, except h3 on this page. It keeps returning None or an empty list. I'm trying to get this h3 tag:
...on the following webpage:
https://www.empireonline.com/movies/features/best-movies-2/
And this is the code I use:
from bs4 import BeautifulSoup
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll(name = "h3" , class_ = "jsx-4245974604")
movies_text=[]
for item in movies:
result = item.getText()
movies_text.append(result)
print(movies_text)
Can you please help with the solution for this problem?
As other people mentioned this is dynamic content, which needs to be generated first when opening/running the webpage. Therefore you can't find the class "jsx-4245974604" with BS4.
If you print out your "soup" variable you actually can see that you won't find it. But if simply you want to get the names of the movies you can just use another part of the html in this case.
The movie name is in the alt tag of the picture (and actually also in many other parts of the html).
import requests
from pprint import pprint
from bs4 import BeautifulSoup
URL = "https://www.empireonline.com/movies/features/best-movies-2/"
response = requests.get(URL)
web_html = response.text
soup = BeautifulSoup(web_html, "html.parser")
movies = soup.findAll("img", class_="jsx-952983560")
movies_text=[]
for item in movies:
result = item.get('alt')
movies_text.append(result)
print(movies_text)
If you run into this issue in the future, remember to just print out the initial html you can get with soup and just check by eye if the information you need can be found.

Cannot reach to the href in a html element -Beautifulsoup

I have been trying to randomize a wikipedia page and get the URL of that randomized site. Even though I can fetch every link on the site, I can not reach to this piece of html code and fetch the href for some reason.
An example of a randomized wikipedia page.
<a accesskey="v" href="https://en.wikipedia.org/wiki/T%C5%99eb%C3%ADvlice?action=edit" class="oo-ui-element-hidden"></a>
All the wikipedia pages have this and I need to get the href so that I can manipulate this in a way that I can get the current URL.
The code I have written this far:
from bs4 import BeautifulSoup
import requests
links = []
for x in range(0, 1):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
print(soup.find(id="firstHeading"))
for link in soup.findAll('a'):
links.append(link.get('href'))
print(links)
Directly getting the current URL would also help too, however I couldn't find a solution for that online.
Also I'm using Lunix OS -if that would help-
Take a look for the attributes
You should specify your search by using the attribute of this <a>:
soup.find_all('a', accesskey='e')
Example
import requests
from bs4 import BeautifulSoup
links = []
for x in range(0, 1):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
print(soup.find(id="firstHeading"))
for link in soup.find_all('a', accesskey='e'):
links.append(link.get('href'))
print(links)
Output
<h1 class="firstHeading" id="firstHeading" lang="en">James Stack (golfer)</h1>
['/w/index.php?title=James_Stack_(golfer)&action=edit']
Just in case
You do not need the second loop, if you just wanna handle that single <a> use find() instead of find_all()
Example
import requests
from bs4 import BeautifulSoup
links = []
for x in range(0, 5):
source = requests.get("https://en.wikipedia.org/wiki/Special:Random").text
soup = BeautifulSoup(source, "lxml")
links.append(soup.find('a', accesskey='e').get('href'))
links
Output
['/w/index.php?title=Rick_Moffat&action=edit',
'/w/index.php?title=Mount_Burrows&action=edit',
'/w/index.php?title=The_Rock_Peter_and_the_Wolf&action=edit',
'/w/index.php?title=Yamato,_Yamanashi&action=edit',
'/w/index.php?title=Craig_Henderson&action=edit']

Scraping links from href on Sephora website

Hi So I am trying to scrape the links for all the products a specific page on Sephora. My code only gives me the first 12 links while there are 48 products on the website. I think this is because Sephora is a User-Interactive-website(Please correct me if I am wrong) so it doesn't load the rest. But I do not know how to get the rest. Please send some help! Thank you!!!
Here is my code:
from bs4 import BeautifulSoup
import requests
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
link_list = []
keyword = 'product'
for link in soup.findAll('a'):
href = link.get('href')
if keyword in href:
link_list.append('https://www.sephora.com/' + href)
else:
continue
If you take a look at the source code, you will see their data stored as a json object. You can get the json object by this:
from bs4 import BeautifulSoup
import requests
import json
url = "https://www.sephora.com/brand/estee-lauder/skincare"
r = requests.get(url)
data = r.text
soup = BeautifulSoup(data,'html.parser')
data = json.loads(soup.find('script', id='linkJSON').text)
products = data[3]['props']['products']
prefix = "https://www.sephora.com"
url_links = [prefix+p["targetUrl"] for p in products]
print(url_links)
By investigating the json data, you can find where the links stored. To view the json data more clearly, I use this website: https://codebeautify.org/jsonviewer

Using Python to Scrape Sky Cinema List

I'd like to gather a list of films and their links to all available movies on Sky Cinema website.
The website is:
http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200
I am using Python 3.6 and Beautiful Soup.
I am having problems finding the title and link. Especially as there are several pages to click through - possibly based on scroll position (in the URL?)
I've tried using BS and Python but there is no output. The code I have tried would only return the title. I'd like the title and the link to the film. As these are in different areas on the site, I am unsure on how this is done.
Code I have tried:
from bs4 import BeautifulSoup
import requests
link = "http://www.sky.com/tv/channel/skycinema/find-a-movie#/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200"
r = requests.get(link)
page = BeautifulSoup(r.content, "html.parser")
for dd in page.find_all("div", {"class":"sentence-result-infos"}):
title = dd.find(class_="title ellipsis ng-binding").text.strip()
print(title)
spans=page.find_all('span', {'class': 'title ellipsis ng-binding'})
for span in spans:
print(span.text)
I'd like the output to show as the title, link.
EDIT:
I have just tried the following but get "text" is not an attribute:
from bs4 import BeautifulSoup
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('http://www.sky.com/tv/channel/skycinema/find-a-movie/search?genre=all&window=skyCinema&certificate=all&offset=0&scrollPosition=200')
soup = BeautifulSoup(response.content, 'html.parser')
title = soup.find('span', {'class': 'title ellipsis ng-binding'}).text.strip()
print(title)
There is an API to be found in network tab. You can get all results with one call. You can set the limit to a number greater than the expected result count
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=10000&window=skyMovies').json()
Or use the number you can see on the page
import requests
import pandas as pd
base = 'http://www.sky.com/tv'
r = requests.get('http://www.sky.com/tv/api/search/movie?limit=1555&window=skyMovies').json()
data = [(item['title'], base + item['url']) for item in r['items']]
df = pd.DataFrame(data, columns = ['Title', 'Link'])
print(df)
First of all, read terms and conditions of the site you are going to scrape.
Next, you need selenium:
from selenium import webdriver
import bs4
# MODIFY the url with YOURS
url = "type the url to scrape here"
driver = webdriver.Firefox()
driver.get(url)
html = driver.page_source
soup = bs4.BeautifulSoup(html, "html.parser")
baseurl = 'http://www.sky.com/'
titles = [n.text for n in soup.find_all('span', {'class':'title ellipsis ng-binding'})]
links = [baseurl+h['href'] for h in soup.find_all('a', {'class':'sentence-result-pod ng-isolate-scope'})]

Categories

Resources