Python - How to pull back a Specific Link from a website - python

Python noob incoming,
I am attempting to web-scrape a specific link from a website, although I am pulling back multiple and I don't know how I could define the code further to only pull back the one I want.
I believe the problem is due to their being a duplicate 'target' in the HTML
Here is an example of the HTML below:
<ul><li>Weekly Metrics</li>
<li><a rel="noreferrer noopener" href="Link2.xlsx" target="_blank">Monthly Website Statistics</a></li>
<li><a rel="noreferrerenter code here noopener" href="Link3.pdf" target="_blank">2020 Overview</a></li></ul>
My attempt at it:
import requests
import pandas as pd
from bs4 import BeautifulSoup
raw_url = 'https://url1.com/'
r = requests.get(raw_url)
soup = BeautifulSoup(r.content, 'html.parser')
monthly_url = soup.find_all('a', target="_blank")
print(monthly_url)
******** Pulls back 2 results *********
monthly_url = (url.get('href')) #this would give me just the URL inside the <a /a> code I want.
I would like to pull back ONLY the Link for the "Monthly Website Statistics" excel sheet.
Any thoughts on how I could define this further?
Thank you in advance.

You are using findall to find all the elements with target=_blank which sadly has two.
You could try and use other attributs, bs4 lets you do so:
soup.findAll(attrs= {"href":"Link2.xlsx"})

from bs4 import BeautifulSoup
html = '''<ul><li>Weekly Metrics</li>
<li><a rel="noreferrer noopener" href="Link2.xlsx" target="_blank">Monthly Website Statistics</a></li>
<li><a rel="noreferrerenter code here noopener" href="Link3.pdf" target="_blank">2020 Overview</a></li></ul>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('a:-soup-contains(Monthly)')['href'])
Output:
Link2.xlsx

Related

Python BeautifulSoup Printing info from multiple tags div class

Trying to create a python script to collect info from a website.
Trying to work out how I can extract information from 2/3 DIV tags and print.
example of html code
<div class="PowerDetails">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
I have managed to get it one by running a for loop, but trying to get RunningCost and Time side by side
Python Script, I'm new to it so playing around trying a few different things
import bs4, requests, time
while True:
url = "https://www.website.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
#soupTitle = soup.select('.RunningCost')
soupDetail = soup.select('.Time')
for soupDetailList in soupDetail:
print (soupDetailList.text)
End goal for this script is a web monitor to list changes/updates
zip should do the job.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
for r, t in zip(div.find_all("p", {"class":"RunningCost"}),
div.find_all("p", {"class":"Time"})):
print(r.string, t.string)
$4.44 peek
$2.33 Off-peek
Assuming that soup is the HTML code
PowerDetails = soup.find("div")
RunningCost = PowerDetails.find_all("p", _class="RunningCost")
Time = PowerDetails.find_all("p", _class="Time")

Scrape data-encoded-url from website with beautiful soup

I try to scrape the restaurant websites on www.tripadivisor.de
For example I took this one:
Restaurant and on the site there is a reference to my URL I want to scrape: http://leniliebtkaffee.de
The source code looks like this:
<a data-encoded-url="VUxRX2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX3FLOQ==" class="_2wKz--mA _27M8V6YV"
target="_blank" href="http://leniliebtkaffee.de/"><span class="ui_icon laptop _3ZW3afUk"></span><span
cass="_2saB_OSe">Website/span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
However, if I try to scrape this with the following python code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.tripadvisor.de/Restaurant_Review-g187367-d12632224-Reviews-Leni_Liebt_Kaffee-Aachen_North_Rhine_Westphalia.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for website in soup.findAll('a', attrs={'class':'_2wKz--mA _27M8V6YV'}):
print(website)
I get
<a class="_2wKz--mA _27M8V6YV" data-encoded-url="NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==" target="_blank"><span class="ui_icon laptop _3ZW3afUk"></span><span class="_2saB_OSe">Website</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
Unfortunately, there is no href link in there. How can I get it?
There's a URL base64-encoded in data-encoded-url:
>>> import base64
>>> base64.b64decode(b"NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==")
b'5Xt_http://leniliebtkaffee.de/_WCZ'
As you can see, the URL seems to be padded with either nonsense or some kind of flags, so you'll want to strip that.

How to get data from nested HTML using BeautifulSoup in Django

I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)
This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())

How to exract link to image "a href" & "class" in a html page using beautifulsoup

So I have several images using the same line of code to refer to html image links on a page: <a href="#" class="sh-mo__image" data-image="http://somejpgimage.jpeg">
I would like to retrieve the link only but just can't seem to navigate beyond the class to the link.
Can anyone help?
Also I have "n" number of links which I would like to retrieve separately.
You can do what #D.Chel suggested using list comprehension.
>>> links = [x['data-image'] for x in soup.find_all('a', {'class': 'sh-mo__image'})]
>>> links
['http://somejpgimage1.jpeg', 'http://somejpgimage2.jpeg']
I believe that your are looking for something like this
from bs4 import BeautifulSoup
html = ''' <a href="#" class="sh-mo__image" data-image="http://somejpgimage1.jpeg">
<a href="#" class="sh-mo__image" data-image="http://somejpgimage2.jpeg"> '''
soup = BeautifulSoup(html,'lxml')
mylinks = []
for link in soup.find_all('a',{'class':'sh-mo__image'}):
mylinks.append(link['data-image'])

find specific text in beautifulsoup

I have a specific piece of text i'm trying to get using BeautifulSoup and Python, however I am not sure how to get it using sou.find().
I am trying to obtain "#1 in Beauty" only from the following.
<ul>
<li>...<li>
<li>...<li>
<li id="salesRank">
<b>Amazon Best Sellers Rank:</b>
"#1 in Beauty ("
See top 100
")
Can anyone help me with this?
You need to use the find_all method of soup. Try below
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
url='your url here'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
print soup.find_all('#1 in Beauty')

Categories

Resources