Python noob incoming,
I am attempting to web-scrape a specific link from a website, although I am pulling back multiple and I don't know how I could define the code further to only pull back the one I want.
I believe the problem is due to their being a duplicate 'target' in the HTML
Here is an example of the HTML below:
<ul><li>Weekly Metrics</li>
<li><a rel="noreferrer noopener" href="Link2.xlsx" target="_blank">Monthly Website Statistics</a></li>
<li><a rel="noreferrerenter code here noopener" href="Link3.pdf" target="_blank">2020 Overview</a></li></ul>
My attempt at it:
import requests
import pandas as pd
from bs4 import BeautifulSoup
raw_url = 'https://url1.com/'
r = requests.get(raw_url)
soup = BeautifulSoup(r.content, 'html.parser')
monthly_url = soup.find_all('a', target="_blank")
print(monthly_url)
******** Pulls back 2 results *********
monthly_url = (url.get('href')) #this would give me just the URL inside the <a /a> code I want.
I would like to pull back ONLY the Link for the "Monthly Website Statistics" excel sheet.
Any thoughts on how I could define this further?
Thank you in advance.
You are using findall to find all the elements with target=_blank which sadly has two.
You could try and use other attributs, bs4 lets you do so:
soup.findAll(attrs= {"href":"Link2.xlsx"})
from bs4 import BeautifulSoup
html = '''<ul><li>Weekly Metrics</li>
<li><a rel="noreferrer noopener" href="Link2.xlsx" target="_blank">Monthly Website Statistics</a></li>
<li><a rel="noreferrerenter code here noopener" href="Link3.pdf" target="_blank">2020 Overview</a></li></ul>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('a:-soup-contains(Monthly)')['href'])
Output:
Link2.xlsx
Related
Trying to create a python script to collect info from a website.
Trying to work out how I can extract information from 2/3 DIV tags and print.
example of html code
<div class="PowerDetails">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
I have managed to get it one by running a for loop, but trying to get RunningCost and Time side by side
Python Script, I'm new to it so playing around trying a few different things
import bs4, requests, time
while True:
url = "https://www.website.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
#soupTitle = soup.select('.RunningCost')
soupDetail = soup.select('.Time')
for soupDetailList in soupDetail:
print (soupDetailList.text)
End goal for this script is a web monitor to list changes/updates
zip should do the job.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
for r, t in zip(div.find_all("p", {"class":"RunningCost"}),
div.find_all("p", {"class":"Time"})):
print(r.string, t.string)
$4.44 peek
$2.33 Off-peek
Assuming that soup is the HTML code
PowerDetails = soup.find("div")
RunningCost = PowerDetails.find_all("p", _class="RunningCost")
Time = PowerDetails.find_all("p", _class="Time")
I try to scrape the restaurant websites on www.tripadivisor.de
For example I took this one:
Restaurant and on the site there is a reference to my URL I want to scrape: http://leniliebtkaffee.de
The source code looks like this:
<a data-encoded-url="VUxRX2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX3FLOQ==" class="_2wKz--mA _27M8V6YV"
target="_blank" href="http://leniliebtkaffee.de/"><span class="ui_icon laptop _3ZW3afUk"></span><span
cass="_2saB_OSe">Website/span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
However, if I try to scrape this with the following python code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.tripadvisor.de/Restaurant_Review-g187367-d12632224-Reviews-Leni_Liebt_Kaffee-Aachen_North_Rhine_Westphalia.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for website in soup.findAll('a', attrs={'class':'_2wKz--mA _27M8V6YV'}):
print(website)
I get
<a class="_2wKz--mA _27M8V6YV" data-encoded-url="NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==" target="_blank"><span class="ui_icon laptop _3ZW3afUk"></span><span class="_2saB_OSe">Website</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
Unfortunately, there is no href link in there. How can I get it?
There's a URL base64-encoded in data-encoded-url:
>>> import base64
>>> base64.b64decode(b"NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==")
b'5Xt_http://leniliebtkaffee.de/_WCZ'
As you can see, the URL seems to be padded with either nonsense or some kind of flags, so you'll want to strip that.
I am trying to learn web scraping and I'm stuck at a point where the data I want is wrapped by a div tag as so:
<div class="maincounter-number">
<span style="color:#aaa">803 </span>
</div>
There are several data like that and I need all (eg. 803). So i guess I need to do soup.find_all(...) but I don't know what to put inside. Anyone help?
I am working in python (Django.)
This should do what you are looking to do:
from bs4 import BeautifulSoup
html_doc = '<div class="maincounter-number"><span style="color:#aaa">803 </span></div>'
soup = BeautifulSoup(html_doc, 'html.parser')
print(soup.find_all('span', {'style': 'color:#aaa'})[0].get_text())
If you just want to query the text in the div and search by class:
print(soup.find_all('div', {'class': 'maincounter-number'})[0].get_text())
So I have several images using the same line of code to refer to html image links on a page: <a href="#" class="sh-mo__image" data-image="http://somejpgimage.jpeg">
I would like to retrieve the link only but just can't seem to navigate beyond the class to the link.
Can anyone help?
Also I have "n" number of links which I would like to retrieve separately.
You can do what #D.Chel suggested using list comprehension.
>>> links = [x['data-image'] for x in soup.find_all('a', {'class': 'sh-mo__image'})]
>>> links
['http://somejpgimage1.jpeg', 'http://somejpgimage2.jpeg']
I believe that your are looking for something like this
from bs4 import BeautifulSoup
html = ''' <a href="#" class="sh-mo__image" data-image="http://somejpgimage1.jpeg">
<a href="#" class="sh-mo__image" data-image="http://somejpgimage2.jpeg"> '''
soup = BeautifulSoup(html,'lxml')
mylinks = []
for link in soup.find_all('a',{'class':'sh-mo__image'}):
mylinks.append(link['data-image'])
I have a specific piece of text i'm trying to get using BeautifulSoup and Python, however I am not sure how to get it using sou.find().
I am trying to obtain "#1 in Beauty" only from the following.
<ul>
<li>...<li>
<li>...<li>
<li id="salesRank">
<b>Amazon Best Sellers Rank:</b>
"#1 in Beauty ("
See top 100
")
Can anyone help me with this?
You need to use the find_all method of soup. Try below
import urllib, urllib2
from bs4 import BeautifulSoup, Comment
url='your url here'
content = urllib2.urlopen(url).read()
soup = BeautifulSoup(content, "html.parser")
print soup.find_all('#1 in Beauty')