Scrape data-encoded-url from website with beautiful soup - python

I try to scrape the restaurant websites on www.tripadivisor.de
For example I took this one:
Restaurant and on the site there is a reference to my URL I want to scrape: http://leniliebtkaffee.de
The source code looks like this:
<a data-encoded-url="VUxRX2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX3FLOQ==" class="_2wKz--mA _27M8V6YV"
target="_blank" href="http://leniliebtkaffee.de/"><span class="ui_icon laptop _3ZW3afUk"></span><span
cass="_2saB_OSe">Website/span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
However, if I try to scrape this with the following python code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.tripadvisor.de/Restaurant_Review-g187367-d12632224-Reviews-Leni_Liebt_Kaffee-Aachen_North_Rhine_Westphalia.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for website in soup.findAll('a', attrs={'class':'_2wKz--mA _27M8V6YV'}):
print(website)
I get
<a class="_2wKz--mA _27M8V6YV" data-encoded-url="NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==" target="_blank"><span class="ui_icon laptop _3ZW3afUk"></span><span class="_2saB_OSe">Website</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
Unfortunately, there is no href link in there. How can I get it?

There's a URL base64-encoded in data-encoded-url:
>>> import base64
>>> base64.b64decode(b"NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==")
b'5Xt_http://leniliebtkaffee.de/_WCZ'
As you can see, the URL seems to be padded with either nonsense or some kind of flags, so you'll want to strip that.

Related

Unable to find element BeautifulSoup

I am trying to parse a specific href link from the following website: https://www.murray-intl.co.uk/en/literature-library.
Element i seek to parse:
<a class="btn btn--naked btn--icon-left btn--block focus-within" href="https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc&_ga=2.12911351.1364356977.1629796255-1577053129.1629192717" target="blank">Portfolio Holding Summary<i class="material-icons btn__icon">library_books</i></a>
However, using BeautifulSoup I am unable to obtain the desired element, perhaps due to cookies acceptance.
from bs4 import BeautifulSoup
import urllib.request
import requests as rq
page = requests.get('https://www.murray-intl.co.uk/en/literature-library')
soup = BeautifulSoup(page.content, 'html.parser')
link = soup.find_all('a', class_='btn btn--naked btn--icon-left btn--block focus-within')
url = link[0].get('href')
url
I am still new at BS4, and hope someone can help me on the right course.
Thank you in advance!
To get correct tags, remove "focus-within" class (it's added later by JavaScript):
import requests
from bs4 import BeautifulSoup
url = "https://www.murray-intl.co.uk/en/literature-library"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
links = soup.find_all("a", class_="btn btn--naked btn--icon-left btn--block")
for u in links:
print(u.get_text(strip=True), u.get("href", ""))
Prints:
...
Portfolio Holding Summarylibrary_books https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc
...
EDIT: To get only the specified link you can use for example CSS selector:
link = soup.select_one('a:-soup-contains("Portfolio Holding Summary")')
print(link["href"])
Prints:
https://www.aberdeenstandard.com/docs?editionId=9123afa2-5318-4715-9783-e07d08e2e7cc

Python - How to pull back a Specific Link from a website

Python noob incoming,
I am attempting to web-scrape a specific link from a website, although I am pulling back multiple and I don't know how I could define the code further to only pull back the one I want.
I believe the problem is due to their being a duplicate 'target' in the HTML
Here is an example of the HTML below:
<ul><li>Weekly Metrics</li>
<li><a rel="noreferrer noopener" href="Link2.xlsx" target="_blank">Monthly Website Statistics</a></li>
<li><a rel="noreferrerenter code here noopener" href="Link3.pdf" target="_blank">2020 Overview</a></li></ul>
My attempt at it:
import requests
import pandas as pd
from bs4 import BeautifulSoup
raw_url = 'https://url1.com/'
r = requests.get(raw_url)
soup = BeautifulSoup(r.content, 'html.parser')
monthly_url = soup.find_all('a', target="_blank")
print(monthly_url)
******** Pulls back 2 results *********
monthly_url = (url.get('href')) #this would give me just the URL inside the <a /a> code I want.
I would like to pull back ONLY the Link for the "Monthly Website Statistics" excel sheet.
Any thoughts on how I could define this further?
Thank you in advance.
You are using findall to find all the elements with target=_blank which sadly has two.
You could try and use other attributs, bs4 lets you do so:
soup.findAll(attrs= {"href":"Link2.xlsx"})
from bs4 import BeautifulSoup
html = '''<ul><li>Weekly Metrics</li>
<li><a rel="noreferrer noopener" href="Link2.xlsx" target="_blank">Monthly Website Statistics</a></li>
<li><a rel="noreferrerenter code here noopener" href="Link3.pdf" target="_blank">2020 Overview</a></li></ul>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('a:-soup-contains(Monthly)')['href'])
Output:
Link2.xlsx

How to Get data-* attributes when web scraping using python requests (Python Requests Creating Some Issues)

How can I get the value of data-d1-value when I am using requests library of python?
The request.get(URL) function is itself not giving the data-* attributes in the div which are present in the original webpage.
The web page is as follows:
<div id="test1" class="class1" data-d1-value="150">
180
</div>
The code I am using is :
req = request.get(url)
soup = BeautifulSoup(req.text, 'lxml')
d1_value = soup.find('div', {'class':"class1"})
print(d1_value)
The result I get is:
<div id="test1" class="class1">
180
</div>
When I debug this, I found that request.get(URL) is not returning the full div but only the id and class and not data-* attributes.
How should I modify to get the full value?
For better example:
For my case the URL is:
https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG
And the Information of variable:
The DIV CLASS is : class="inprice1 nsecp" and The value of data-numberanimate-value is what I am trying to fetch
Thanks in advance :)
EDIT
Website response differs in case of requesting it - In your case using requests the value you are looking for is served in this way:
<div class="inprice1 nsecp" id="nsecp" rel="92.75">92.75</div>
So you can get it from the rel or from the text:
soup.find('div', {'class':"inprice1"})['rel']
soup.find('div', {'class':"inprice1"}).get_text()
Example
import requests
from bs4 import BeautifulSoup
req = requests.get('https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG')
soup = BeautifulSoup(req.text, 'lxml')
print('rel: '+soup.find('div', {'class':"inprice1"})['rel'])
print('text :'+soup.find('div', {'class':"inprice1"}).get_text())
Output
rel: 92.75
text: 92.75
To get a response that display the source as you inspect it, you have to try selenium
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.moneycontrol.com/india/stockpricequote/oil-drillingexploration/oilnaturalgascorporation/ONG"
driver.get(url)
sleep(2)
soup = BeautifulSoup(driver.page_source, "lxml")
print(soup.find('div', class_='inprice1 nsecp')['data-numberanimate-value'])
driver.close()
To get the attribute value just add ['data-d1-value'] to your find()
Example
from bs4 import BeautifulSoup
html='''
<div id="test1" class="class1" data-d1-value="150">
180
</div>
'''
soup = BeautifulSoup(html, 'lxml')
d1_value = soup.find('div', {'class':"class1"})['data-d1-value']
print(d1_value)
you are seeing this issue, because you didn't retrieve all of the other attributes which we're defined on the DIV.
The below code will retrieve all of the custom attributes which we're defined on the div as well
from bs4 import BeautifulSoup
s = '<div id="test1" class="class1" data-d1-value="150">180</div>'
soup = BeautifulSoup(s)
attributes_dictionary = soup.find('div',{'class':"class1"}).attrs
print(attributes_dictionary)
You can get data from HTML or you just can do it scraping the API
This is an example:
Website is: Money Control
If you going to developer tools into your browser, and select Network, you can see the requests that are doing the website:
See image
You can see that in headers, appear URL from API: priceapi.moneycontrol.com.
This is a strange case, because the API is open... and usually it isn't.
You can access to price:
Imagine that you save JSON data into a variable called 'json', you can access it with:
json.data.pricecurrent

Download Multiple PDF files from a webpage

So I am trying to download a few eBooks that I have purchased through humble bundle. I am using beautifulsoup and requests to try and parse the html and get the href links for the pdfs.
Python
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.humblebundle.com/downloads?key=fkuzzq6R8MA8ydEw")
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("div", {"class": "js-all-downloads-holder"})
print(links)
I am going to put a imgar link to the site and html layout because I don't believe you can access the html page without prompting a login(Which might be one of the reason I am having this issue to start with.) https://imgur.com/24x2X0m
HTML
<div class="flexbtn active noicon js-start-download">
<div class="right"></div>
<span class="label">PDF</span>
<a class="a" download="" href="https://dl.humble.com/makea2drpginaweekend.pdf?gamekey=fkuzzq6R8MA8ydEw&ttl=1521117317&t=b714bb732413a1f0532ec6aa72b282f9">
PDF
</a>
</div>
So the print statement should output to contents of the div but that is not the case.
Output
python3 pdf_downloader.py
[]
Sorry for the long post, I have just been up all night working on this and at this point it would have just been easier to hit the download button 20+ times but that is not how you learn.

Scrape Javascript page with Python or other

I want to scrape the team grids for each game on the following website:
http://mc.championdata.com/nrl/
and I believe the code below is for away teams:
<div class="cd6364_component cd6364_div_away_team_single" style="width: 100%;
How can I scrape this site?
I'm a beginner but I think I got what you are asking for. You could do it in python using BeautifulSoup and Requests. Something like this:
from bs4 import BeautifulSoup
from urllib.request import urlopen
quote_page = "http://mc.championdata.com/nrl"
page = urlopen(quote_page)
soup = BeautifulSoup(page, "html.parser")
# Take out the <div> of name and get its value
name_box = soup.find("")
#for example: soup.find("a", {"class": "price", "data-usd": True})['data-usd']
I don't understand what you are looking for exactly though.

Categories

Resources