Download Multiple PDF files from a webpage - python

So I am trying to download a few eBooks that I have purchased through humble bundle. I am using beautifulsoup and requests to try and parse the html and get the href links for the pdfs.
Python
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.humblebundle.com/downloads?key=fkuzzq6R8MA8ydEw")
soup = BeautifulSoup(r.content, "html.parser")
links = soup.find_all("div", {"class": "js-all-downloads-holder"})
print(links)
I am going to put a imgar link to the site and html layout because I don't believe you can access the html page without prompting a login(Which might be one of the reason I am having this issue to start with.) https://imgur.com/24x2X0m
HTML
<div class="flexbtn active noicon js-start-download">
<div class="right"></div>
<span class="label">PDF</span>
<a class="a" download="" href="https://dl.humble.com/makea2drpginaweekend.pdf?gamekey=fkuzzq6R8MA8ydEw&ttl=1521117317&t=b714bb732413a1f0532ec6aa72b282f9">
PDF
</a>
</div>
So the print statement should output to contents of the div but that is not the case.
Output
python3 pdf_downloader.py
[]
Sorry for the long post, I have just been up all night working on this and at this point it would have just been easier to hit the download button 20+ times but that is not how you learn.

Related

How to get a style value inside a span tag with a python scraper?

I downloaded the code of this website: https://www.ilmakiage.co.il/mineral-lip-pencil-4040 with the response.get method. In the inspectors mode I could see that there was a style value "display: block;" or "display: none;" in a unique span class="qtyValidate_color" tag I wanted to download. But when I opened my soup or response.text I couldn't find that value in that span tag. It's empty. Please let me know what library or method I can use to get this style value in the span tag.
My Python code
from bs4 import BeautifulSoup
import requests
response = requests.get('https://www.ilmakiage.co.il/mineral-lip-pencil-4043')
soup = BeautifulSoup(response.text, 'lxml')
no_quantity = soup.find('span', class_='qtyValidate_color').contents[0]
if no_quantity == 'הכמות המבוקשת אינה קיימת':
print("Ooops, 'הכמות המבוקשת אינה קיימת' the stock is empty. Open the app later. You will see the 'Shopping time' phrase when the lipstick be in stock")
else:
print('Shopping time')
HTML code on the website in Chrome inspector mode
<span class="qtyValidate_color" style="display: block;">הכמות המבוקשת אינה קיימת</span>
<span class="qtyValidate_color" style="display: none;">הכמות המבוקשת אינה קיימת</span>
Screenshot:
Website code in Inspector mode Chrome
lXML code in my soup
<span class="qtyValidate_color">הכמות המבוקשת אינה קיימת</span>
Screenshot 2:
Website code lxml in my soup
I tried reading stackoverflow and replacing request.get method with urllib3 methods
ps. i'm doing data analysis course now and created this tool for my girlfriend as a part of my training.
When I inspect the page source, the span element appears the same as in your soup (and your code works for me as I would expect it to).
It may be that the element appears differently on different devices, or if logged in, for example.
You can include style in your search with attr={'style': 'my_style'}, for example, on that url:
example = soup.find('span', class_='show_sku', attrs={'style': 'display:none'})

Python BeautifulSoup Printing info from multiple tags div class

Trying to create a python script to collect info from a website.
Trying to work out how I can extract information from 2/3 DIV tags and print.
example of html code
<div class="PowerDetails">
<p class="RunningCost">$4.44</p>
<p class="Time">peek</p>
<p class="RunningCost"> $2.33</p>
<p class="Time">Off-peek</p>
</div>
I have managed to get it one by running a for loop, but trying to get RunningCost and Time side by side
Python Script, I'm new to it so playing around trying a few different things
import bs4, requests, time
while True:
url = "https://www.website.com"
response = requests.get(url)
soup = bs4.BeautifulSoup(response.text, 'html.parser')
#soupTitle = soup.select('.RunningCost')
soupDetail = soup.select('.Time')
for soupDetailList in soupDetail:
print (soupDetailList.text)
End goal for this script is a web monitor to list changes/updates
zip should do the job.
from bs4 import BeautifulSoup
soup = BeautifulSoup("<html_text>" , "html.parser")
div = soup.find("div")
for r, t in zip(div.find_all("p", {"class":"RunningCost"}),
div.find_all("p", {"class":"Time"})):
print(r.string, t.string)
$4.44 peek
$2.33 Off-peek
Assuming that soup is the HTML code
PowerDetails = soup.find("div")
RunningCost = PowerDetails.find_all("p", _class="RunningCost")
Time = PowerDetails.find_all("p", _class="Time")

Scrape data-encoded-url from website with beautiful soup

I try to scrape the restaurant websites on www.tripadivisor.de
For example I took this one:
Restaurant and on the site there is a reference to my URL I want to scrape: http://leniliebtkaffee.de
The source code looks like this:
<a data-encoded-url="VUxRX2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX3FLOQ==" class="_2wKz--mA _27M8V6YV"
target="_blank" href="http://leniliebtkaffee.de/"><span class="ui_icon laptop _3ZW3afUk"></span><span
cass="_2saB_OSe">Website/span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
However, if I try to scrape this with the following python code:
import requests
from bs4 import BeautifulSoup
URL = 'https://www.tripadvisor.de/Restaurant_Review-g187367-d12632224-Reviews-Leni_Liebt_Kaffee-Aachen_North_Rhine_Westphalia.html'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
for website in soup.findAll('a', attrs={'class':'_2wKz--mA _27M8V6YV'}):
print(website)
I get
<a class="_2wKz--mA _27M8V6YV" data-encoded-url="NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==" target="_blank"><span class="ui_icon laptop _3ZW3afUk"></span><span class="_2saB_OSe">Website</span><span class="ui_icon external-link-no-box _2OpUzCuO"></span></a>
Unfortunately, there is no href link in there. How can I get it?
There's a URL base64-encoded in data-encoded-url:
>>> import base64
>>> base64.b64decode(b"NVh0X2h0dHA6Ly9sZW5pbGllYnRrYWZmZWUuZGUvX1dDWg==")
b'5Xt_http://leniliebtkaffee.de/_WCZ'
As you can see, the URL seems to be padded with either nonsense or some kind of flags, so you'll want to strip that.

Scraping difficult table

I have been trying to scrape a table from here for quite some time but have been unsuccessful. The table I am trying to scrape is titled "Team Per Game Stats". I am confident that once I am able to scrape one element of that table that I can iterate through the columns I want from the list and eventually end up with a pandas data frame.
Here is my code so far:
from bs4 import BeautifulSoup
import requests
# url that we are scraping
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
# Lets look at what the request content looks like
print(r.content)
# use Beautifulsoup on content from request
c = r.content
soup = BeautifulSoup(c)
print(soup)
# using prettify() in Beautiful soup indents HTML like it should be in the web page
# This can make reading the HTML a little be easier
print(soup.prettify())
# get elements within the 'main-content' tag
team_per_game = soup.find(id="all_team-stats-per_game")
print(team_per_game)
Any help would be greatly appreciated.
That webpage employs a trick to try to stop search engines and other automated web clients (including scrapers) from finding the table data: the tables are stored in HTML comments:
<div id="all_team-stats-per_game" class="table_wrapper setup_commented commented">
<div class="section_heading">
<span class="section_anchor" id="team-stats-per_game_link" data-label="Team Per Game Stats"></span><h2>Team Per Game Stats</h2> <div class="section_heading_text">
<ul> <li><small>* Playoff teams</small></li>
</ul>
</div>
</div>
<div class="placeholder"></div>
<!--
<div class="table_outer_container">
<div class="overthrow table_container" id="div_team-stats-per_game">
<table class="sortable stats_table" id="team-stats-per_game" data-cols-to-freeze=2><caption>Team Per Game Stats Table</caption>
...
</table>
</div>
</div>
-->
</div>
I note that the opening div has setup_commented and commented classes. Javascript code included in the page is then executed by your browser that then loads the text from those comments and replaces the placeholder div with the contents as new HTML for the browser to display.
You can extract the comment text here:
from bs4 import BeautifulSoup, Comment
soup = BeautifulSoup(r.content, 'lxml')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table_soup = BeautifulSoup(comment, 'lxml')
then continue to parse the table HTML.
This specific site has published both terms of use, and a page on data use you should probably read if you are going to use their data. Specifically, their terms state, under section 6. Site Content:
You may not frame, capture, harvest, or collect any part of the Site or Content without SRL's advance written consent.
Scraping the data would fall under that heading.
Just to complete Martijn Pieters's answer (and without lxml)
from bs4 import BeautifulSoup, Comment
import requests
r = requests.get('https://www.basketball-reference.com/leagues/NBA_2019.html')
soup = BeautifulSoup(r.content, 'html.parser')
placeholder = soup.select_one('#all_team-stats-per_game .placeholder')
comment = next(elem for elem in placeholder.next_siblings if isinstance(elem, Comment))
table = BeautifulSoup(comment, 'html.parser')
rows = table.find_all('tr')
for row in rows:
cells = row.find_all('td')
if cells:
print([cell.text for cell in cells])
Partial output
[u'New Orleans Pelicans', u'71', u'240.0', u'43.6', u'91.7', u'.476', u'10.1', u'29.4', u'.344', u'33.5', u'62.4', u'.537', u'18.1', u'23.9', u'.760', u'11.0', u'36.0', u'47.0', u'27.0', u'7.5', u'5.5', u'14.5', u'21.4', u'115.5']
[u'Milwaukee Bucks*', u'69', u'241.1', u'43.3', u'90.8', u'.477', u'13.3', u'37.9', u'.351', u'30.0', u'52.9', u'.567', u'17.6', u'22.8', u'.773', u'9.3', u'40.1', u'49.4', u'26.0', u'7.4', u'6.0', u'14.0', u'19.8', u'117.6']
[u'Los Angeles Clippers', u'70', u'241.8', u'41.0', u'87.6', u'.469', u'9.8', u'25.2', u'.387', u'31.3', u'62.3', u'.502', u'22.8', u'28.8', u'.792', u'9.9', u'35.7', u'45.6', u'23.4', u'6.6', u'4.7', u'14.5', u'23.5', u'114.6']

Text missing after scraping a website using BeautifulSoup

I'm writing a python script to get the number of pull requests generated by a particular user during the ongoing hactoberfest event.
Here's a link to the official website of hacktoberfest.
Here's my code:
url= 'https://hacktoberfest.digitalocean.com/stats/user'
import urllib.request
from bs4 import BeautifulSoup
req = urllib.request.Request(url, headers={'User-Agent': 'Mozilla/5.0'})
html = urllib.request.urlopen(req).read()
soup = BeautifulSoup(html, 'html.parser')
name_box = soup.find('div', attrs={'class': 'userstats--progress'})
print(name_box)
Where 'user' in the first line of the code should be replaced by the user's github handle ( eg. BAJUKA ).
Below is the HTML tag I'm aiming to scrape:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount">5</span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular ProgressBar--full" data-js="progressBar"></div>
</div>
This is what I get after I run my code:
<div class="userstats--progress">
<p>
Progress (<span data-js="userPRCount"></span>/5)
</p>
<div class="ProgressBar ProgressBar--three u-mb--regular" data-js="progressBar"></div>
</div>
The difference being on the third line where the number of pull request in missing (i.e. in the span tag a 5 is missing)
These are the questions that I want to ask:
1.Why is the no. of pull requests ( i.e. 5 in this case ) missing from the scraped lines?
2.How can I solve this issue? That is get the no. of pull requests successfully.
The data you're looking for is not in the original data that the hacktober server sends, and Beautiful Soup downloads and parses; it's inserted into the HTML by the Javascript code that runs on that page in your browser after that original data is loaded.
If you use this shell command to download the data that's actually served as the page, you'll see that the span tag you're looking at starts off empty:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 Progress
What's the javascript that fills that tag? Well, it's minified, so it's very hard to unpick what's going on. You can find it included in the very bottom of the original data, here:
curl -s 'https://hacktoberfest.digitalocean.com/stats/BAJUKA' | grep -3 "script src=" | tail -n5
which when I run it, outputs this:
<script src="https://go.digitalocean.com/js/forms2/js/forms2.min.js"></script>
<script src="/assets/application-134859a20456d7d32be9ea1bc32779e87cad0963355b5372df99a0cff784b7f0.js"></script>
That crazy looking source URL is a minified piece of Javascript, which means that it's been automatically shrunk, which also means that it's almost unreadable. But if you go to that page., and page right down to the bottom, you can see some garbled Javascript which you can try and decode.
I noticed this bit:
var d="2018-09-30T10%3A00%3A00%2B00%3A00",f="2018-11-01T12%3A00%3A00%2B00%3A00";$.getJSON("https://api.github.com/search/issues?q=-label:invalid+created:"+d+".."+f+"+type:pr+is:public+author:"+t+"&per_page=300"
Which I think is where it gets the data to fill that DIV. If you load up and parse that URL, I think you'll find the data you need. You'll need to fill in the dates for that search, and the author. Good luck!

Categories

Resources