Downloading PDF's using Python webscraping not working

Downloading PDF's using Python webscraping not working - python

Here is my code:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://mathsmadeeasy.co.uk/gcse-maths-revision/"
#If there is no such folder, the script will create one automatically
folder_location = r'E:\webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
Any help as to why the code does not download any of my files format maths revision site.
Thanks.

Looking at the page itself, while it may look like it is static, it isn't. The content you are trying to access is gated behind some fancy javascript loading. What I've done to assess that is simply logging the page that BS4 actually got and opening it in a text editor:
with open(folder_location+"\page.html", 'wb') as f:
f.write(response.content)
By the look of it, the page is remplacing placeholders with JS, as hinted by the comment line 70 of the HTML file: // interpolate json by replacing placeholders with variables
For solutions to your problems, it seems BS4 is not able to load Javascript. I suggest looking at this answer for someone who had a similar problem. I also suggest looking into Scrapy if you intend to do some more complex web scraping.

Related

Issue downloading multiple PDFs

After running the following code, I am unable to open the downloaded PDF's. Even though the code ran successfully, the downloaded PDF files are damaged.
My computer's error message is
Unable to open file. it may be damaged or in a format Preview doesn't recognize.
Why are they damaged and how do I solve this?
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/ Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)

This issue is you are requesting the link that is within github 'blob' when you need the the 'raw' link:
'/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
but you want:
'/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf'
So just adjust that. Full code below:
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "https://github.com/sonhuytran/MIT8.01SC.2010F/tree/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual"
#If there is no such folder, the script will create one automatically
folder_location = r'/Users/rahelmizrahi/Desktop/Physics_Solutions'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
pdf_link = link['href'].replace('blob','raw')
pdf_file = requests.get('https://github.com' + pdf_link)
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(pdf_file.content)

I had to use soup.select("a[href$=.pdf]") (without the inner quotes) to get it to select the links correctly.
After that, your script works, but: what you're downloading is not a PDF, but an HTML webpage! Try visiting one of the URLs: https://github.com/sonhuytran/MIT8.01SC.2010F/blob/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
You'll be presented with a GitHub webpage, not the actual PDF. To get that, you need the "raw" GitHub URL, which you can see when you hover over the Download button: https://github.com/sonhuytran/MIT8.01SC.2010F/raw/master/References/University%20Physics%20with%20Modern%20Physics%2C%2013th%20Edition%20Solutions%20Manual/A01_YOUN6656_09_ISM_FM.pdf
So, it looks like you just have to replace blob with raw at the proper spot to make it work:
href = link['href']
href = href.replace('/blob/', '/raw/')
requests.get(urljoin(url,href).content)

The issue is that the file is not properly closed after the open/write.
Just add f.close() at the end of the code to do that.

Automate download all links (of PDFs) inside multiple pdf files

I'm trying to download journal issues from a website (http://cis-ca.org/islamscience1.php). I ran something to get all the PDF's on this page. However these PDF's have links inside them that link to another PDF.
I want to get the terminal articles from all the PDF links.
Got all the PDF's from the page: http://cis-ca.org/islamscience1.php
import os
import requests
from urllib.parse import urljoin
from bs4 import BeautifulSoup
url = "http://cis-ca.org/islamscience1.php"
#If there is no such folder, the script will create one automatically
folder_location = r'webscraping'
if not os.path.exists(folder_location):os.mkdir(folder_location)
response = requests.get(url)
soup= BeautifulSoup(response.text, "html.parser")
for link in soup.select("a[href$='.pdf']"):
#Name the pdf files using the last portion of each link which are unique in this case
filename = os.path.join(folder_location,link['href'].split('/')[-1])
with open(filename, 'wb') as f:
f.write(requests.get(urljoin(url,link['href'])).content)
I'd like to get the articles linked inside these PDF's.
Thanks in advance

https://mamclain.com/?page=Blog_Programing_Python_Removing_PDF_Hyperlinks_With_Python
Take a look at this link. It shows how to identify hyperlink and sanitize the PDF document. You could follow it upto the identification part and then perform an operation to store the hyperlink instead of sanitizing.
Alternatively, take a look at this library: https://github.com/metachris/pdfx

Scraping data from multiple places on same HTML with Beautiful Soup (Python)

I have problems with scraping data from certain URL with Beautiful Soup.
I've successfully made part where my code opens text file with list of URL's and goes through them.
First problem that I encounter is when I want to go through two separate places on HTML page.
With code that I wrote so far, it only goes through first "class" and just doesn't want to search and scrap another one that I defined.
Second issue is that I can get data only if I run my script in terminal with:
python mitel.py > mitel.txt
Output that I get is not the one that I want. I am just looking for two strings from it, but I cannot find a way to extract it.
Finally, there's no way I can get my results to write to CSV.
I only get last string of last URL from url-list into my CSV.
Can you assist TOTAL beginner in Python?
Here's my script:
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
with open('urllist.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll(True, {"class":["tel-number", "tel-result main"]}):
finalt = target.text.strip()
print finalt
with open('output_file.csv', 'wb') as f:
writer = csv.writer(f)
writer.writerows(finalt)
For some reason, I cannot paste succesfully targeted HTML code, so I'll just put a link here to one of the pages, and if it gets needed, I'll try to somehow paste it, although, its very big and complex.
Targeted URL for scraping
Thank you so much in advance!

Well I managed to get some results with the help of #furas and google.
With this code, I can get all "a" from the page, and then in MS Excel I was able to get rid of everything that wasn't name and phone.
Sorting and rest of the stuff is also done in excel... I guess I am to big of a newbie to accomplish everything in one script.
import urllib2
from bs4 import BeautifulSoup
import csv
import os
import itertools
import requests
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
finalt = []
proxy = urllib2.ProxyHandler({'http': 'http://163.158.216.152:80'})
auth = urllib2.HTTPBasicAuthHandler()
opener = urllib2.build_opener(proxy, auth, urllib2.HTTPHandler)
urllib2.install_opener(opener)
with open('mater.txt') as inf:
urls = (line.strip() for line in inf)
for url in urls:
site = urllib2.urlopen(url)
soup = BeautifulSoup(site.read(), 'html.parser')
for target in soup.findAll('a'):
finalt.append( target.text.strip() )
print finalt
with open('imena1-50.csv', 'wb') as f:
writer = csv.writer(f)
for i in finalt:
writer.writerow([i])
It also uses proxy.. sort off. Didn't get it to get proxies from .txt list.
Not bad for first python scraping, but far from efficient and the way I imagine it.

maybe your selector is wrong, try this
for target in soup.findAll(True, {"class":["tel-number",'tel-result-main']}):

Automating filename generation from url text

I am parsing some content from the web and then saving it to a file. So far I manually create the filename.
Here's my code:
import requests
url = "http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840"
html = requests.get(url).text.encode('utf-8')
with open("html_output_test.html", "wb") as file:
file.write(html)
How could I automate the process of creating and saving the following html filename from the url:
The-Google-Way-Revolutionizing-Management (instead of html_output_test?
This name comes from the original bookstore url that I posted and that probably was modified to avoid product adv.
Thanks!

You can use BeautifulSoup to get the title text from the page, I would let requests handle the encoding with .content:
url = "http://rads.stackoverflow.com/amzn/click/1593271840"
html = requests.get(url).content
from bs4 import BeautifulSoup
print(BeautifulSoup(html).title.text)
with open("{}.html".format(BeautifulSoup(html).title.text), "wb") as file:
file.write(html)
The Google Way: How One Company is Revolutionizing Management As We Know It: Bernard Girard: 9781593271848: Amazon.com: Books
For that particular page if you just want The Google Way: How One Company is Revolutionizing Management As We Know It the product title is in the class a-size-large:
text = BeautifulSoup(html).find("span",attrs={"class":"a-size-large"}).text
with open("{}.html".format(text), "wb") as file:
file.write(html)
The link with The-Google-Way-Revolutionizing-Management is in the link tag:
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
print(link["href"])
http://www.amazon.com/The-Google-Way-Revolutionizing-Management/dp/1593271840
So to get that part you need to parse it:
print(link["href"].split("/")[3])
The-Google-Way-Revolutionizing-Management
link = BeautifulSoup(html).find("link",attrs={"rel":"canonical"})
with open("{}.html".format(link["href"].split("/")[3]),"wb") as file:
file.write(html)

You could parse the web page using beautiful soup, get the of the page, then slugify it and use as file name, or generate a random filename, something like os.tmpfile.

Inherent way to save web page source

I have read a lot of answers regarding web scraping that talk about BeautifulSoup, Scrapy e.t.c. to perform web scraping.
Is there a way to do the equivalent of saving a page's source from a web brower?
That is, is there a way in Python to point it at a website and get it to save the page's source to a text file with just the standard Python modules?
Here is where I got to:
import urllib
f = open('webpage.txt', 'w')
html = urllib.urlopen("http://www.somewebpage.com")
#somehow save the web page source
f.close()
Not much I know - but looking for code to actually pull the source of the page so I can write it. I gather that urlopen just makes a connection.
Perhaps there is a readlines() equivalent for reading lines of a web page?

You may try urllib2:
import urllib2
page = urllib2.urlopen('http://stackoverflow.com')
page_content = page.read()
with open('page_content.html', 'w') as fid:
fid.write(page_content)

Updated code, for Python 3 (where urllib2 is deprecated):
from urllib.request import urlopen
html = urlopen("http://www.google.com/")
with open('page_content.html', 'w') as fid:
fid.write(html)

Answer from SoHei will not work because it's missing html.read() and the file must be opened with 'wb' parameter instead of just a 'w'. The 'b' indicates that data will be written in binary mode (since .read() returns sequence of bytes).
The fully working code is:
from urllib.request import urlopen
html = urlopen("http://www.google.com/")
page_content = html.read()
with open('page_content.html', 'wb') as fid:
fid.write(page_content)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading PDF's using Python webscraping not working - python

Related

Issue downloading multiple PDFs

Automate download all links (of PDFs) inside multiple pdf files

Scraping data from multiple places on same HTML with Beautiful Soup (Python)

Automating filename generation from url text

Inherent way to save web page source

Categories

Resources