Python - Scraping a PDF file from a URL

Python - Scraping a PDF file from a URL - python

I want to scrape pdf files from this site
https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_pilote_7b.pdf
I tried this code for that but it doesn't work. Can anybody tell me why, please?
res = requests.get('https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_7b.pdf')
with open('C:\\Users\\sioud\\Desktop\\Manuels scolaires TN\\1\\test.pdf', 'wb') as f:
f.write(ress.content)

res = requests.get('https://www.sigmaths.net/manuels/ph/physique_7b.pdf',stream=True)
with open('test.pdf', 'wb') as f:
f.write(res.content)
your url is pointing to a reader https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_7b.pdf, remove the 'reader.php?var= for the actual pdf

You can also use urlretrieve.
Check out my solution code.
from urllib.request import urlretrieve
pdfurl = u"https://www.sigmaths.net/manuels/ph/physique_7b.pdf";
urlretrieve(pdfurl, "test.pdf")
And you will find the required pdf download under the name test.pdf

Related

Saving an image from a URL that does not end with image extension

I'm a python beginner. I have a dataset column that contains thousands of URLs. I want to save the image in each URL with its extension. I don't have a problem with urls that end with the image extension like https://web.archive.org/web/20170628093753im_/http://politicot.com/wp-content/uploads/2016/12/Sean-Spicer.jpg.(with urllib or requests)
However for URLs like link1= https://nypost.com/wp-content/uploads/sites/2/2017/11/171106-texas-shooter-church-index.jpg?quality=90&strip=all&w=1200 or link2 = https://i2.wp.com/www.huzlers.com/wp-content/uploads/2017/03/maxresdefault.jpeg?fit=1280%2C720&ssl=1, i failed to save them.
I want to save the images in links as follows: image1.jpg and image2.jpeg. How can we do this?
Any help could be useful.

The following seems to work for me, give it a try:
import requests
urls = ['https://nypost.com/wp-content/uploads/sites/2/2017/11/171106-texas-shooter-church-index.jpg?quality=90&strip=all&w=1200',
'https://i2.wp.com/www.huzlers.com/wp-content/uploads/2017/03/maxresdefault.jpeg?fit=1280%2C720&ssl=1']
for i, url in enumerate(urls):
r = requests.get(url)
filename = 'image{0}.jpg'.format(i+1)
with open(filename, 'wb') as f:
f.write(r.content)

Download Images to Local Folder from Urls in CSV file in Python

I made the program to get the Images from the Urls in the CSV file and want to download it in the Local Folder in Python but the program showing the below error
"TypeError: cannot use a string pattern on a bytes-like object"
Please check the Code in below
import pandas as pd
import urllib.request
def url_to_jpg(i, url , File_Path):
filename = 'image_{}.jpg'.format(i)
full_path = '{}{}'.format(File_Path, filename)
urllib.request.urlretrieve(url, full_path)
print('{} saved.'.format(filename))
return None
FileName = "C:/Users/IT City/Desktop/Kwiat-USA/KavantaCSV.csv"
File_Path = "C:/Users/IT City/Desktop/Kwiat-USA/images"
urls = pd.read_csv(FileName)
for i , url in enumerate(urls.values):
url_to_jpg(i, url , File_Path)
Need your Immediate help. Help will be highly Appreciated.
Thank You

Maybe you have to decode it?
-or you can simply use the requests library.
r = requests.get(url)
f = open(filename, "wb+")
f.write(r.content)
f.close()

The url being passed to urlretrieve is an array. the urls.values gives a 2d-array.
it always helps if the stack trace is also posted.

Don't understand this PdfReadError: EOF marker not found

I am downloading multiple PDFs. I have a list of urls and the code is written to download them and also create one big pdf with them all in. The code works for the first 144 pdfs then it throws this error:
PdfReadError: EOF marker not found
I've tried making all the pdfs end in %%EOF but that doesn't work - it still reaches the same point then I get the error again.
Here's my code:
my file and converting to list for python to read each separately
with open('minutelinks.txt', 'r') as file:
data = file.read()
links = data.split()
download pdfs
from PyPDF2 import PdfFileMerger
import requests
urls = links
merger = PdfFileMerger()
for url in urls:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
merger.append(title)
merger.write("allminues.pdf")
merger.close()
I want to be able to download all of them and create one big pdf - which it appears to do until it throws this error. I have about 750 pdfs and it only gets to 144.

This is how I changed my code so it now downloads all of the pdfs and skips the one (or more) that may be correupted. I also had to add the self argument to the function.
from PyPDF2 import PdfFileMerger
import requests
import sys
urls = links
def download_pdfs(self):
merger = PdfFileMerger()
for url in urls:
try:
response = requests.get(url)
title = url.split("/")[-1]
with open(title, 'wb') as f:
f.write(response.content)
except PdfReadError:
print(title)
sys.exit()
merger.append(title)
merger.write("allminues.pdf")
merger.close()

The end of file marker '%%EOF' is meant to be the very last line. It is a kind of marker where the pdf parser knows, that the PDF document ends here.
My solution is to force this marker to stay at the end:
def reset_eof(self, pdf_file):
with open(pdf_file, 'rb') as p:
txt = (p.readlines())
for i, x in enumerate(txt[::-1]):
if b'%%EOF' in x:
actual_line = len(txt)-i-1
break
txtx = txt[:actual_line] + [b'%%EOF']
with open(pdf_file, 'wb') as f:
f.writelines(txtx)
return PyPDF4.PdfFileReader(pdf_file)

I read that EOF is a kind of tag included in PDF files. link in portuguese
However, I guess some kinds of PDF files do not have the 'EOF marker' and PyPDF2 do not recognizes those ones.
So, what I did to fix "PdfReadError: EOF marker not found" was opening my PDF with Google Chromer and print it as .pdf once more, so that the file is converted to .pdf by Chromer and hopefully with the EOF marker.
I ran my script with the new .pdf file converted by Chromer and it worked fine.

unable to download pdf from this particular url using python

I have tried downloading a .pdf file using the following code but I can't open the downloaded file, it shows pdf error. I also tried doing the same with urllib2, requests none of them helped. Please help in resolving this.
import urllib
import os
pdf_link = "https://www.indeed.com/resumes/account/login?dest=%2Fr%2F23c59475ad19d393/pdf"
pdf_file = "sample.pdf"
response = urllib.urlopen(pdf_link)
file = open(pdf_file, 'wb')
file.write(response.read())
file.close()

Download csv file through python (url)

I work on a project and I want to download a csv file from a url. I did some research on the site but none of the solutions presented worked for me.
The url offers you directly to download or open the file of the blow I do not know how to say a python to save the file (it would be nice if I could also rename it)
But when I open the url with this code nothing happens.
import urllib
url='https://data.toulouse-metropole.fr/api/records/1.0/download/?dataset=dechets-menagers-et-assimiles-collectes'
testfile = urllib.request.urlopen(url)
Any ideas?

Try this. Change "folder" to a folder on your machine
import os
import requests
url='https://data.toulouse-metropole.fr/api/records/1.0/download/?dataset=dechets-menagers-et-assimiles-collectes'
response = requests.get(url)
with open(os.path.join("folder", "file"), 'wb') as f:
f.write(response.content)

You can adapt an example from the docs
import urllib.request
url='https://data.toulouse-metropole.fr/api/records/1.0/download/?dataset=dechets-menagers-et-assimiles-collectes'
with urllib.request.urlopen(url) as testfile, open('dataset.csv', 'w') as f:
f.write(testfile.read().decode())

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Scraping a PDF file from a URL - python

res = requests.get('https://www.sigmaths.net/manuels/ph/physique_7b.pdf',stream=True) with open('test.pdf', 'wb') as f: f.write(res.content) your url is pointing to a reader https://www.sigmaths.net/Reader.php?var=manuels/ph/physique_7b.pdf, remove the 'reader.php?var= for the actual pdf

You can also use urlretrieve. Check out my solution code. from urllib.request import urlretrieve pdfurl = u"https://www.sigmaths.net/manuels/ph/physique_7b.pdf"; urlretrieve(pdfurl, "test.pdf") And you will find the required pdf download under the name test.pdf

Related

Saving an image from a URL that does not end with image extension

Download Images to Local Folder from Urls in CSV file in Python

Don't understand this PdfReadError: EOF marker not found

unable to download pdf from this particular url using python

Download csv file through python (url)

Categories

Resources