Python downloading PDF into a .zip - python

What I am trying to do is loop through a list of URL to download a series of .pdfs, and save them to a .zip. At the moment I am just trying to test code using just one URL. The ERROR I am getting is:
Traceback (most recent call last):
File "I:\test_pdf_download_zip.py", line 36, in <module>
zip_file(zipfile_name, url)
File "I:\test_pdf_download_zip.py", line 30, in zip_file
myzip.write(dowload_pdf(url))
TypeError: expected a string or other character buffer object
Would someone know how to pass .pdf request to the .zip correctly (avoiding the error above) in order for me to append it, or know if it is possible to do this?
import os
import zipfile
import requests
output = r"I:"
# File name of the zipfile
zipfile_name = os.path.join(output, "test.zip")
# Random test pdf
url = r"http://www.pdf995.com/samples/pdf.pdf"
def create_zipfile(zipfile_name):
zipfile.ZipFile(zipfile_name, "w")
def dowload_pdf(url):
response = requests.get(url, stream=True)
with open('test.pdf', 'wb') as f:
f.write(response.content)
def zip_file(zip_name, url):
with open(zip_name,'a') as myzip:
myzip.write(dowload_pdf(url))
if __name__ == "__main__":
create_zipfile(zipfile_name)
zip_file(zipfile_name, url)
print("Done")

Your download_pdf() function is saving a file but it doesn't return anything. You need to modify it so it actually returns the file path to myzip.write(). You don't want to hardcode test.pdf but pass unique paths to your download function so you don't end up with multiple test.pdf in your archive.
def dowload_pdf(url, path):
response = requests.get(url, stream=True)
with open(path, 'wb') as f:
f.write(response.content)
return path

Related

python download folder of text files

The goal is to download GTFS data through python web scraping, starting with https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download
Currently, I'm using requests like so:
def download(url):
fpath = "prov/city/GTFS"
r = requests.get(url)
if r.ok:
print("Saving file.")
open(fpath, "wb").write(r.content)
else:
print("Download failed.")
The results of requests.content of the above url unfortunately renders the following:
You can see the files of interest within the output (e.g. stops.txt) but how might I access them to read/write?
I fear you're trying to read a zip file with a text editor, perhaps you should try using the "zipfile" module.
The following worked:
def download(url):
fpath = "path/to/output/"
f = requests.get(url, stream = True, headers = headers)
if f.ok:
print("Saving to {}".format(fpath))
g=open(fpath+'output.zip','wb')
g.write(f.content)
g.close()
else:
print("Download failed with error code: ", f.status_code)
You need to write this file into a zip.
import requests
url = "https://transitfeeds.com/p/agence-metropolitaine-de-transport/129/latest/download"
fname = "gtfs.zip"
r = requests.get(url)
open(fname, "wb").write(r.content)
Now fname exists and has several text files inside. If you want to programmatically extract this zip and then read the content of a file, for example stops.txt, then you need first to extract a single file, or simply extractall.
import zipfile
# this will extract only a single file, and
# raise a KeyError if the file is missing from the archive
zipfile.ZipFile(fname).extract("stops.txt")
# this will extract all the files found from the archive,
# overwriting files in the process
zipfile.ZipFile(fname).extractall()
Now you just need to work with your file(s).
thefile = "stops.txt"
# just plain text
text = open(thefile).read()
# csv file
import csv
reader = csv.reader(open(thefile))
for row in reader:
...

Download file from URL and save it in a folder Python

I've a lot of URL with file types .docx and .pdf I want to run a python script that downloads them from the URL and saves it in a folder. Here is what I've done for a single file I'll add them to a for loop:
response = requests.get('http://wbesite.com/Motivation-Letter.docx')
with open("my_file.docx", 'wb') as f:
f.write(response.content)
but the my_file.docx that it is saving is only 266 bytes and is corrupt but the URL is fine.
UPDATE:
Added this code and it works but I want to save it in a new folder.
import os
import shutil
import requests
def download_file(url, folder_name):
local_filename = url.split('/')[-1]
path = os.path.join("/{}/{}".format(folder_name, local_filename))
with requests.get(url, stream=True) as r:
with open(path, 'wb') as f:
shutil.copyfileobj(r.raw, f)
return local_filename
Try using stream option:
import os
import requests
def download(url: str, dest_folder: str):
if not os.path.exists(dest_folder):
os.makedirs(dest_folder) # create folder if it does not exist
filename = url.split('/')[-1].replace(" ", "_") # be careful with file names
file_path = os.path.join(dest_folder, filename)
r = requests.get(url, stream=True)
if r.ok:
print("saving to", os.path.abspath(file_path))
with open(file_path, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024 * 8):
if chunk:
f.write(chunk)
f.flush()
os.fsync(f.fileno())
else: # HTTP status code 4XX/5XX
print("Download failed: status code {}\n{}".format(r.status_code, r.text))
download("http://website.com/Motivation-Letter.docx", dest_folder="mydir")
Note that mydir in example above is the name of folder in current working directory. If mydir does not exist script will create it in current working directory and save file in it. Your user must have permissions to create directories and files in current working directory.
You can pass an absolute file path in dest_folder, but check permissions first.
P.S.: avoid asking multiple questions in one post
try:
import urllib.request
urllib.request.urlretrieve(url, filename)

"File Does Not Exist" when dynamically creating files for PDF download with Requests in Python 2.7

I'm trying to dynamically download pdf's from a web site. I am sure I'm listing them correctly but I am not sure I'm doing the actual file I/O correctly. I get the following error
File "download.py", line 22, in <module>
with open("'"+url+"'", "wb") as pdf:
IOError: [Errno 2] No such file or directory: "'http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-179.pdf'"
Here is my code:
import requests
import re
from bs4 import BeautifulSoup
origin = requests.get("http://freehaven.net/anonbib")
soup=BeautifulSoup(origin.text)
results = soup.find_all(href=re.compile("(http).*(pdf)"))
for link in results:
url = (link.get('href'))
r = requests.get(url)
with open("'"+url+"'", "wb") as pdf:
try:
pdf.write(r.content)
finally:
pdf.close
If url is set to 'http://www.lcs.mit.edu/publications/pubs/pdf/MIT-LCS-TR-179.pdf', your code fails because it is trying to open a file with that name on your filesystem.
Instead, try something like this:
fileForUrl = '/tmp/' + url.split('/')[-1]
with open(fileForUrl, 'wb') as pdf:
# Rest of the code as before

How to make a dictionary in Python with files as keys?

I am trying to POST all files in a folder on my local drive to a certain web URL by using Requests and Glob. Every time I POST a new file to the URL, I want to add to a dictionary a new "key-value" item that is "name of the file (key), output from server after POSTing the file (value)":
import requests, glob, unicodedata
outputs = {}
text_files = glob.iglob("/Users/ME/Documents/folder/folder/*.csv")
url = 'http://myWebsite.com/extension/extension/extension'
for data in text_files:
file2 = {'file': open(data)}
r = requests.post(url, files=file2)
outputs[file2] = r.text
This gives me the error:
Traceback (most recent call last):
File "/Users/ME/Documents/folder/folder/myProgram.py", line 15, in <module>
outputs[file2] = r.text
TypeError: unhashable type: 'dict'
This is because (I think) "file2" if of type 'dict'. Is there anyway to cast/alter 'file2' after I POST it to just be a string of the file name?
You are trying to use the file object, no the file name. Use data as the key:
for data in text_files:
file2 = {'file': open(data)}
r = requests.post(url, files=file2)
outputs[data] = r.text
better yet, use a more meaningful name, and use with to have the open file object closed again for you:
for filename in text_files:
with open(filename) as fileobj:
files = {'file': fileobj}
r = requests.post(url, files=files)
outputs[filename] = r.text

Using urlopen to open list of urls

I have a python script that fetches a webpage and mirrors it. It works fine for one specific page, but I can't get it to work for more than one. I assumed I could put multiple URLs into a list and then feed that to the function, but I get this error:
Traceback (most recent call last):
File "autowget.py", line 46, in <module>
getUrl()
File "autowget.py", line 43, in getUrl
response = urllib.request.urlopen(url)
File "/usr/lib/python3.2/urllib/request.py", line 139, in urlopen
return opener.open(url, data, timeout)
File "/usr/lib/python3.2/urllib/request.py", line 361, in open
req.timeout = timeout
AttributeError: 'tuple' object has no attribute 'timeout'
Here's the offending code:
url = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(*url):
response = urllib.request.urlopen(url)
with urllib.request.urlopen(url) as response, open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl()
I've exhausted Google trying to find how to open a list with urlopen(). I found one way that sort of works. It takes a .txt document and goes through it line-by-line, feeding each line as a URL, but I'm writing this using Python 3 and for whatever reason twillcommandloop won't import. Plus, that method is unwieldy and requires (supposedly) unnecessary work.
Anyway, any help would be greatly appreciated.
In your code there are some errors:
You define getUrls with variable arguments list (the tuple in your error);
You manage getUrls arguments as a single variable (list instead)
You can try with this code
import urllib2
import shutil
urls = ['https://www.example.org/', 'https://www.foo.com/', 'http://bar.com']
def getUrl(urls):
for url in urls:
#Only a file_name based on url string
file_name = url.replace('https://', '').replace('.', '_').replace('/','_')
response = urllib2.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
getUrl(urls)
It do not support tuple:
urllib.request.urlopen(url[, data][, timeout])
Open the URL url, which can be either a string or a Request object.
And your calling is incorrect. It should be:
getUrl(url[0],url[1],url[2])
And inside the function, use a loop like "for u in url" to travel all urls.
You should just iterate over your URLs using a for loop:
import shutil
import urllib.request
urls = ['https://www.example.org/', 'https://www.foo.com/']
file_name = 'foo.txt'
def fetch_urls(urls):
for i, url in enumerate(urls):
file_name = "page-%s.html" % i
response = urllib.request.urlopen(url)
with open(file_name, 'wb') as out_file:
shutil.copyfileobj(response, out_file)
fetch_urls(urls)
I assume you want the content saved to separate files, so I used enumerate here to create a uniqe file name, but you can obviously use anything from hash(), the uuid module to creating slugs.

Categories

Resources