Using urllib.urlretrieve to download files over HTTP not working

Using urllib.urlretrieve to download files over HTTP not working - python

I'm still working on my mp3 downloader but now I'm having trouble with the files being downloaded. I have two versions of the part that's tripping me up. The first gives me a proper file but causes an error. The second gives me a file that is way too small but no error. I've tried opening the file in binary mode but that didn't help. I'm pretty new to doing any work with html so any help would be apprecitaed.
import urllib
import urllib2
def milk():
SongList = []
SongStrings = []
SongNames = []
earmilk = urllib.urlopen("http://www.earmilk.com/category/pop")
reader = earmilk.read()
#gets the position of the playlist
PlaylistPos = reader.find("var newPlaylistTracks = ")
#finds the number of songs in the playlist
NumberSongs = reader[reader.find("var newPlaylistIds = " ): PlaylistPos].count(",") + 1
initPos = PlaylistPos
#goes though the playlist and records the html address and name of the song
for song in range(0, NumberSongs):
songPos = reader[initPos:].find("http:") + initPos
namePos = reader[songPos:].find("name") + songPos
namePos += reader[namePos:].find(">")
nameEndPos = reader[namePos:].find("<") + namePos
SongStrings.append(reader[songPos: reader[songPos:].find('"') + songPos])
SongNames.append(reader[namePos + 1: nameEndPos])
initPos = nameEndPos
for correction in range(0, NumberSongs):
SongStrings[correction] = SongStrings[correction].replace('\\/', "/")
#downloading songs
fileName = ''.join([a.isalnum() and a or '_' for a in SongNames[0]])
fileName = fileName.replace("_", " ") + ".mp3"
# This version writes a file that can be played but gives an error saying: "TypeError: expected a character buffer object"
## songDL = open(fileName, "wb")
## songDL.write(urllib.urlretrieve(SongStrings[0], fileName))
# This version creates the file but it cannot be played (file size is much smaller than it should be)
## url = urllib.urlretrieve(SongStrings[0], fileName)
## url = str(url)
## songDL = open(fileName, "wb")
## songDL.write(url)
songDL.close()
earmilk.close()

Re-read the documentation for urllib.urlretrieve:
Return a tuple (filename, headers) where filename is the local file
name under which the object can be found, and headers is whatever the
info() method of the object returned by urlopen() returned (for a
remote object, possibly cached).
You appear to be expecting it to return the bytes of the file itself. The point of urlretrieve is that it handles writing to a file for you, and returns the filename it was written to (which will generally be the same thing as your second argument to the function if you provided one).

Related

Get PDF attachments using Python

I admit that I am new to Python.
We have to process PDF files with attachments or annotated attachments. I am trying to extract attachments from a PDF file using PyPDF2 library.
The only (!) example found on GitHub contains the following code:
import PyPDF2
def getAttachments(reader):
catalog = reader.trailer["/Root"]
# VK
print (catalog)
#
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
And the call is:
rootdir = "C:/Users/***.pdf" # My file path
handler = open(rootdir, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
I am getting a KeyError: '/EmbeddedFiles'
A print of the catalog indeed does not contain EmbeddedFiles:
{'/Extensions': {'/ADBE': {'/BaseVersion': '/1.7', '/ExtensionLevel': 3}}, '/Metadata': IndirectObject(2, 0), '/Names': IndirectObject(5, 0), '/OpenAction': IndirectObject(6, 0), '/PageLayout': '/OneColumn', '/Pages': IndirectObject(3, 0), '/PieceInfo': IndirectObject(7, 0), '/Type': '/Catalog'}
This particular PDF contains 9 attachments. How can I get them?

Too Long for comments, and I have not tested personally this code, which looks very similar to your outline in the question, however I am adding here for others to test. It is the subject of a Pull Request https://github.com/mstamy2/PyPDF2/pull/440 and here is the full updated sequence as described by Kevin M Loeffler in https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/
Viewable at https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38#file-extract_pdf_attachments-py
Download as
https://gist.github.com/kevinl95/29a9e18d474eb6e23372074deff2df38/raw/acdc194058f9fa2c4d2619a4c623d0efeec32555/extract_pdf_attachments.py
It always helps if you can provide an example input of the type you have problems with so that others can adapt the extraction routine to suit.
In response to getting an error
"I’m guessing the script is breaking because the embedded files section of the PDF doesn’t always exist so trying to access it throws an error."
"Something I would try is to put everything after the ‘catalog’ line in the get_attachments method in a try-catch."
Unfortunately there are many pending pull requests not included into PyPDF2 https://github.com/mstamy2/PyPDF2/pulls and others may also be relevant or needed to aid with this and other shortcomings. Thus you need to see if any of those may also help.
For one pending example of a try catch that you might be able to include / and adapt for other use cases see https://github.com/mstamy2/PyPDF2/pull/551/commits/9d52ef517319b538f007669631ba6b778f8ec3a3
Associated keywords for imbedded files apart from /Type/EmbeddedFiles include /Type /Filespec & /Subtype /FileAttachment note the pairs may not always have spaces so perhaps see if those can be interrogated for the attachments
Again on that last point the example searches for /EmbeddedFiles as indexed in the plural whilst any individual entry itself is identified as singular

This can be improved but it was tested to work (using PyMuPDF).
It detects corrupted PDF files, encryption, attachments, annotations and portfolios.
I am yet to compare the output with our internal classification.
Produces a semicolon separated file that can be imported into Excel.
import fitz # = PyMuPDF
import os
outfile = open("C:/Users/me/Downloads/testPDF3.txt", "w", encoding="utf-8")
folder = "C:/Users/me/Downloads"
print ("filepath;","encrypted;","pages;", "embedded;","attachments;","annotations;","portfolio", file = outfile)
enc=pages=count=names=annots=collection=''
for subdir, dirs, files in os.walk(folder):
for file in files:
#print (os.path.join(subdir, file))
filepath = subdir + os.sep + file
if filepath.endswith(".pdf"):
#print (filepath, file = outfile)
try:
doc = fitz.open(filepath)
enc = doc.is_encrypted
#print("Encrypted? ", enc, file = outfile)
pages = doc.page_count
#print("Number of pages: ", pages, file = outfile)
count = doc.embfile_count()
#print("Number of embedded files:", count, file = outfile) # shows number of embedded files
names = doc.embfile_names()
#print("Embedded files:", str(names), file = outfile)
#if count > 0:
# for emb in names:
# print(doc.embfile_info(emb), file = outfile)
annots = doc.has_annots()
#print("Has annots?", annots, file = outfile)
links = doc.has_links()
#print("Has links?", links, file = outfile)
trailer = doc.pdf_trailer()
#print("Trailer: ", trailer, file = outfile)
xreflen = doc.xref_length() # length of objects table
for xref in range(1, xreflen): # skip item 0!
#print("", file = outfile)
#print("object %i (stream: %s)" % (xref, doc.is_stream(xref)), file = outfile)
#print(doc.xref_object(i, compressed=False), file = outfile)
if "Collection" in doc.xref_object(xref, compressed=False):
#print ("Portfolio", file = outfile)
collection ='True'
break
else: collection="False"
#print(doc.xref_object(xref, compressed=False), file = outfile)
except:
#print ("Not a valid PDF", file = outfile)
enc=pages=count=names=annots=collection="Not a valid PDF"
print(filepath,";", enc,";",pages, ";",count, ";",names, ";",annots, ";",collection, file = outfile )
outfile.close()

I was also running into the same problem with several pdfs that I have. I was able to make these changes to the referenced code that got it to work for me:
import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
attachments = {}
#First, get those that are pdf attachments
catalog = reader.trailer["/Root"]
if "/EmbeddedFiles" in catalog["/Names"]:
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
#Next, go through all pages and all annotations to those pages
#to find any attached files
for pagenum in range(0, reader.getNumPages()):
page_object = reader.getPage(pagenum)
if "/Annots" in page_object:
for annot in page_object['/Annots']:
annotobj = annot.getObject()
if annotobj['/Subtype'] == '/FileAttachment':
fileobj = annotobj["/FS"]
attachments[fileobj["/F"]] = fileobj["/EF"]["/F"].getData()
return attachments
handler = open(filename, 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)

I know it is a late reply, but i only started looking into this yesterday. I have used the PyMuPdf library to extract the embedded files. here is my code:
import os
import fitz
def get_embedded_pdfs(input_pdf_path, output_path=None):
input_path = "/".join(input_pdf_path.split('/')[:-1])
if not output_path :
output_path = input_pdf_path.split(".")[0] + "_embeded_files/"
if output_path not in os.listdir(input_path):
os.mkdir(output_path)
doc = fitz.open(input_pdf_path)
item_name_dict = {}
for each_item in doc.embfile_names():
item_name_dict[each_item] = doc.embfile_info(each_item)["filename"]
for item_name, file_name in item_name_dict.items():
out_pdf = output_path + file_name
## get embeded_file in bytes
fData = doc.embeddedFileGet(item_name)
## save embeded file
with open(out_pdf, 'wb') as outfile:
outfile.write(fData)

disclaimer: I am the author of borb (the library used in this answer)
borb is an open-source, pure Python PDF library. It abstracts away most of the unpleasantness of dealing with PDF (such as having to deal with dictionaries and having to know PDF-syntax and structure).
There is a huge repository of examples, containing a section on dealing with embedded files, which you can find here.
I'll repeat the relevant example here for completeness:
import typing
from borb.pdf.document.document import Document
from borb.pdf.pdf import PDF
def main():
# read the Document
doc: typing.Optional[Document] = None
with open("output.pdf", "rb") as pdf_file_handle:
doc = PDF.loads(pdf_file_handle)
# check whether we have read a Document
assert doc is not None
# retrieve all embedded files and their bytes
for k, v in doc.get_embedded_files().items():
# display the file name, and the size
print("%s, %d bytes" % (k, len(v)))
if __name__ == "__main__":
main()
After the Document has been read, you can simply ask it for a dict mapping the filenames unto the bytes.

urllib urlretrieve only saving final image in list of urls

I'm fairly new to using Python. I have been trying to set up a very basic web scraper to help speed up my workday, it is supposed to download images from a section of a website and save them.
I have a list of urls and I am trying to use urllib.request.urlretrieve to download all the images.
The output location (savepath) updates so it adds 1 to the current highest number in the folder.
I've tried a bunch of different ways but urlretrieve only saves the image from the last url in the list. Is there a way to download all the images in the url list?
to_download=['url1','url2','url3','url4']
for t in to_download:
urllib.request.urlretrieve(t, savepath)
This is the code I was trying to use to update the savepath every time
def getNextFilePath(photos):
highest_num = 0
for f in os.listdir(photos):
if os.path.isfile(os.path.join(photos, f)):
file_name = os.path.splitext(f)[0]
try:
file_num = int(file_name)
if file_num > highest_num:
highest_num = file_num
except ValueError:
'The file name "%s" is not an integer. Skipping' % file_name
output_file = os.path.join(output_folder, str(highest_num + 1))
return output_file

as suggested by #vks, you need to update savepath (otherwise you save each url onto the same file). One way to do so, is to use enumerate:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
for i, url in enumerate(to_download):
save_path = f'website_{i}.txt'
print(save_path)
request.urlretrieve(url, save_path)
which you may want to contract into:
from urllib import request
to_download=['https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/','https://edition.cnn.com/']
[request.urlretrieve(url, f'website_{i}.txt') for i, url in enumerate(to_download)]
see:
Python3 doc: Python enumerate doc
Example of enumerate: enumerate example
Example of f' using a string with a {variable}': f string example
FOR SECOND PART OF THE QUESTION:
Not sure what you are trying to achieve but:
def getNextFilePath(photos):
file_list = os.listdir(photos)
file_list = [int(s) for s in file_list if s.isdigit()]
print(file_list)
max_id_file = max(file_list)
print(f'max id:{max_id_file}')
output_file = os.path.join(output_folder, str(max_id_file + 1))
print(f'output file path:{output_file}')
return output_file
this will hopefully find all files that are named with digits (IDs), and find the highest ID, and return a new file name as a max_id+1
I guess that this will replace the save_path in your example.
Which quickly coding, AND MODIFYING above function, so that it returns the max_id and not the path.
The bellow code be a working example using the iterrator:
import os
from urllib import request
photo_folder = os.path.curdir
def getNextFilePath(photos):
file_list = os.listdir(photos)
print(file_list)
file_list = [int(os.path.splitext(s)[0]) for s in file_list if os.path.splitext(s)[0].isdigit()]
if not file_list:
return 0
print(file_list)
max_id_file = max(file_list)
#print(f'max id:{max_id_file}')
#output_file = os.path.join(photo_folder, str(max_id_file + 1))
#print(f'output file path:{output_file}')
return max_id_file
def download_pic(to_download):
start_id = getNextFilePath(photo_folder)
for i, url in enumerate(to_download):
save_path = f'{i+start_id}.png'
output_file = os.path.join(photo_folder, save_path)
print(output_file)
request.urlretrieve(url, output_file)
You should add handling exception etc, but this seems to be working, if I understood correctly.

Are you updating savepath? If you pass the same savepath to each loop iteration, it is likely just overwriting the same file over and over.
Hope that helps, happy coding!

Extract email attachments and retain modification/creation date?

I'm trying to extract files from emails via IMAP using Python 3.7 (on Windows, fyi) and each of my attempts shows extracted files with Modification & Creation Date = time of extraction (which is incorrect).
As full email applications have the ability to preserve that information, it must me stored somewhere. I also gave working with structs a try, thinking the information may be stored in binary, but had no luck.
import email
from email.header import decode_header
import imaplib
import os
SERVER = None
OUT_DIR = '/var/out'
IMP_SRV = 'mail.domain.tld'
IMP_USR = 'user#domain.tld'
IMP_PWD = 'hunter2'
def login_mail():
global SERVER
SERVER = imaplib.IMAP4_SSL(IMP_SRV)
SERVER.login(IMP_USR, IMP_PWD)
def get_mail(folder='INBOX'):
mails = []
_, data = SERVER.uid('SEARCH', 'ALL')
uids = data[0].split()
for uid in uids:
_, s = SERVER.uid('FETCH', uid, '(RFC822)')
mail = email.message_from_bytes(s[0][1])
mails.append(mail)
return mails
def parse_attachments(mail):
for part in mail.walk():
if part.get_content_type() == 'application/octet-stream':
filename = get_filename(part)
output = os.path.join(OUT_DIR, filename)
with open(output, 'wb') as f:
f.write(part.get_payload(decode=True))
def get_filename(part):
filename = part.get_filename()
binary = part.get_payload(decode=True)
if decode_header(filename)[0][1] is not None:
filename = decode_header(filename)[0][0].decode(decode_header(filename)[0][1])
filename = os.path.basename(filename)
return filename
Can anyone tell me what I'm doing wrong and if it's somehow possible?
After getting said information it could be possible to modify the timestamps utilizing How do I change the file creation date of a Windows file?.

I was able to extract the creation-date and modification-date from the content-disposition header. Setting the file modified date is simple too.
attachment_creation_date = attachment.get_param('creation-date', None, 'content-disposition')
attachment_modification_date = attachment.get_param('modification-date', None, 'content-disposition')
Here's a more complete example that shows how to read these parameters if present:
def process_email_attachments(msg, output_directory):
for attachment in msg.iter_attachments():
try:
output_filename = attachment.get_filename()
except AttributeError:
print("Couldn't get attachment filename. Skipping.")
continue
# If no attachments are found, skip this file
if output_filename:
attachment_creation_date = attachment.get_param('creation-date', None, 'content-disposition')
attachment_modification_date = attachment.get_param('modification-date', None, 'content-disposition')
try:
output_file_full_path = os.path.join(output_directory, output_filename)
with open(output_file_full_path, "wb") as of:
payload = attachment.get_payload(decode=True)
of.write(payload)
if attachment_modification_date is not None:
attachment_modification_datetime = email.utils.parsedate_to_datetime(attachment_modification_date)
set_file_last_modified(output_file_full_path, attachment_modification_datetime)
except TypeError:
print("Couldn't get payload for %s" % output_filename)
def set_file_last_modified(file_path, dt):
dt_epoch = dt.timestamp()
os.utime(file_path, (dt_epoch, dt_epoch))
The second part of your question is how to set the file created date. This is platform dependent. There is already a separate question with answers demonstrating how to set the creation date on a Windows file: How do I change the file creation date of a Windows file?

Error opening photos stored in zip file in Django

I'm going to create a zip file from some of the image files stored on my server.
I've used the following function to do this:
def create_zip_file(user, examination):
from lms.models import StudentAnswer
f = BytesIO()
zip = zipfile.ZipFile(f, 'w')
this_student_answer = StudentAnswer.objects.filter(student_id=user.id, exam=examination)
for answer in this_student_answer:
if answer.answer_file:
answer_file_full_path = answer.answer_file.path
fdir, fname = os.path.split(answer_file_full_path)
zip.writestr(fname, answer_file_full_path)
zip.close() # Close
zip_file_name = "student-answers_"+ str(examination.id)+"_" + str(user.id) + "_" + date=datetime.datetime.now().strftime("%Y-%m-%d-%H-%M") + '.zip'
response = HttpResponse(f.getvalue(), content_type="application/x-zip-compressed")
response['Content-Disposition'] = 'attachment; filename=%s' % zip_file_name
return response
Everything is fine and all photos are made in zip file but there is only one problem.
The problem is that the photos won't open and this error will appear in Windows:
Its look like we don't support this file format.
What is wrong with my codes?

To append data from file you have to use
write(filename)
Using writestr(filename) you add only string from variable filename but not from file.

PySide QFTP put only uploading 35-40 bytes

When I use QFTP's put command to upload a file it only uploads around 40 bytes of the specified file. I'm catching the dataProgress signal and I'm getting the progress but the total size of the file is only read to be around 40 bytes. Is there anything wrong with my code, or is it a problem on the FTP server's side?
Here is my upload function:
def upload(self):
filename = QFileDialog.getOpenFileName(self, 'Upload File', '.')
fname = QIODevice(filename[0])
dataname = filename[0]
data = os.path.basename(dataname)
#data = data[data.find("/") + 1:]
print data
print fname
if not self.fileTree.currentItem():
self.qftp.put(fname, data)
elif "." in self.fileTree.currentItem().text(0):
self.qftp.put(fname, self.fileTree.currentItem().parent().text(0) + data)
elif self.fileTree.currentItem().text(0) == "/":
self.qftp.put(fname, data)
else:
return
Alright, figured out what I needed to do. I needed to create a QFile and read all of the bytes from that file and then pass that to the put command.
def upload(self):
filename = QFileDialog.getOpenFileName(self, 'Upload File', '.')
data = QFile(filename[0])
data.open(1)
qdata = QByteArray(data.readAll())
file = os.path.basename(filename[0])
print data
if not self.fileTree.currentItem():
self.qftp.put(qdata, file, self.qftp.TransferType())
elif "." in self.fileTree.currentItem().text(0):
self.qftp.put(qdata, self.fileTree.currentItem().parent().text(0) + file)
elif self.fileTree.currentItem().text(0) == "/":
self.qftp.put(qdata, file)
else:
return

I'm guessing that data = os.path.basename(dataname) means data is always a string containing the name of the file. Try changing this to be an open fileobj by using data = open(os.path.basename(dataname), 'rb')
edit
Looking at PySide.QtNetwork.QFtp.put(data, file[, type=Binary]) and PySide.QtNetwork.QFtp.put(dev, file[, type=Binary]) - the order of arguments is data/dev then file - so it's the wrong way around in your code...

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using urllib.urlretrieve to download files over HTTP not working - python

Related

Get PDF attachments using Python

urllib urlretrieve only saving final image in list of urls

Extract email attachments and retain modification/creation date?

Error opening photos stored in zip file in Django

PySide QFTP put only uploading 35-40 bytes

Categories

Resources