Python: Crossplatform code to download a valid .zip file - python

I have a requirement to download and unzip a file from a website. Here is the code I'm using:
#!/usr/bin/python
#geoipFolder = r'/my/folder/path/ ' #Mac/Linux folder path
geoipFolder = r'D:\my\folder\path\ ' #Windows folder path
geoipFolder = geoipFolder[:-1] #workaround for Windows escaping trailing quote
geoipName = 'GeoIPCountryWhois'
geoipURL = 'http://geolite.maxmind.com/download/geoip/database/GeoIPCountryCSV.zip'
import urllib2
response = urllib2.urlopen(geoipURL)
f = open('%s.zip' % (geoipFolder+geoipName),"w")
f.write(repr(response.read()))
f.close()
import zipfile
zip = zipfile.ZipFile(r'%s.zip' % (geoipFolder+geoipName))
zip.extractall(r'%s' % geoipFolder)
This code works on Mac and Linux boxes, but not on Windows. There, the .zip file is written, but the script throws this error:
zipfile.BadZipfile: File is not a zip file
I can't unzip the file using Windows Explorer either. It says that:
The compressed (zipped) folder is empty.
However the file on disk is 6MB large.
Thoughts on what I'm doing wrong on Windows?
Thanks

Your zipfile is corrupt on windows because you're opening the file in write/text mode (line-terminator conversion trashes binary data):
f = open('%s.zip' % (geoipFolder+geoipName),"w")
You have to open in write/binary mode like this:
f = open('%s.zip' % (geoipFolder+geoipName),"wb")
(will still work on Linux of course)
To sum it up, a more pythonic way of doing it, using a with block (and remove repr):
with open('{}{}.zip'.format(geoipFolder,geoipName),"wb") as f:
f.write(response.read())
EDIT: no need to write a file to disk, you can use io.BytesIO, since the ZipFile object accepts a file handle as first parameter.
import io
import zipfile
with open('{}{}.zip'.format(geoipFolder,geoipName),"wb") as f:
outbuf = io.BytesIO(f.read())
zip = zipfile.ZipFile(outbuf) # pass the fake-file handle: no disk write, no temp file
zip.extractall(r'%s' % geoipFolder)

Related

How to speed up zip file extraction in Python

I'm trying to extract data from a zip file in Python, but it's kind of slow. Could anyone advise me and see if I'm doing something that obviously makes it slower?
def go_through_zip(zipname):
out = {}
with ZipFile(zipname) as z:
for filename in z.namelist():
with z.open(filename) as f:
try:
outdict = make_dict(f)
out.update(outdict)
except:
print("File is not in the correct format")
return out
make_dict(f) just takes the file path and makes a dictionary, and this function is probably also slow, but that's not what I want to speed up right now.
Try using the following code for file extraction. it works fast as long as the size of the file being extracted is reasonable.
# importing required modules
from zipfile import ZipFile
# specifying the zip file name
file_name = "my_python_files.zip"
# opening the zip file in READ mode
with ZipFile(file_name, 'r') as zip:
# printing all the contents of the zip file
zip.printdir()
# extracting all the files
print('Extracting all the files now...')
zip.extractall()
print('Done!')
```

Creating view in browser functionality with python

I have been struggling with this problem for a while but can't seem to find a solution for it. The situation is that I need to open a file in browser and after the user closes the file the file is removed from their machine. All I have is the binary data for that file. If it matters, the binary data comes from Google Storage using the download_as_string method.
After doing some research I found that the tempfile module would suit my needs, but I can't get the tempfile to open in browser because the file only exists in memory and not on the disk. Any suggestions on how to solve this?
This is my code so far:
import tempfile
import webbrowser
# grabbing binary data earlier on
temp = tempfile.NamedTemporaryFile()
temp.name = "example.pdf"
temp.write(binary_data_obj)
temp.close()
webbrowser.open('file://' + os.path.realpath(temp.name))
When this is run, my computer gives me an error that says that the file cannot be opened since it is empty. I am on a Mac and am using Chrome if that is relevant.
You could try using a temporary directory instead:
import os
import tempfile
import webbrowser
# I used an existing pdf I had laying around as sample data
with open('c.pdf', 'rb') as fh:
data = fh.read()
# Gives a temporary directory you have write permissions to.
# The directory and files within will be deleted when the with context exits.
with tempfile.TemporaryDirectory() as temp_dir:
temp_file_path = os.path.join(temp_dir, 'example.pdf')
# write a normal file within the temp directory
with open(temp_file_path, 'wb+') as fh:
fh.write(data)
webbrowser.open('file://' + temp_file_path)
This worked for me on Mac OS.

Downloading a file with a wildcard from Sharepoint URL with Python

I'm not sure if this can be done or not but I thought I'd ask!
I'm running a windows 10 PC using Python 2.7. I'm wanting to download a file form sharepoint to a folder on my C: drive.
OpenPath = "https://office.test.com/sites/Rollers/New improved files/"
OpenFile = "ABC UK.xlsb"
The downside is the file is uploaded externally & due to human error it can be saved as a .xlsx or ABC_UK. Therefor I only want to use the first 3 characters with a wildcard (ABC*) to open that file. Thankfully the first 3 Characters are unique & there only be one file in the path that should match.
to find the file in your dir:
import os, requests, fnmatch
#loop over dir
for filename in os.listdir(OpenPath):
#find the file
if fnmatch.fnmatch(filename, 'ABC_UK*'):
#download the file
# open file handler
with open('C:\dwnlfile.xls', 'wb') as fh:
#try to get it
result = requests.get(OpenPath+filename)
#check u got it
if not result.ok:
print result.reason # or text
exit(1)
#save it
fh.write(result.content)
print 'got it and saved'

Error when trying to read and write multiple files

I modified the code based on the comments from experts in this thread. Now the script reads and writes all the individual files. The script reiterates, highlight and write the output. The current issue is, after highlighting the last instance of the search item, the script removes all the remaining contents after the last search instance in the output of each file.
Here is the modified code:
import os
import sys
import re
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f
infile = open(filepath, 'r+')
source_content = infile.read()
color = ('red')
regex = re.compile(r"(\b be \b)|(\b by \b)|(\b user \b)|(\bmay\b)|(\bmight\b)|(\bwill\b)|(\b's\b)|(\bdon't\b)|(\bdoesn't\b)|(\bwon't\b)|(\bsupport\b)|(\bcan't\b)|(\bkill\b)|(\betc\b)|(\b NA \b)|(\bfollow\b)|(\bhang\b)|(\bbelow\b)", re.I)
i = 0; output = ""
for m in regex.finditer(source_content):
output += "".join([source_content[i:m.start()],
"<strong><span style='color:%s'>" % color[0:],
source_content[m.start():m.end()],
"</span></strong>"])
i = m.end()
outfile = open(filepath, 'w+')
outfile.seek(0)
outfile.write(output)
print "\nProcess Completed!\n"
infile.close()
outfile.close()
raw_input()
The error message tells you what the error is:
No such file or directory: 'sample1.html'
Make sure the file exists. Or do a try statement to give it a default behavior.
The reason why you get that error is because the python script doesn't have any knowledge about where the files are located that you want to open.
You have to provide the file path to open it as I have done below. I have simply concatenated the source file path+'\\'+filename and saved the result in a variable named as filepath. Now simply use this variable to open a file in open().
import os
import sys
source = raw_input("Enter the source files path:")
listfiles = os.listdir(source)
for f in listfiles:
filepath = source+'\\'+f # This is the file path
infile = open(filepath, 'r')
Also there are couple of other problems with your code, if you want to open the file for both reading and writing then you have to use r+ mode. More over in case of Windows if you open a file using r+ mode then you may have to use file.seek() before file.write() to avoid an other issue. You can read the reason for using the file.seek() here.

unzipping file results in "BadZipFile: File is not a zip file"

I have two zip files, both of them open well with Windows Explorer and 7-zip.
However when i open them with Python's zipfile module [ zipfile.ZipFile("filex.zip") ], one of them gets opened but the other one gives error "BadZipfile: File is not a zip file".
I've made sure that the latter one is a valid Zip File by opening it with 7-Zip and looking at its properties (says 7Zip.ZIP). When I open the file with a text editor, the first two characters are "PK", showing that it is indeed a zip file.
I'm using Python 2.5 and really don't have any clue how to go about for this. I've tried it both with Windows as well as Ubuntu and problem exists on both platforms.
Update: Traceback from Python 2.5.4 on Windows:
Traceback (most recent call last):
File "<module1>", line 5, in <module>
zipfile.ZipFile("c:/temp/test.zip")
File "C:\Python25\lib\zipfile.py", line 346, in init
self._GetContents()
File "C:\Python25\lib\zipfile.py", line 366, in _GetContents
self._RealGetContents()
File "C:\Python25\lib\zipfile.py", line 378, in _RealGetContents
raise BadZipfile, "File is not a zip file"
BadZipfile: File is not a zip file
Basically when the _EndRecData function is called for getting data from End of Central Directory" record, the comment length checkout fails [ endrec[7] == len(comment) ].
The values of locals in the _EndRecData function are as following:
END_BLOCK: 4096,
comment: '\x00',
data: '\xd6\xf6\x03\x00\x88,N8?<e\xf0q\xa8\x1cwK\x87\x0c(\x82a\xee\xc61N\'1qN\x0b\x16K-\x9d\xd57w\x0f\xa31n\xf3dN\x9e\xb1s\xffu\xd1\.....', (truncated)
endrec: ['PK\x05\x06', 0, 0, 4, 4, 268, 199515, 0],
filesize: 199806L,
fpin: <open file 'c:/temp/test.zip', mode 'rb' at 0x045D4F98>,
start: 4073
files named file can confuse python - try naming it something else. if it STILL wont work, try this code:
def fixBadZipfile(zipFile):
f = open(zipFile, 'r+b')
data = f.read()
pos = data.find('\x50\x4b\x05\x06') # End of central directory signature
if (pos > 0):
self._log("Trancating file at location " + str(pos + 22)+ ".")
f.seek(pos + 22) # size of 'ZIP end of central directory record'
f.truncate()
f.close()
else:
# raise error, file is truncated
I run into the same issue. My problem was that it was a gzip instead of a zip file. I switched to the class gzip.GzipFile and it worked like a charm.
astronautlevel's solution works for most cases, but the compressed data and CRCs in the Zip can also contain the same 4 bytes. You should do an rfind (not find), seek to pos+20 and then add write \x00\x00 to the end of the file (tell zip applications that the length of the 'comments' section is 0 bytes long).
# HACK: See http://bugs.python.org/issue10694
# The zip file generated is correct, but because of extra data after the 'central directory' section,
# Some version of python (and some zip applications) can't read the file. By removing the extra data,
# we ensure that all applications can read the zip without issue.
# The ZIP format: http://www.pkware.com/documents/APPNOTE/APPNOTE-6.3.0.TXT
# Finding the end of the central directory:
# http://stackoverflow.com/questions/8593904/how-to-find-the-position-of-central-directory-in-a-zip-file
# http://stackoverflow.com/questions/20276105/why-cant-python-execute-a-zip-archive-passed-via-stdin
# This second link is only losely related, but echos the first, "processing a ZIP archive often requires backwards seeking"
content = zipFileContainer.read()
pos = content.rfind('\x50\x4b\x05\x06') # reverse find: this string of bytes is the end of the zip's central directory.
if pos>0:
zipFileContainer.seek(pos+20) # +20: see secion V.I in 'ZIP format' link above.
zipFileContainer.truncate()
zipFileContainer.write('\x00\x00') # Zip file comment length: 0 byte length; tell zip applications to stop reading.
zipFileContainer.seek(0)
return zipFileContainer
I had the same problem and was able to solve this issue for my files, see my answer at
zipfile cant handle some type of zip data?
I'm very new at python and i was facing the exact same issue, none of the previous methods were working.
Trying to print the 'corrupted' file just before unzipping it returned an empty byte object.
Turned out, I was trying to unzip the file right after writing it to disk, without closing the file handler.
with open(path, 'wb') as outFile:
outFile.write(data)
outFile.close() # was missing this
with zipfile.ZipFile(path, 'r') as zip:
zip.extractall(destination)
Closing the file stream then unzipping the file resolved my issue.
Sometime there are zip file which contain corrupted files and upon unzipping the zip gives badzipfile error. but there are tools like 7zip winrar which ignores these errors and successfully unzip the zip file. you can create a sub process and use this code to unzip your zip file without getting BadZipFile Error.
import subprocess
ziploc = "C:/Program Files/7-Zip/7z.exe" #location where 7zip is installed
cmd = [ziploc, 'e',your_Zip_file.zip ,'-o'+ OutputDirectory ,'-r' ]
sp = subprocess.Popen(cmd, stderr=subprocess.STDOUT, stdout=subprocess.PIPE)
Show the full traceback that you got from Python -- this may give a hint as to what the specific problem is. Unanswered: What software produced the bad file, and on what platform?
Update: Traceback indicates having problem detecting the "End of Central Directory" record in the file -- see function _EndRecData starting at line 128 of C:\Python25\Lib\zipfile.py
Suggestions:
(1) Trace through the above function
(2) Try it on the latest Python
(3) Answer the question above.
(4) Read this and anything else found by google("BadZipfile: File is not a zip file") that appears to be relevant
I faced this problem and was looking for a good and clean solution; But there was no solution until I found this answer. I had the same problem that #marsl (among the answers) had. It was a gzipfile instead of a zipfile in my case.
I could unarchive and decompress my gzipfile with this approach:
with tarfile.open(archive_path, "r:gz") as gzip_file:
gzip_file.extractall()
Have you tried a newer python, or if that is too much trouble, simply a newer zipfile.py? I have successfully used a copy of zipfile.py from Python 2.6.2 (latest at the time) with Python 2.5 in order to open some zip files that weren't supported by Py2.5s zipfile module.
In some cases, you have to confirm if the zip file is actually in gzip format. this was the case for me and i solved it by :
import requests
import tarfile
url = ".tar.gz link"
response = requests.get(url, stream=True)
file = tarfile.open(fileobj=response.raw, mode="r|gz")
file.extractall(path=".")
for this this happened when the file wasn't downloaded fully I think. So I just delete it in my download code.
def download_and_extract(url: str,
path_used_for_zip: Path = Path('~/data/'),
path_used_for_dataset: Path = Path('~/data/tmp/'),
rm_zip_file_after_extraction: bool = True,
force_rewrite_data_from_url_to_file: bool = False,
clean_old_zip_file: bool = False,
gdrive_file_id: Optional[str] = None,
gdrive_filename: Optional[str] = None,
):
"""
Downloads data and tries to extract it according to different protocols/file types.
note:
- to force a download do:
force_rewrite_data_from_url_to_file = True
clean_old_zip_file = True
- to NOT remove file after extraction:
rm_zip_file_after_extraction = False
Tested with:
- zip files, yes!
Later:
- todo: tar, gz, gdrive
force_rewrite_data_from_url_to_file = remvoes the data from url (likely a zip file) and redownloads the zip file.
"""
path_used_for_zip: Path = expanduser(path_used_for_zip)
path_used_for_zip.mkdir(parents=True, exist_ok=True)
path_used_for_dataset: Path = expanduser(path_used_for_dataset)
path_used_for_dataset.mkdir(parents=True, exist_ok=True)
# - download data from url
if gdrive_filename is None: # get data from url, not using gdrive
import ssl
ctx = ssl.create_default_context()
ctx.check_hostname = False
ctx.verify_mode = ssl.CERT_NONE
print("downloading data from url: ", url)
import urllib
import http
response: http.client.HTTPResponse = urllib.request.urlopen(url, context=ctx)
print(f'{type(response)=}')
data = response
# save zipfile like data to path given
filename = url.rpartition('/')[2]
path2file: Path = path_used_for_zip / filename
else: # gdrive case
from torchvision.datasets.utils import download_file_from_google_drive
# if zip not there re-download it or force get the data
path2file: Path = path_used_for_zip / gdrive_filename
if not path2file.exists():
download_file_from_google_drive(gdrive_file_id, path_used_for_zip, gdrive_filename)
filename = gdrive_filename
# -- write downloaded data from the url to a file
print(f'{path2file=}')
print(f'{filename=}')
if clean_old_zip_file:
path2file.unlink(missing_ok=True)
if filename.endswith('.zip') or filename.endswith('.pkl'):
# if path to file does not exist or force to write down the data
if not path2file.exists() or force_rewrite_data_from_url_to_file:
# delete file if there is one if your going to force a rewrite
path2file.unlink(missing_ok=True) if force_rewrite_data_from_url_to_file else None
print(f'about to write downloaded data from url to: {path2file=}')
# wb+ is used sinze the zip file was in bytes, otherwise w+ is fine if the data is a string
with open(path2file, 'wb+') as f:
# with open(path2file, 'w+') as f:
print(f'{f=}')
print(f'{f.name=}')
f.write(data.read())
print(f'done writing downloaded from url to: {path2file=}')
elif filename.endswith('.gz'):
pass # the download of the data doesn't seem to be explicitly handled by me, that is done in the extract step by a magic function tarfile.open
# elif is_tar_file(filename):
# os.system(f'tar -xvzf {path_2_zip_with_filename} -C {path_2_dataset}/')
else:
raise ValueError(f'File type {filename=} not supported.')
# - unzip data written in the file
extract_to = path_used_for_dataset
print(f'about to extract: {path2file=}')
print(f'extract to target: {extract_to=}')
if filename.endswith('.zip'):
import zipfile # this one is for zip files, inspired from l2l
zip_ref = zipfile.ZipFile(path2file, 'r')
zip_ref.extractall(extract_to)
zip_ref.close()
if rm_zip_file_after_extraction:
path2file.unlink(missing_ok=True)
elif filename.endswith('.gz'):
import tarfile
file = tarfile.open(fileobj=response, mode="r|gz")
file.extractall(path=extract_to)
file.close()
elif filename.endswith('.pkl'):
# no need to extract it, but when you use the data make sure you torch.load it or pickle.load it.
print(f'about to test torch.load of: {path2file=}')
data = torch.load(path2file) # just to test
assert data is not None
print(f'{data=}')
pass
else:
raise ValueError(f'File type {filename=} not supported, edit code to support it.')
# path_2_zip_with_filename = path_2_ziplike / filename
# os.system(f'tar -xvzf {path_2_zip_with_filename} -C {path_2_dataset}/')
# if rm_zip_file:
# path_2_zip_with_filename.unlink(missing_ok=True)
# # raise ValueError(f'File type {filename=} not supported.')
print(f'done extracting: {path2file=}')
print(f'extracted at location: {path_used_for_dataset=}')
print(f'-->Succes downloading & extracting dataset at location: {path_used_for_dataset=}')
you can use my code with pip install ultimate-utils for the most up to date version.
In the other case, this warning showing up when the ml/dl model has different format.
For the example:
you want to open pickle, but the model format is .sav
Solution:
you need to change the format to original format
pickle --> .pkl
tensorflow --> .h5
etc.
In my case, the zip file itself was missing from that directory - thus when I tried to unzip it, I got the error "BadZipFile: File is not a zip file". It got resolved after I moved the .zip file to the directory. Please confirm that the file is indeed present in your directory before running the python script.
In my case, the zip file was corrupted. I was trying to download the zip file with urllib.request.urlretrieve but the file wouldn't completely download for some reason.
I connected to a VPN, the file downloaded just fine, and I was able to open the file.

Categories

Resources