I'm trying to unzip a zip file inside my script using Python's zipfile module. The problem is that when I try to unzip this file, it raises Bad magic number for file header error:
This is the error:
..
zip_ref.extractall(destination_to_unzip_file)
File "C:\Python27\lib\zipfile.py", line 1040, in extractall
self.extract(zipinfo, path, pwd)
File "C:\Python27\lib\zipfile.py", line 1028, in extract
return self._extract_member(member, path, pwd)
File "C:\Python27\lib\zipfile.py", line 1082, in _extract_member
with self.open(member, pwd=pwd) as source, \
File "C:\Python27\lib\zipfile.py", line 971, in open
raise BadZipfile("Bad magic number for file header")
zipfile.BadZipfile: Bad magic number for file header
The file I want to unzip is downloaded this way:
_url = """http://edane.drsr.sk/report/ds_dphs_csv.zip"""
def download_platici_dph(self):
if os.path.isfile(_destination_for_downloads+'platici_dph.zip'):
os.remove(_destination_for_downloads+'platici_dph.zip')
with open(_destination_for_downloads+'platici_dph.zip','w') as f:
response = requests.get(_url,stream=True)
if not response.ok:
print 'Something went wrong'
return False
else:
for block in response.iter_content(1024):
f.write(block)
Do anybody knows where is the problem?
Quoth the documentation for open(): "when opening a binary file, you should append 'b' to the mode value to open the file in binary mode"
Open your output file using b for binary:
with open(_destination_for_downloads+'platici_dph.zip','wb') as f:
I tried downloading your archive without using your download-code and then extracting it with:
import zipfile
with zipfile.ZipFile("ds_dphs_csv.zip") as a:
a.extractall()
It worked fine. The exception zipfile.BadZipfile is raised when there is a problem in a header thus I think your file is corrupted after download. There must be a problem with your downloading method.
You can find more details on the exception in this post: Python - Extracting files from a large (6GB+) zip file
Related
I am getting error on opening xlsx extension file in windows 8 using tablib library.
python version - 2.7.14
error is as follows:
python suit_simple_sheet_product.py
Traceback (most recent call last):
File "suit_simple_sheet_product.py", line 19, in <module>
data = tablib.Dataset().load(open(BASE_PATH).read())
File "C:\Python27\lib\site-packages\tablib\core.py", line 446, in load
format = detect_format(in_stream)
File "C:\Python27\lib\site-packages\tablib\core.py", line 1157, in detect_format
if fmt.detect(stream):
File "C:\Python27\lib\site-packages\tablib\formats\_xls.py", line 25, in detect
xlrd.open_workbook(file_contents=stream)
File "C:\Python27\lib\site-packages\xlrd\__init__.py", line 120, in open_workbook
zf = zipfile.ZipFile(timemachine.BYTES_IO(file_contents))
File "C:\Python27\lib\zipfile.py", line 770, in __init__
self._RealGetContents()
File "C:\Python27\lib\zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file
path location is as follows =
BASE_PATH = 'C:\Users\anju\Downloads\automate\catalog-5090 fabric detail and price list.xlsx'
Excel .xlsx files are actually zip files. In order for the unzip to work correctly, the file must be opened in binary mode, as such your need to open the file using:
import tablib
BASE_PATH = r'c:\my folder\my_test.xlsx'
data = tablib.Dataset().load(open(BASE_PATH, 'rb').read())
print data
Add r before your string to stop Python from trying to interpret the backslash characters in your path.
I am storing data using pandas built-in HDF5 methods.
Somehow, these HDF5 files were turned into 'read-only' files, and I am getting a lot of Opening xxx in read-only mode messages when I open those files in write mode and I can't write them, which is something I really need to do.
The thing I really don't understand so far is how come those files turned into read-only, as I am not aware of a piece of code that I wrote that may result in that behavior. (I have tried to check if the data stored in the HDF5 is corrupt, but I am able to read it and manipulate it, so it seems to be working just fine)
I have 2 questions:
How can I append data to those 'read-only mode' HDF5 files? (Can I convert them back to write mode or any other clever solution?)
Is there any pandas method that would change the HDF5 file to a 'read-only mode' by default so I can avoid turning those files into read-only in the first place?
Code:
The piece of code that is raising this issue is, which is the piece I use to save the output I generated:
with pd.HDFStore('data/observer/' + self._currency + '_' + str(ts)) as hdf:
hdf.append(key='observers', value=df, format='table', data_columns=True)
I also use this piece of code to manipulate the outputs that were generated previously:
for the_file in list_dir:
if currency in the_file:
temp_df = pd.read_hdf(folder + the_file)
...
I use some select commands as well to get specific columns from the data files:
with pd.HDFStore('data/observer/' + self.currency + '_' + timestamp) as hdf:
df = hdf.select(key='observers', columns=[x, y])
Error Traceback:
File ".../data_processing/observer_data.py", line 52, in save_obs_to_pandas
hdf.append(key='observers', value=df, format='table', data_columns=True)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 963, in append
**kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 1341, in _write_to_group
s.write(obj=value, append=append, complib=complib, **kwargs)
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3930, in write
self.set_info()
File ".../venv/lib/python3.5/site-packages/pandas/io/pytables.py", line 3163, in set_info
self.attrs.info = self.info
File ".../venv/lib/python3.5/site-packages/tables/attributeset.py", line 464, in __setattr__
nodefile._check_writable()
File ".../venv/lib/python3.5/site-packages/tables/file.py", line 2119, in _check_writable
raise FileModeError("the file is not writable")
tables.exceptions.FileModeError: the file is not writable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../general_manager.py", line 144, in <module>
gm.run()
File ".../general_manager.py", line 114, in run
list_of_observer_managers = self.load_all_observer_managers()
File ".../general_manager.py", line 64, in load_all_observer_managers
observer = currency_pool.map(self.load_observer_manager, list_of_currencies)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 260, in map
return self._map_async(func, iterable, mapstar, chunksize).get()
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
tables.exceptions.FileModeError: the file is not writable
The issue at hand was that I messed up with OS file permissions. The file I was trying to read belonged to the root (as I had run the code that generated those files with the root) and I was trying to access them with a user account.
I am running debian, and the following command (as root) solved my issues:
chown -R user.user folder
This commands recursively changes permissions of all files inside that folder to user.user.
I am trying to download a SWF file using Python,
but for some reason everytime I use urllib.urlretrieve(url,filepathwithname) or wget.download(url), it says:
File "C:\Users\User-PC\Desktop\check.py", line 3, in <module>
response =
urllib.urlretrieve('www.domain.com\\folder\\folder\\52sd1399sc2emaple-story-
vthree-f.swf','file.swf')
File "C:\Python27\lib\urllib.py", line 98, in urlretrieve
return opener.retrieve(url, filename, reporthook, data)
File "C:\Python27\lib\urllib.py", line 245, in retrieve
fp = self.open(url, data)
File "C:\Python27\lib\urllib.py", line 213, in open
return getattr(self, name)(url)
File "C:\Python27\lib\urllib.py", line 469, in open_file
return self.open_local_file(url)
File "C:\Python27\lib\urllib.py", line 483, in open_local_file
raise IOError(e.errno, e.strerror, e.filename)
IOError: [Errno 2] The system cannot find the path specified:
'www.domain.com\\folder\\folder\\52sd1399sc2emaple-story-vthree-f.swf'
while the wget line still gives the same error when running the line in a seperate file than my code(which runs perfectly without that download line), it does work sometimes, but wget never works with the swf links I've placed.
*Note. The string I enter is "www.domain.com\folder\folder\52sd1399sc2emaple-story-vthree-f.swf" and the error contains "www.domain.com\folder\folder\52sd1399sc2emaple-story-vthree-f.swf".
Please help.
#Blckknght You are amazing. that was the whole problem. Thanks to everyone who tried to help!
The whole problem was a missing http:// in the urlretrieve link... I have spent almost 10 hours trying to solve that problem using wget and other methods, and #Blckknght just solved it for me. =)
I wish you all the best, have a great day
-CodingCode.
I am new to python. I can't figure out what I am doing wrong when trying to read the contents of .tar.gz file into python. The tarfile I would like to read is hosted at the following web address:
ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz
more info on file at this site (just so you can trust contents)
http://www.pubmedcentral.nih.gov/utils/oa/oa.fcgi?id=PMC13901
The tarfile contains .pdf and .nxml copies of the journal article. And also a couple of image files.
If I open the file in my browser by copying and pasting. I can save to a location on my PC and import the tarfile fine using the following commands (note: winzip changes the file from .tar.gz to simply .tar when I save to location):
import tarfile
thetarfile = "C:/Users/dfcm/Documents/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar"
tfile = tarfile.open(thetarfile)
tfile
However, if I try to access the file directly using similar commands:
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
bbb = tarfile.open(thetarfile)
That results in the following error:
Traceback (most recent call last):
File "<pyshell#137>", line 1, in <module>
bbb = tarfile.open(thetarfile)
File "C:\Python30\lib\tarfile.py", line 1625, in open
return func(name, "r", fileobj, **kwargs)
File "C:\Python30\lib\tarfile.py", line 1687, in gzopen
fileobj = bltn_open(name, mode + "b")
File "C:\Python30\lib\io.py", line 278, in __new__
return open(*args, **kwargs)
File "C:\Python30\lib\io.py", line 222, in open
closefd)
File "C:\Python30\lib\io.py", line 615, in __init__
_fileio._FileIO.__init__(self, name, mode, closefd)
IOError: [Errno 22] Invalid argument: 'ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar'
Can anyone explain what I am doing wrong when trying to read the .tar.gz file directly from the web address? Thanks in advance. Chris
Unfortunately you cannot just open files from the network. Things are a bit more complex here. You have to instruct the interpreter to create a network request and create an object representing the request state. This can be done using the urllib module.
import urllib.request
import tarfile
thetarfile = "ftp://ftp.ncbi.nlm.nih.gov/pub/pmc/b0/ac/Breast_Cancer_Res_2001_Nov_9_3(1)_61-65.tar.gz"
ftpstream = urllib.request.urlopen(thetarfile)
thetarfile = tarfile.open(fileobj=ftpstream, mode="r|gz")
The ftpstream object is a file-like that represents the connection to the ftp server. Then the tarfile module can access this stream. Since we do not pass the filename, we have to specify the compression in the mode parameter.
I have a .zip file and would like to know the names of the files within it. Here's the code:
zip_path = glob.glob(path + '/*.zip')[0]
file = open(zip_path, 'r') # opens without error
if zipfile.is_zipfile(file):
print str(file) # prints to console
my_zipfile = zipfile.ZipFile(zip_path) # throws IOError
Here is the traceback:
<open file u'/Users/me/Documents/project/uploads/assets/peter/offline_message/offline_imgs.zip', mode 'r' at 0x107b2a150>
Traceback (most recent call last):
File "/Users/me/Documents/project/admin_dev/proj_name/views.py", line 1680, in get_dps_app_builder_assets
link_to_assets_zip = zip_dps_app_builder_assets(server_url, app_slug, button_slugs)
File "/Users/me/Documents/project/admin_dev/proj_name/views.py", line 1724, in zip_dps_app_builder_assets
my_zipfile = zipfile.ZipFile(zip_path)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 712, in __init__
self._GetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 746, in _GetContents
self._RealGetContents()
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/zipfile.py", line 779, in _RealGetContents
fp.seek(self.start_dir, 0)
IOError: [Errno 22] Invalid argument
I am very confused as to why this is happening since the file is clearly there and is a valid .zip file. The documentation clearly states that you can pass it either the path to the file or a file-like object, neither of which work in my case:
http://docs.python.org/2/library/zipfile#zipfile-objects
I was not able to figure this issue out and ended up doing it a different way entirely.
EDIT: In the Django app I work with, users needed to be able to upload assets in the form of .zip files and later download everything they had uploaded (plus other content we generate dynamically) in another zip with a different structure. So, I wanted to unzip a previously uploaded file and zip up the contents of that file up in another zip, which I couldn't do because of the error. Instead of reading the zip file when the user requested the download, I ended up unzipping it from a Django InMemoryUploadedFile (whose contents I was able to successfully read) and just leaving the unzipped files on the file system to work with later. The contents of the zip are only two smallish image files, so this workaround of unzipping the zip ahead of time to be used later worked OK for my purposes.