I have a large zip file that contains many jar files. I want to read the content of the jar files.
What I tried is to read the inner jar file into memory, which seem to work (see below). However, I am not sure how large the jar files can be, and am concerned that they won't fit into memory.
Is there streaming solution for this problem?
hello.zip
+- hello.jar
+- Hello.class
#!/usr/local/bin/python3
import os
import io
import zipfile
zip = zipfile.ZipFile('hello.zip', 'r')
for zipname in zip.namelist():
if zipname.endswith('.jar'):
print(zipname)
jarname = zip.read(zipname)
memfile = io.BytesIO(jarname)
jar = zipfile.ZipFile(memfile)
for f in jar.namelist():
print(f)
hello.jar
META-INF/
META-INF/MANIFEST.MF
Hello.class
Related
I have a data dump from Wikipedia of about 30 files, each being about ~2.5 GB uncompressed size. I want to extract these files automatically, but as I understand I cannot use Lambda because it has file limitations.
I found another alternate solution of using SQS which will call EC2 instance, which I am working on. However, for that situation to work my script needs to read all zip files(.gz and .bz2) from S3 bucket and folders and extract them.
But on using zipfile module from python, I receive the following error:
zipfile.BadZipFile: File is not a zip file
Is there a solution to this?
This is my code:
import boto3
from io import BytesIO
import zipfile
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="backupwikiscrape", key= 'raw/enwiki-20200920-pages-articles-multistream1.xml-p1p41242.bz2')
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket='backupwikiextract',
Key=f'{filename}'
)
The above code doesn't seem to be able to extract the above formats. Any suggestions?
Your file is bz2, thus you should use bz2 python library.
To decompress your object:
decompressed_bytes = bz2.decompress(zip_obj.get()["Body"].read())
I'll suggest you to use smart_open, it's much easier. It both handle gz and bz2 files.
from smart_open import open
import boto3
s3_session = boto3.Session()
with open(path_to_my_file, transport_params={'session': s3_session}) as fin:
for line in fin:
print(line)
I am trying to run a python zip file which is retrieved using requests.get. The zip file has several directories of python files in addition to the __main__.py, so in the interest of easily sending it as a single file, I am zipping it.
I know the file is being sent correctly, as I can save it to a file and then run it, however, I want to execute it without writing it to storage.
The working part is more or less as follows:
import requests
response = requests.get("http://myurl.com/get_zip")
I can write the zip to file using
f = open("myapp.zip","wb")
f.write(response.content)
f.close()
and manually run it from command line. However, I want something more like
exec(response.content)
This doesn't work since it's still compressed, but you get the idea.
I am also open to ideas that replace the zip with some other format of sending the code over internet, if it's easier to execute it from memory.
A possible solution is this
import io
import requests
from zipfile import ZipFile
response = requests.get("http://myurl.com/get_zip")
# Read the contents of the zip into a bytes object.
binary_zip = io.BytesIO(response.content)
# Convert the bytes object into a ZipFile.
zip_file = ZipFile(binary_zip, "r")
# Iterate over all files in the zip (folders should be also ok).
for script_file in zip_file.namelist():
exec(zip_file.read(script_file))
But it is a bit convoluted and probably can be improved.
I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.
My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?
My code is below:
import numpy as np
import bz2
import os
import glob
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
with bz2.BZ2File(zipped_file,'rb') as zipfile: #Read in the bz2 files
newfilepath = cpath +'/temp/'+zipped_file[-47:-4] #create a temporary file
with open(newfilepath, "wb") as tmpfile: #open the temporary file
for i,line in enumerate(zipfile.readlines()):
tmpfile.write(line) #write the data from the compressed file to the temporary file
filenames_.append(newfilepath)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)
It returns the correct file paths with the wrong sizes capped at 900 kb.
It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.
import numpy as np
import os
import glob
import shutil
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
newfilepath = cpath +'/temp/' #create a temporary file
newfilename = newfilepath + zipped_file[-47:-4]
os.popen('bzip2 -kd ' + zipped_file)
shutil.move(zipped_file[-47:-4],newfilepath)
filenames_.append(newfilename)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
unzip_f(path_)
This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.
This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.
After running pip install bz2file you can just replace bz2 with it:
import bz2file as bz2 and everything should just work :)
The original Python bug report: https://bugs.python.org/issue1625
I have a big zip file containing many files that i'd like to unzip by chunks to avoid consuming too much memory.
I tried to use python module zipfile but I didn't find a way to load the archive by chunk and to extract it on disk.
Is there simple way to do that in python ?
EDIT
#steven-rumbalski correctly pointed that zipfile correctly handle big files by unzipping the files one by one without loading the full archive.
My problem here is that my zip file is on AWS S3 and that my EC2 instance cannot load such a big file in RAM so I download it by chunks and I would like to unzip it by chunk.
You don't need a special way to extract a large archive to disk. The source Lib/zipfile.py shows that zipfile is already memory efficient. Creating a zipfile.ZipFile object does not read the whole file into memory. Rather it just reads in the table of contents for the ZIP file. ZipFile.extractall() extracts files one at a time using shutil.copyfileobj() copying from a subclass of io.BufferedIOBase.
If all you want to do is a one-time extraction Python provides a shortcut from the command line:
python -m zipfile -e archive.zip target-dir/
You can use zipfile (or possibly tarfile) as follows:
import zipfile
def extract_chunk(fn, directory, ix_begin, ix_end):
with zipfile.ZipFile("{}/file.zip".format(directory), 'r') as zf:
infos = zf.infolist()
print(infos)
for ix in range(max(0, ix_begin), min(ix_end, len(infos))):
zf.extract(infos[ix], directory)
zf.close()
directory = "path"
extract_chunk("{}/file.zip".format(directory), directory, 0, 50)
Is there a way to generate a file on HDFS directly?
I want to avoid generating a local file and then over hdfs command line like:
hdfs dfs -put - "file_name.csv" to copy to HDFS.
Or is there any python library?
Have you tried with HdfsCli?
To quote the paragraph Reading and Writing files:
# Loading a file in memory.
with client.read('features') as reader:
features = reader.read()
# Directly deserializing a JSON object.
with client.read('model.json', encoding='utf-8') as reader:
from json import load
model = load(reader)
Is extremly slow when I use hdfscli the write method?
Is there an any way to speedup with using hdfscli?
with client.write(conf.hdfs_location+'/'+ conf.filename, encoding='utf-8', buffersize=10000000) as f:
writer = csv.writer(f, delimiter=conf.separator)
for i in tqdm(10000000000):
row = [column.get_value() for column in conf.columns]
writer.writerow(row)
Thanks lot.
hdfs dfs -put does not require yo to create a file on local. Also, no need of creating a zero byte file on hdfs (touchz) and append to it (appendToFile). You can directly write a file on hdfs as:
hadoop fs -put - /user/myuser/testfile
Hit enter. On the command prompt, enter the text you want to put in the file. Once you are finished, say Ctrl+D.
Two ways of write local files to hdfs using python:
One way is using hdfs python package:
Code snippet:
from hdfs import InsecureClient
hdfsclient = InsecureClient('http://localhost:50070', user='madhuc')
hdfspath="/user/madhuc/hdfswritedata/"
localpath="/home/madhuc/sample.csv"
hdfsclient.upload(hdfspath, localpath)
Outputlocation:'/user/madhuc/hdfswritedata/sample.csv'
Otherway is subprocess python package using PIPE
Code Sbippet:
from subprocess import PIPE, Popen
# put file into hdfs
put = Popen(["hadoop", "fs", "-put", localpath, hdfspath], stdin=PIPE, bufsize=-1)
put.communicate()
print("File Saved Successfully")