MetaData of downloaded zipped file - python

url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path='D:')
I am writing above code to download a zipped file from a url and have downloaded and extracted all files from it to a specified drive and it is working fine.
Is there a way I can get meta data of all files extracted from z for example.
filenames,file sizes and file extenstions etc?

Zipfile objects actually have built in tools for this that you can use without even extracting anything. infolist returns a list of ZipInfo objects that you can read certain information out of, including full file name and uncompressed size.
import os
url='http://www.test.com/test.zip'
z = zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
info = z.infolist()
data = []
for obj in info:
name = os.path.splitext(obj.filename)
data.append(name[0], name[1], obj.file_size)
I also used os.path.splitext just to separate out the file's name from its extension as you did ask for file type separately from the name.

I don't know of a built-in way to do that using the zipfile module, however it is easily done using os.path:
import os
EXTRACT_PATH = "D:"
z= zipfile.ZipFile(BytesIO(urllib.urlopen(url).read()))
z.extractall(path=EXTRACT_PATH)
extracted_files = [os.path.join(EXTRACT_PATH, filename) for filename in z.namelist()]
for extracted_file in extracted_files:
# All metadata operations here, such as:
print os.path.getsize(extracted_file)

Related

python: set file path to only point to files with a specific ending

I am trying to run a program with requires pVCF files alone as inputs. Due to the size of the data, I am unable to create a separate directory containing the particular files that I need.
The directory contains multiple files with 'vcf.gz.tbi' and 'vcf.gz' endings. Using the following code:
file_url = "file:///mnt/projects/samples/vcf_format/*.vcf.gz"
I tried to create a file path that only grabs the '.vcf.gz' files while excluding the '.vcf.gz.tbi' but I have been unsuccesful.
The code you have, as written, is just assigning your file path to the variable file_url. For something like this, glob is popular but isn't the only option:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
os.chdir(file_url)
for file in glob.glob("*.vcf.gz"):
print(file)
Note that the file path doesn't contain the kind of file you want (in this case, a gzipped VCF), the glob for loop does that.
Check out this answer for more options.
It took some digging but it looks like you're trying to use the import_vcf function of Hail. To put the files in a list so that it can be passed as input:
import glob, os
file_url = "file:///mnt/projects/samples/vcf_format/"
def get_vcf_list(path):
vcf_list = []
os.chdir(path)
for file in glob.glob("*.vcf.gz"):
vcf_list.append(path + "/" + file)
return vcf_list
get_vcf_list(file_url)
# Now you pass 'get_vcf_list(file_url)' as your input instead of 'file_url'
mt = hl.import_vcf(get_vcf_list(file_url), force_bgz=True, reference_genome="GRCh38", array_elements_required=False)

How to load all .txt files in a folder with separate names - Python

I have to create a bunch of numpy arrays by importing all the .txt files from a folder. This is the way I'm doing it right now:
wil_davide_noIA_IST_nz300 = np.genfromtxt("%s/davide/wil_davide_noIA_IST_nz300.txt" %path)
wig_davide_multiBinBias_IST_nz300 = np.genfromtxt("%s/davide/wig_davide_multiBinBias_IST_nz300.txt" %path)
wig_davide_multiBinBias_PySSC_nz300 = np.genfromtxt("%s/davide/wig_davide_multiBinBias_PySSC_nz300.txt" %path)
wig_davide_noBias_PySSC_nz300 = np.genfromtxt("%s/davide/wig_davide_noBias_PySSC_nz300.txt" %path)
wig_davide_IST_nz300 = np.genfromtxt("%s/davide/wig_davide_IST_nz300.txt" %path)
wig_davide_noBias_IST_nz300 = np.genfromtxt("%s/davide/wig_davide_noBias_IST_nz300.txt" %path)
wig_davide_PySSC_nz300 = np.genfromtxt("%s/davide/wig_davide_PySSC_nz300.txt" %path)
...
from the folder
Can I automate the process somehow? Note that I'd like the array to have the same name as the imported file (without the .txt, of course).
Thank you very much
I haven't tested this, but I think this pseudo-code should work - the only thing "pseudo" about it is the hardcoded "dir/to/files" string, which you'll have to change to the path to the directory containing the text files. In modern Python you would use the pathlib or glob standard library modules to iterate over all text files in a given directory. Creating a variable number of variables, with variable names determined at runtime is never a good idea. Instead, keep your numpy arrays in some kind of collection. In this case, I would suggest a dictionary, so that you can access the individual numpy arrays by key, where the key is the name of the corresponding file as a string - without the extension of course:
def get_kv_pairs():
from pathlib import Path
for path in Path("dir/to/files").glob("*.txt"):
yield path.stem, np.genfromtxt(str(path))
arrays = dict(get_kv_pairs())
print(arrays["wil_davide_noIA_IST_nz300"]) # Here's how you would access the individual numpy arrays.
This is how you can get all the filenames.
path is whatever is your file path ("./" if in the same directory)
import os
path = "./"
filenames = []
for filename in os.listdir(path):
filenames.append[file]
...
Also, you can filter some files with the if-structure provided. Here I'm selecting only txt files.
filter is some filter you can apply (use "." not in file to get directories)
import os
path = "./"
filter = ".txt"
txt_filenames = []
for filename in os.listdir(path):
if filter in file:
txt_filenames.append[filename]
...
A heads up: You should not do this, if there are any security concerns with the files or their names.
Otherwise you can use the exec function to get the variables you want.
For a list approach see the other solution.
import os
path_to_folder = "."
command = ""
for f in os.listdir(path_to_folder):
if f.endswith(".txt"):
command += f.replace(".txt", "") + f" = np.genfromtxt('{path_to_folder}/{f}')\n"
exec(command)
Depended on your folder/file structure you would need to change up the code a bit, but that should work like asked.

Unzip a file and save its content into a database

I am building a website using Django where the user can upload a .zip file. I do not know how many sub folders the file has or which type of files it contains.
I want to:
1) Unzip the file
2) Get all the file in the unzipped directory (which might contains nested sub folders)
3) Save these files (the content, not the path) into the database.
I managed to unzip the file and to output the files path.
However this is snot exactly what I want. Because I do not care about the file path but the file itself.
In addition, since I am saving the unzipped file into my media/documents, if different users upload different zip, and all the zip files are unzipped, the folder media/documents would be huge and it would impossible to know who uploaded what.
Unzipping the .zip file
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
Getting path of file in subfolders
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
views.py # It is not perfect, it is just an idea. I am just debugging.
def homeupload(request):
if request.method == "POST":
my_entity = Uploading()
# my_entity.my_uploads = request.FILES["my_uploads"]
myFile = request.FILES.get('my_uploads')
with ZipFile(myFile, 'r') as zipObj:
zipObj.extractall('media/documents/')
x = [i[2] for i in os.walk('media/documents/')]
file_names = []
for t in x:
for f in t:
file_names.append(f)
my_entity.save()
You really don't have to clutter up your filesystem when using a ZipFile, as it contains methods that allow you to read the files stored in the zip, directly to memory, and then you can save those objects to a database.
Specifically, we can use .infolist() or .namelist() to get a list of all the files in the zip, and .read() to actually get their contents:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item) for item in zipObj.namelist()]
Now file_objects is a list of bytes objects that have the content of all the files. I didn't bother saving the names or file paths because you said it was unneccessary, but that can be done too. To see what you can do, check out what actually get's returned from infolist
If you want to save these bytes objects to your database, it's usually possible if your database can support it (most can). If you however wanted to get these files as plain text and not bytes, you just have to convert them first with something like .decode:
with ZipFile(myFile, 'r') as zipObj:
file_objects = [zipObj.read(item).decode() for item in zipObj.namelist()]
Notice that we didn't save any files on our system at any point, so there's nothing to worry about a lot of user uploaded files cluttering up your system. However the database storage size on your disk will increase accordingly.

Limit on bz2 file decompression using python?

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.
My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?
My code is below:
import numpy as np
import bz2
import os
import glob
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
with bz2.BZ2File(zipped_file,'rb') as zipfile: #Read in the bz2 files
newfilepath = cpath +'/temp/'+zipped_file[-47:-4] #create a temporary file
with open(newfilepath, "wb") as tmpfile: #open the temporary file
for i,line in enumerate(zipfile.readlines()):
tmpfile.write(line) #write the data from the compressed file to the temporary file
filenames_.append(newfilepath)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)
It returns the correct file paths with the wrong sizes capped at 900 kb.
It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.
import numpy as np
import os
import glob
import shutil
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
newfilepath = cpath +'/temp/' #create a temporary file
newfilename = newfilepath + zipped_file[-47:-4]
os.popen('bzip2 -kd ' + zipped_file)
shutil.move(zipped_file[-47:-4],newfilepath)
filenames_.append(newfilename)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
unzip_f(path_)
This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.
This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.
After running pip install bz2file you can just replace bz2 with it:
import bz2file as bz2 and everything should just work :)
The original Python bug report: https://bugs.python.org/issue1625

Walking through directory path and opening them with trimesh

I have the following code:
import os
import trimesh
# Core settings
rootdir = 'path'
extension = ".zip"
for root, dirs, files in os.walk(rootdir):
if not root.endswith(".zip"):
for file in files:
if file.endswith(".stl"):
mesh = trimesh.load(file)
And I get the following error:
ValueError: File object passed as string that is not a file!
When I open the files one by one however, it works. What could be the reason ?
that's because file is the filename, not the full filepath
Fix that by using os.path.join with the containing directory:
mesh = trimesh.load(os.path.join(root,file))
This is not a direct answer to your question. However, you might be interested in noting that there is now a less complicated paradigm for this situation. It involves using the pathlib module.
I don't use trimesh. I will process pdf documents instead.
First, you can identify all of the pdf files in a directory and its subdirectories recursively with just a single line.
>>> from pathlib import Path
>>> for item in path.glob('**/*.pdf'):
... item
...
WindowsPath('C:/Quantarctica2/Quantarctica-Get_Started.pdf')
WindowsPath('C:/Quantarctica2/Quantarctica2_GetStarted.pdf')
WindowsPath('C:/Quantarctica2/Basemap/Terrain/BEDMAP2/tc-7-375-2013.pdf') WindowsPath('C:/Quantarctica2/Scientific/Glaciology/ALBMAP/1st_ReadMe_ALBMAP_LeBrocq_2010_EarthSystSciData.pdf')
WindowsPath('C:/Quantarctica2/Scientific/Glaciology/ASAID/Bindschadler2011TC_GroundingLines.pdf')
WindowsPath('C:/Quantarctica2/Software/CIA_WorldFactbook_Antarctica.pdf')
WindowsPath('C:/Quantarctica2/Software/CIA_WorldFactbook_SouthernOcean.pdf')
WindowsPath('C:/Quantarctica2/Software/QGIS-2.2-UserGuide-en.pdf')
You will have noticed that (a) the complete paths are made available, and (b) the paths are available within object instances. Fortunately, it's easy to recover the full paths using str.
>>> import fitz
>>> for item in path.glob('**/*.pdf'):
... doc = fitz.Document(str(item))
...
This line shows that the final pdf document has been loaded as a fitz document, ready for subsequent processing.
>>> doc
fitz.Document('C:\Quantarctica2\Software\QGIS-2.2-UserGuide-en.pdf')

Categories

Resources