Limit on bz2 file decompression using python? - python

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.
My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?
My code is below:
import numpy as np
import bz2
import os
import glob
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
with bz2.BZ2File(zipped_file,'rb') as zipfile: #Read in the bz2 files
newfilepath = cpath +'/temp/'+zipped_file[-47:-4] #create a temporary file
with open(newfilepath, "wb") as tmpfile: #open the temporary file
for i,line in enumerate(zipfile.readlines()):
tmpfile.write(line) #write the data from the compressed file to the temporary file
filenames_.append(newfilepath)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)
It returns the correct file paths with the wrong sizes capped at 900 kb.

It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.
import numpy as np
import os
import glob
import shutil
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
newfilepath = cpath +'/temp/' #create a temporary file
newfilename = newfilepath + zipped_file[-47:-4]
os.popen('bzip2 -kd ' + zipped_file)
shutil.move(zipped_file[-47:-4],newfilepath)
filenames_.append(newfilename)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
unzip_f(path_)

This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.
This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.
After running pip install bz2file you can just replace bz2 with it:
import bz2file as bz2 and everything should just work :)
The original Python bug report: https://bugs.python.org/issue1625

Related

Compressing Excel file does not reduce the size

I am using python to automate the conversion of Excel file to zip folder. The file is converted to zip using below code.
import os
from zipfile import ZipFile
path = r'C:/Users/kj/Desktop/Results'
os.chdir(path)
new_path = "C:/Users/kj/Desktop/New folder/"
ZipFile(new_path+"Week6.zip", 'w').write("Week6.csv")
The problem is below code is converting it to zip file but not reducing the size of the file. It remains same. But when I manually zip file, it reduces the size considerably. Please suggest what can be done to do it manually?
Because the default compression algorithm is ZIP_STORED which basically saves exact files in uncompressed format. so you have to mention your compression algorithm like lzma:
ZipFile(new_path+"Week6.zip", mode='w', compression=ZIP_LZMA)

Access data from particular directory with less line of codes

Assume, I have a csv file data.csv located in the following directory 'C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets'. Using this code, I can access my csv file:
## read the csv file from a particular folder
import pandas as pd
import glob
files = glob.glob(r"C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets*.csv")
df = pd.DataFrame()
for f in files:
csv = pd.read_csv(f)
df = df.append(csv)
But as you can see the csv file path is long. So, is there is any way to do the same operation where I can reduce the path location of my data as well as codes line.
use the "dot" notation for a relative path (it does not depend on the programming language)
# example for a "shorter" version of the path
import os
my_current_position = '.' # where you launch the program
files = '' # from above
print(os.path.relpath(files, my_current_position)
Remark relpath is order sensitive
You can use a context manager to open the file, not shorter but more elegant
with open(file, 'r') as fd:
data_table = pd.read_csv(fd)
If you put your script in the same directory as the datasets, you can simply do:
import glob
files = glob.glob("datasets*.csv")

Read compressed binary file (.grib2.bz2)

I have downloaded one of the files from this list https://opendata.dwd.de/weather/nwp/icon-eu/grib/03/t_2m/ (the actual filenames change every day) which are bz2 compressed.
I can read in the decompressed file using e.g.
import xarray as xr
# cfgrib + dependencies are also required
grib1 = xr.open_dataset("icon-eu_europe_regular-lat-lon_single-level_2020101212_001_ASHFL_S.grib2", engine='cfgrib')
However, I would like to read in the compressed file.
I tried things like
with bz2.open("icon-eu_europe_regular-lat-lon_single-level_2020101818_002_ASWDIFD_S.grib2.bz2", "rb") as f:
xr.open_dataset(f, engine='cfgrib')
but this does not work.
I am looking for any way to programmatically read in the compressed file.
I had the same issue within processing numerical weather prediction data.
What I am doing here is to download the file and hold it as a Binary Object (e.g. with urlopen or requests). Pass this object into the following function:
import bz2, shutil
from io import BytesIO
from pathlib import Path
def bunzip_store(file: BytesIO, local_intermediate_file: Path):
with bz2.BZ2File(file) as fr, local_intermediate_file.open(mode="wb") as fw:
shutil.copyfileobj(fr, fw)
An unzipped file will be stored underlocal_intermediate_file. Now you should be able to open this file.

Using gdal in python to produce tiff files from csv files

I have many csv files with this format:
Latitude,Longitude,Concentration
53.833399,-122.825257,0.021957
53.837893,-122.825238,0.022642
....
My goal is to produce GeoTiff files based on the information within these files (one tiff file per csv file), preferably using python. This was done several years ago on the project I am working on, however how they did it before has been lost. All I know is that they most likely used GDAL.
I have attempted to do this by researching how to use GDAL, but this has not got me anywhere, as there are limited resources and I have no knowledge of how to use this.
Can someone help me with this?
Here is a little code I adapted for your case. You need to have the GDAL directory with all the *.exe in added to your path for it to work (in most cases it's C:\Program Files (x86)\GDAL).
It uses the gdal_grid.exe util (see doc here: http://www.gdal.org/gdal_grid.html)
You can modify as you wish the gdal_cmd variable to suits your needs.
import subprocess
import os
# your directory with all your csv files in it
dir_with_csvs = r"C:\my_csv_files"
# make it the active directory
os.chdir(dir_with_csvs)
# function to get the csv filenames in the directory
def find_csv_filenames(path_to_dir, suffix=".csv"):
filenames = os.listdir(path_to_dir)
return [ filename for filename in filenames if filename.endswith(suffix) ]
# get the filenames
csvfiles = find_csv_filenames(dir_with_csvs)
# loop through each CSV file
# for each CSV file, make an associated VRT file to be used with gdal_grid command
# and then run the gdal_grid util in a subprocess instance
for fn in csvfiles:
vrt_fn = fn.replace(".csv", ".vrt")
lyr_name = fn.replace('.csv', '')
out_tif = fn.replace('.csv', '.tiff')
with open(vrt_fn, 'w') as fn_vrt:
fn_vrt.write('<OGRVRTDataSource>\n')
fn_vrt.write('\t<OGRVRTLayer name="%s">\n' % lyr_name)
fn_vrt.write('\t\t<SrcDataSource>%s</SrcDataSource>\n' % fn)
fn_vrt.write('\t\t<GeometryType>wkbPoint</GeometryType>\n')
fn_vrt.write('\t\t<GeometryField encoding="PointFromColumns" x="Longitude" y="Latitude" z="Concentration"/>\n')
fn_vrt.write('\t</OGRVRTLayer>\n')
fn_vrt.write('</OGRVRTDataSource>\n')
gdal_cmd = 'gdal_grid -a invdist:power=2.0:smoothing=1.0 -zfield "Concentration" -of GTiff -ot Float64 -l %s %s %s' % (lyr_name, vrt_fn, out_tif)
subprocess.call(gdal_cmd, shell=True)

Compile latex from python

I have made some python function for compiling passed string as pdf file using latex. The function works as expected and has been quite useful, therefore I look for ways to improve it.
The code which I have:
def generate_pdf(pdfname,table):
"""
Generates the pdf from string
"""
import subprocess
import os
f = open('cover.tex','w')
tex = standalone_latex(table)
f.write(tex)
f.close()
proc=subprocess.Popen(['pdflatex','cover.tex'])
subprocess.Popen(['pdflatex',tex])
proc.communicate()
os.unlink('cover.tex')
os.unlink('cover.log')
os.unlink('cover.aux')
os.rename('cover.pdf',pdfname)
The problem with the code is that it creates bunch of files named cover in the working directory which afterwards are removed.
How to avoid of creating unneeded files at the working directory?
Solution
def generate_pdf(pdfname,tex):
"""
Genertates the pdf from string
"""
import subprocess
import os
import tempfile
import shutil
current = os.getcwd()
temp = tempfile.mkdtemp()
os.chdir(temp)
f = open('cover.tex','w')
f.write(tex)
f.close()
proc=subprocess.Popen(['pdflatex','cover.tex'])
subprocess.Popen(['pdflatex',tex])
proc.communicate()
os.rename('cover.pdf',pdfname)
shutil.copy(pdfname,current)
shutil.rmtree(temp)
Use a temporary directory. Temporary directories are always writable and can be cleared by the operating system after a restart. tempfile library lets you create temporary files and directories in a secure way.
path_to_temporary_directory = tempfile.mkdtemp()
# work on the temporary directory
# ...
# move the necessary files to the destination
shutil.move(source, destination)
# delete the temporary directory (recommended)
shutil.rmtree(path_to_temporary_directory)

Categories

Resources