How to read a gzip netcdf file in python? - python

I have a working python program that reads in a number of large netCDF files using the Dataset command from the netCDF4 module. Here is a snippet of the relevant parts:
from netCDF4 import Dataset
import glob
infile_root = 'start_of_file_name_'
for infile in sorted(glob.iglob(infile_root + '*')):
ncin = Dataset(infile,'r')
ncin.close()
I want to modify this to read in netCDF files that are gzipped. The files themselves were gzipped after creation; they are not internally compressed (i.e., the files are *.nc.gz). If I were reading in gzipped text files, the command would be:
from netCDF4 import Dataset
import glob
import gzip
infile_root = 'start_of_file_name_'
for infile in sorted(glob.iglob(infile_root + '*.gz')):
f = gzip.open(infile, 'rb')
file_content = f.read()
f.close()
After googling around for maybe half an hour and reading through the netCDF4 documentation, the only way I can come up with to do this for netCDF files is:
from netCDF4 import Dataset
import glob
import os
infile_root = 'start_of_file_name_'
for infile in sorted(glob.iglob(infile_root + '*.gz')):
os.system('gzip -d ' + infile)
ncin = Dataset(infile[:-3],'r')
ncin.close()
os.system('gzip ' + infile[:-3])
Is it possible to read gzip files with the Dataset command directly? Or without otherwise calling gzip through os?

Reading datasets from memory is supported since netCDF4-1.2.8 (Changelog):
import netCDF4
import gzip
with gzip.open('test.nc.gz') as gz:
with netCDF4.Dataset('dummy', mode='r', memory=gz.read()) as nc:
print(nc.variables)
See the description of the memory parameter in the Dataset documentation

Because NetCDF4-Python wraps the C NetCDF4 library, you're out of luck as far as using the gzip module to pass in a file-like object. The only option is, as suggested by #tdelaney, to use the gzip to extract to a temporary file.
If you happen to have any control over the creation of these files, NetCDF version 4 files support zlib compression internally, so that using gzip is superfluous. It might also be worth converting the files from version 3 to version 4 if you need to repeatedly process these files.

Since I just had to solve the same problem, here is a ready-made solution:
import gzip
import os
import shutil
import tempfile
import netCDF4
def open_netcdf(fname):
if fname.endswith(".gz"):
infile = gzip.open(fname, 'rb')
tmp = tempfile.NamedTemporaryFile(delete=False)
shutil.copyfileobj(infile, tmp)
infile.close()
tmp.close()
data = netCDF4.Dataset(tmp.name)
os.unlink(tmp.name)
else:
data = netCDF4.Dataset(fname)
return data

Related

Access data from particular directory with less line of codes

Assume, I have a csv file data.csv located in the following directory 'C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets'. Using this code, I can access my csv file:
## read the csv file from a particular folder
import pandas as pd
import glob
files = glob.glob(r"C:\\Users\\rp603\\OneDrive\\Documents\\Python Scripts\\Basics\\tutorials\\Revision\\datasets*.csv")
df = pd.DataFrame()
for f in files:
csv = pd.read_csv(f)
df = df.append(csv)
But as you can see the csv file path is long. So, is there is any way to do the same operation where I can reduce the path location of my data as well as codes line.
use the "dot" notation for a relative path (it does not depend on the programming language)
# example for a "shorter" version of the path
import os
my_current_position = '.' # where you launch the program
files = '' # from above
print(os.path.relpath(files, my_current_position)
Remark relpath is order sensitive
You can use a context manager to open the file, not shorter but more elegant
with open(file, 'r') as fd:
data_table = pd.read_csv(fd)
If you put your script in the same directory as the datasets, you can simply do:
import glob
files = glob.glob("datasets*.csv")

Read compressed binary file (.grib2.bz2)

I have downloaded one of the files from this list https://opendata.dwd.de/weather/nwp/icon-eu/grib/03/t_2m/ (the actual filenames change every day) which are bz2 compressed.
I can read in the decompressed file using e.g.
import xarray as xr
# cfgrib + dependencies are also required
grib1 = xr.open_dataset("icon-eu_europe_regular-lat-lon_single-level_2020101212_001_ASHFL_S.grib2", engine='cfgrib')
However, I would like to read in the compressed file.
I tried things like
with bz2.open("icon-eu_europe_regular-lat-lon_single-level_2020101818_002_ASWDIFD_S.grib2.bz2", "rb") as f:
xr.open_dataset(f, engine='cfgrib')
but this does not work.
I am looking for any way to programmatically read in the compressed file.
I had the same issue within processing numerical weather prediction data.
What I am doing here is to download the file and hold it as a Binary Object (e.g. with urlopen or requests). Pass this object into the following function:
import bz2, shutil
from io import BytesIO
from pathlib import Path
def bunzip_store(file: BytesIO, local_intermediate_file: Path):
with bz2.BZ2File(file) as fr, local_intermediate_file.open(mode="wb") as fw:
shutil.copyfileobj(fr, fw)
An unzipped file will be stored underlocal_intermediate_file. Now you should be able to open this file.

Extracting compressed .gz and .bz2 s3 files automatically

I have a data dump from Wikipedia of about 30 files, each being about ~2.5 GB uncompressed size. I want to extract these files automatically, but as I understand I cannot use Lambda because it has file limitations.
I found another alternate solution of using SQS which will call EC2 instance, which I am working on. However, for that situation to work my script needs to read all zip files(.gz and .bz2) from S3 bucket and folders and extract them.
But on using zipfile module from python, I receive the following error:
zipfile.BadZipFile: File is not a zip file
Is there a solution to this?
This is my code:
import boto3
from io import BytesIO
import zipfile
s3_resource = boto3.resource('s3')
zip_obj = s3_resource.Object(bucket_name="backupwikiscrape", key= 'raw/enwiki-20200920-pages-articles-multistream1.xml-p1p41242.bz2')
buffer = BytesIO(zip_obj.get()["Body"].read())
z = zipfile.ZipFile(buffer)
for filename in z.namelist():
file_info = z.getinfo(filename)
s3_resource.meta.client.upload_fileobj(
z.open(filename),
Bucket='backupwikiextract',
Key=f'{filename}'
)
The above code doesn't seem to be able to extract the above formats. Any suggestions?
Your file is bz2, thus you should use bz2 python library.
To decompress your object:
decompressed_bytes = bz2.decompress(zip_obj.get()["Body"].read())
I'll suggest you to use smart_open, it's much easier. It both handle gz and bz2 files.
from smart_open import open
import boto3
s3_session = boto3.Session()
with open(path_to_my_file, transport_params={'session': s3_session}) as fin:
for line in fin:
print(line)

Limit on bz2 file decompression using python?

I have numerous files that are compressed in the bz2 format and I am trying to uncompress them in a temporary directory using python to then analyze. There are hundreds of thousands of files so manually decompressing the files isn't feasible so I wrote the following script.
My issue is that whenever I try to do this, the maximum file size is 900 kb even though a manual decompression has each file around 6 MB. I am not sure if this is a flaw in my code and how I am saving the data as a string to then copy to the file or a problem with something else. I have tried this with different files and I know that it works for files smaller than 900 kb. Has anyone else had a similar problem and knows of a solution?
My code is below:
import numpy as np
import bz2
import os
import glob
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
with bz2.BZ2File(zipped_file,'rb') as zipfile: #Read in the bz2 files
newfilepath = cpath +'/temp/'+zipped_file[-47:-4] #create a temporary file
with open(newfilepath, "wb") as tmpfile: #open the temporary file
for i,line in enumerate(zipfile.readlines()):
tmpfile.write(line) #write the data from the compressed file to the temporary file
filenames_.append(newfilepath)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S*bz2'
unzip_f(path_)
It returns the correct file paths with the wrong sizes capped at 900 kb.
It turns out this issue is due to the files being multi stream which does not work in python 2.7. There is more info here as mentioned by jasonharper and here. Below is a solution just using the Unix command to decompress the bz2 files and then moving them to the temporary directory I want. It is not as pretty but it works.
import numpy as np
import os
import glob
import shutil
def unzip_f(filepath):
'''
Input a filepath specifying a group of Himiwari .bz2 files with common names
Outputs the path of all the temporary files that have been uncompressed
'''
cpath = os.getcwd() #get current path
filenames_ = [] #list to add filenames to for future use
for zipped_file in glob.glob(filepath): #loop over the files that meet the name criterea
newfilepath = cpath +'/temp/' #create a temporary file
newfilename = newfilepath + zipped_file[-47:-4]
os.popen('bzip2 -kd ' + zipped_file)
shutil.move(zipped_file[-47:-4],newfilepath)
filenames_.append(newfilename)
return filenames_
path_='test/HS_H08_20180930_0710_B13_FLDK_R20_S0*bz2'
unzip_f(path_)
This is a known limitation in Python2, where the BZ2File class doesn't support multiple streams.
This can be easily resolved by using bz2file, https://pypi.org/project/bz2file/, which is a backport of Python3 implementation and can be used as a drop-in replacement.
After running pip install bz2file you can just replace bz2 with it:
import bz2file as bz2 and everything should just work :)
The original Python bug report: https://bugs.python.org/issue1625

python reading ini files

With python I want to transform Joomla ini language files to sql. However the joomla ini files actually misses any section (example: [translations])
Since the rawconfigparser almost does the job but it demands a section, so I construct a temp file with a 'dummy' section named [ALL]:
fout = tempfile.NamedTemporaryFile(delete=True)
fin = file(self._inFilename, "r")
fout.write("[ALL]\n")
for f in fin.read():
fout.write(f)
config = ConfigParser.RawConfigParser(allow_no_value=True)
config.read(fout.name)
for c in config.items("ALL"):
self._ini2sql(unicode(c[0]).upper(), unicode('de'), unicode(c[1][1:-1]))
However... this is def. not the most elegant solution... any tips to make this more pythonic?
You could use a StringIO instead of creating an actual file:
from cStringIO import StringIO
import shutil
data = StringIO()
data.write('[ALL]\n')
with open(self._infilename, 'r') as f:
shutil.copyfileobj(f, data)
data.seek(0)
config.readfp(data)
You can use StringIO instead, which is keeping the content in the RAM:
import cStringIO
fout = cStringIO.StringIO()
fout.write("[ALL]\n")
with open(self._inFilename) as fobj:
fout.write(fobj.read())
fout.seek(0)
config = ConfigParser.RawConfigParser(allow_no_value=True)
config.readfp(fout)
Please note, there is some optimization in contrast to your code, which is important for you to learn:
Always safely close a file. This is done with the with statement.
You are iterating over each char of the input and writing it. This is not necessary and a serious performance drawback.
As an alternative to ConfigParser I would really recommend the configobj library, which has a much cleaner and more pythonic API (and does not require a default section). Example:
from configobj import ConfigObj
config = ConfigObj('myConfigFile.ini')
config.get('key1')
config.get('key2')
Reading .ini file in current directory
import configparser
import os
ini_file = configparser.ConfigParser()
ini_file_path = os.path.join(os.path.dirname(__file__),"filename.ini")
ini_file.read(ini_file_path) # ini_file as a dictionary
print (ini_file["key1"])

Categories

Resources