Join/merge multiple NetCDF files using xarray

Join/merge multiple NetCDF files using xarray - python

I have a folder with NetCDF files from 2006-2100, in ten year blocks (2011-2020, 2021-2030 etc).
I want to create a new NetCDF file which contains all of these files joined together. So far I have read in the files:
ds = xarray.open_dataset('Path/to/file/20062010.nc')
ds1 = xarray.open_dataset('Path/to/file/20112020.nc')
etc.
Then merged these like this:
dsmerged = xarray.merge([ds,ds1])
This works, but is clunky and there must be a simpler way to automate this process, as I will be doing this for many different folders full of files. Is there a more efficient way to do this?
EDIT:
Trying to join these files using glob:
for filename in glob.glob('path/to/file/.*nc'):
dsmerged = xarray.merge([filename])
Gives the error:
AttributeError: 'str' object has no attribute 'items'
This is reading only the text of the filename, and not the actual file itself, so it can't merge it. How do I open, store as a variable, then merge without doing it bit by bit?

If you are looking for a clean way to get all your datasets merged together, you can use some form of list comprehension and the xarray.merge function to get it done. The following is an illustration:
ds = xarray.merge([xarray.open_dataset(f) for f in glob.glob('path/to/file/.*nc')])
In response to the out of memory issues you encountered, that is probably because you have more files than the python process can handle. The best fix for that is to use the xarray.open_mfdataset function, which actually uses the library dask under the hood to break the data into smaller chunks to be processed. This is usually more memory efficient and will often allow you bring your data into python. With this function, you do not need a for-loop; you can just pass it a string glob in the form "path/to/my/files/*.nc". The following is equivalent to the previously provided solution, but more memory efficient:
ds = xarray.open_mfdataset('path/to/file/*.nc')
I hope this proves useful.

Related

h5py file subset taking more space than parent file?

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?

Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

How can I import many binary files in Dask?

I have many binary files (.tdms format, similar to .wav) stored in S3 and I would like to read them with nptdms then process them in a distributed fashion with Dask on a cluster.
In PySpark there is pyspark.SparkContext.binaryFiles() which produces an RDD with a bytearray for each input file which is a simple solution to this problem.
I have not found an equivalent function in Dask - is there one? If not, how could the equivalent functionality be achieved in Dask?
I noticed there's dask.bytes.read_bytes() if it's necessary to involve this however nptdms can't read a chunk of a file - it needs the entire file to be available and I'm not sure how to accomplish that.

dask.bytes.read_bytes() will give you whole files if you use blocksize=None, i.e., exactly one block per file. The most common use case for that is compressed files (e.g., gzip) where you can't start mid-stream, but should work for your use case too. Note that the delayed objects you get each return bytes, not open files.
Alternatively, you can use fsspec.open_files. This returns OpenFile objects, which are safe to serialise and so you can use them in dask.delayed calls such as
ofs = fsspec.open_files("s3://...", ...)
#dask.delayed
def read_a_file(of):
with of as f:
# entering context actually touches storage
return TdmsFile.read(f)
tdms = [read_a_file(of) for of in ofs]

How to read a lot of excel files in python pandas?

I have lots of excel files(xlsx format) and want to read and handle them.
For example, file names are ex201901, ex201902, .... 201912.
Its name is made by exYYYYMM format.
Anyway, to import these files in pandas as an usual case, it's easy.
import pandas as pd
df201901 = pd.read_excel(r'C:\\users\ex201901.xlsx)
df201902 = pd.read_excel(r'C:\\users\ex201902.xlsx)
df201903 = pd.read_excel(r'C:\\users\ex201903.xlsx)
df201904 = pd.read_excel(r'C:\\users\ex201904.xlsx)
....
df201912 = pd.read_excel(r'C:\\users\ex201912.xlsx)
However, it seem to be a boring and tedius.
In SAS program, I use Macro() syntax. But in python, I have no idea how to handle.
Can you help me how to handle the multiple and repeated jobs in easy way, like a SAS MACRO().
Thanks for reading.

Given that you'll probably want to somehow work with all data frames at once afterwards, it's a smell if you even put them into separate local variables, and in general, whenever you're experiencing a "this task feels repetitive because I'm doing the same thing over and over again", that calls for introducing loops of some sort. As you're planning to use pandas, chances are that you'll be iterating soon again (now that you have your files, you're probably going to be performing some transformations on the rows of those files), in which case you'll probably be best off looking into how control flow a la loops works in Python (and indeed in pandas) in general; good tutorials are plentiful.
In your particular case, depending on what kind of processing you are planning on doing afterwards, you'd probably benefit from having something like
df2019 = [pd.read_excel(rf'C:\users\ex2019{str(i).zfill(2)}.xlsx') for i in range(1, 13)]
With that, you can access the individual data frames through e.g. df2019[5] to get the data frame corresponding to June, or you can collapse all of them into a single data frame using df = pd.concat(df2019) if that's what suits your need.
If you have less structure in your file names, glob can come in handy. With that, the above could become something like
import glob
df2019 = list(map(pd.read_excel, glob.glob(r'C:\users\ex2019*.xlsx')))

You can use OS module from python. It has a method listdir which stores all the file names in the folder. Check the code below:
import os, re
listDir = os.listdir(FILE_PATH)
dfList = []
for aFile in listDir:
if re.search(r'ex20190[0-9]{1}.xlsx', aFile):
tmpDf = pd.read_excel(FILE_PATH + aFile)
dfList.append(tmpDf)
outDf = pd.concat(dfList)

How to export a list of dataframes from R to Python?

I'm currently working with functional MRI data in R but I need to import it to Python for some faster analysis. How can I do that in an efficient way?
I currently have in R a list of 198135 dataframes. All of them have 5 variables and 84 observations of connectivity between brain regions. I need to display the same 198135 dataframes in Python for running some specific analysis there (with the same structure than in R: one object that contains all dataframes separately).
Initially I tried exporting a .RDS file from R and then importing it to Python using "pyreadr", but I'm getting empty objects in every atempt with "pyreadr.read_r()" function.
My other method was to save every dataframe of the R list as a separate .csv file, and then importing them to Python. In that way I could get what I wanted (I tried it with 100 dataframes only for trying the code). The problem with this method is that is highly inefficient and slow.
I found several answers to similar problems, but most of them were to merge all dataframes and load it as a unique .csv into Python, which is not the solution I need.
Is there some more efficient way to do this process, without altering the data structure that I mentioned?
Thanks for your help!
# This is the code in R for an example
a <- as.data.frame(cbind(c(1:3), c(1:3), c(4:6), c(7:9)))
b <- as.data.frame(cbind(c(11:13), c(21:23), c(64:66), c(77:79)))
c <- as.data.frame(cbind(c(31:33), c(61:63), c(34:36), c(57:59)))
d <- as.data.frame(cbind(c(12:14), c(13:15), c(54:56), c(67:69)))
e <- as.data.frame(cbind(c(31:33), c(51:53), c(54:56), c(37:39)))
somelist_of_df <- list(a,b,c,d,e)
saveRDS(somelist_of_df, "somefile.rds")
## This is the function I used from pyreadr in Python
import pyreadr
results = pyreadr.read_r('/somepath/somefile.rds')

Well, thanks for the help in the other answers, but it's not exactly what I was looking for(I wanted to export just one file with the list of dataframes within it, and then loading one single file to Python, keeping the same structure). For using feather you have to decompose the list in all the dataframes within it, pretty much like saving separate .csv files, and then load each one of them into Python (or R). Anyway, it must be said that it's much faster than the method with .csv.
I leave the code that I used successfully in a separate answer, maybe it could be useful for other people since I used a simple loop for loading dataframes into Python as a list:
## Exporting a list of dataframes from R to .feather files
library(feather) #required package
a <- as.data.frame(cbind(c(1:3), c(1:3), c(4:6), c(7:9))) #Example DFs
b <- as.data.frame(cbind(c(11:13), c(21:23), c(64:66), c(77:79)))
c <- as.data.frame(cbind(c(31:33), c(61:63), c(34:36), c(57:59)))
d <- as.data.frame(cbind(c(12:14), c(13:15), c(54:56), c(67:69)))
e <- as.data.frame(cbind(c(31:33), c(51:53), c(54:56), c(37:39)))
somelist_of_df <- list(a,b,c,d,e)
## With sapply you loop over the list for creating the .feather files
sapply(seq_along(1:length(somelist_of_df)),
function(i) write_feather(somelist_of_df[[i]],
paste0("/your/directory/","DF",i,".feather")))
(Using just a MacBook Air, the code above took less than 5 seconds to run for a list of 198135 DFs)
## Importing .feather files into a list of DFs in Python
import os
import feather
os.chdir('/your/directory')
directory = '/your/directory'
py_list_of_DFs = []
for filename in os.listdir(directory):
DF = feather.read_dataframe(filename)
py_list_of_DFs.append(DF)
(This code did the work for me besides it was a bit slow, it took 12 minutes to do the task for the 198135 DFs)
I hope this could be useful for somebody.

This package may be of some interest to you

Pandas also implements a direct way to read .feather file :
pd.read_feather()

Pyreadr cannot currently read R lists, therefore you need to save the dataframes individually, also you need to save to a RDA file so that you can host multiple dataframes in one file:
# first construct a list with the names of dataframes you want to save
# instead of the dataframes themselves
somelist_of_df <- list("a", "b", "c", "d", "e")
do.call("save", c(somelist_of_df, file="somefile.rda"))
or any other variant as described here.
Then you can read the file in python:
import pyreadr
results = pyreadr.read_r('/somepath/somefile.rda')
The advantage is that there will be only one file with all dataframes.

I cannot comment in the #crlagos0 answer because reputation. I Want to add a couple of things:
seq_along(list_of_things) is enough, there is no need to do seq_along(lenght(1:list_of_things)) in R. Also, I want to point out that the official package to read and write feather files in R is called arrow and you can find its documentation here. In python is pyarrow.

Increase speed numpy.loadtxt?

I have hundred of thousands of data text files to read. As of now, I'm importing the data from text files every time I run the code. Perhaps the easy solution would be to simply reformat the data into a file faster to read.
Anyway, right now every text files I have look like:
User: unknown
Title : OE1_CHANNEL1_20181204_103805_01
Sample data
Wavelength OE1_CHANNEL1
185.000000 27.291955
186.000000 27.000877
187.000000 25.792290
188.000000 25.205620
189.000000 24.711882
.
.
.
The code where I read and import the txt files is:
# IMPORT DATA
path = 'T2'
if len(sys.argv) == 2:
path = sys.argv[1]
files = os.listdir(path)
trans_import = []
for index, item in enumerate(files):
trans_import.append(np.loadtxt(path+'/'+files[1], dtype=float, skiprows=4, usecols=(0,1)))
The resulting array looks in the variable explorer as:
{ndarray} = [[185. 27.291955]\n [186. 27.000877]\n ... ]
I'm wondering, how I could speed up this part? It takes a little too long as of now just to import ~4k text files. There are 841 lines inside every text files (spectrum). The output I get with this code is 841 * 2 = 1682. Obviously, it considers the \n as a line...

It would probably be much faster if you had one large file instead of many small ones. This is generally more efficient. Additionally, you might get a speedup from just saving the numpy array directly and loading that .npy file in instead of reading in a large text file. I'm not as sure about the last part though. As always when time is a concern, I would try both of these options and then measure the performance improvement.
If for some reason you really can't just have one large text file / .npy file, you could also probably get a speedup by using, e.g., multiprocessing to have multiple workers reading in the files at the same time. Then you can just concatenate the matrices together at the end.
Not your primary question but since it seems to be an issue - you can rewrite the text files to not have those extra newlines, but I don't think np.loadtxt can ignore them. If you're open to using pandas, though, pandas.read_csv with skip_blank_lines=True should handle that for you. To get a numpy.ndarray from a pandas.DataFrame, just do dataframe.values.

Let use pandas.read_csv (with C speed) instead of numpy.loadtxt. This is a very helpful post:
http://akuederle.com/stop-using-numpy-loadtxt

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Join/merge multiple NetCDF files using xarray - python

Related

h5py file subset taking more space than parent file?

How can I import many binary files in Dask?

How to read a lot of excel files in python pandas?

How to export a list of dataframes from R to Python?

Increase speed numpy.loadtxt?

Categories

Resources