Merging multiple .npz files into single .npz file - python

I have multiple .npz files in folder with same nature, I want to append all of my .npz files into a single .npz file present in a given folder
I have tried below code to achieve so, but it seems its not appending multiple .npz files to single npz file.
Here is the code
import numpy as np
file_list = ['image-embeddings\img-emb-1.npz', 'image-embeddings\img-emb-2.npz']
data_all = [np.load(fname) for fname in file_list]
merged_data = {}
for data in data_all:
[merged_data.update({k: v}) for k, v in data.items()]
np.savez('new_file.npz', **merged_data)
Where img-emb-1.npz has different value and img-emb-2.npz has different value

Maybe try the following to construct merged_data:
arrays_read = dict(
chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)
Full example:
from itertools import chain
import numpy as np
file = lambda name: f"arrays/{name}.npz"
# Create data
arrays = {f"arr{i:02d}": np.random.randn(10, 20) for i in range(10)}
# Save data in separate files
for arr_name, arr in arrays.items():
np.savez(file(arr_name), **{arr_name: arr})
# Read all files into a dict
arrays_read = dict(
chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)
# Save into a single file
np.savez(file("arrays"), **arrays_read)
# Load to compare
arrays_read_single = dict(np.load(file("arrays")).items())
assert arrays_read.keys() == arrays_read_single.keys()
for k in arrays_read.keys():
assert np.array_equal(arrays_read[k], arrays_read_single[k])

Related

Sorting a large amount of data into separate sets

I'm extracting up to 2500 frame files per experiment (not always the same amount), and my process at the moment is to manually divide the total number of frames by three to separate into three subset folders since the file size is too large to convert all into a .mat file. I simply want to automate this.
Once the files are separated into three subsets ('Subset1, Subset2, Subset3'), i run each folder through my code to convert and rename.
from scipy.io import savemat
import numpy as np
import os
arrays = []
directory = r"F:\...\Experiment 24\Imaging\Subset3" # **something here that will look at the while directory and create a different file for each subset folder**
sorted(os.listdir(directory))
for filename in sorted(os.listdir(directory)):
f = os.path.join(directory, filename)
arrays.append(np.load(f))
data = np.array(arrays)
data = data.astype('uint16')
data = np.moveaxis(data, [0, 1, 2], [2, 1, 0])
savemat('24_subset3.mat', {'data': data})
How can I automatically sort my frame files into three separate subset folders and convert?
Create subsets from the filenames and copy them to new subset directories:
num_subsets = 3
in_dir = "/some/path/to/input"
out_dir = "/some/path/to/output/subsets"
filenames = sorted(os.listdir(in_dir))
chunk_size = len(filenames) // num_subsets
for i in range(num_subsets):
subset = filenames[i * chunk_size : (i + 1) * chunk_size]
# Create subset output directory.
subset_dir = f"{out_dir}/subset_{i}"
os.makedirs(subset_dir, exist_ok=True)
for filename in subset:
shutil.copyfile(filename, f"{subset_dir}/{filename}")
NOTE: Any extra files that cannot be distributed into equal subsets will be skipped.
If your goal is simply to create your three .mat files, you don't necessarily need to create subfolders and move your files around at all; you can iterate through subsets of them in-place. You could manually calculate the indexes at which to divide into subsets, but more_itertools.divide is convenient and readable.
Additionally, pathlib is usually a more convenient way of manipulating paths and filenames. No more worrying about os.path.join! The paths yielded by Path.iterdir or Path.glob know where they're located, and don't need to be recombined with their parent.
import pathlib
from more_itertools import divided
import numpy as np
from scipy.io import savemat
directory = Path("F:/.../Experiment 24/Imaging/")
subsets = divide(3, sorted(directory.iterdir()))
for index, subset in enumerate(subsets, start=1):
arrays = [np.load(file) for file in subset]
data = np.array(arrays).astype('uint16')
data = np.moveaxis(data, [0, 1, 2], [2, 1, 0])
savemat(f'24_subset{index}.mat', {'data': data})

Reading multiple text files and separating them into different arrays in Python

I have this code in MATLAB
txtFiles = dir('*.txt') ; %loads txt files
N = length(txtFiles) ;
for i = 1:N
data = importdata(txtFiles(i).name);
x = data(:,1);
y(:,i) = data(:,2) ;
end
Which takes all 100 of my txt files and creates an array for x then stores the y data in an a separate array where each column corresponds to a different txt file's values.
Is there a similar trick in Python?
this is how the data files are constructed:
896.5000000000 0.8694710776
896.7500000000 0.7608314184
897.0000000000 0.6349069122
897.2500000000 0.5092121001
897.5000000000 0.3955858698
There are 50 of them and each one has about 1000 rows like this,
My solution so far jams it all into a massive list which is impossible to handle. In MATLAB it adds the second column of each text file to an array and I can easily cycle through them.
This is my solution
#%%
import matplotlib.pyplot as plt
import os
import numpy as np
import glob
# This can be further shortened, but will result in a less clear code.
# One quality of a pythonic code
# Gets file names and reads them.
data = [open(file).read() for file in glob.glob('*.txt')]
# each block of data has it's line split and is put
# into seperate arrays
# [['data1 - line1', 'data1 - line2'], ['data2 - line1',...]]
data = [block.splitlines() for block in data]
x, y = [], []
# takes each line within each file data
# splits it, and appends to x and y
for file in glob.glob('*.txt'):
# open the file
with open(file) as _:
# read and splitlines
for line in _.read().splitlines():
# split the columns
line = line.split()
# this splits by the spaces
# example line: -2000 data1
# output = ['-2000', 'data']
# append to lists
x.append(line[0])
y.append(line[1])
Your files are pretty much csv files, and could be read using np.loadtxt or pd.read_csv.
But as you did you can extract values from the text yourself, the following will work for any number of columns:
def extract_values(text, sep=" ", dtype=float):
return (
np.array(x, dtype=dtype)
for x in zip(*(l.split(sep) for l in text.splitlines()))
)
Then just concatenate the results in the shape you want:
import pathlib
dir_in = pathlib.Path("files/")
indexes, datas = zip(
*(
extract_values(f.read_text())
for f in sorted(dir_in.glob("*.txt"))
)
)
index = np.stack(indexes, axis=-1)
data = np.stack(datas, axis=-1)

Python for loop over non-numeric folders in hdf5 file

I want to pull the numbers from a .HDF5 data file, which is in folders with increasing numbers:
Folder_001, Folder_002, Folder_003, ... Folder_100.
In each folder, the data I want to pull has same name: 'Time'. So in order for me to pull the numbers from each folders, I am trying to use for loop over the name of folders to pull numbers in files; yet, still can't figure out how to structure the code. I did the following
f = h5.File('name.h5'.'r')
folders = list(f.keys())
for i in folders:
dataset_folder = f['i']
f = h5.File('name.h5', 'r') # comma
groups = f.keys()
adict = {}
for key in groups:
agroup = f[key]
ds = agroup['Time'] # a dataset
arr = ds[:] # download the dataset to array
adict[key] = arr
Now adict should be a dictionary with keys like 'Folder_001', and values being the respective Time array. You could also collect those arrays in a list.

How do I import a load of csvs into different python dataframes via a loop?

I have a load of csv files. I want to create a loop that allows me to do this;
df_20180731 = pd.read_csv('path/cust_20180731.csv')
for each of about 36 files.
My files are df_20160131, df_20160231 ...... df_20181231 etc. Basically dates by the end of the month.
Thanks
# include here all ids
files = ['20160131', '20160231']
_g = globals()
for f in files:
_g['df_{}'.format(f)] = pandas.read_csv('path/cust_{}.csv'.format(f))
print(df_20160131)
You could do something like:
import glob
import pandas as pd
datasets = {}
for file in glob.glob('path/df_*'):
datasets[file] = pd.read_csv(file)
import os
import pandas as pd
# get a list of all the files in the directory
files = os.listdir(<path of the directory containing all the files>)
#iterate over all the files and store it in a dictionary
dataframe = {file: pd.read_csv(file) for file in files}
#if the directory must contain other files,
#you can check the file paths with any logic(extension etc.), in that case
def logic(fname):
return '.csv' in fname
dataframe = {file: pd.read_csv(file) for file in files if logic(file) }
#this will create a dictionary of file : dataframe_objects
I hope it helps

How can i combine several database files with numpy?

I know that I can can read a file with numpy with the genfromtxt command. It works like this:
data = numpy.genfromtxt('bmrbtmp',unpack=True,names=True,dtype=None)
I can plot the stuff in there easily with:
ax.plot(data['field'],data['field2'], linestyle=" ",color="red")
or
ax.boxplot(data)
and its awesome. What I really would like to do now is read a whole folder of files and combine them into one giant dataset. How do I add datapoints to the data data structure?
And how do I read a whole folder at once?
To visit all the files in a directory, use os.walk.
To stack two structured numpy arrays "vertically", use np.vstack.
To save the result, use np.savetxt to save in a text format, or np.save to save the array in a (smaller) binary format.
import os
import numpy as np
result = None
for root, dirs, files in os.walk('.', topdown = True):
for filename in files:
with open(os.path.join(root, filename), 'r') as f:
data = np.genfromtxt(f, unpack=True, names=True, dtype=None)
if result is None:
result = data
else:
result = np.vstack((result, data))
print(result[:10]) # print first 10 lines
np.save('/tmp/outfile.npy', result)

Categories

Resources