h5py file subset taking more space than parent file?

h5py file subset taking more space than parent file? - python

I have an existing h5py file that I downloaded which is ~18G in size. It has a number of nested datasets within it:
h5f = h5py.File('input.h5', 'r')
data = h5f['data']
latlong_data = data['lat_long'].value
I want to be able to some basic min/max scaling of the numerical data within latlong, so i want to put it in its own h5py file for easier use and lower memory usage.
However, when i try to write it out to its own file:
out = h5py.File('latlong_only.h5', 'w')
out.create_dataset('latlong', data=latlong)
out.close()
The output file is incredibly large. It's still not done writing to disk and is ~85GB in space. Why is the data being written to the new file not compressed?

Could be h5f['data/lat_long'] is using compression filters (and you aren't). To check the original dataset's compression settings, use this line:
print (h5f['data/latlong'].compression, h5f['data/latlong'].compression_opts)
After writing my answer, it occurred to me that you don't need to copy the data to another file to reduce the memory footprint. Your code reads the dataset into an array, which is not necessary in most use cases. A h5py dataset object behaves similar to a NumPy array. Instead, use this line: ds = h5f1['data/latlong'] to create a dataset object (instead of an array) and use it "like" it's a NumPy array. FYI, .value is a deprecated method to return the dataset as an array. Use this syntax instead arr = h5f1['data/latlong'][()]. Loading the dataset into an array also requires more memory than using an h5py object (which could be an issue with large datasets).
There are other ways to access the data. My suggestion to use dataset objects is 1 way. Your method (extracting data to a new file) is another way. I am not found of that approach because you now have 2 copies of the data; a bookkeeping nightmare. Another alternative is to create external links from the new file to the existing 18GB file. That way you have a small file that links to the big file (and no duplicate data). I describe that method in this post: [How can I combine multiple .h5 file?][1] Method 1: Create External Links.
If you still want to copy the data, here is what I would do. Your code reads the dataset into an array then writes the array to the new file (uncompressed). Instead, copy the dataset using h5py's group .copy() method, it will retain compression settings and attributes.
See below:
with h5py.File('input.h5', 'r') as h5f1, \
h5py.File('latlong_only.h5', 'w') as h5f2:
h5f1.copy(h5f1['data/latlong'], h5f2,'latlong')

Related

Converting CSV to numpy NPY efficiently

How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)

Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.

TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format

It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..

import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```

I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)

Convert huge csv to hdf5 format

I downloaded IBM's Airline Reporting Carrier On-Time Performance Dataset; the uncompressed CSV is 84 GB. I want to run an analysis, similar to Flying high with Vaex, with the vaex libary.
I tried to convert the CSV to a hdf5 file, to make it readable for the vaex libary:
import time
import vaex
start=time.time()
df = vaex.from_csv(r"D:\airline.csv", convert=True, chunk_size=1000000)
end=time.time()
print("Time:",(end-start),"Seconds")
I always get an error when running the code:
RuntimeError: Dirty entry flush destroy failed (file write failed: time = Fri Sep 30 17:58:55 2022
, filename = 'D:\airline.csv_chunk_8.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 0000021EA8C6B128, total write size = 2040, bytes this sub-write = 2040, bytes actually written = 18446744073709551615, offset = 221133661).
Second run, I get this error:
RuntimeError: Unable to flush file's cached information (file write failed: time = Fri Sep 30 20:18:19 2022
, filename = 'D:\airline.csv_chunk_18.hdf5', file descriptor = 7, errno = 22, error message = 'Invalid argument', buf = 000002504659B828, total write size = 2048, bytes this sub-write = 2048, bytes actually written = 18446744073709551615, offset = 348515307)
Is there an alternative way to convert the CSV to hdf5 without Python? For example, a downloadable software which can do this job?

I'm not familiar with vaex, so can't help with usage and functions. However, I can read error messages. :-)
It reports "bytes written" with a huge number (18_446_744_073_709_551_615), much larger than the 84GB CSV. Some possible explanations:
you ran out of disk
you ran out of memory, or
had some other error
To diagnose, try testing with a small csv file and see if vaex.from_csv() works as expected. I suggest the lax_to_jfk.csv file.
Regarding your question, is there an alternative way to convert a csv to hdf5?, why not use Python?
Are you more comfortable with other languages? If so, you can install HDF5 and write your code with their C or Fortran API.
OTOH, if you are familiar with Python, there are other packages you can use to read the CSV file and create the HDF5 file.
Python packages to read the CSV
Personally, I like NumPy's genfromtxt() to read the CSV (You can also use loadtxt() to read the CSV, if you don't have missing values and don't need the field names.) However, I think you will run into memory problems reading a 84GB file. That said, you can use the skip_header and max_rows parameters with genfromtxt() to read and load a subset of lines. Alternately you can use csv.DictReader(). It reads a line at a time. So, you avoid memory issues, but it could be very slow loading the HDF5 file.
Python packages to create the HDF5 file
I have used both h5py and pytables (aka tables) to create and read HDF5 files. Once you load the CSV data to a NumPy array, it's a snap to create the HDF5 dataset.
Here is a very simple example that reads the lax_to_jfk.csv data and loads to a HDF5 file.
csv_name = 'lax_to_jfk'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes')
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)
Update:
After posting this example, I decided to test with a larger file (airline_2m.csv). It's 861 MB, and has 2M rows. I discovered the code above doesn't work. However, it's not because of the number of rows. The problem is the columns (field names). Turns out the data isn't as clean; there are 109 field names on row 1, and some rows have 111 columns of data. As a result, the auto-generated dtype doesn't have a matching field. While investigating this, I also discovered many rows only have the values for first 56 fields. In other words, fields 57-111 are not very useful. One solution to this is to add the usecols=() parameter. Code below reflects this modification, and works with this test file. (I have not tried testing with your large file airline.csv. Given it's size likely you will need to read and load incrementally.)
csv_name = 'airline_2m'
rec_arr = np.genfromtxt(csv_name+'.csv', delimiter=',',
dtype=None, names=True, encoding='bytes') #,
usecols=(i for i in range(56)) )
with h5py.File(csv_name+'.h5', 'w') as h5f:
h5f.create_dataset(csv_name,data=rec_arr)

I tried reproducing your example. I believe the problem you are facing is quite common when dealing with CSVs. The schema is not known.
Sometimes there are "mixed types" and pandas (used underneath vaex's read_csv or from_csv ) casts those columns as dtype object.
Vaex does not really support such mixed dtypes, and requires each column to be of a single uniform type (kind of a like a database).
So how to go around this? Well, the best way I can think of is to use the dtype argument to explicitly specify the types of all columns (or those that you suspect or know to have mixed types). I know this file has like 100+ columns and that's annoying.. but that is also kind of the price to pay when using a format such as CSV...
Another thing i noticed is the encoding.. using pure pandas.read_csv failed at some point because of encoding and requires one to add encoding="ISO-8859-1". This is also supported by vaex.open (since the args are just passed down to pandas).
In fact if you want to do manually what vaex.open does automatically for you (given that this CSV file might not be as clean as one would hope), do something like (this is pseudo code but I hope close to the real thing)
# Iterate over the file in chunks
for i, df_tmp in enumerate(pd.read_csv(file, chunksize=11_000_000, encoding="ISO-8859-1", dtype=dtype)):
# Assert or check or do whatever needs doing to ensure column types are as they should be
# Pass the data to vaex (this does not take extra RAM):
df_vaex = vaex.from_pandas(df_tmp)
# Export this chunk into HDF5
# df_vaex.export_hdf5(f'chunk_{i}.hdf5')
# When the above loop finishes, just concat and export the data to a single file if needed (gives some performance benefit).
df = vaex.open('chunk*.hdf5')
df.export_hdf5('converted.hdf5', progress='rich')
I've seen potentially much better/faster way of doing this with vaex, but it is not released yet (i saw it in the code repo on github), so I will not go into it, but if you can install from source, and want me to elaborate further feel free to drop a comment.
Hope this at least gives some ideas on how to move forward.
EDIT:
In last couple of versions of vaex core, vaex.open() opens all CSV files lazily, so then just export to hdf5/arrow directly, it will do it in one go. Check the docs for more details: https://vaex.io/docs/guides/io.html#Text-based-file-formats

How to avoid memory mapping when loading a numpy file

Csv file:
0,0,0,0,0,0,0,0,0,0.32,0.21,0,0.16,0,0,0,0,0,0,0.32
0,0,0,0,0,0,0.17,0,0.04,0,0,0.25,0.03,0.32,0,0.02,0.05,0.03,0.08,0
0.08,0.07,0.09,0.06,0,0,0.21,0.02,0,0,0,0,0,0,0,0.1,0.36,0,0,0
[goes on always 20 columns and x number of rows]
I'm saving the array this way:
with open(csv_profile) as csv_file:
array = np.loadtxt(csv_file, delimiter=",",dtype='str')
npy_profile=open(outfile, "wb")
np.save(npy_profile, array)
Which is saved as u4 instead of f8 which is what I need.
I noticed this error in the datatype as the output file says
<93>NUMPY^A^#v^#{'descr': '<U4', 'fortran_order': False, 'shape': (680, 20), }
Also when I load it:
profile_matrix=np.load(npy_profile,"r")
the class type is numpy.memmap instead of numpy.ndarray. How can I avoid this issue?
Both saving it in the correct format and loading it in the correct format.

Looking into the manual we can see that the second parameter of numpy.load is called mmap_mode and is set to "r" in your code. This enables memory mapping the file:
A memory-mapped array is kept on disk. However, it can be accessed and sliced like any ndarray. Memory mapping is especially useful for accessing small fragments of large files without reading the entire file into memory.
Memory mapping is normally not an "issue" as you called it, but a feature that enables faster file access and saves memory for large files. When doing memory mapped I/O, your operating system maps parts of the file into the RAM address space of your program. That way the data has not to be copied into RAM. Any changes that are made to the memory mapped numpy array are directly reflected in the file. Because you specified read only access, you probably cannot change values in the array.
If you want to disable memory mapping, you could remove the second argument "r" from the call to numpy.load, which leads to a fresh copy of the array in RAM, that you can modify without affecting the file.

While the answer from Jakob Stark explains what the additional "r" argument to np.load() does, let me just suggest a simpler and safer usage. To save and load NumPy arrays in the straight-forward way (no memory mapping, etc.), use the most straight-forward syntax:
np.save('filename.npy', array)
array2 = np.load('filename.npy')
You don't have to specify the dtype or anything, it just does the simplest possible thing, as you are expecting. Also, not manually opening the file prior to calling np.save() means that you do not have to worry about closing it again (these acts should generally be written inside a try/except block, which further adds to the complexity).

Most efficient way to use a large data set for PyTorch?

Perhaps this question has been asked before, but I'm having trouble finding relevant info for my situation.
I'm using PyTorch to create a CNN for regression with image data. I don't have a formal, academic programming background, so many of my approaches are ad-hoc and just terribly inefficient. May times I can go back through my code and clean things up later because the inefficiency is not so drastic that performance is significantly affected. However, in this case, my method for using the image data takes a long time, uses a lot of memory, and it is done every time I want to test a change in the model.
What I've done is essentially loaded the image data into numpy arrays, saved those arrays in an .npy file, and then when I want to use said data for the model I import all of the data in that file. I don't think the data set is really THAT large, as it is comprised of 5000, 3 color channel images of size 64x64. Yet my memory usage shoots up to 70%-80% (out of 16gb) when it is being loaded, and it takes 20-30 seconds to load in every time.
My guess is that I'm being dumb about the way I'm loading it in, but frankly I'm not sure what the standard is. Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
I would really appreciate any help on this.

For speed I would advise to used HDF5 or LMDB:
Reasons to use LMDB:
LMDB uses memory-mapped files, giving much better I/O performance.
Works well with really large datasets. The HDF5 files are always read
entirely into memory, so you can’t have any HDF5 file exceed your
memory capacity. You can easily split your data into several HDF5
files though (just put several paths to h5 files in your text file).
Then again, compared to LMDB’s page caching the I/O performance won’t
be nearly as good.
[http://deepdish.io/2015/04/28/creating-lmdb-in-python/]
If you decide to used LMDB:
ml-pyxis is a tool for creating and reading deep learning datasets using LMDBs.*(I am co author of this tool)
It allows to create binary blobs (LMDB) and they can be read quite fast. The link above comes with some simple examples on how to create and read the data. Including python generators/ iteratos .
This notebook has an example on how to create a dataset and read it paralley while using pytorch.
If you decide to use HDF5:
PyTables is a package for managing hierarchical datasets and designed to efficiently and easily cope with extremely large amounts of data.
https://www.pytables.org/

Here is a concrete example to demonstrate what I meant. This assumes that you've already dumped the images into an hdf5 file (train_images.hdf5) using h5py.
import h5py
hf = h5py.File('train_images.hdf5', 'r')
group_key = list(hf.keys())[0]
ds = hf[group_key]
# load only one example
x = ds[0]
# load a subset, slice (n examples)
arr = ds[:n]
# should load the whole dataset into memory.
# this should be avoided
arr = ds[:]
In simple terms, ds can now be used as an iterator which gives images on the fly (i.e. it doesn't load anything in memory). This should make the whole run time blazing fast.
for idx, img in enumerate(ds):
# do something with `img`

In addition to the above answers, the following may be useful due to some recent advances (2020) in the Pytorch world.
Your question: Should I, in some way, put the image data somewhere before I need it, or should the data be loaded directly from the image files? And in either case, what is the best, most efficient way to do that, independent of file structure?
You can leave the image files in their original format (.jpg, .png, etc.) on your local disk or on the cloud storage, but with one added step - compress the directory as a tar file. Please read this for more details:
Pytorch Blog (Aug 2020): Efficient PyTorch I/O library for Large Datasets, Many Files, Many GPUs (https://pytorch.org/blog/efficient-pytorch-io-library-for-large-datasets-many-files-many-gpus/)
This package is designed for situations where the data files are too large to fit in memory for training. Therefore, you give the URL of the dataset location (local, cloud, ..) and it will bring in the data in batches and in parallel.
The only (current) requirement is that the dataset must be in a tar file format.
The tar file can be on the local disk or on the cloud. With this, you don't have to load the entire dataset into the memory every time. You can use the torch.utils.data.DataLoader to load in batches for stochastic gradient descent.

No need saving image into npy and loading all into memory. Just load a batch of image path and transform then into tensor.
The following code define the MassiveDataset, and pass it into DataLoader, everything goes well.
from torch.utils.data.dataset import Dataset
from typing import Optional, Callable
import os
import multiprocessing
def apply_transform(transform: Callable, data):
try:
if isinstance(data, (list, tuple)):
return [transform(item) for item in data]
return transform(data)
except Exception as e:
raise RuntimeError(f'applying transform {transform}: {e}')
class MassiveDataset(Dataset):
def __init__(self, filename, transform: Optional[Callable] = None):
self.offset = []
self.n_data = 0
if not os.path.exists(filename):
raise ValueError(f'filename does not exist: {filename}')
with open(filename, 'rb') as fp:
self.offset = [0]
while fp.readline():
self.offset.append(fp.tell())
self.offset = self.offset[:-1]
self.n_data = len(self.offset)
self.filename = filename
self.fd = open(filename, 'rb', buffering=0)
self.lock = multiprocessing.Lock()
self.transform = transform
def __len__(self):
return self.n_data
def __getitem__(self, index: int):
if index < 0:
index = self.n_data + index
with self.lock:
self.fd.seek(self.offset[index])
line = self.fd.readline()
data = line.decode('utf-8').strip('\n')
return apply_transform(self.transform, data) if self.transform is not None else data
NB: open file with buffering=0 and multiprocessing.Lock() are used to avoid loading bad data (usually a bit from one part of the file and a bit from the another part of the file).
additionally, if using multiprocessing in DataLoader, one could get such exception TypeError: cannot serialize '_io.BufferedReader' object. This is caused by pickle module used in multiprocessing, it cannot serialize _io.BufferedReader, but dill can. Replacing multiprocessing with multiprocess, things goes okay (major changes compare with multiprocessing, enhanced serialization is done with dill)
same thing was discussed in this issue

How to read file to import matrix?

I need to use some matrices in Python programs, like
Q = np.matrix([[1,0,1,1,0],
[0,2,0,1,1],
[1,0,2,0,1],
[1,1,0,1,0],
[0,1,1,0,1]])
and I want to import the matrix (use numpy) from a file, so what should I do to realize it? what code should I write and what file should I use (.txt?). I am quite new to python, anyone can help me? Thank you in advance.

I'm assuming that you're not only importing the matrices, but also exporting them to files in the first place.
If that's true, there are multiple easy options, with different tradeoffs.
np.save saves the array in a binary format that's only usable by NumPy. But it's very fast, and generates reasonably small files.
np.save('matrix.npy', Q)
Q = np.load('matrix.npy')
np.savetxt saves the array in a text file, using a dialect of CSV (with whitespace separators, by default). It's slower, and generates bigger files, but if you want to be able to read or edit the files (or send them through an ASCII-only channel, like email without attachments), it's the best option.
np.savetxt('matrix.txt', Q)
Q = np.loadtxt('matrix.txt')
np.savetxt can also save the array in a compressed text file. This gives you small files, but they're slower to save and load. They're not directly human-readable, but it's very easy to un-gzip a file, and then you've got a text file you can read and edit. So, sometimes this is worth doing.
np.savetxt('matrix.txt.gz', Q)
Q = np.loadtxt('matrix.txt.gz')
Finally, you can just use standard Python saving and loading mechanisms, like pickle:
with open('matrix.pickle', 'wb') as f:
pickle.dump(Q, f)
with open('matrix.pickle', 'rb') as f:
Q = pickle.load(f)
This is really only useful if you need to store NumPy arrays together with non-NumPy objects.
If you have to save multiple matrices, instead of saving one per file, you might want to look at savez and savez_compressed. Or, if you need multiple objects, only some of which are NumPy, pickle may be the best option.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

h5py file subset taking more space than parent file? - python

Related

Converting CSV to numpy NPY efficiently

Convert huge csv to hdf5 format

How to avoid memory mapping when loading a numpy file

Most efficient way to use a large data set for PyTorch?

How to read file to import matrix?

Categories

Resources