I have a lot of satellite data that is consists of two-dimension.
(I convert H5 to 2d array data that not include latitude information
I made Lat/Lon information data additionally.)
I know real Lat/Lon coordination and grid coordination in one data.
How can I partially read 2d satellite file in Python?
"numpy.fromfile" was usually used to read binary file.
if I use option as count in numpy.fromfile, I can read binary file partially.
However I want to skip front records in one data for save memory.
for example, i have 3x3 2d data as follow:
python
a= [[1,2,3]
[4,5,6]
[7,8,9]]
I just read a[3][0] in Python. (result = 7)
When I read file in Fortran, I used "recl, rec".
Fortran
open(1, file='exsmaple.bin', access='direct', recl=4) ! recl=4 means 4 btype
read(1, rec=lat*x-lon) filename
close(1)
lat means position of latitude in data.
(lat = 3 in above exsample ; start number is 1 in Fortran.)
lon means position of longitude in data.
(lon = 1 in above exsample ; start number is 1 in Fortran.)
x is no. rows.
(x = 3, above example, array is 3x3)
I can read file, and use only 4 byte of memory.
I want to know similar method in Python.
Please give me special information to save time and memory.
Thank you for reading my question.
2016.10.28.
Solution
python
Data = [1,2,3,4,5,6,7,8,9], dtype = int8, filename=name
a = np.memmap(name, dtype='int8', mode='r', shape=(1), offset=6)
print a[0]
result : 7
To read .h5 files :
import h5py
ds = h5py.File(filename, "r")
variable = ds['variable_name']
It's hard to follow your description. Some proper code indentation would help over come your English language problems.
So you have data on a H5 file. The simplest approach is to h5py to load it into a Python/numpy session, and select the necessary data from those arrays.
But it sounds as though you have written a portion of this data to a 'plain' binary file. It might help to know how you did it. Also in what way is this 2d?
np.fromfile reads a file as though it was 1d. Can you read this file, up to some count? And with a correct dtype?
np.fromfile accepts an open file. So I think you can open the file, use seek to skip forward, and then read count items from there. But I haven't tested that idea.
Related
How to convert a .csv file to .npy efficently?
I've tried:
import numpy as np
filename = "myfile.csv"
vec =np.loadtxt(filename, delimiter=",")
np.save(f"{filename}.npy", vec)
While the above works for smallish file, the actual .csv file I'm working on has ~12 million lines with 1024 columns, it takes quite a lot to load everything into RAM before converting into an .npy format.
Q (Part 1): Is there some way to load/convert a .csv to .npy efficiently for large CSV file?
The above code snippet is similar to the answer from Convert CSV to numpy but that won't work for ~12M x 1024 matrix.
Q (Part 2): If there isn't any way to to load/convert a .csv to .npy efficiently, is there some way to iteratively read the .csv file into .npy efficiently?
Also, there's an answer here https://stackoverflow.com/a/53558856/610569 to save the csv file as numpy array iteratively. But seems like the np.vstack isn't the best solution when reading the file. The accepted answer there suggests hdf5 but the format is not the main objective of this question and the hdf5 format isn't desired in my use-case since I've to read it back into a numpy array afterwards.
Q (Part 3): If part 1 and part2 are not possible, are there other efficient storage (e.g. tensorstore) that can store and efficiently convert to numpy array when loading the saved storage format?
There is another library tensorstore that seems to efficiently handles arrays which support conversion to numpy array when read, https://google.github.io/tensorstore/python/tutorial.html. But somehow there isn't any information on how to save the tensor/array without the exact dimensions, all of the examples seem to include configurations like 'dimensions': [1000, 20000],.
Unlike the HDF5, the tensorstore doesn't seem to have reading overhead issues when converting to numpy, from docs:
Conversion to an numpy.ndarray also implicitly performs a synchronous read (which hits the in-memory cache since the same region was just retrieved)
Nice question; Informative in itself.
I understand you want to have the whole data set/array in memory, eventually, as a NumPy array. I assume, then, you have enough (RAM) memory to host such array -- 12M x 1K.
I don't specifically know about how np.loadtxt (genfromtxt) is operating behind the scenes, so I will tell you how I would do (after trying like you did).
Reasoning about memory...
Notice that a simple boolean array will cost ~12 GBytes of memory:
>>> print("{:.1E} bytes".format(
np.array([True]).itemsize * 12E6 * 1024
))
1.2E+10 bytes
And this is for a Boolean data type. Most likely, you have -- what -- a dataset of Integer, Float? The size may increase quite significantly:
>>> np.array([1], dtype=bool).itemsize
1
>>> np.array([1], dtype=int).itemsize
8
>>> np.array([1], dtype=float).itemsize
8
It's a lot of memory (which you know, just want to emphasize).
At this point, I would like to point out a possible swapping of the working memory. You may have enough physical (RAM) memory in your machine, but if not enough of free memory, your system will use the swap memory (i.e, disk) to keep your system stable & have the work done. The cost you pay is clear: read/writing from/to the disk is very slow.
My point so far is: check the data type of your dataset, estimate the size of your future array, and guarantee you have that minimum amount of RAM memory available.
I/O text
Considering you do have all the (RAM) memory necessary to host the whole numpy array: I would then loop over the whole (~12M lines) text file, filling the pre-existing array row-by-row.
More precisely, I would have the (big) array already instantiated before start reading the file. Only then, I would read each line, split the columns, and give it to np.asarray and assign those (1024) values to each respective row of the output array.
The looping over the file is slow, yes. The thing here is that you limit (and control) the amount of memory being used. Roughly speaking, the big objects consuming your memory are the "output" (big) array, and the "line" (1024) array. Sure, there are quite a considerable amount of memory being consumed in each loop in the temporary objects during reading (text!) values, splitting into list elements and casting to an array. Still, it's something that will remain largely constant during the whole ~12M lines.
So, the steps I would go through are:
0) estimate and guarantee enough RAM memory available
1) instantiate (np.empty or np.zeros) the "output" array
2) loop over "input.txt" file, create a 1D array from each line "i"
3) assign the line values/array to row "i" of "output" array
Sure enough, you can even make it parallel: If on one hand text files cannot be randomly (r/w) accessed, on the other hand you can easily split them (see How can I split one text file into multiple *.txt files?) to have -- if fun is at the table -- them read in parallel, if that time if critical.
Hope that helps.
TL;DR
Export to a different function other than .npy seems inevitable unless your machine is able to handle the size of the data in-memory as per described in #Brandt answer.
Reading the data, then processing it (Kinda answering Q part 2)
To handle data size larger than what the RAM can handle, one would often resort to libraries that performs "out-of-core" computation, e.g. turicreate.SFrame, vaex or dask . These libraries would be able to lazily load the .csv files into dataframes and process them by chunks when evaluated.
from turicreate import SFrame
filename = "myfile.csv"
sf = SFrame.read_csv(filename)
sf.apply(...) # Trying to process the data
or
import vaex
filename = "myfile.csv"
df = vaex.from_csv(filename,
convert=True,
chunk_size=50_000_000)
df.apply(...)
Converting the read data into numpy array (kinda answering Q part 1)
While out-of-core libraries can read and process the data efficiently, converting into numpy is an "in-memory" operation, the machine needs to have enough RAM to fit all data.
The turicreate.SFrame.to_numpy documentation writes:
Converts this SFrame to a numpy array
This operation will construct a numpy array in memory. Care must be taken when size of the returned object is big.
And the vaex documentation writes:
In-memory data representations
One can construct a Vaex DataFrame from a variety of in-memory data representations.
And dask best practices actually reimplemented their own array objects that are simpler than numpy array, see https://docs.dask.org/en/stable/array-best-practices.html. But when going through the docs, it seems like the format they have saved the dask array in are not .npy but various other formats.
Writing the file into non-.npy versions (answering Q Part 3)
Given the numpy arrays are inevitably in-memory, trying to save the data into one single .npy isn't the most viable option.
Different libraries seems to have different solutions for storage. E.g.
vaex saves the data into hdf5 by default if the convert=True argument is set when data is read through vaex.from_csv()
sframe saves the data into their own binary format
dask export functions save to_hdf() and to_parquet() format
It it's latest version (4.14) vaex support "streaming", i.e. lazy loading of CSV files. It uses pyarrow under the hood so it is supper fast. Try something like
df = vaex.open(my_file.csv)
# or
df = vaex.from_csv_arrow(my_file.csv, lazy=True)
Then you can export to bunch of formats as needed, or keep working with it like that (it is surprisingly fast). Of course, it is better to convert to some kind of binary format..
import numpy as np
import pandas as pd
# Define the input and output file names
csv_file = 'data.csv'
npy_file = 'data.npy'
# Create dummy data
data = np.random.rand(10000, 100)
df = pd.DataFrame(data)
df.to_csv(csv_file, index=False)
# Define the chunk size
chunk_size = 1000
# Read the header row and get the number of columns
header = pd.read_csv(csv_file, nrows=0)
num_cols = len(header.columns)
# Initialize an empty array to store the data
data = np.empty((0, num_cols))
# Loop over the chunks of the csv file
for chunk in pd.read_csv(csv_file, chunksize=chunk_size):
# Convert the chunk to a numpy array
chunk_array = chunk.to_numpy()
# Append the chunk to the data array
data = np.append(data, chunk_array, axis=0)
np.save(npy_file, data)
# Load the npy file and check the shape
npy_data = np.load(npy_file)
print('Shape of data before conversion:', data.shape)
print('Shape of data after conversion:', npy_data.shape)```
I'm not aware of any existing function or utility that directly and efficiently converts csv files into npy files. With efficient I guess primarily meaning with low memory requirements.
Writing a npy file iteratively is indeed possible, with some extra effort. There's already a question on SO that addresses this, see:
save numpy array in append mode
For example using the NpyAppendArray class from Michael's answer you can do:
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
for line in csv:
row = np.fromstring(line, sep=',')
npy.append(row[np.newaxis, :])
The NpyAppendArray class updates the npy file header on every call to append, which is a bit much for your 12M rows. Maybe you could update the class to (optionally) only write the header on close. Or you could easily batch the writes:
batch_lines = 128
with open('data.csv') as csv, NpyAppendArray('data.npy') as npy:
done = False
while not done:
batch = []
for count, line in enumerate(csv):
row = np.fromstring(line, sep=',')
batch.append(row)
if count + 1 >= batch_lines:
break
else:
done = True
npy.append(np.array(batch))
(code is not tested)
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed last year.
Improve this question
I'm currently working doing some basic analysis/trying to make tools to automate some of the more quantitative parts of my job. One of these tasks is analyzing data from local instruments, and using that data to draw quantitative conclusions. The end goal is to calculate percent data coverage over a given region (What percent of values in area 'x' exceed value 'y'?). However, there are problems.
First, the data we are looking at is in binary. While the programmer's guides for the data document some of the data structure, they are very sparse in how to actually utilize the data for analysis outside of their proprietary programs.
Second, I am new to Python. While I tried programming tasks in python years ago, I did not end up making anything useful; I am more adept at shell scripting, can work with html/javascript/php, and managing a program using Fortran; I'm trying to learn Python to diversify.
What I know about the data in question: The binary file contains a 640-character long header made up of three parts. Each part is a a mixture of: characters; unsigned and signed 8, 16, and 32 bit integers; and 16 and 32 bit binary angles. After the header, the files show a cartesian grid of data as 'pixels' in an 'image'. Each 'pixel' within the 'image' is an one-byte unsigned character with a value between 0 and 255. The 'image' is a 2-D grid of 'x by y' with the next 'image' occurring after a given number of bytes (In this data set, the images are 720 by 720 'pixels', so the 'images' are separated after 720^2 bytes).
Right now, my goal is just to read the file into a python program and separate the various "images" for inspection. The initialized data/format are below:
testFile = 'C:/path/to/file/binaryFile'
headerFormat = '640c'
nBytesData = 720 * 720
# Below is commented out
inputFile = open(testfile, 'rb')
I have been able to read the file in as a binary file, but I have no clue how to inspect it. First instinct was to try and put it in a numpy array, but additional research suggested using the struct module and struct.unpack to break apart the data. From what I've read, the following block should unpack each 'image' correctly after the initial header, even if it's not the most efficient method:
header_size = struct.calcsize(headerFormat)
testUnpacked = []
with open(testFile, 'rb') as testData:
headerOut = testData.read(header_size)
print("header is: ", headerOut)
while True:
testContent = testData.read()
if not testContent: break
testArray = struct.unpack(testContent, nBytesData)
testUnpacked.append(testArray)
The problem is I do not know how to set up the code to unpack/skip the header to the binary file. I do not think the headerFormat = '640c' line of code, plus the next couple of commands to try and format its output, correct. I was able to output a line that the program, run in PyCharm, interpreted as the "header", and below is a sample of the output starting from the first 'print':
b'\x1b\x00\x08\x00\x80\xd4\x0f\x00\x00\x00\x00\x00\x1a\x00\x06\x00#\x01\x00\x00\x00\x00\x00\x00\x03\x00\x02\x00\x00\x00\x00\x00}\t\x0
After that, I got a error stating that there is an embedded null character preventing the data from saving to the designated array.
Other questions I referenced to try and figure out how to read the data:
Reading a binary file with python
Reading a binary file into a struct
Fastest way to read a binary file with a defined format?
Main questions are as follows:
How do I tell the program to read the binary file header and then start reading the file according to the 720^2 arrays?
How do I tell the program to save the header in a format I can understand?
How do I figure out what is causing the struct.error message?
Based on this description it is difficult to say how one could read the header, since this will depend on its specific structure. It should be possible though to read the rest of the file.
Start by reading the file as a byte array:
with open(testFile, 'rb') as testData:
data = testData.read()
len(data) will give the number of bytes. Assuming that the header consists of fewer than 720^2 bytes, and that the rest of the bytes is subdivided into images 720^2 bytes each, the reminder from the division of len(data) by 720^2 will give the length of the header:
len_header = len(data) % 720**2
You can then disregard the header and convert the remaining bytes into integers:
pixels = [b for b in data[len_header:]]
Next, you can use numpy to rearrange this list into a 2-dimensional array with 720^2 columns, so that each row consists of pixels of a single image:
import numpy as np
images = np.array(pixels).reshape(-1, 720**2)
Each image can be now accessed as images[i] where i is the index of a row. This is a 1-dimensional array, so to make it into a 2-dimensional structure representing an image reshape again:
images[i].reshape(720, 720)
Finally, you can use matplotlib to display the image and check if it looks correctly:
import matplotlib.pyplot as plt
plt.imshow(images[i].reshape(720, 720), cmap="gray_r")
plt.show()
A little background:
I am running a binary stellar evolution code and storing the evolution histories as gzipped .dat files. I used to have a smaller dataset resulting in some ~2000 .dat files, which I read during post-processing by appending the list data from each file to create a 3d list. Each .dat file looks somewhat like this file.
But recently I started working with a larger dataset and the number of evolutionary history files rose to ~100000. So I decided to compress the .dat files as gzips and save them in a zipped folder. The reason being, that I am doing all this on a remote server and have a limited disk quota.
Main query:
During post-processing, I try to read data using pandas from all these files as 2d numpy arrays which are stacked to form a 3d list (each file has a different length so I could not use numpy.append and have to use lists instead). To achieve this, I use this:
def read_evo_history(EvoHist, zipped, z):
ehists = []
for i in range( len(EvoHist) ):
if zipped == True:
try:
ehists.append( pd.read_csv(z.open(EvoHist[i]), delimiter = "\t", compression='gzip', header=None).to_numpy() )
except pd.errors.EmptyDataError:
pass
return ehists
outdir = "plots"
indir = "OutputFiles_allsys"
z = zipfile.ZipFile( indir+'.zip' )
EvoHist = []
for filename in z.namelist():
if not os.path.isdir(filename):
# read the file
if filename[0:len("OutputFiles_allsys/EvoHist")] == "OutputFiles_allsys/EvoHist":
EvoHist.append( filename )
zipped = True
ehists = read_evo_history(EvoHist, zipped, z)
del z # Cleanup (if there's no further use of it after this)
The problem I am now facing is that the one particular column in the data is being read as a list of strings, rather than float. Do I need to somehow convert the datatype while reading the file? Or is this being caused because of datatype inconsistencies in the files being read? Is there a way to get the data as a 3d list of numpy arrays of floats?
P.S.: If this is being caused by inconsistencies in the input files, then I am afraid I won't be able to run my binary stellar evolution code again as it takes days to produce all these files.
I will be more than happy to clarify more on this if needed. Thanks in advance for your time and energy.
Edit:
I noticed that only the 16th column of some files is being read in as a string. And I think this is because there are some NaN values in there, but I may be wrong.
This image shows the raw data with NaN values pointed out. A demonstration showing that particular column bein read as string can be here. However, another column is read as float: image.
The workaround for overcoming a missing value was simple, pandas.read_csv has a parameter called na_values which allows users to pass specified values that they want to be read as NaNs. From the pandas docs:
na_values: scalar, str, list-like, or dict, default None
Additional strings to recognize as NA/NaN. If dict passed, specific per-column NA values. By default, the following values are interpreted as NaN: ββ, ... βNAβ, ...`.
Pandas itself is smart enough to automatically recognize those values without us explicitly stating it. But in my case, the file had nan values as 'nan ' (yeah, with a space!) which is why I was facing this issue. A minute change in the code fixed this,
pd.read_csv(z.open(EvoHist[i]), delimiter = "\t",
compression='gzip', header=None, na_values = 'nan ').to_numpy()
This forum has been extremely helpful for a python novice like me to improve my knowledge. I have generated a large number of raw data in text format from my CFD simulation. My objective is to import these text files into python and do some postprocessing on them. This is a code that I have currently.
import numpy as np
from matplotlib import pyplot as plt
import os
filename=np.array(['v1-0520.txt','v1-0878.txt','v1-1592.txt','v1-3020.txt','v1-5878.txt'])
for i in filename:
format_name= i
path='E:/Fall2015/Research/CFDSimulations_Fall2015/ddn310/Autoexport/v1'
data= os.path.join(path,format_name)
X,Y,U,V,T,Tr = np.loadtxt(data,usecols=(1,2,3,4,5,6),skiprows=1,unpack = True) # Here X and Y represents the X and Y coordinate,U,V,T,Tr represents the Dependent Variables
plt.figure(1)
plt.plot(T,Y)
plt.legend(['vt1a','vtb','vtc','vtd','vte','vtf'])
plt.grid(b=True)
Is there a better way to do this, like importing all the text files (~10000 files) at once into python and then accessing whichever files I need for post processing (maybe indexing). All the text files will have the same number of columns and rows.
I am just a beginner to Python.I will be grateful if someone can help me or point me in the right direction.
Your post needs to be edited to show proper indentation.
Based on a quick read, I think you are:
reading a file, making a small edit, and write it back
then you load it into a numpy array and plot it
Presumably the purpose of your edit is to correct some header or value.
You don't need to write the file back. You can use content directly in loadtxt.
content = content.replace("nodenumber","#nodenumber") # Ignoring Node number column
data1=np.loadtxt(content.splitlines())
Y=data1[:,2]
temp=data1[:,5]
loadtxt accepts any thing that feeds it line by line. content.splitlines() makes a list of lines, which loadtxt can use.
the load could be more compact with:
Y, temp = np.loadtxt(content.splitlines(), usecols=(2,5), unpack=True)
With usecols you might not even need the replace step. You haven't given us a sample file to test.
I don't understand your multiple file needs. One way other you need to open and read each file, one by one. And it would be best to close one before going on to the next. The with open(name) as f: syntax is great for ensuring that a file is closed.
You could collect the loaded data in larger lists or arrays. If Y and temp are identical in size for all files, they can be collected into larger dimensional array, e.g. YY[i,:] = Y for the ith file, where YY is preallocated. If they can vary in size, it is better to collect them in lists.
Back in Feb 8 '13 at 20:20, YamSMit asked a question (see: How to read and write a table / matrix to file with python?) similar to what I am struggling with: starting out with an Excel table (CSV) that has 3 columns and a varying number of rows. The contents of the columns are string, floating point, and string. The first string will vary in length, while the other string can be fixed (eg, 2 characters). The table needs to go into a 2 dimensional array, so that I can do manipulations on the data to produce a final file (which will be a text file). I have experimented with a variety of strategies presented in stackoverflow, but I am always missing something, and I haven't seen an example with all the parts, which is the reason for the struggle to figure this out.
Sample data will be similar to:
Ray Smith, 41645.87778, V1
I have read and explored numpy and astropy since the available documentation says they make this type of code easy. I have tried import csv. Somehow, the code doesn't come together. I should add that I am writing in Python 3.2.3 (which seems to be a mistake since a lot of documentation is for Python 2.x).
I realize the basic nature of this question directs me to read more tutorials. I have been reading many, yet the tutorials always refer to enough that is different, that I fail to assemble the right pieces: read the table file, write into a 2D array, then... do more stuff.
I am grateful to anyone who might provide me with a workable outline of the code, or pointing me to specific documentation I should read to handle the specific nature of the code I am trying to write.
Many thanks in advance. (Sorry for the wordiness - just trying to be complete.)
I am more familiar with 2.x, but from the 3.3 csv documentation found here, it seems to be mostly the same as 2.x. The following function will read a csv file, and return a 2D array of the rows found in the file.
import csv
def read_csv(file_name):
array_2D = []
with open(file_name, 'rb') as csvfile:
read = csv.reader(csvfile, delimiter=';') #Assuming your csv file has been set up with the ';' delimiter - there are other options, for which you should see the first link.
for row in read:
array_2D.append(row)
return array_2D
You would then be able to manipulate the data as follows (assuming your csv file is called 'foo.csv' and the desired text file is 'foo.txt'):
data = read_csv('foo.csv')
with open('foo.txt') as textwrite:
for row in data:
string = '{0} has {1} apples in his Ford {2}.\n'.format(row[0], row[1], row[2])
textwrite.write(string)
#if you know the second column is a float:
manipulate = float(row[1])*3
textwrite.write(manipulate)
string would then be written to 'foo.txt' as:
Ray Smith has 41645.87778 apples in his Ford V1.\n
and maniuplate would be written to 'foo.txt' as:
124937.63334