I have a large binary file (~4GB) written in 4byte reals. I am trying to read this file using numpy fromfile as follows.
data = np.fromfile(filename, dtype=np.single)
Upon inspecting data, I see that all elements are zeros. However when I read the file in Matlab I can see that the file contains correct data and not zeros. I tested a smaller file (~2.5GB) and numpy could read that fine.
I finally tried using np.memmap to read the large file (~4GB), as
data = np.memmap(filename, dtype=np.single, mode='r')
and upon inspecting data, I can see that it correctly reads the data.
My question is why is np.fromfile giving me all zeros in the array. Is there a memory limit to what np.fromfile can read?
Related
I am trying to perform analysis on dozens very large CSV files, each with hundreds of thousands of rows of time series data, with each file being about roughly 5GB in size.
My goal is to read in each of these CSV files as a dataframe, perform calculations on these dataframe, append some new columns to these dataframes based on these calculations, and then write these new dataframes to a unique output CSV file for each input CSV file. This whole process would occur within a for loop iterating through a folder containing all of these large CSV files. And so this whole process is very memory intensive, and when I try to run my code, I am met with this error message: MemoryError: Unable to allocate XX MiB for an array with shape (XX,) and data type int64
And so I want to explore a way to make the process of reading in my CSVs much loss memory intensive, which is why I want to try out the pickle module in python.
To "pickle" each CSV and then read it in I try the following:
#Pickle CSV and read in as pickle
df = pd.read_csv(path_to_csv)
filename = "pickle.csv"
file = open(filename, 'wb')
pickle.dump(df, file)
file = open(filename, 'rb')
pickled_df = pickle.load(file)
print(pickled_df)
However, after including this pickling code to read in my data in my larger script, I get the same error message as above. I suspect this is because I am still reading the file in with pandas to begin with before pickling and then reading that pickle. My question is, how to I avoid the memory-intensive process of reading my data into a pandas dataframe by just reading in the CSV with pickle? Most instruction I am finding tells me to pickle the CSV and then read in that pickle, but I do not understand how pickle the CSV without first reading in that CSV with pandas, which is what is causing my code to crash. I am also confused about whether reading in my data as a pickle would still provide me with a dataframe I can perform calculations on.
I have a large npy file with 33,696 lines of data. I would like to separate this into 18 small files with 1872 lines of data in each. I have tried to use the same codes for converting a large text file into small text files but I am not able to receive the output that I want. What alternative codes can be used to achieve this?
I tried to repeat the same steps as done for the text file but did not receive the output that I wanted.
A npy file is a binary format file that must be loaded as a whole. On the other hand, a text file should be read one line at a time. That means that you cannot expect that a code expecting text file to correctly process binary npy files.
Here you should:
load the npy file into a large numpy array (numpy.load)
split the large numpy array into smaller ones
save the smaller files
Possible code:
arr = np.load('large_file')
for rank, start in range(0, 33696, 1872):
np.save(f'small_{rank}.npy', arr[start: start+1872])
I am generating some arrays for a simulation and I want save them in a JSON file. I am using the jsonpickle library.
The problem is that the arrays I need to save can be very large in size (hundreds of MB up to some GB). Thus, I need to save each array to the JSON file immediately after its generation.
Basically, I am generating a multiple independent large arrays, storing them in another array and saving them into the JSON file after all of them have been generated:
N = 1000 # Number of arrays
z_NM = np.zeros((64000,1), dtype=complex) # Large array
z_NM_array = np.zeros((N,64000,1), dtype=complex) # Array of z_NM arrays
for in range(0, N):
z_NM[:,0] = GenerateArray() # Generate array and store it in z_NM_array
z_NM_array[i] = z_NM
# Write data to JSON file
data = {"z_NM_array": z_NM_array}
outdata = json.encode(data)
with open(filename, "wb+") as f:
f.write(outdata.encode("utf-8"))
f.close()
I was wondering if it is instead possible to append the new data to the existing JSON file, by writing each array to the file immediately after its generation, inside the for loop? If so, how? And how can it be read back? Maybe using a library different from jsonpickle?
I know I could save each array in a separate file, but I'm wondering if there's a solution that lets me use a single file. I also have some settings in the dict which I want to save along with the array.
I'm probably trying to reinvent the wheel here, but numpy has a fromfile() function that can read - I imagine - CSV files.
It appears to be incredibly fast - even compared to Pandas read_csv(), but I'm unclear on how it works.
Here's some test code:
import pandas as pd
import numpy as np
# Create the file here, two columns, one million rows of random numbers.
filename = 'my_file.csv'
df = pd.DataFrame({'a':np.random.randint(100,10000,1000000), 'b':np.random.randint(100,10000,1000000)})
df.to_csv(filename, index = False)
# Now read the file into memory.
arr = np.fromfile(filename)
print len(arr)
I included the len() at the end there to make sure it wasn't reading just a single line. But curiously, the length for me (will vary based on your random number generation) was 1,352,244. Huh?
The docs show an optional sep parameter. But when that is used:
arr = np.fromfile(filename, sep = ',')
...we get a length of 0?!
Ideally I'd be able to load a 2D array of arrays from this CSV file, but I'd settle for a single array from this CSV.
What am I missing here?
numpy.fromfile is not made to read .csv files, instead, it is made for reading data written with the numpy.ndarray.tofile method.
From the docs:
A highly efficient way of reading binary data with a known data-type, as well as parsing simply formatted text files. Data written using the tofile method can be read using this function.
By using it without a sep parameter, numpy assumes you are reading a binary file, hence the different lengths. When you specify a separator, I guess the function just breaks.
To read a .csv file using numpy, I think you can use numpy.genfromtext or numpy.loadtxt (from this question).
I need to read a file into a numpy array. The program only has access to the binary data from the file, and the original file extension if needed. The data the program receives would look something like the "data" shown below.
data = open('file.csv', 'rb').read()
I need to generate an array from this binary data. I do not have permission to write the data to a file so doing that then sending the file to numpy won't work.
Is there some way I can treat the binary data like a file so I can use the numpy function below?
my_data = genfromtxt(data, delimiter=',')
Thanks.