I have a large npy file with 33,696 lines of data. I would like to separate this into 18 small files with 1872 lines of data in each. I have tried to use the same codes for converting a large text file into small text files but I am not able to receive the output that I want. What alternative codes can be used to achieve this?
I tried to repeat the same steps as done for the text file but did not receive the output that I wanted.
A npy file is a binary format file that must be loaded as a whole. On the other hand, a text file should be read one line at a time. That means that you cannot expect that a code expecting text file to correctly process binary npy files.
Here you should:
load the npy file into a large numpy array (numpy.load)
split the large numpy array into smaller ones
save the smaller files
Possible code:
arr = np.load('large_file')
for rank, start in range(0, 33696, 1872):
np.save(f'small_{rank}.npy', arr[start: start+1872])
Related
I'm trying to save data from an array called Sevol. This matrix has 100 rows and 1000 columns, so len(Sevol[i]) has 1000 elements and Sevol[0][0] would be the first element of the first list.
I tried to save this array with the commands
np.savetxt(path + '/data_Sevol.txt', Sevol[i], delimiter=" ")
It works fine. However, I would like the file to be organized as an array anyway. For example, currently, the file is being saved like this in Notepad:
And I would like the data to remain organized, as for example in this file:
Is there an argument in the np.savetxt function or something I can do to better organize the text file?
I am trying to perform analysis on dozens very large CSV files, each with hundreds of thousands of rows of time series data, with each file being about roughly 5GB in size.
My goal is to read in each of these CSV files as a dataframe, perform calculations on these dataframe, append some new columns to these dataframes based on these calculations, and then write these new dataframes to a unique output CSV file for each input CSV file. This whole process would occur within a for loop iterating through a folder containing all of these large CSV files. And so this whole process is very memory intensive, and when I try to run my code, I am met with this error message: MemoryError: Unable to allocate XX MiB for an array with shape (XX,) and data type int64
And so I want to explore a way to make the process of reading in my CSVs much loss memory intensive, which is why I want to try out the pickle module in python.
To "pickle" each CSV and then read it in I try the following:
#Pickle CSV and read in as pickle
df = pd.read_csv(path_to_csv)
filename = "pickle.csv"
file = open(filename, 'wb')
pickle.dump(df, file)
file = open(filename, 'rb')
pickled_df = pickle.load(file)
print(pickled_df)
However, after including this pickling code to read in my data in my larger script, I get the same error message as above. I suspect this is because I am still reading the file in with pandas to begin with before pickling and then reading that pickle. My question is, how to I avoid the memory-intensive process of reading my data into a pandas dataframe by just reading in the CSV with pickle? Most instruction I am finding tells me to pickle the CSV and then read in that pickle, but I do not understand how pickle the CSV without first reading in that CSV with pandas, which is what is causing my code to crash. I am also confused about whether reading in my data as a pickle would still provide me with a dataframe I can perform calculations on.
I have a large binary file (~4GB) written in 4byte reals. I am trying to read this file using numpy fromfile as follows.
data = np.fromfile(filename, dtype=np.single)
Upon inspecting data, I see that all elements are zeros. However when I read the file in Matlab I can see that the file contains correct data and not zeros. I tested a smaller file (~2.5GB) and numpy could read that fine.
I finally tried using np.memmap to read the large file (~4GB), as
data = np.memmap(filename, dtype=np.single, mode='r')
and upon inspecting data, I can see that it correctly reads the data.
My question is why is np.fromfile giving me all zeros in the array. Is there a memory limit to what np.fromfile can read?
I have a .csv file several million lines long. I would like to have python sequentially read N lines at a time, and then put them into a Numpy array to be processed. I would like to be able to move to the next N lines without reading the file from the beginning. numpy.genfromtxt() works very well for the simple subcase of a N lines csv file.
I have a csv file which is too large to completely fit into my laptop's memory (about 10GB). Is there a way to truncate the file such that only the first n entries are saved in a new file? I started by trying
df = pandas.read_csv("path/data.csv").as_matrix()
but this doesn´t work since the memory is too small.
Any help will be appreciated!
Leon
Use nrows:
df = pandas.read_csv("path/data.csv", nrows=1000)
The nrows docs say:
Number of rows of file to read. Useful for reading pieces of large files