how to append to a pickle file in batches - python

So I had this large Dataset of files and I created a program to put them in a pickle file but I Only Have 2GBs of RAM so I can`t Have the entire file in an array so how can I append the array on multiple batches "stuff data inside the array, append to the pickle file, clear the array, repeat " how can I do that,
thanks

Actually I don't think that it's possible to append data to a pickle file and if if it was, you would run into memory issues when trying to read the pickle file.
Pickle files are not designed for large data storage, so it might be worth switching to another file format.
You could go with text-based formats like csv, json, ... or binary formats like hdf5 which is specifically optimized for large amounts of numerical data.

Related

Is it more beneficial to read many small files or fewer large files of the exact same data?

I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.

Is there a way to read an array in csv format?

This code is what I'm used it.
dataset = np.loadtxt('path to dataset', delimeter=',')
x_train=dataset[:700,0:3]
y_train=dataset[:700,3]
x_test=dataset[700:,0:3]
y_test=dataset[700:,3]
And I have billions of learning data.
Putting this data in a csv file is a pain to the computer.
I use 'sleep' to fetch 100,000 data at a time into a numpy array.
Is there a way to read the data in memory directly in csv format?

why would data have different footprints for disk versus memory?

"We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3... This data is about 20GB on disk or 60GB in RAM."
i came across this observation while trying out dask, a python framework for handling out of memory datasets.
can someone explain to me why there is a 3x difference? id imagine it has to do with python objects but am not 100% sure.
thanks!
You are reading from a CSV on disk into a structured data frame object in memory. The two things are not at all analogous. The CSV data on disk is a single string of text. The data in memory is a complex data structure, with multiple data types, internal pointers, etc.
The CSV itself is not taking up any RAM. There is a complex data structure that is taking up RAM, and it was populated using data sourced from the CSV on disk. This is not at all the same thing.
To illustrate the difference, you could try reading the CSV into an actual single string variable and seeing how much memory that consumes. In this case, it would effectively be a single CSV string in memory:
with open('data.csv', 'r') as csvFile:
data=csvFile.read()

np.load to a paths file

I have a file with all of my pathnames for each .npy file. I have about 5 million files so I would like to avoid unnecessary fors.
What I need to do is to load them all into my data variable like this:
data = np.load( input_file_w_pathnames )
I know this will not work but I was wondering if someone knows of a clever way to do something similar, or at least a way to do this efficiently.
np.load takes a filename or a file object (a file that you opened). It uses standard Python file reading tools. It does not take multiple names or files.
np.stack([np.load(f) for f in ['x.npy','x.npy','x.npy']])
can join the arrays in each file into a larger array, it is still doing a file by file load.
Keep in mind that numpy 'efficiency' is achieved by performing the task in compiled code - it's faster because of the compiling, not because it is getting around the serial nature of the task. And this task does not come up often enough to warrant special code.
I assume you can easily deal with loading the filenames into a list.

Loading Large Files as Dictionary

My first question on stackoverflow :)
I am trying to load a pertained vectors from Glove vectors and create a dictionary with words as keys and corresponding vectors as values. I did the usual naive method to this:
fp=open(wordEmbdfile)
self.wordVectors={}
# Create wordVector dictionary
for aline in fp:
w=aline.rstrip().split()
self.wordVectors[w[0]]=w[1:]
fp.close()
I see a huge memory pressure on Activity Monitor, and eventually after trying for an hour or two it crashes.
I am going to try splitting in multiple smaller files and create multiple dictionaries.
In the meantime I have following questions:
To read the word2vec file, is it better if I read the gzipped file using gzip.open or uncompress it and then read it with plain open.
The word vector file has text in first column and float in the rest, would it be more optimal to use genfromtext or loadtext from numpy?
I intend save this dictionary using chicle, I know loading it is going to be hard too. I read the suggestion to use shelve, how does it compare to cPickle in loading time and access time. May be its better to spend some more time loading with cPickle if improve future accesses (if cPickle does not crash, with 8G RAM), Does anyone have some suggestion on this?
Thanks!

Categories

Resources