I'm extracting a large CSV file (200Mb) that was generated using R with Python (I'm the one using python).
I do some tinkling with the file (normalization, scaling, removing junk columns, etc) and then save it again using numpy's savetxt with data delimiter as ',' to kee the csv property.
Thing is, the new file is almost twice as large than the original (almost 400Mb). The original data as well as the new one are only arrays of floats.
If it helps, it looks as if the new file has really small values, that need exponential values, which the original did not have.
Any idea on why is this happening?
Have you looked at the way floats are represented in text before and after? You might have a line "1.,2.,3." become "1.000000e+0, 2.000000e+0,3.000000e+0" or something like that, the two are both valid and both represent the same numbers.
More likely, however, is that if the original file contained floats as values with relatively few significant digits (for example "1.1, 2.2, 3.3"), after you do normalization and scaling, you "create" more digits which are needed to represent the results of your math but do not correspond to real increase in precision (for example, normalizing the sum of values to 1.0 in the last example gives "0.1666666, 0.3333333, 0.5").
I guess the short answer is that there is no guarantee (and no requirement) for floats represented as text to occupy any particular amount of storage space, or less than the maximum possible per float; it can vary a lot even if the data remains the same, and will certainly vary if the data changes.
Related
I hope this is a good question, if I should post this as an issue on the PyPolars GitHub instead, please let me know.
I have a quite large parquet file where some columns contain binary data.
These columns are not interesting for me right now, so it is ok for me that PyPolars does not support the Binary datatype so far (this is how I understand it at least, my question would be irrelevant if that were not the case!), but I would like to make full use of the query optimization by lazily reading the file with .scan_parquet() instead of read_parquet().
Currently .scan_parquet() gives me the following error:
pyo3_runtime.PanicException: Arrow datatype Binary not supported by Polars. You probably need to activate that data-type feature.
and I don't know of a way to 'activate that data-type feature'
So my workaround is to use .read_parquet() and specify in advance which columns I want to use so that it never attempts to read the Binary ones.
The problem is I am doing exploratory data analysis and there are a large amount of columns so for one it is annoying to have to specify a large list of columns (basically ~150 minus the two that produce the issue) and it is also inefficient to read all these columns each time when I only need some small subset each time (it is even more annoying to change a small list of columns each time I, for example, add some filter).
It would be ideal if I could use .scan_parquet and let the query optimizer figure out that it only needs to read the (unproblematic) columns that I actually need.
Is there a better way of doing things that I am not seeing?
I've just started learning numpy to analyse experimental data and I am trying to give a custom name to a ndarray column so that I can do operations without slicing.
The data I receive is generally a two column .txt file (I'll call it X and Y for the sake of clarity) with a large number of rows corresponding to the measured data. I do operations on those data and generate new columns (I'll call it F(X,Y), G(X,Y,F), ...). I know I can do column-wise operations by slicing ie Data[:,2]=Data[:,1]+Data[:,0], but with a large number of added columns, this becomes tedious. Hence I'm looking for a way to label the columns so that I can refer to a column by its label, and can also label the new columns I generate. So essentially I'm looking for something that'll allow me to directly write F=X+Y (as a substitute for the example above).
Currently, I'm assigning the entire column to a new variable and doing the operations, and then 'hstack'ing it to the data, but I'm unsure of the memory usage here. For example,
X=Data[:,0]
Y=Data[:,1]
F=X+Y
Data=numpy.hstack((Data,F.reshape(n,1)))
I've seen the use of structured array and record arrays, but the data I'm dealing with is homogeneous and new columns are being added continuously. Also, I hear Pandas is well suited for what I'm describing, but again, since I'm working with numerical data, I don't find the need to learn a new module, unless it's really needed. Thanks in advance.
I had a DataFrame whose memory usage was 159.7 MB. When I used .to_csv method to write it in storage the written file was about 400 MB. And when I loaded this file its memory usage was 159.7 MB. Is there an explanation for this difference in sizes and how to write it so that it takes less space in the hard drive ? Thank you for your help
If your DataFrame contains strs, try using tab as a delimiter instead of comma. That could save you on the need for quotes.
df.to_csv('new_file.csv', sep='\t')
The easiest way to reduce the size of the csv is to compress it when writing, using the compression parameter in to_csv. For example df.to_csv(compression='gzip').
There are a variety of reasons the memory usage could be so different from the size of the csv on disk, it's a little hard to say without knowing any specifics about the data you're working with.
One generic recommendation is to check the precision of any floating point values in your dataframe, if you're writing a bunch of numbers with 15 decimal points of precision or something that will take up a lot of space. Try truncating these values to the precision you need.
I have a huge sequence (1000000) of small matrices (32x32) stored in a hdf5 file, each one with a label.
Each of this matrices represent a sensor data for a specific time.
I want to obtain the evolution for each pixel in for a small time slice, different for each x,y position in the matrix.
This is taking more time than I expect.
def getPixelSlice (self,xpixel,ypixel,initphoto,endphoto):
#obtain h5 keys inside time range between initphoto and endphoto
valid=np.where(np.logical_and(self.photoList>=initphoto,self.photoList<endphoto))
#look at pixel data in valid frames
evolution = []
#for each valid frame, obtain the data, and append the target pixel to the list.
for frame in valid[0]:
data = self.h5f[str(self.photoList[frame])]
evolution.append(data[ypixel][xpixel])
return evolution,valid
So, there is a problem here that took me a while to sort out for a similar application. Due to the physical limitations of hard drives, the data are stored in such a way that with a three dimensional array it will always be easier to read in one orientation than another. It all depends on what order you stored the data in.
How you handle this problem depends on your application. My specific application can be characterized as "write few, read many". In this case, it makes the most sense to store the data in the order that I expect to read it. To do this, I use PyTables and specify a "chunkshape" that is the same as one of my timeseries. So, in your case it would be (1,1,1000000). I'm not sure if that size is too large or not, though, so you may need to break it down a bit farther, say (1,1,10000) or something like that.
For more info see PyTables Optimization Tips.
For applications where you intend to read in a specific orientation many times, it is crucial that you choose an appropriate chuck shape for your HDF5 arrays.
I have a dictionary with many entries and a huge vector as values. These vectors can be 60.000 dimensions large and I have about 60.000 entries in the dictionary. To save time, I want to store this after computation. However, using a pickle led to a huge file. I have tried storing to JSON, but the file remains extremely large (like 10.5 MB on a sample of 50 entries with less dimensions). I have also read about sparse matrices. As most entries will be 0, this is a possibility. Will this reduce the filesize? Is there any other way to store this information? Or am I just unlucky?
Update:
Thank you all for the replies. I want to store this data as these are word counts. For example, when given sentences, I store the amount of times word 0 (at location 0 in the array) appears in the sentence. There are obviously more words in all sentences than appear in one sentence, hence the many zeros. Then, I want to use this array tot train at least three, maybe six classifiers. It seemed easier to create the arrays with word counts and then run the classifiers over night to train and test. I use sklearn for this. This format was chosen to be consistent with other feature vector formats, which is why I am approaching the problem this way. If this is not the way to go, in this case, please let me know. I am very much aware that I have much to learn in coding efficiently!
I also started implementing sparse matrices. The file is even bigger now (testing with a sample set of 300 sentences).
Update 2:
Thank you all for the tips. John Mee was right by not needing to store the data. Both he and Mike McKerns told me to use sparse matrices, which sped up calculation significantly! So thank you for your input. Now I have a new tool in my arsenal!
See my answer to a very closely related question https://stackoverflow.com/a/25244747/2379433, if you are ok with pickling to several files instead of a single file.
Also see: https://stackoverflow.com/a/21948720/2379433 for other potential improvements, and here too: https://stackoverflow.com/a/24471659/2379433.
If you are using numpy arrays, it can be very efficient, as both klepto and joblib understand how to use minimal state representation for an array. If you indeed have most elements of the arrays as zeros, then by all means, convert to sparse matrices... and you will find huge savings in storage size of the array.
As the links above discuss, you could use klepto -- which provides you with the ability to easily store dictionaries to disk or database, using a common API. klepto also enables you to pick a storage format (pickle, json, etc.) -- where HDF5 is coming soon. It can utilize both specialized pickle formats (like numpy's) and compression (if you care about size and not speed).
klepto gives you the option to store the dictionary with "all-in-one" file or "one-entry-per" file, and also can leverage multiprocessing or multithreading -- meaning that you can save and load dictionary items to/from the backend in parallel.
With 60,000 dimensions do you mean 60,000 elements? if this is the case and the numbers are 1..10 then a reasonably compact but still efficient approach is to use a dictionary of Python array.array objects with 1 byte per element (type 'B').
The size in memory should be about 60,000 entries x 60,000 bytes, totaling 3.35Gb of data.
That data structure is pickled to about the same size to disk too.