how to have indexed files in storage in python - python

I have a huge dataset of images and I am processing them one by one.
All the images are stored in a folder.
My Approach:
What I have tried is that I have tried reading all the filenames in memory and whenever a call for a certain index is sent, I load the corresponding image.
The problem is that it is even not possible to keep the paths and the names of the files in memory due to the huge dataset.
Is it possible to have an indexed file on storage and one is able to read a file name at a certain index.
Thanks a lot.

Related

Is it more beneficial to read many small files or fewer large files of the exact same data?

I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.

Save python objects to separate files, and also their structure

I have a collection of mainly numerical data-files that are the result of running a physics simulation (or several). I can convert the files into pandas dataframes. It is natural to organize the dataframe objects in lists, lists of lists etc. For example:
allData = [df1, [df11, df12], df2, [df21, df22]]
I want to save this data to files (to be sent). I know the whole thing can be dumped into one file with e.g. a pickle format, but I don't want this because some files can be large and I want to be able to load the files selectively. So each dataframe should be stored as a separate file.
But I also want to store how the objects are organized into lists, for example in another file. So that when reading the files from somewhere else, python will know how the data files are connected.
Possibly I could solve this by inventing some system of writing the filenames and how they are structured into a txt file. But is there a proper/cleaner way to do it?

Sort labeled dataset with labels in separate csv file?

I have a dataset of ~3500 images, with the labels of each image in a csv file. The csv file has two columns: the first one contains the exact name of the image file (i.e. 00001.jpg) and the second column contains the label of the image. There are a total of 7 different labels.
How can I sort the images from one huge folder to 7 different folders (each image in its respective category) in an efficient manner? Does anyone have a script that can do this?
Also, is there any way I can do this with Google Drive? I've already uploaded the dataset to Drive in order to use with Colab soon, so I don't want to have to do it again (takes ~2.5 hours).
I'm not sure about performance, probably there are better ways...
But this would be my take on the problem:
(not tested, so might need small adjustments)
I'm assuming the images are in a subfolder /images/, but the csv and the script are in root. Furthermore I'm assuming the csv is named images.csv and the columns in the csv are titled file and label.
import pandas as pd
import os
df = pf.DataFrame.from_csv('images.csv')
for _, row in df.iterrows():
f = row['file']
l = row['label']
os.replace(f'images/{f}', f'images/{l}/{f}')
I don't know what google drive would make out of it, but as long as you can run it on a google-drive-synced folder, I wouldn't know why this should be an issue.
Note: if you test it, you may want to do so on a copy of the files, in case I screwed up...

Python: Can I write to a file without loading its contents in RAM?

Got a big data-set that I want to shuffle. The entire set won't fit into RAM so it would be good if I could open several files (e.g. hdf5, numpy) simultaneously, loop through my data chronologically and randomly assign each data-point to one of the piles (then afterwards shuffle each pile).
I'm really inexperienced with working with data in python so I'm not sure if it's possible to write to files without holding the rest of its contents in RAM (been using np.save and savez with little success).
Is this possible and in h5py or numpy and, if so, how could I do it?
Memmory mapped files will allow for what you want. They create a numpy array which leaves the data on disk, only loading data as needed. The complete manual page is here. However, the easiest way to use them is by passing the argument mmap_mode=r+ or mmap_mode=w+ in the call to np.load leaves the file on disk (see here).
I'd suggest using advanced indexing. If you have data in a one dimensional array arr, you can index it using a list. So arr[ [0,3,5]] will give you the 0th, 3rd, and 5th elements of arr. That will make selecting the shuffled versions much easier. Since this will overwrite the data you'll need to open the files on disk read only, and create copies (using mmap_mode=w+) to put the shuffled data in.

Pandas: efficiently write thousands of small files

here is my problem.
I have a single big CSV file containing a bit more than 100M rows which I need to divide in much smaller files (if needed I can add more details). At the moment I'm reading in chunks the big CSV, doing some computations to determine how to subdivide the chunk and finally writing (appending) to the files with
df.to_csv(outfile, float_format='%.8f', index=False, mode='a', header=header)
(the header variable is True if it is the first time that I write to 'outfile', otherwise it is False).
While running the code I noticed that the amount of on-disk memory taken by the smaller files on the whole was likely to become larger than three times the size of the single big csv.
So here are my questions:
is this behavior normal? (probably it is, but I'm asking just in case)
is it possible to reduce the size of the files? (different file formats?) [SOLVED through compression, see update below and comments]
are there better file types for this situation with respect to CSV?
Please note that I don't have an extensive knowledge of programming, I'm just using Python for my thesis.
Thanks in advance to whoever will help.
UPDATE: thanks to #AshishAcharya and #PatrickArtner I learned how to use the compression while writing and reading the CSV. Still, I'd like to know if there are any file types that may be better than CSV for this task.
NEW QUESTION: (maybe stupid question) does appending work on compressed files?
UPDATE 2: using the compression option I noticed something that I don't understand. To determine the size of folders I was taught to use the du -hs <folder> command, but using it on the folder containing compressed files or the one containing the uncompressed files results in the same value of '3.8G' (both are created using the same first 5M rows from the big CSV). From the file explorer (Nautilus) instead, I get about 590MB for the one containing uncompressed CSV and 230MB for the other. What am I missing?

Categories

Resources