Split mf4 files - python

I have a memory problem when trying to convert mf4 files larger than 7mb (example 100mb). I would like to split my files larger than 7MB into several 7MB files. Is this possible?
Here is the error I have with the memory for large files
Unable to allocate 123. MiB for an array with shape (3570125,) and data type |S36

Related

Is it more beneficial to read many small files or fewer large files of the exact same data?

I am working on a project where I am combining 300,000 small files together to form a dataset to be used for training a machine learning model. Because each of these files do not represent a single sample, but rather a variable number of samples, the dataset I require can only be formed by iterating through each of these files and concatenating/appending them to a single, unified array. With this being said, I unfortunately cannot avoid having to iterate through such files in order to form the dataset I require. As such, the process of data loading prior to model training is very slow.
Therefore my question is this: would it be better to merge these small files together into relatively larger files, e.g., reducing the 300,000 files to 300 (merged) files? I assume that iterating through less (but larger) files would be faster than iterating through many (but smaller) files. Can someone confirm if this is actually the case?
For context, my programs are written in Python and I am using PyTorch as the ML framework.
Thanks!
Usually working with one bigger file is faster than working with many small files.
It needs less open, read, close, etc. functions which need time to
check if file exists,
check if you have privilege to access this file,
get file's information from disk (where is beginning of file on disk, what is its size, etc.),
search beginning of file on disk (when it has to read data),
create system's buffer for data from disk (system reads more data to buffer and later function read() can read partially from buffer instead of reading partially from disk).
Using many files it has to do this for every file and disk is much slower than buffer in memory.

bigquery.Client().extract_table() does not (always) divide a big table into small CSV files

My Python application needs to export BigQuery tables into small CSV files in GCS (like smaller than 1GB).
I referred to the document, and wrote the following code:
from google.cloud import bigquery
bigquery.Client().extract_table('my_project.my_dataset.my_5GB_table',
destination_uris='gs://my-bucket/*.csv')
The size of my_5GB_table is approximately 5GB.
But it results in a single 10GB CSV file in GCS.
I tried with other tables with various numbers of size, and then some resulted in divided files of about 200MB, and others in a single huge file too.
The doc says as if tables are always divided into 1GB files, but now I don't know the rules where the files are divided.
Q1 How to make sure that tables are always divided into smaller than 1GB files ?
Q2 Can't I specify the size of files into which tables are divided ?

how to have indexed files in storage in python

I have a huge dataset of images and I am processing them one by one.
All the images are stored in a folder.
My Approach:
What I have tried is that I have tried reading all the filenames in memory and whenever a call for a certain index is sent, I load the corresponding image.
The problem is that it is even not possible to keep the paths and the names of the files in memory due to the huge dataset.
Is it possible to have an indexed file on storage and one is able to read a file name at a certain index.
Thanks a lot.

Can I use ffmpeg to output jpgs to a numpy array in python without writing the files to disk etc?

I have to read thousands of images in memory.This has to be done.When i extract frames using ffmpeg from a video,the disk space for the 14400 files =92MB and are in JPG format.When I read those images in python and append in a python list using libraries like opencv,scipy etc the same 14400 files=2.5 to 3GB.Guess the decoding is the reason?any thoughts on this will be helpful?
You are exactly right, JPEG images are compressed (this is even a lossy compression, PNG would be a format with lossless compression), and JPEG files are much smaller than the data in uncompressed form.
When you load the images to memory, they are in uncompressed form, and having several GB of data with 14400 images is not surprising.
Basically, my advice is don't do that. Load them one at a time (or in batches), process them, then load the next images. If you load everything to memory beforehand, there will be a point when you run out of memory.
I'm doing a lot of image processing, and I have trouble imagining a case where it is necessary to have that many images loaded at once.

how to append to a pickle file in batches

So I had this large Dataset of files and I created a program to put them in a pickle file but I Only Have 2GBs of RAM so I can`t Have the entire file in an array so how can I append the array on multiple batches "stuff data inside the array, append to the pickle file, clear the array, repeat " how can I do that,
thanks
Actually I don't think that it's possible to append data to a pickle file and if if it was, you would run into memory issues when trying to read the pickle file.
Pickle files are not designed for large data storage, so it might be worth switching to another file format.
You could go with text-based formats like csv, json, ... or binary formats like hdf5 which is specifically optimized for large amounts of numerical data.

Categories

Resources