"We have the 2015 Yellow Cab NYC Taxi data as 12 CSV files on S3... This data is about 20GB on disk or 60GB in RAM."
i came across this observation while trying out dask, a python framework for handling out of memory datasets.
can someone explain to me why there is a 3x difference? id imagine it has to do with python objects but am not 100% sure.
thanks!
You are reading from a CSV on disk into a structured data frame object in memory. The two things are not at all analogous. The CSV data on disk is a single string of text. The data in memory is a complex data structure, with multiple data types, internal pointers, etc.
The CSV itself is not taking up any RAM. There is a complex data structure that is taking up RAM, and it was populated using data sourced from the CSV on disk. This is not at all the same thing.
To illustrate the difference, you could try reading the CSV into an actual single string variable and seeing how much memory that consumes. In this case, it would effectively be a single CSV string in memory:
with open('data.csv', 'r') as csvFile:
data=csvFile.read()
Related
I understand that generators in Python can help for reading and processing large files when specific transformations or outputs are needed from the file (i.e. such as reading a specific column or computing an aggregation).
However, for me it's not clear if there is any benefit in using generators in Python when the only purpose is to read the entire file.
Edit: Assuming your dataset fits in memory.
Lazy Method for Reading Big File in Python?
pd.read_csv('sample_file.csv', chunksize=chunksize)
vs.
pd.read_csv('sample_file.csv')
Are generators useful just to read the entire data without any data processing?
The DataFrame you get from pd.read_csv('sample_file.csv') might fit into memory; however, pd.read_csv itself is a memory intensive function so while reading a file that will end up consuming 10 gigabytes of memory your actual memory usage may exceed 30-40 gigabytes. In cases like this, reading the file in smaller chunks might be the only option.
I have approximately 60,000 small CSV files of varying sizes 1MB to several hundred MB that I would like to convert into a single Parquet file. The total size of all the CSVs is around 1.3 TB. This is larger than the memory of the server that I am using (678 GB available).
Since all the CSVs have same fields, I've concatenated them into a single large file. I tried to process this file with Dask:
ddf = dd.read_csv("large.csv", blocksize="1G").to_parquet("large.pqt")
My understanding was that the blocksize option would prevent dask running out of memory when the job was split over multiple workers.
What happens is that eventually Dask does run out of memory and I get a bunch of messages like:
distributed.nanny - WARNING - Restarting worker
Is my approach completely wrong or am I just missing an important detail?
You don't have to concatenate all of your files into one large file. dd.read_csv is happy to accept a list of filenames, or a string with a "*" in it.
If you have text data in your CSV file, then loading it into pandas or dask dataframes can expand the amount of memory used considerably, so your 1GB chunks might be quite a bit bigger than you expect. Do things work if you use a smaller chunk size? You might want to consult this doc entry: https://docs.dask.org/en/latest/best-practices.html#avoid-very-large-partitions
In general I recommend using Dask's dashboard to watch the computation, and see what is taking up your memory. This might help you find a good solution. https://docs.dask.org/en/latest/diagnostics-distributed.html
I am trying to read expedia data from Kaggle which contains a 4GB csv file I tried reading it using pd.read_csv('filename') and got memory error. Second approach I tried reading particular columns only using the code:
pd.read_csv('train.csv', dtype={'date_time':np.str, user_location_country': np.int32, 'user_location_region':np.int32, 'user_location_city':np.int32, 'orig_destination_distance':np.float64, 'user_id':np.int32})
this again gives me memory error but using another modification of the same method which is:
train = pd.read_csv('train.csv', dtype={'user_id':np.int32,'is_booking':bool, 'srch_destination_id':np.int32, 'hotel_cluster':np.int32}, usecols=['date_time', 'user_id', 'srch_ci', 'srch_co', 'srch_destination_id', 'is_booking', 'hotel_cluster'])'
reads the data in about 5 minutes.
My problem is I want to read more columns using any of the methods but both fails and gives Memory error. I am using 8GB RAM with 8GB swap space so reading only 7-8 columns out of 24 columns in the data will reduce the data size around 800MB so no issues on the hardware usage.
I also tried reading in chunks that I don't want to do based on the algorithms that I am going to read in the later part.
Unfortunately, reading a csv file requires more memory than its size on the disk (I do not know how much more though).
You can find an alternative way to process your file here
So I had this large Dataset of files and I created a program to put them in a pickle file but I Only Have 2GBs of RAM so I can`t Have the entire file in an array so how can I append the array on multiple batches "stuff data inside the array, append to the pickle file, clear the array, repeat " how can I do that,
thanks
Actually I don't think that it's possible to append data to a pickle file and if if it was, you would run into memory issues when trying to read the pickle file.
Pickle files are not designed for large data storage, so it might be worth switching to another file format.
You could go with text-based formats like csv, json, ... or binary formats like hdf5 which is specifically optimized for large amounts of numerical data.
I have a dictionary pickled on disk with size of ~780 Megs (on disk). However, when I load that dictionary into the memory, its size swells unexpectedly to around 6 gigabytes. Is there anyway to keep the size around the actual filesize in the memory as well, (I mean it will be alright if it takes around 1 gigs in the memory, but 6 gigs is kind of a strange behavior). Is there a problem with the pickle module, or should I save the dictionary in some other format?
Here is how I am loading the file:
import pickle
with open('py_dict.pickle', 'rb') as file:
py_dict = pickle.load(file)
Any ideas, help, would be greatly appreciated.
If you're using pickle just for storing large values in a dictionary, or a very large number of keys, you should consider using shelve instead.
import shelve
s=shelve.open('shelve.bin')
s['a']='value'
This loads each key/value only as needed, keeping the rest on disk
Use SQL to store all the data into a database and use efficient queries to reach it.