Need to write and read huge pandas DF. I am using pickle format right now:
.to_pickle to write DF to pickle
read_pickle to read pickle file.
I have couple of issues when pickle file size is huge (2 GB in this case)
Read speed is very slow (23 second to read the data)
Increasing RAM/core in VM is not improving speed
How can I read it faster? Can I use some other format which is much faster?
Can I leverage parallel processing/more core functionality to read it faster?
Related
I understand that generators in Python can help for reading and processing large files when specific transformations or outputs are needed from the file (i.e. such as reading a specific column or computing an aggregation).
However, for me it's not clear if there is any benefit in using generators in Python when the only purpose is to read the entire file.
Edit: Assuming your dataset fits in memory.
Lazy Method for Reading Big File in Python?
pd.read_csv('sample_file.csv', chunksize=chunksize)
vs.
pd.read_csv('sample_file.csv')
Are generators useful just to read the entire data without any data processing?
The DataFrame you get from pd.read_csv('sample_file.csv') might fit into memory; however, pd.read_csv itself is a memory intensive function so while reading a file that will end up consuming 10 gigabytes of memory your actual memory usage may exceed 30-40 gigabytes. In cases like this, reading the file in smaller chunks might be the only option.
When writing to a Windows network drive with the Pandas to_csv() function the write operation is considerably slower than when writing to a local disk. This is obviously partly a function of network latency, but I find that if I were to write the data to a StringIO object and then write the StringIO object to the network drive it is considerably faster than calling to_csv directly with the network path, i.e.
from io import StringIO
# Slow
df.to_csv("/network/drive/test.csv")
# Fast
buf = StringIO()
df.to_csv(buf)
with open("/network/drive/test.csv", "w") as fh: fh.write(buf.getvalue())
I likewise find that when using the fwrite() function from the R data.table package there is a much smaller difference in write time between the local and network drives.
Given that I need to frequently write to a network disk I am considering making use of the "fast" method above using StringIO, but I am curious if there isn't some option I am overlooking in to_csv() that will get the same result?
I have a few files. The big one is ~87 million rows. I have others that are ~500K rows. Part of what I am doing is joining them, and when I try to do it with Pandas, I get memory issues. So I have been using Dask. It is super fast to do all the joins/applies, but then it takes 5 hours to write out to a csv, even if I know the resulting dataframe is only 26 rows.
I've read that some joins/applies are not the best for Dask, but does that mean it is slower using Dask? Because mine have been very quick. It takes seconds to do all of my computations/manipulations on the millions of rows. But it takes forever to write out. Any ideas how to speed this up/why this is happening?
You can use Dask Parallel Processing or try writing into Parquet file instead of CSV as Parquet operation is very fast with Dask
dask uses lazy evaluation. This means that when you perform the operations, you are actually only creating the processing graph.
Once you try to write your data to a csv file, Dask starts performing the operations.
And that is why it takes 5 hrs, he just needs to process a lot of data.
See https://tutorial.dask.org/01x_lazy.html for more information on the topic.
One way to speed up the processing would be to increase the parallelism by using a machine with more resources.
I am trying to read expedia data from Kaggle which contains a 4GB csv file I tried reading it using pd.read_csv('filename') and got memory error. Second approach I tried reading particular columns only using the code:
pd.read_csv('train.csv', dtype={'date_time':np.str, user_location_country': np.int32, 'user_location_region':np.int32, 'user_location_city':np.int32, 'orig_destination_distance':np.float64, 'user_id':np.int32})
this again gives me memory error but using another modification of the same method which is:
train = pd.read_csv('train.csv', dtype={'user_id':np.int32,'is_booking':bool, 'srch_destination_id':np.int32, 'hotel_cluster':np.int32}, usecols=['date_time', 'user_id', 'srch_ci', 'srch_co', 'srch_destination_id', 'is_booking', 'hotel_cluster'])'
reads the data in about 5 minutes.
My problem is I want to read more columns using any of the methods but both fails and gives Memory error. I am using 8GB RAM with 8GB swap space so reading only 7-8 columns out of 24 columns in the data will reduce the data size around 800MB so no issues on the hardware usage.
I also tried reading in chunks that I don't want to do based on the algorithms that I am going to read in the later part.
Unfortunately, reading a csv file requires more memory than its size on the disk (I do not know how much more though).
You can find an alternative way to process your file here
I am new in python and data science and I am wondering what would be the best way to handle my csv file.
I have a csv with 50.000 rows and 2.000 columns - 30.000Kb.
so far my python program does not take long to read it; but I am concern about consuming so much memory and making my program slow.
Currently I am reading the file with pandas:
pd.read_csv( tf.gfile.Open(pathA), sep=None, skipinitialspace=True, engine="python")
my questions are:
Should I implement optimization techniques or my csv is not that big for that?
what kind of techniques should I use?
I read that I can read in batchs like this: with open(filename, 'rb') as f ...
should I read in batch and keep the data in memory or
should I always read from the file and not keeping the data in memory
I appreciate your answers =)
If the read times are OK for you, then I wouldn't worry about premature optimisation.
There are some built-in parameters you could try within the read_csv method: chunksize, iterator or lowmemory.
However I personally don't think that filesize is overly large. I've dealt with reading files of hundreds of thousands of rows using a 2015 MacBook.
To add to #phil-sheard's answer. Reading in chunks using Python code will make the code more reliant on slow Python code, while read_csv is implemented in C-code and much quicker.
If you do want to optimize it's probably something in the settings of Pandas read_csv, which is already much more efficient than what you could build in surrounding Python code.
Also don't optimize if it isn't necessary, you seem to be dealing with a very small table. There's no reason to microoptimize here.