import large dataset (4gb) in python using pandas - python

I'm trying to import a large (approximately 4Gb) csv dataset into python using the pandas library. Of course the dataset cannot fit all at once in the memory so I used chunks of size 10000 to read the csv.
After this I want to concat all the chunks into a single dataframe in order to perform some calculations but I ran out of memory (I use a desktop with 16gb RAM).
My code so far:
# Reading csv
chunks = pd.read_csv("path_to_csv", iterator=True, chunksize=1000)
# Concat the chunks
pd.concat([chunk for chunk in chunks])
pd.concat(chunks, ignore_index=True)
I searched many threads on StackOverflow and all of them suggest one of these solutions. Is there a way to overcome this? I can't believe I can't handle a 4 gb dataset with 16 gb ram!
UPDATE: I still haven't come up with any solution to import the csv file. I bypassed the problem by importing the data into a PostgreSQL then querying the database.

I once deal with this kind of situation using generator in python. I hope this will be helpful:
def read_big_file_in_chunks(file_object, chunk_size=1024):
"""Reading whole big file in chunks."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open('very_very_big_file.log')
for chunk in read_big_file_in_chunks(f):
process_data(chunck)

Related

Reading json file into pandas dataframe is very slow

I have a json file of size less than 1Gb.I am trying to read the file on a server that have 400 Gb RAM using the following simple command:
df = pd.read_json('filepath.json')
However this code is taking forever (several hours) to execute,I tried several suggestions such as
df = pd.read_json('filepath.json', low_memory=False)
or
df = pd.read_json('filepath.json', lines=True)
But none have worked. How come reading 1GB file into a server of 400GB be so slow?
You can use Chunking can shrink memory use.
I recommend Dask Library can load data in parallel.

how to share pandas dataframe between processes to save memory

I have a large csv file(around 10Gb).
I use different ipython notebooks to analyse it.(Using pd.read_csv() to load the file into dataframe in each notebook)
My problem is , every time I read the file, 10G memory is used.
I am wondering if there is a way to share dataframe data between processes so that I can optimize my memory usage.
An ideal solution would be like this:
in my server file,
def InitData():
df = pd.read_csv(my.csv)
share(df)
in other notebook files,
def loadingData():
df = LoadingSharedData()
result = df.sum() #something like this
No matter how many notebooks I create,there would be only one piece of dataframe in my memory.
Using pickle is fast and efficient if you are confident that nobody will be able to interfere with the pickled files, see security considerations.
import pickle
with open('filename.pickle', 'wb') as file:
pickle.dump(df, file)
with open('filename.pickle', 'rb') as file:
df_test = pickle.load(file)
print(df.equals(df_test))

Concatenating huge dataframes with pandas

I have sensor data recorded over a timespan of one year. The data is stored in twelve chunks, with 1000 columns, ~1000000 rows each. I have worked out a script to concatenate these chunks to one large file, but about half way through the execution I get a MemoryError. (I am running this on a machine with ~70 GB of usable RAM.)
import gc
from os import listdir
import pandas as pd
path = "/slices02/hdf/"
slices = listdir(path)
res = pd.DataFrame()
for sl in slices:
temp = pd.read_hdf(path + f"{sl}")
res = pd.concat([res, temp], sort=False, axis=1)
del temp
gc.collect()
res.fillna(method="ffill", inplace=True)
res.to_hdf(path + "sensor_data_cpl.hdf", "online", mode="w")
I have also tried to fiddle with HDFStore so I do not have to load all the data into memory (see Merging two tables with millions of rows in Python), but I could not figure out how that works in my case.
When you read in a csv as a pandas DataFrame, the process will take up to twice the needed memory at the end (because of type guessing and all the automatic stuff pandas tries to provide).
Several methods to fight that :
Use chunks. I see that your data is already in chunks, but maybe those are too big, so you can read each files by chunks using the chunk_size parameter of pandas.read_hdf or pandas.read_csv
Provide dtypes to avoid type guessing and mixed types (ex: a column of strings with null value with have mixed type), this will work along low_memory parameters.
If this is not sufficient you'll have to turn to distributed technologies like pyspark, dask, modin or even pandarallel
When you have so much data avoid creating temporary dataframes as they take up memory too. Try doing it in one pass:
folder = "/slices02/hdf/"
files = [os.path.join(folder, file) for file in os.listdir(folder)]
res = pd.concat((pd.read_csv(file) for file in files), sort=False)
See how this works for you.

How to extract and save in .csv chunks of data from a large .csv file iteratively using Python?

I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1

Fastest way to read complex data structures from disk in Python

I have a dataset in CSV containing lists of values as strings in a single field that looks more or less like this:
Id,sequence
1,'1;0;2;6'
2,'0;1'
3,'1;0;9'
In the real dataset I'm dealing with, the sequence length vary greatly and can contain from one up to few thousands observations. There are many columns containing sequences all stored as strings.
I'm reading those CSV's and parsing strings to become lists nested inside Pandas DataFrame. This takes some time, but I'm ok with it.
However, later when I save the parsed results to pickle the read time of this pickle file is very high.
I'm facing the following:
Reading a raw ~600mb CSV file of such structure to Pandas takes around ~3
seconds.
Reading the same (raw, unprocessed) data from pickle takes ~0.1 second.
Reading the processed data from pickle takes 8 seconds!
I'm trying to find a way to read processed data from disk in the quickest possible way.
Already tried:
Experimenting with different storage formats but most of them can't store nested structures. The only one that worked was msgpack but that didn't improve the performance much.
Using structures other than Pandas DataFrame (like tuple of tuples) - faced similar performance.
I'm not very tied to the exact data structure. The thing is I would like to quickly read parsed data from disk directly to Python.
This might be a duplicate to this question
HDF5 is quite a bit quicker at handling nested pandas dataframes. I would give that a shot.
An example usage borrowed from here shows how you can chunk it efficiently when dumping:
import glob, os
import pandas as pd
df = DataFrame(np.random.randn(1000,2),columns=list('AB'))
df.to_hdf('test.h5','df',mode='w',format='table',data_columns=True)
store = pd.HDFStore('test.h5')
nrows = store.get_storer('df').nrows
chunksize = 100
for i in xrange(nrows//chunksize + 1):
chunk = store.select('df',
start=i*chunksize,
stop=(i+1)*chunksize)
store.close()
When reading it back, you can do it in chunks like this, too:
for df in pd.read_hdf('raw_sample_storage2.h5','raw_sample_all', start=0,stop=300000,chunksize = 3000):
print df.info()
print(df.head(5))

Categories

Resources