I have a folder (7.7GB) with multiple pandas dataframes stored in parquet file format. I need to load all these dataframes in a python dictionary, but since I only have 32GB of RAM, I use the .loc method to just load the data that I need.
When all the dataframes are loaded in memory in the python dictory, I create a common index from the indexes all of the data, then I reindex all the dataframes with the new index.
I developed two scripts to do this, the first one is in a classic sequential way, the second one is using Dask in oder to get some performance improvement from all the cores of my Threadripper 1920x.
Sequential code:
# Standard library imports
import os
import pathlib
import time
# Third party imports
import pandas as pd
# Local application imports
class DataProvider:
def __init__(self):
self.data = dict()
def load_parquet(self, source_dir: str, timeframe_start: str, timeframe_end: str) -> None:
t = time.perf_counter()
symbol_list = list(file for file in os.listdir(source_dir) if file.endswith('.parquet'))
# updating containers
for symbol in symbol_list:
path = pathlib.Path.joinpath(pathlib.Path(source_dir), symbol)
name = symbol.replace('.parquet', '')
self.data[name] = pd.read_parquet(path).loc[timeframe_start:timeframe_end]
print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# building index
index = None
for symbol in self.data:
if index is not None:
index.union(self.data[symbol].index)
else:
index = self.data[symbol].index
print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# reindexing data
for symbol in self.data:
self.data[symbol] = self.data[symbol].reindex(index=index, method='pad').itertuples()
print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')
if __name__ == '__main__' or __name__ == 'builtins':
source = r'WindowsPath'
x = DataProvider()
x.load_parquet(source_dir=source, timeframe_start='2015', timeframe_end='2015')
Dask code:
# Standard library imports
import os
import pathlib
import time
# Third party imports
from dask.distributed import Client
import pandas as pd
# Local application imports
def __load_parquet__(directory, timeframe_start, timeframe_end):
return pd.read_parquet(directory).loc[timeframe_start:timeframe_end]
def __reindex__(new_index, df):
return df.reindex(index=new_index, method='pad').itertuples()
if __name__ == '__main__' or __name__ == 'builtins':
client = Client()
source = r'WindowsPath'
start = '2015'
end = '2015'
t = time.perf_counter()
file_list = [file for file in os.listdir(source) if file.endswith('.parquet')]
# build data
data = dict()
for file in file_list:
path = pathlib.Path.joinpath(pathlib.Path(source), file)
symbol = file.replace('.parquet', '')
data[symbol] = client.submit(__load_parquet__, path, start, end)
print(f'Loaded data in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# build index
index = None
for symbol in data:
if index is not None:
index.union(data[symbol].result().index)
else:
index = data[symbol].result().index
print(f'Built index in {round(time.perf_counter() - t, 3)} seconds.')
t = time.perf_counter()
# reindex
for symbol in data:
data[symbol] = client.submit(__reindex__, index, data[symbol].result())
print(f'Indexed data in {round(time.perf_counter() - t, 3)} seconds.')
I found the results pretty weird.
Sequential code:
max memory consumption during computations: 30.2GB
memory consumption at the end of computations: 15.6GB
total memory consumption (without Windows and others): 11.6GB
Loaded data in 54.289 seconds.
Built index in 0.428 seconds.
Reindexed data in 9.666 seconds.
Dask code:
max memory consumption during computations: 25.2GB
memory consumption at the end of computations: 22.6GB
total memory consumption (without Windows and others): 18.9GB
Loaded data in 0.638 seconds.
Built index in 27.541 seconds.
Reindexed data in 30.179 seconds.
My questions:
Why with Dask the memory consumption at the end of computation is so much higher?
Why with Dask building the common index and reindexing all the dataframes takes so much time?
Also, when using the Dask code the console prints me the following error.
C:\Users\edit\Anaconda3\envs\edit\lib\site-packages\distribute\worker.py:901:UserWarning: Large object of size 5.41 MB detected in task graph:
(DatetimeIndex(['2015-01-02 09:30:00', '2015-01-02 ... s x 5 columns])
Consider scattering large objects ahead of time with client.scatter to reduce scheduler burden and keep data on workers
future = client.submit(func, big_data) # bad
big_future = client.scatter(big_data) # good
future = client.submit(func, big_future) # good
% (format_bytes(len(b)), s))
Even if the error suggestions are really good, I don't get what's wrong with my code. Why is it saying keep data on workers? I thought that with submit method I'm sending all the data to my client, and so the workers have an easy access to all the data. Thank you all for the help.
I am not an expert at all, just try to help.
you might want to try not using time.perf_counter , see if that changes anything.
Related
I've been transitioning away from SPSS for syntax writing/data management where I work to python and pandas for higher levels of functionality and programming. The issue is, reading SPSS files into pandas is SO slow. i work with bigger datasets (1 million or more rows often with 100+ columns). it seems that there are some pretty cool plugins out there to speed up processing CSV files such as Dask and Modin, but i don't think these work with SPSS files. i'd like to continue using pandas, but i have to stick with the SPSS file format (it's what everyone else where i work uses).
Are there any tips on how to accomplish faster data processing outside of computer upgrades and or file chunking?
You can try to parallelize reading your file:
As an example I have a file "big.sav" which is 294000 rows x 666 columns. Reading the file with pyreadstat.read_sav (which is what pd.read_spss uses in the background) takes 115 seconds. By parallelizing it I get 29 seconds:
first I create a file worker.py:
def worker(inpt):
import pyreadstat
offset, chunksize, path = inpt
df, meta = pyreadstat.read_sav(path, row_offset=offset, row_limit=chunksize)
return df
and then in the main script I have this:
import multiprocessing as mp
from time import time
import pandas as pd
import pyreadstat
from worker import worker
# calculate the number of rows in the file
_, meta = pyreadstat.read_sav("big.sav", metadataonly=True)
numrows = meta.number_rows
# calculate number of cores in the machine, this could also be set manually to some number, i.e. 8
numcores = mp.cpu_count()
# calculate the chunksize and offsets
divs = [numrows // numcores + (1 if x < numrows % numcores else 0) for x in range (numcores) ]
chunksize = divs[0]
offsets = [indx*chunksize for indx in range(numcores)]
# pack the data for the jobs
jobs = [(x, chunksize, "big.sav") for x in offsets]
pool = mp.Pool(processes=numcores)
# let's go!
t0=time()
chunks = pool.map(worker, jobs)
t1=time()
print(t1-t0) # this prints 29 seconds
# chunks is a list of dataframes in the right order
# you can concatenate all the chunks into a single big dataframe if you like
final = pd.concat(chunks, axis=0, ignore_index=True)
EDIT:
pyreadstat version 1.0.3 has had a big improvement in performance of about 5x.
In addition a new function "read_file_multiprocessing" has been added that is a wrapper around the previous code shared in this answer. It can give up to another 3x improvement, making (up to) a 15 times improvement compared to the previous version!
You can use the function like this:
import pyreadstat
fpath = "path/to/file.sav"
df, meta = pyreadstat.read_file_multiprocessing(pyreadstat.read_sav, fpath)
I am engaged in analysing HDF5 format data for scientific research purposes. I'm using Python's h5py library.
Now, the HDF file I want to read is so large. Its file size is about 20GB and the main part of its data is 400000*10000 float matrix. I tried to read the data once, but my development environment Spyder was terminated by compulsion because of the shortage of the memory. Then is there any method to read it partially and avoid this problem?
Use pd.read_hdf with columns argument. See example below:
import numpy as np
import pandas as pd
from contexttimer import Timer
def create_sample_df():
with Timer() as t:
df = pd.DataFrame(np.random.rand(100000, 5000))
df.to_hdf('file.h5', 'df', format='table')
print('create_sample_df: %.2fs' % t.elapsed)
def read_full_df():
""" data is too large to read fully """
with Timer() as t:
df = pd.read_hdf('file.h5')
print('read_full_df: %.2fs' % t.elapsed)
def read_df_with_start_stop():
""" to quick look all columns """
with Timer() as t:
df = pd.read_hdf('file.h5', start=0, stop=5)
print('read_df_with_start_stop: %.2fs' % t.elapsed)
def read_df_with_columns():
""" to read dataframe (hdf5) with necessary columns """
with Timer() as t:
df = pd.read_hdf('file.h5', columns=list(range(4)))
print('read_df_with_columns: %.2fs' % t.elapsed)
if __name__ == '__main__':
create_sample_df()
read_full_df()
read_df_with_start_stop()
read_df_with_columns()
# outputs:
# create_sample_df: 51.25s
# read_full_df: 5.21s
# read_df_with_start_stop: 0.03s
# read_df_with_columns: 4.44s
read_df_with_columns only reduces space cost, but does not necessarily improve speed performance. And this is under the assumption that the HDF5 was saved in table format (otherwise columns argument cannot be applied).
You can slice h5py datasets like numpy arrays, so you could work on a number of subsets instead of the whole dataset (e.g. 4 100000*10000 subsets).
datasets = {}
datasets['df1'] = df1
datasets['df2'] = df2
datasets['df3'] = df3
datasets['df4'] = df4
def prepare_dataframe(dataframe):
return dataframe.apply(lambda x: x.astype(str).str.lower().str.replace('[^\w\s]', ''))
for key, value in datasets.items():
datasets[key] = prepare_dataframe(value)
I need to prepare the data in some dataframes for further analysis. I would like to parallelize the for loop that updates the dictionary with a prepared dataframe. This code will eventually run on a machine with dozens of cores and thousands of dataframes. On my local machine I do not appear to be using more than a single core in the prepare_dataframe function.
I have looked at Numba and Joblib but I cannot find a way to work with dictionary values in either library.
Any insight would be very much appreciated!
You can use the multiprocessing library. You can read about its basics here.
Here is the code that does what you need:
from multiprocessing import Pool
def prepare_dataframe(dataframe):
# do whatever you want here
# changes made here are *not* global
# return a modified version of what you want
return dataframe
def worker(dict_item):
key,value = dict_item
return (key,prepare_dataframe(value))
def parallelize(data, func):
data_list = list(data.items())
pool = Pool()
data = dict(pool.map(func, data_list))
pool.close()
pool.join()
return data
datasets = parallelize(datasets,worker)
We have a dataset which has approx 1.5MM rows. I would like to process that in parallel. The main function of that code is to lookup master information and enrich the 1.5MM rows. The master is a two column dataset with roughly 25000 rows. However i am unable to make the multi-process work and test its scalability properly. Can some one please help. The cut-down version of the code is as follows
import pandas
from multiprocessing import Pool
def work(data):
mylist =[]
#Business Logic
return mylist.append(data)
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
chunksize = 2
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable= data_df, chunksize=20)
pool.close()
pool.join()
print('Result :', result)
Method work will have the business logic and i would like to pass partitioned data_df into work to enable parallel processing. The sample data is as follows
CUSTOMER_ID,PRODUCT_ID,SALE_QTY
641996,115089,2
1078894,78144,1
1078894,121664,1
1078894,26467,1
457347,59359,2
1006860,36329,2
1006860,65237,2
1006860,121189,2
825486,78151,2
825486,78151,2
123445,115089,4
Ideally i would like to process 6 rows in each partition.
Please help.
Thanks and Regards
Bala
First, work is returning the output of mylist.append(data), which is None. I assume (and if not, I suggest) you want to return a processed Dataframe.
To distribute the load, you could use numpy.array_split to split the large Dataframe into a list of 6-row Dataframes, which are then processed by work.
import pandas
import math
import numpy as np
from multiprocessing import Pool
def work(data):
#Business Logic
return data # Return it as a Dataframe
if __name__ == '__main__':
data_df = pandas.read_csv('D:\\retail\\customer_sales_parallel.csv',header='infer')
print('Source Data :', data_df)
agents = 2
rows_per_workload = 6
num_loads = math.ceil(data_df.shape[0]/float(rows_per_workload))
split_df = np.array_split(data_df, num_loads) # A list of Dataframes
with Pool(processes=agents) as pool:
result = pool.map(func=work, iterable=split_df)
result = pandas.concat(result) # Stitch them back together
pool.close()
pool.join()pool = Pool(processes=agents)
print('Result :', result)
My best recommendation is for you to use the chunksize parameter in read_csv (Docs) and iterate over. This way you wont crash your ram trying to load everything plus if you want you can for example use threads to speed up the process.
for i,chunk in enumerate(pd.read_csv('bigfile.csv', chunksize=500000)):
Im not sure if this answer your specific question but i hope it helps.
I am reading in hundreds of HDF files and processing the data of each HDF seperately. However, this takes an awful amount of time, since it is working on one HDF file at a time. I just stumbled upon http://docs.python.org/library/multiprocessing.html and am now wondering how I can speed things up using multiprocessing.
So far, I came up with this:
import numpy as np
from multiprocessing import Pool
def myhdf(date):
ii = dates.index(date)
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
results[ii] = np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
results = np.zeros(len(dates))
pool = Pool(len(dates))
pool.map(myhdf,dates)
However, this is obviously not correct. Can you follow my chain of thought what I want to do? What do I need to change?
Try joblib for a friendlier multiprocessing wrapper:
from joblib import Parallel, delayed
def myhdf(date):
# do work
return np.mean(records)
results = Parallel(n_jobs=-1)(delayed(myhdf)(d) for d in dates)
The Pool classes map function is like the standard python libraries map function, you're guaranteed to get your results back in the order that you put them in. Knowing that, the only other trick is that you need to return results in a consistant manner, and the filter them afterwards.
import numpy as np
from multiprocessing import Pool
def myhdf(date):
year = date[0:4]
month = date[4:6]
day = date[6:8]
rootdir = 'data/mydata/'
filename = 'no2track'+year+month+day
records = read_my_hdf(rootdir,filename)
if records.size:
return np.mean(records)
dates = ['20080105','20080106','20080107','20080108','20080109']
pool = Pool(len(dates))
results = pool.map(myhdf,dates)
results = [ result for result in results if result ]
results = np.array(results)
If you really do want results as soon as they are available you can use imap_unordered