Concatenating huge dataframes with pandas - python

I have sensor data recorded over a timespan of one year. The data is stored in twelve chunks, with 1000 columns, ~1000000 rows each. I have worked out a script to concatenate these chunks to one large file, but about half way through the execution I get a MemoryError. (I am running this on a machine with ~70 GB of usable RAM.)
import gc
from os import listdir
import pandas as pd
path = "/slices02/hdf/"
slices = listdir(path)
res = pd.DataFrame()
for sl in slices:
temp = pd.read_hdf(path + f"{sl}")
res = pd.concat([res, temp], sort=False, axis=1)
del temp
gc.collect()
res.fillna(method="ffill", inplace=True)
res.to_hdf(path + "sensor_data_cpl.hdf", "online", mode="w")
I have also tried to fiddle with HDFStore so I do not have to load all the data into memory (see Merging two tables with millions of rows in Python), but I could not figure out how that works in my case.

When you read in a csv as a pandas DataFrame, the process will take up to twice the needed memory at the end (because of type guessing and all the automatic stuff pandas tries to provide).
Several methods to fight that :
Use chunks. I see that your data is already in chunks, but maybe those are too big, so you can read each files by chunks using the chunk_size parameter of pandas.read_hdf or pandas.read_csv
Provide dtypes to avoid type guessing and mixed types (ex: a column of strings with null value with have mixed type), this will work along low_memory parameters.
If this is not sufficient you'll have to turn to distributed technologies like pyspark, dask, modin or even pandarallel

When you have so much data avoid creating temporary dataframes as they take up memory too. Try doing it in one pass:
folder = "/slices02/hdf/"
files = [os.path.join(folder, file) for file in os.listdir(folder)]
res = pd.concat((pd.read_csv(file) for file in files), sort=False)
See how this works for you.

Related

Python: merge large datasets and how to work with large data (500 Gb)

I have some large csv files which I need to merge them together. each file is about 5gb and my RAM is only 8gb. I use the following code to read some csv files into dataframes and merge them on columns fund_ticker ticker and date.
import numpy as np
import pandas as pd
# Read in data, ignore column "version"
table1 = pd.read_csv(r'C:\data\data1.csv', usecols=lambda col: col not in ["Version"])
table2 = pd.read_csv(r'C:\data\data2.csv', usecols=lambda col: col not in ["Version"])
weight = pd.read_csv(r'C:\data\data3.csv', usecols=lambda col: col not in ["Version"])
print("Finish reading")
# merge datasets
merged = data1.merge(data2, on=['fund_ticker', 'TICKER', 'Date']).merge(data3, on=['fund_ticker', 'TICKER', 'Date'])
Unfortunately, I got the following error:
numpy.core._exceptions.MemoryError: Unable to allocate 105. MiB for an array with shape (27632931,) and data type object
After searching on the internet, I think the issue is that the data is larger than my RAM. To overcome this issue, I am thinking about using some database such as SQL or parquet files. My question is which is the most efficient ways to work with large datasets? My data is financial data and could go up to 500 Gb or 1 Tb. Some direction to how to set up would be much appreciated. Thanks
There are a few options discussed on the 'Scaling to large datasets' page of the pandas User Guide
The easiest drop-in replacement here would be to use dask
It uses a subset of the pandas api, so that should be familiar, and it allows working with dataframes that are larger than the memory by only working on chunks at a time.
However, that merge is likely to still be quite slow. (It would help some to set the 'fund_ticker', 'TICKER', and 'Date' columns as the index of each dataframe first.

Pandas read_hdf is giving "can only use an iterator or chunksize on a table" error

I have an h5 data file, which includes key rawreport
I can read the rawreport and save as dataframe using read_hdf(filename, "rawreport") without any problems. But the data has 17 mil rows and i'd like to use chunking
When I ran this code
chunksize = 10**6
someval = 100
df = pd.DataFrame()
for chunk in pd.read_hdf(filename, 'rawreport', chunksize=chunksize, where='datetime < someval'):
df = pd.concat([df, chunk], ignore_index=True)
I get "TypeError: can only use an iterator or chunksize on a table"
What does it mean that the rawreport isn't a table and how could I overcome this issue? I'm not the person who created the h5 file.
Chunking is only possible if your file was written in a Table format using PyTables. This must be specified when your file was first written:
df.to_hdf('rawreport', format = 'table')
If this wasn't specified when you wrote the file, then Pandas defaults to using a fixed format. This means that while the file can be quickly written and read later, it does mean that the entire dataframe must be read into memory. Unfortunately, this means that chunking and other options in read_hdf to specify particular rows or columns can't be used here.

How to extract and save in .csv chunks of data from a large .csv file iteratively using Python?

I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1

Looking for a way to overcome 'MemoryError' in Spyder while merging dataframes together

I am running the script below.
import numpy as np
import pandas as pd
# load all data to respective dataframes
orders = pd.read_csv('C:\\my_path\\orders.csv')
products = pd.read_csv('C:\\my_path\\products.csv')
order_products = pd.read_csv('C:\\my_path\\order_products.csv')
# check out data sets
print(orders.shape)
print(products.shape)
print(order_products.shape)
# merge different dataframes into one consolidated dataframe
df = pd.merge(order_products, products, on='product_id')
df = pd.merge(df, orders, on='order_id')
On the last line of merging the second data frame, I get this result:
out = np.empty(out_shape, dtype=dtype)
MemoryError
The file named 'order_products.csv' is around 550MB, 'orders.csv' is 100MB, and 'products.csv' is just 2MB. I have tried running this process a few times and I always get the MemoryError issue. It doesn't seem like the files are really, really massive, but I guess it's all relative, because on my old machine, it's just too much. Is there a simple way to read these files into dataframes in chunks and then merge these together in chunks?
I am working with Spyder 3.3.4, Python 3.7, and Windows 7 on an old ThinkPad.
Thanks.
try to use the concept of slicing and chunking. what it says is you've reach the highest possible length of what your computer's ram can take
orders_100 = orders[:100]
products_100 = product[:100]
order_products_100 = order_products[:100]
Then do pd.merge()

Load multiple parquet files into dataframe for analysis

I have several .parquet files each with shape (1126399, 503) and size of 13MB. As far as I know and from what I have read this should be able to be handled just fine on a local machine. I am trying to get them into a pandas dataframe to run some analysis but having trouble doing so. Saving them to a CSV file is too costly as the files become extremely large and loading them directly into several dataframes and then concatenating gives me memory errors. I've never worked with .parquet files and not sure what the best path forward is or how to use the files to actually do some analysis with the data.
At first, I tried:
import pandas as pd
import pyarrow.parquet as pq
# This is repeated for all files
p0 = pq.read_table('part0.parquet') # each part increases python's memory usage by ~14%
df0 = part0.to_pandas() # each frame increases python's memory usage by additional ~14%
# Concatenate all dataframes together
df = pd.concat([df0, df1, df2, df3, df4, df6, df7], ignore_index=True)
This was causing me to run out of memory. I am running on a system with 12 cores and 32GB of memory. I thought I'd be more efficient and tried looping through and deleting the files that were no longer needed:
import pandas as pd
# Loop through files and load into a dataframe
df = pd.read_parquet('part0.parquet', engine='pyarrow')
files = ['part1.parquet', 'part2.parquet', 'part3.parquet'] # in total there are 6 files
for file in files:
data = pd.read_parque(file)
df = df.append(data, ignore_index=True)
del data
Unfortunately, neither of these worked. Any and all help is greatly appreciated.
I opened https://issues.apache.org/jira/browse/ARROW-3424 about at least making a function in pyarrow that will load a collection of file paths as efficiently as possible. You can load them individually with pyarrow.parquet.read_table, concatenate the pyarrow.Table objects with pyarrow.concat_tables, then call Table.to_pandas to convert to pandas.DataFrame. That will be much more efficient then concatenating with pandas

Categories

Resources