I have several .parquet files each with shape (1126399, 503) and size of 13MB. As far as I know and from what I have read this should be able to be handled just fine on a local machine. I am trying to get them into a pandas dataframe to run some analysis but having trouble doing so. Saving them to a CSV file is too costly as the files become extremely large and loading them directly into several dataframes and then concatenating gives me memory errors. I've never worked with .parquet files and not sure what the best path forward is or how to use the files to actually do some analysis with the data.
At first, I tried:
import pandas as pd
import pyarrow.parquet as pq
# This is repeated for all files
p0 = pq.read_table('part0.parquet') # each part increases python's memory usage by ~14%
df0 = part0.to_pandas() # each frame increases python's memory usage by additional ~14%
# Concatenate all dataframes together
df = pd.concat([df0, df1, df2, df3, df4, df6, df7], ignore_index=True)
This was causing me to run out of memory. I am running on a system with 12 cores and 32GB of memory. I thought I'd be more efficient and tried looping through and deleting the files that were no longer needed:
import pandas as pd
# Loop through files and load into a dataframe
df = pd.read_parquet('part0.parquet', engine='pyarrow')
files = ['part1.parquet', 'part2.parquet', 'part3.parquet'] # in total there are 6 files
for file in files:
data = pd.read_parque(file)
df = df.append(data, ignore_index=True)
del data
Unfortunately, neither of these worked. Any and all help is greatly appreciated.
I opened https://issues.apache.org/jira/browse/ARROW-3424 about at least making a function in pyarrow that will load a collection of file paths as efficiently as possible. You can load them individually with pyarrow.parquet.read_table, concatenate the pyarrow.Table objects with pyarrow.concat_tables, then call Table.to_pandas to convert to pandas.DataFrame. That will be much more efficient then concatenating with pandas
Related
I have a cloud bucket with many (around 1000) small JSON files (few KB each one). I have to read them, select some fields and store the result in a single parquet file. Since the JSON files are very small, the resulting dataframe (around 100MB) stays in memory.
I tried two ways. The first is using Pandas with a for loop:
import os
import pandas as pd
import json
path = ...
df = pd.DataFrame()
for root, _, filenames in os.walk(path):
for filename in fnmatch.filter(filenames, '*.json'):
with os.open(file_path, 'r') as f:
json_file = json.loads(f.read())
df = pd.DataFrame(json_file)
df = df.append(df, ignore_index=True)
The second option would be using Pyspark:
from pyspark.sql import SparkSession, SQLContext
path = ...
spark_builder = SparkSession.builder.appName(app_name).config(conf=conf)
sql_context = SQLContext(spark_builder)
df = sql_context.read.json(path)
What is the most efficient way to read multiple JSON files between the two approaches? And how the solutions scale if the number of files to read would be larger (more than 100K)?
If you are not running Spark in a cluster, it will not change much.
Pandas dataframe is not distributed. When performing transformations on a pd dataset, the data will not be spread across the cluster so all the processing will be concentrated in one node.
Working with the Spark datasets - like in second option - Spark will send chunks of data to the available workers in your cluster so this data will be processed in parallel, making the process much more fast. Depending on the size and shape of your data, you can play with how this data is "sliced" so you can increase performance even further.
I have some large csv files which I need to merge them together. each file is about 5gb and my RAM is only 8gb. I use the following code to read some csv files into dataframes and merge them on columns fund_ticker ticker and date.
import numpy as np
import pandas as pd
# Read in data, ignore column "version"
table1 = pd.read_csv(r'C:\data\data1.csv', usecols=lambda col: col not in ["Version"])
table2 = pd.read_csv(r'C:\data\data2.csv', usecols=lambda col: col not in ["Version"])
weight = pd.read_csv(r'C:\data\data3.csv', usecols=lambda col: col not in ["Version"])
print("Finish reading")
# merge datasets
merged = data1.merge(data2, on=['fund_ticker', 'TICKER', 'Date']).merge(data3, on=['fund_ticker', 'TICKER', 'Date'])
Unfortunately, I got the following error:
numpy.core._exceptions.MemoryError: Unable to allocate 105. MiB for an array with shape (27632931,) and data type object
After searching on the internet, I think the issue is that the data is larger than my RAM. To overcome this issue, I am thinking about using some database such as SQL or parquet files. My question is which is the most efficient ways to work with large datasets? My data is financial data and could go up to 500 Gb or 1 Tb. Some direction to how to set up would be much appreciated. Thanks
There are a few options discussed on the 'Scaling to large datasets' page of the pandas User Guide
The easiest drop-in replacement here would be to use dask
It uses a subset of the pandas api, so that should be familiar, and it allows working with dataframes that are larger than the memory by only working on chunks at a time.
However, that merge is likely to still be quite slow. (It would help some to set the 'fund_ticker', 'TICKER', and 'Date' columns as the index of each dataframe first.
I am trying to split a parquet file using DASK with the following piece of code
import dask.dataframe as pd
df = pd.read_parquet(dataset_path, chunksize="100MB")
df.repartition(partition_size="100MB")
pd.to_parquet(df,output_path)
I have only one physical file in input, i.e. file.parquet
The output of this script is as well only one file, i.e. part.0.parquet.
Based on the partition_size & chunksize parameters, I should have multiple files in output
Any help would be appreciated
df.repartition(partition_size="100MB") returns a Dask Dataframe.
You have to write :
df = df.repartition(partition_size="100MB")
You can check the number of partitions created looking at df.npartitions
Also, you can use the following to write your parquet files :
df.to_parquet(output_path)
Because Parquet files are meant to deal with large files, you should also consider using the argument compression= when writing you parquet files.
You should get what you expect.
NB: Writing import dask.dataframe as pd is missleading because import dask.dataframe as dd is commonly used
I have sensor data recorded over a timespan of one year. The data is stored in twelve chunks, with 1000 columns, ~1000000 rows each. I have worked out a script to concatenate these chunks to one large file, but about half way through the execution I get a MemoryError. (I am running this on a machine with ~70 GB of usable RAM.)
import gc
from os import listdir
import pandas as pd
path = "/slices02/hdf/"
slices = listdir(path)
res = pd.DataFrame()
for sl in slices:
temp = pd.read_hdf(path + f"{sl}")
res = pd.concat([res, temp], sort=False, axis=1)
del temp
gc.collect()
res.fillna(method="ffill", inplace=True)
res.to_hdf(path + "sensor_data_cpl.hdf", "online", mode="w")
I have also tried to fiddle with HDFStore so I do not have to load all the data into memory (see Merging two tables with millions of rows in Python), but I could not figure out how that works in my case.
When you read in a csv as a pandas DataFrame, the process will take up to twice the needed memory at the end (because of type guessing and all the automatic stuff pandas tries to provide).
Several methods to fight that :
Use chunks. I see that your data is already in chunks, but maybe those are too big, so you can read each files by chunks using the chunk_size parameter of pandas.read_hdf or pandas.read_csv
Provide dtypes to avoid type guessing and mixed types (ex: a column of strings with null value with have mixed type), this will work along low_memory parameters.
If this is not sufficient you'll have to turn to distributed technologies like pyspark, dask, modin or even pandarallel
When you have so much data avoid creating temporary dataframes as they take up memory too. Try doing it in one pass:
folder = "/slices02/hdf/"
files = [os.path.join(folder, file) for file in os.listdir(folder)]
res = pd.concat((pd.read_csv(file) for file in files), sort=False)
See how this works for you.
I am running the script below.
import numpy as np
import pandas as pd
# load all data to respective dataframes
orders = pd.read_csv('C:\\my_path\\orders.csv')
products = pd.read_csv('C:\\my_path\\products.csv')
order_products = pd.read_csv('C:\\my_path\\order_products.csv')
# check out data sets
print(orders.shape)
print(products.shape)
print(order_products.shape)
# merge different dataframes into one consolidated dataframe
df = pd.merge(order_products, products, on='product_id')
df = pd.merge(df, orders, on='order_id')
On the last line of merging the second data frame, I get this result:
out = np.empty(out_shape, dtype=dtype)
MemoryError
The file named 'order_products.csv' is around 550MB, 'orders.csv' is 100MB, and 'products.csv' is just 2MB. I have tried running this process a few times and I always get the MemoryError issue. It doesn't seem like the files are really, really massive, but I guess it's all relative, because on my old machine, it's just too much. Is there a simple way to read these files into dataframes in chunks and then merge these together in chunks?
I am working with Spyder 3.3.4, Python 3.7, and Windows 7 on an old ThinkPad.
Thanks.
try to use the concept of slicing and chunking. what it says is you've reach the highest possible length of what your computer's ram can take
orders_100 = orders[:100]
products_100 = product[:100]
order_products_100 = order_products[:100]
Then do pd.merge()