I am trying to load the parquet file with row size group = 10 into duckdb table in chunks. I am not finding any documents to support this.
This is my work so on: see code
import duckdb
import pandas as pd
import gc
import numpy as np
# connect to an in-memory database
con = duckdb.connect(database='database.duckdb', read_only=False)
df1 = pd.read_parquet("file1.parquet")
df2 = pd.read_parquet("file2.parquet")
# create the table "my_table" from the DataFrame "df1"
con.execute("CREATE TABLE table1 AS SELECT * FROM df1")
# create the table "my_table" from the DataFrame "df2"
con.execute("CREATE TABLE table2 AS SELECT * FROM df2")
con.close()
gc.collect()
Please help me load both the tables with parquet files with row group size or chunks. ALso, load the data to duckdb as chunks
df1 = pd.read_parquet("file1.parquet")
This statement will read the entire parquet file into memory. Instead, I assume you want to read in chunks (i.e one row group after another or in batches) and then write the data frame into DuckDB.
This is not possible as of now using pandas. You can use something like pyarrow (or fast parquet) to do this.
Here is an example from pyarrow docs.
iter_batches can be used to read streaming batches from a Parquet file. This can be used to read in batches, read certain row groups or even certain columns.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=10):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 10 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
Hope this helps!
This is not necessarily a solution (I like the pyarrow oriented one already submitted!), but here are some other pieces of information that may help you. I am attempting to guess what your root cause problem is! (https://xyproblem.info/)
In the next release of DuckDB (and on the current master branch), data will be written to disk in a streaming fashion for inserts. This should allow you to insert ~any size of Parquet file into a file-backed persistent DuckDB without running out of memory. Hopefully it removes the need for you to do batching at all (since DuckDB will batch based on your rowgroups automatically)! For example:
con.execute("CREATE TABLE table1 AS SELECT * FROM 'file1.parquet'")
Another note is that the typically recommended size of a rowgroup is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small rowgroups. Compression will work better, since compression operates within a rowgroup only. There will also be less overhead spent on storing statistics, since each rowgroup stores its own statistics. And, since DuckDB is quite fast, it will process a 100,000 or 1,000,000 row rowgroup quite quickly (whereas the overhead of reading statistics may slow things down with really small rowgroups).
Related
I have a directory with 80K csv files and I need to somehow transform those files to another csv format. I need for example to change the the column name in all 80K files or change a value.
But the catch is that all these transformations have to happen in a short period of time and preferably in under five minutes.
I have already tried to use an in-memory database like Sqlite or DuckDB where I:
load the csv file
insert it into a table
query the table with an sql update statement
export the table to a new csv file
drop the table
and this process 80K times. but this is too slow
here is the code for that:
for i in range(80_000):
fileNum = i + 1
# Load CSV data into Pandas DataFrame
data = pd.read_csv(f"generatedFiles/Generated-File-{fileNum}.csv")
# Write the data to a sqlite table
data.to_sql(f"table_{fileNum}", conn, if_exists='replace', index=False)
conn.execute(f"UPDATE table_{fileNum} SET Name = 'TransformedName'")
pd.read_sql_query(f"SELECT * FROM table_{fileNum}", conn).to_csv(f'exportedFiles-poc2.1/Transformed-File-{fileNum}.csv', index=False)
conn.execute(f"DROP TABLE table_{fileNum}")
can anyone help me come up with a solution to efficiently transform and update 80 to 100K csv files in a so short as possible time?
Here is the program i am developing in python -
Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server -
Step2 - I use Pandas Dataframe to load JSON in to DF using
df = pd.read_json(jsonfile,index=False, header=False
Step3 - I use df.to_csv(temp_csvfile,..)
Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..
curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from
with open(temp_csvfile,'r') as f:
curr.copy_from(f,..)
conn.commit()
I seek feedback on below points -
a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..
b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )
c. Is there any better way of doing this ..
d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).
Please advise
You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter.
In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.
References :
Loading 20gb json file in pandas
Loading dataframes data into postgresql table article
Read a large json file into pandas
I am trying to import a large csv file (5 million rows) into a local MySQL database using the code below:
Code:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqlconnector://[username]:[password]#[host]:[port]/[schema]', echo=False)
df = pd.read_csv('C:/Users/[user]/Documents/Sales_Records.csv')
df = df.head(27650)
df.to_sql(con= engine, name='data', if_exists='replace', chunksize = 50000)
If I execute this code it works so long as df.head([row limit]) is less than 27650. However, as soon as I increase this row limit by just a single row, the import fails and now data is transferred to MySQL. Does anyone know why this would happen?
Pandas DataFrame should have no memory limit except for your local machine's memory. So I think it's because your machine is running out of memory. You can use memory_profiler, a Python library I like to use to check real-time memory usage. More info can be found at the docs here: https://pypi.org/project/memory-profiler/
You should never read big files in one go as it's a single point of failure and also slow. Load the data into the database in chunks, like how they did in this post: https://soprasteriaanalytics.se/2020/10/22/working-with-large-csv-files-in-pandas-create-a-sql-database-by-reading-files-in-chunks/
I have a daily process where I read in a historical parquet dataset and then concatenate that with a new file each day. I'm trying to optimize memory by making better use of arrows dictionary arrays. I want to avoid doing round trip to pandas systematically (and without defining columns) to get categoricals.
I'm wondering how to do this in pyarrow.
I currently do:
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.csv as csv
historical_table = pq.read_table(historical_pq_path)
new_table = (pa.Table.from_pandas(csv.read_csv(new_file_path)
.to_pandas(strings_to_categorical=True,
split_blocks=True,
self_destruct=True))
)
combined_table = pa.concat_tables([historical_table, new_table])
I process many files and would like to avoid having to maintain a schema for each file where I list the dictionary columns of each column and use that as read options to csv. The convenience of going to pandas with no column specification using strings_to_categorical=True is really nice. From what I've seen there isn't a way to do something like strings_to_dict natively in pyarrow.
Is there clean a way to do this in just pyarrow?
I am new to Python and I attempt to read a large .csv file (with hundreds of thousands or possibly few millions of rows; and about 15.000 columns) using pandas.
What I thought I could do is to create and save each chunk in a new .csv file, iteratively across all chunks. I am currently using a lap top with relatively limited memory (of about 4 Gb, in the process of upgrading it) but I was wondering whether I could do this without changing my set up now. Alternatively, I could transfer this process in a pc with large RAM and attempt larger chunks, but I wanted to get this in place even for shorter row chunks.
I have seen that I can process quickly chunks of data (e.g. 10.000 rows and all columns), using the code below. But due to me being a Python beginner, I have only managed to order the first chunk. I would like to loop iteratively across chunks and save them.
import pandas as pd
import os
print(os.getcwd())
print(os.listdir(os.getcwd()))
chunksize = 10000
data = pd.read_csv('ukb35190.csv', chunksize=chunksize)
df = data.get_chunk(chunksize)
print(df)
export_csv1 = df.to_csv (r'/home/user/PycharmProjects/PROJECT/export_csv_1.csv', index = None, header=True)
If you are not doing any processing on data then you dont have to even store it in any variable.You can do it directly. PFA code below.Hope this would help u.
import pandas as pd
import os
chunksize = 10000
batch=1
for chunk in pd.read_csv(r'ukb35190.csv',chunksize=chunk_size):
chunk.to_csv(r'ukb35190.csv'+str(batch_no)+'.csv',index=False)
batch_no+=1