Here is the program i am developing in python -
Step1 - We will get JSON file ( size could be in GBs e.g 50 GB or more ) from source to our server -
Step2 - I use Pandas Dataframe to load JSON in to DF using
df = pd.read_json(jsonfile,index=False, header=False
Step3 - I use df.to_csv(temp_csvfile,..)
Steps4 - I use Python psycopg2 to make Postgresql connection and cursor ..
curr=conn.cursor() ```
Step5 - Read the CSV and load using copy_from
with open(temp_csvfile,'r') as f:
curr.copy_from(f,..)
conn.commit()
I seek feedback on below points -
a. Will this way of loading JSON to Pandas Dataframe not cause out of memory issue if my system memory is < size of the JSON file ..
b. At step 5 again i am opening file in read mode will same issue come here as it might load file in memory ( am i missing anything here )
c. Is there any better way of doing this ..
d. Can Python DASK will be used as it provides reading data in chunks ( i am not familiar with this).
Please advise
You could split your input json file into many smaller files, and also use the chunk size parameter while reading file content into pandas dataframe. Also, use the psycopg2 copy_from function which supports a buffer size parameter.
In fact you could use execute_batch() to get batches of rows inserted into your Postgresql table, as in article mentioned in reference below.
References :
Loading 20gb json file in pandas
Loading dataframes data into postgresql table article
Read a large json file into pandas
Related
I am trying to load the parquet file with row size group = 10 into duckdb table in chunks. I am not finding any documents to support this.
This is my work so on: see code
import duckdb
import pandas as pd
import gc
import numpy as np
# connect to an in-memory database
con = duckdb.connect(database='database.duckdb', read_only=False)
df1 = pd.read_parquet("file1.parquet")
df2 = pd.read_parquet("file2.parquet")
# create the table "my_table" from the DataFrame "df1"
con.execute("CREATE TABLE table1 AS SELECT * FROM df1")
# create the table "my_table" from the DataFrame "df2"
con.execute("CREATE TABLE table2 AS SELECT * FROM df2")
con.close()
gc.collect()
Please help me load both the tables with parquet files with row group size or chunks. ALso, load the data to duckdb as chunks
df1 = pd.read_parquet("file1.parquet")
This statement will read the entire parquet file into memory. Instead, I assume you want to read in chunks (i.e one row group after another or in batches) and then write the data frame into DuckDB.
This is not possible as of now using pandas. You can use something like pyarrow (or fast parquet) to do this.
Here is an example from pyarrow docs.
iter_batches can be used to read streaming batches from a Parquet file. This can be used to read in batches, read certain row groups or even certain columns.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=10):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 10 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
Hope this helps!
This is not necessarily a solution (I like the pyarrow oriented one already submitted!), but here are some other pieces of information that may help you. I am attempting to guess what your root cause problem is! (https://xyproblem.info/)
In the next release of DuckDB (and on the current master branch), data will be written to disk in a streaming fashion for inserts. This should allow you to insert ~any size of Parquet file into a file-backed persistent DuckDB without running out of memory. Hopefully it removes the need for you to do batching at all (since DuckDB will batch based on your rowgroups automatically)! For example:
con.execute("CREATE TABLE table1 AS SELECT * FROM 'file1.parquet'")
Another note is that the typically recommended size of a rowgroup is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small rowgroups. Compression will work better, since compression operates within a rowgroup only. There will also be less overhead spent on storing statistics, since each rowgroup stores its own statistics. And, since DuckDB is quite fast, it will process a 100,000 or 1,000,000 row rowgroup quite quickly (whereas the overhead of reading statistics may slow things down with really small rowgroups).
I have a MySQL dump file as .sql format. Its size is around 100GB. There are just two tables in int. I have to extract data from this file using Python or Bash. The issue is the insert statement contains all data and that line is too lengthy. Hence, normal practice cause Memory issue as that line (i.e., all data) is load in loop also.
Is there any efficient way or tool to get data as CSV?
Just a little explanation. Following line contains actual data and it is of very large size.
INSERT INTO `tblEmployee` VALUES (1,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(2,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(3,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),....
The issue is that I cannot import it into MySQL due to resources issues.
I'm not sure if this is what you want, but pandas has a function to turn sql into a csv. Try this:
import pandas as pd
import sqlite3
connect = sqlite3.connect("connections.db")
cursor = connect.cursor()
# save sqlite table in a DataFrame
dataframe = pd.read_sql(f'SELECT * FROM table', connect)
# write DataFrame to CSV file
dataframe.to_csv("filename.csv", index = False)
connect.commit()
connect.close()
If you want to change the delimiter, you can do dataframe.to_csv("filename.csv", index = False, sep='3') and just change the '3' to your delimiter choice.
I have a very large json file (3m+ records) that causes memory issues when I try to import it using pandas.
data = pd.read_json("JSON_large_file.json")
This procedure causes problem when the file exceeds 100k records.
In order to get around this issue, I would like open the file, and search the record for an identifier (e.g. Customer_Number), and then append it to a dataframe.
Is this possible, and how could it be done?
I am trying to tidy up a large (8gb) .csv file in python then stream it into BigQuery. My code below starts off okay, as the table is created and the first 1000 rows go in, but then I get the error:
InvalidSchema: Please verify that the structure and data types in the DataFrame match the schema of the destination table.
Is this perhaps related to the streaming buffer? My issue is that I will need to remove the table before i run the code again, otherwise the first 1000 entries will be duplicated due to the 'append' method.
import pandas as pd
destination_table = 'product_data.FS_orders'
project_id = '##'
pkey ='##'
chunks = []
for chunk in pd.read_csv('Historic_orders.csv',chunksize=1000, encoding='windows-1252', names=['Orderdate','Weborderno','Productcode','Quantitysold','Paymentmethod','ProductGender','DeviceType','Brand','ProductDescription','OrderType','ProductCategory','UnitpriceGBP' 'Webtype1','CostPrice','Webtype2','Webtype3','Variant','Orderlinetax']):
chunk.replace(r' *!','Null', regex=True)
chunk.to_gbq(destination_table, project_id, if_exists='append', private_key=pkey)
chunks.append(chunk)
df = pd.concat(chunks, axis=0)
print(df.head(5))
pd.to_csv('Historic_orders_cleaned.csv')
Question:
- why streaming and not simply loading? This way you can upload batches of 1 GB instead of 1000 rows. Streaming is usually the case when you do have continuous data that needs to be appended as they happen. If you have a break of 1 day between the collection of the data and the load job it's usually safer to just load it. see here.
apart from that. I've had my share of issues loading tables in bigQuery from csv files and most of the times it was either 1) encoding (I see you have non utf-8 encoding) and 2) invalid characters, some comma that was lost in the middle of the file that broke the line.
To validate that, what if you insert the rows backwards? do you get the same error?
I have certain computations performed on Dataset and I need the result to be stored in external file.
Had it been to CSV, to process it further I'd have to convert again to Dataframe/SFrame, which is again increasing lines of code.
Here's the snippet:
train_data = graphlab.SFrame(ratings_base)
Clearly, it is in SFrame and can be converted to DFrame using
df_train = train_data.to_dataframe()
Now that it is in DFrame, I need it exported to a file without changing it's structure. Since the exported file will be used as Argument to another python code. That code must accept DFrame and not CSV.
I have already check out in place1, place2, place3, place4 and place5
P.S. - I'm still digging for Python serialization, if anyone can simplify
it in the context would be helpful
I'd use HDFS format as it's supported by Pandas and by graphlab.SFrame and beside that HDFS format is very fast.
Alternatively you can export Pandas.DataFrame to Pickle file and read it from another scripts:
sf.to_dataframe().to_pickle(r'/path/to/pd_frame.pickle')
to read it back (from the same or from another script):
df = pd.read_pickle(r'/path/to/pd_frame.pickle')