transform bulk csv files - python

I have a directory with 80K csv files and I need to somehow transform those files to another csv format. I need for example to change the the column name in all 80K files or change a value.
But the catch is that all these transformations have to happen in a short period of time and preferably in under five minutes.
I have already tried to use an in-memory database like Sqlite or DuckDB where I:
load the csv file
insert it into a table
query the table with an sql update statement
export the table to a new csv file
drop the table
and this process 80K times. but this is too slow
here is the code for that:
for i in range(80_000):
fileNum = i + 1
# Load CSV data into Pandas DataFrame
data = pd.read_csv(f"generatedFiles/Generated-File-{fileNum}.csv")
# Write the data to a sqlite table
data.to_sql(f"table_{fileNum}", conn, if_exists='replace', index=False)
conn.execute(f"UPDATE table_{fileNum} SET Name = 'TransformedName'")
pd.read_sql_query(f"SELECT * FROM table_{fileNum}", conn).to_csv(f'exportedFiles-poc2.1/Transformed-File-{fileNum}.csv', index=False)
conn.execute(f"DROP TABLE table_{fileNum}")
can anyone help me come up with a solution to efficiently transform and update 80 to 100K csv files in a so short as possible time?

Related

Importing parquet file in chunks and insert in DuckDB

I am trying to load the parquet file with row size group = 10 into duckdb table in chunks. I am not finding any documents to support this.
This is my work so on: see code
import duckdb
import pandas as pd
import gc
import numpy as np
# connect to an in-memory database
con = duckdb.connect(database='database.duckdb', read_only=False)
df1 = pd.read_parquet("file1.parquet")
df2 = pd.read_parquet("file2.parquet")
# create the table "my_table" from the DataFrame "df1"
con.execute("CREATE TABLE table1 AS SELECT * FROM df1")
# create the table "my_table" from the DataFrame "df2"
con.execute("CREATE TABLE table2 AS SELECT * FROM df2")
con.close()
gc.collect()
Please help me load both the tables with parquet files with row group size or chunks. ALso, load the data to duckdb as chunks
df1 = pd.read_parquet("file1.parquet")
This statement will read the entire parquet file into memory. Instead, I assume you want to read in chunks (i.e one row group after another or in batches) and then write the data frame into DuckDB.
This is not possible as of now using pandas. You can use something like pyarrow (or fast parquet) to do this.
Here is an example from pyarrow docs.
iter_batches can be used to read streaming batches from a Parquet file. This can be used to read in batches, read certain row groups or even certain columns.
import pyarrow.parquet as pq
parquet_file = pq.ParquetFile('example.parquet')
for i in parquet_file.iter_batches(batch_size=10):
print("RecordBatch")
print(i.to_pandas())
Above example simply reads 10 records at a time. You can further limit this to certain row groups or even certain columns like below.
for i in parquet_file.iter_batches(batch_size=10, columns=['user_address'], row_groups=[0,2,3]):
Hope this helps!
This is not necessarily a solution (I like the pyarrow oriented one already submitted!), but here are some other pieces of information that may help you. I am attempting to guess what your root cause problem is! (https://xyproblem.info/)
In the next release of DuckDB (and on the current master branch), data will be written to disk in a streaming fashion for inserts. This should allow you to insert ~any size of Parquet file into a file-backed persistent DuckDB without running out of memory. Hopefully it removes the need for you to do batching at all (since DuckDB will batch based on your rowgroups automatically)! For example:
con.execute("CREATE TABLE table1 AS SELECT * FROM 'file1.parquet'")
Another note is that the typically recommended size of a rowgroup is closer to 100,000 or 1,000,000 rows. This has a few benefits over very small rowgroups. Compression will work better, since compression operates within a rowgroup only. There will also be less overhead spent on storing statistics, since each rowgroup stores its own statistics. And, since DuckDB is quite fast, it will process a 100,000 or 1,000,000 row rowgroup quite quickly (whereas the overhead of reading statistics may slow things down with really small rowgroups).

Load CSV file with 10 columns into an existing BigQuery table that has 103 columns, of which 10 of those have exact column names that of the csv file

I'm using python to load a CSV file that has 10 columns into an existing BigQuery table that has 103 columns. Write mode is append. The 10 columns of the CSV file have the exact column name that 10 of the columns of the 103 columns existing BigQuery table. I'm using autodetect=True in job_config for schema auto-detection.
What I want to accomplish is that this python script will write the data into the corresponding columns of the existing BQ table. However, I'm receiving the error "Provided Schema does not match Table".
Is there any way to do this?
Thanks
Normally, you can try to use the Jagged Row option to allow to skip the trailing column.
That also mean that the 10 column that you have in your file MUST be the 10 first of your table.
And Of course, I think it's not the case.
A usual pattern is to perform that process:
Upload your file in a temporary table
Run a query to merge the data in the temporary table into the target table
Expert tips: if the column are never the same, you can write a stored procedure that use the information schema and that build a query dynamically. Then use execute immediate to run the merge query
Delete the temporary table

Get data as csv from a very large MySQL dump file

I have a MySQL dump file as .sql format. Its size is around 100GB. There are just two tables in int. I have to extract data from this file using Python or Bash. The issue is the insert statement contains all data and that line is too lengthy. Hence, normal practice cause Memory issue as that line (i.e., all data) is load in loop also.
Is there any efficient way or tool to get data as CSV?
Just a little explanation. Following line contains actual data and it is of very large size.
INSERT INTO `tblEmployee` VALUES (1,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(2,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),(3,'Nirali','Upadhyay',NULL,NULL,9,'2021-02-08'),....
The issue is that I cannot import it into MySQL due to resources issues.
I'm not sure if this is what you want, but pandas has a function to turn sql into a csv. Try this:
import pandas as pd
import sqlite3
connect = sqlite3.connect("connections.db")
cursor = connect.cursor()
# save sqlite table in a DataFrame
dataframe = pd.read_sql(f'SELECT * FROM table', connect)
# write DataFrame to CSV file
dataframe.to_csv("filename.csv", index = False)
connect.commit()
connect.close()
If you want to change the delimiter, you can do dataframe.to_csv("filename.csv", index = False, sep='3') and just change the '3' to your delimiter choice.

What is the fastest way to save a pandas dataframe in a MySQL Database

I am writing a code in python to generate and update a mysql table based on another mysql table from another database.
My code does something like this:
For dates in a date_range:
Query a quantity in db1 between 2 dates
Do some work in pandas => df
Delete in db2 the rows with the ids that are in df
save df with df.to_sql
The operation 1-2 are taking less than 2s when 3-4 can take up to 10s. Step 4 takes 4 more times than 3. How can I improve my code to make the writing process more efficient
I have already chunked the df for step 3 and 4. I have added method=multi in .to_sql (this did not work at all). I was wondering if we could do better;
with db.begin() as con:
for chunked in chunks(df.id.tolist(), 1000):
_ = con.execute(""" DELETE FROM table where id
in {} """.format(to_tuple(chunked)))
for chunked in chunks(df.id.tolist(), 100000):
df.query("id in #chunked").to_sql('table', con, index=False,
if_exists='append')
thanks for your help
I have found df.to_sql to be a very slow. One way that I've gotten around it this issue is by outputting the dataframe into a csv file with df.to_csv and using BCP in to bluk insert the data in the csv into the table then deleting the csv file once its done with the insertion. You can use subprocess to run BCP in a python script.

Update data to mysql if row does not exists using Python

Context:
I have a table in mysql database which has the format like this. Every row is one day stock price and volume data
Ticker,Date/Time,Open,High,Low,Close,Volume
AAA,7/15/2010,19.581,20.347,18.429,18.698,174100
AAA,7/16/2010,19.002,19.002,17.855,17.855,109200
BBB,7/19/2010,19.002,19.002,17.777,17.777,104900
BBB,7/19/2010,19.002,19.002,17.777,17.777,104900
CCC,7/19/2010,19.002,19.002,17.777,17.777,104900
....100000 rows
This table is created by importing the data from multiple *.txt file with the same column and format. The *.txt file name is the same with the ticker name in ticker column: ie: import AAA.txt get me the 2 rows of AAA data.
All these *.txt file is generated automatically by a system that retrieve stock price in my country. Every day, after the stock market close, the .txt file will have one new row according to the data of the new day.
Question: everyday, how could I update the new row in each txt file into the database, I do not want to load all the data in the .txt file in mysql table everyday because it take a lot of time, I only want to load new rows.
How should I write the code to do this updating mission.
(1) Create/use an empty stage table, no prmary ... :
create table db.temporary_stage (
... same columns as your orginial table , but no constraints or keys or an index ....
)
(2) # this should be really fast
LOAD DATA INFILE 'data.txt' INTO TABLE db.temporary_stage;
(3) join on id then use a hash function to eliminate all rows that haven't changed. the following can be made better, but all in all using bulk loads against databases is a lot faster when you have lots of rows, and thats mostly down to how the database moves stuff about internally. it can do upkeep much more efficiently all at once than a little at a time.
UPDATE mytable SET
mytable... = temporary_stage...
precomputed_hash = hash(concat( .... ) )
FROM
(
SELECT temporary_stage.* from mytable join
temporary_stage on mytable.id = temporary_state.id
where mytable.pre_computed_hash != hash(concat( .... ) ) )
AS new_data on mytable.id = new_data.id
# clean up
DELETE FROM temporary_stage;

Categories

Resources