Context:
I have a table in mysql database which has the format like this. Every row is one day stock price and volume data
Ticker,Date/Time,Open,High,Low,Close,Volume
AAA,7/15/2010,19.581,20.347,18.429,18.698,174100
AAA,7/16/2010,19.002,19.002,17.855,17.855,109200
BBB,7/19/2010,19.002,19.002,17.777,17.777,104900
BBB,7/19/2010,19.002,19.002,17.777,17.777,104900
CCC,7/19/2010,19.002,19.002,17.777,17.777,104900
....100000 rows
This table is created by importing the data from multiple *.txt file with the same column and format. The *.txt file name is the same with the ticker name in ticker column: ie: import AAA.txt get me the 2 rows of AAA data.
All these *.txt file is generated automatically by a system that retrieve stock price in my country. Every day, after the stock market close, the .txt file will have one new row according to the data of the new day.
Question: everyday, how could I update the new row in each txt file into the database, I do not want to load all the data in the .txt file in mysql table everyday because it take a lot of time, I only want to load new rows.
How should I write the code to do this updating mission.
(1) Create/use an empty stage table, no prmary ... :
create table db.temporary_stage (
... same columns as your orginial table , but no constraints or keys or an index ....
)
(2) # this should be really fast
LOAD DATA INFILE 'data.txt' INTO TABLE db.temporary_stage;
(3) join on id then use a hash function to eliminate all rows that haven't changed. the following can be made better, but all in all using bulk loads against databases is a lot faster when you have lots of rows, and thats mostly down to how the database moves stuff about internally. it can do upkeep much more efficiently all at once than a little at a time.
UPDATE mytable SET
mytable... = temporary_stage...
precomputed_hash = hash(concat( .... ) )
FROM
(
SELECT temporary_stage.* from mytable join
temporary_stage on mytable.id = temporary_state.id
where mytable.pre_computed_hash != hash(concat( .... ) ) )
AS new_data on mytable.id = new_data.id
# clean up
DELETE FROM temporary_stage;
Related
I have a directory with 80K csv files and I need to somehow transform those files to another csv format. I need for example to change the the column name in all 80K files or change a value.
But the catch is that all these transformations have to happen in a short period of time and preferably in under five minutes.
I have already tried to use an in-memory database like Sqlite or DuckDB where I:
load the csv file
insert it into a table
query the table with an sql update statement
export the table to a new csv file
drop the table
and this process 80K times. but this is too slow
here is the code for that:
for i in range(80_000):
fileNum = i + 1
# Load CSV data into Pandas DataFrame
data = pd.read_csv(f"generatedFiles/Generated-File-{fileNum}.csv")
# Write the data to a sqlite table
data.to_sql(f"table_{fileNum}", conn, if_exists='replace', index=False)
conn.execute(f"UPDATE table_{fileNum} SET Name = 'TransformedName'")
pd.read_sql_query(f"SELECT * FROM table_{fileNum}", conn).to_csv(f'exportedFiles-poc2.1/Transformed-File-{fileNum}.csv', index=False)
conn.execute(f"DROP TABLE table_{fileNum}")
can anyone help me come up with a solution to efficiently transform and update 80 to 100K csv files in a so short as possible time?
I would like to retrieve from google cloud storage (Bucket) CSV files, and load the data from these files into a bigquery table without having duplicate data.
The goal being to have a code rather optimal in performance that in cost.
My current code is the following:
def load_data_in_BQT():
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("id", "INTEGER"),
bigquery.SchemaField("name", "STRING"),
],
# The source format defaults to CSV, so the line below is optional.
source_format=bigquery.SourceFormat.CSV,
skip_leading_rows=1,
autodetect=True,
write_disposition=bigquery.WriteDisposition.WRITE_APPEND, # (Addition of the data (possibility of having duplications)
# write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE, # (Formatting of the table and insertion of the new data (Loss of the old data))
)
uri = "gs://mybucket/myfolder/myfile.csv"
load_job = self.client.load_table_from_uri(
uri, self.table_ref["object"], job_config=job_config,
)
Currently my idea is to read the CSV file in pandas in order to have a dataframe, to load also the data of the bigquery table and to transform them into dataframe, to make my treatment on the whole of the data in order to remove the duplicates, and at the end to reinsert (with the option Truncate) the whole of the cleaned data. However, I find this method harmful if we have a huge set of data that we will have to load at each new input file in our bucket.
What could you suggest? Thank you in advance
You can use merge query with Bigquery :
https://cloud.google.com/bigquery/docs/reference/standard-sql/dml-syntax?hl=en#merge_statement
The idea behind is :
Create a staging table with the same structure as final table
Truncate the staging table (empty the table), before you execute your script
Excute your Python script to ingest data in the staging table
Add a merge query between the staging and final table :
If the element doesn't exist in the final table, you can insert it, otherwise you can update it.
Example of merge query :
MERGE dataset.Inventory T
USING dataset.NewArrivals S
ON T.product = S.product
WHEN MATCHED THEN
UPDATE SET quantity = T.quantity + S.quantity
WHEN NOT MATCHED THEN
INSERT (product, quantity) VALUES(product, quantity)
Orchestrator like Airflow or Cloud workflow can easily chain these steps for example.
I want to update some existing data by uploading the CSV file.
I have some data in the MySQL database and some of them have spelling mistakes and some others mistakes.
So I have correct data on the CSV file and I want to upload it and update existing data on the database. It will take an id and update the existing data.
Below code for importing data into the database. So how can I modify this code to update the existing data of the database?
views.py
# Import the data to database
def import_countries(request):
with open('C:/python/Azuro/azuro_django/pms/templates/pms/countryname.csv') as csvfile:
reader = csv.DictReader(csvfile)
for row in reader:
data = Country(currency=row['Currency'], name=row['Country'])
data.save()
return HttpResponse('Data Uploaded!!')
Update:
Is there any way to perform this task without bulk_update.
I want to read a CSV file then check if the CSV file's data is present in the database or not. - if 'Yes' then just start an iteration to update the existing data into the database by its id. - if 'Not' then print a list that contains data(id) that is present in the database and print another list that contains the data(id) which is not present in the database
Thanks!
There are several ways you could update a record in django, one of which is update_or_create, also, you can get the record and set the necessary fields you want to update. The problem with those approach is that it will do a query into the database for each record, and the latter will do two queries in each record (one for get and one for update).
In your case, that is not good as you will need to update several records. What you can do is use bulk_update to update several records in one query:
countries_to_update = []
for row in reader:
countries_to_update.append(
Country(id=row["id"], currency=row["Currency"], name=row["Country"])
)
Country.objects.bulk_update(countries_to_update, fields=["currency", "name"])
Take note that it will only update the fields you specified in the "fields" argument, in the case above, only "currency" and "name" will be updated.
I have 3000 CSV files stored on my hard drive, each containing thousands of rows and 10 columns. Rows correspond to dates, and the number of rows as well as the exact dates is different across spreadsheets. The columns for all the spreadsheets are the same in number (10) and label. For each date from the earliest date across all spreadsheets to the latest date across all spreadsheets, I need to (i) access the columns in each spreadsheet for which data for that date exists, (ii) run some calculations, and (iii) store the results (a set of 3 or 4 scalar values) for that date. To clarify, results should be a variable in my workspace that stores the results for each date for all CSVs.
Is there a way to load this data using Python that is both time and memory efficient? I tried creating a Pandas data frame for each CSV, but loading all the data into RAM takes almost ten minutes and almost completely fills up my RAM. Is it possible to check if the date exists in a given CSV, and if so, load the columns corresponding to that CSV into a single data frame? This way, I could load just the rows that I need from each CSV to do my calculations.
Simple Solution.
Go and Download DB Browser for SQlite.
Open it, and create New Database.
After That, go to File and Import Table from CSV. ( Do this for All of your CSV Tables ) Alternatively, you can use Python script and sqlite3 library to be fast and automated for creating table and inserting values from your CSV sheets.
When you are done with importing all the tables, play around with this function based on your details.
import sqlite3
import pandas as pd
data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def create_database(): # Create Database with table name
con = sqlite3.connect('database.db')
cur = con.cursor()
cur.execute("CREATE TABLE IF NOT EXISTS my_CSV_data (id INTEGER PRIMARY KEY, name text, address text,mobile text , phone text,balance float,max_balance INTEGER)")
con.commit()
con.close()
def insert_into_company(): # Inserting data into column
con = sqlite3.connect(connection_str)
cur = con.cursor()
for i in data:
cur.execute("INSERT INTO my_CSV_data VALUES(Null,?,?,?,?,?,?)",(i[0],i[1],i[2],i[3],i[4],i[5]))
con.commit()
con.close()
def select_company(): # Viewing Data from Column
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data")
data = cur.fetchall()
con.close()
return data
create_database()
insert_into_company()
for j in select_company():
print(j)
Do this Once, you can you use it again and again. It will enable you to access data in less than 1 second. Ask me, if you need any other help. I'll be happy to guide you through.
I was recently tasked with automating the update process of some of my company's production data. Effectively what I am trying to do is establish an automated process where every evening a script downloads all the updated .dbf files from a ftp server, calls on a module in my Microsoft Access database that imports all the .dbf file tables into Access, then, once these tables are loaded into Access, combines all the data from the separate tables into one big table.
So far, I have implemented a python script to fetch the files, and have an Access module to import the .dbf files once they're downloaded, but I am unaware as to whether or not there exists a way to have all of these .dbf files import into a single table. Right now they import into about 253 respective tables that share identical formats and have a column that contains unique integers for each row.
My main focus now is creating a script that will: 1) Truncate my complete_db table each time it runs
2) iterate through each dbf file and append all rows to the complete_db table. 3) delete each individual dbf table after appending the rows to complete_db to save space. Any tips with how to achieve this with python would be greatly appreciated. Thanks!
Attached is an image of the module i'm using to import the files as well as what my table view panel looks like after importing
https://imgur.com/mhIjAa1
Since MS Access can directly query .dbf files, consider an append query to your master table. Query will take the form. Explicitly write out the columns to avoid indexes or autonumber fields:
SQL (hard-coded query)
INSERT INTO complete_db
SELECT * FROM [myFile.dbf] IN 'C:\Path\To\Folder'[dBASE IV;];
-- OR
INSERT INTO complete_db (col1, col2, col3, ...)
SELECT col1, col2, col3, ...
FROM [myFile.dbf] IN 'C:\Path\To\Folder'[dBASE IV;];
VBA (dynamic query)
...
sFolderPath = "C:\Path\To\Folder"
For Each oFile in oFolder.Files
If Right(oFile.Name, 3) = "dbf" Then
sql = "INSERT INTO complete_db SELECT * FROM [" & oFile.Name & "]" _
& " IN '" & sFolderPath & "'[dBASE IV;]"
CurrentDb.Execute sql
End If
Next oFile