Hi I'm recording live stocks data to DB(sqlite3) and, by mistake, unwanted data got into my DB.
For example,
date
name
price
20220107
A_company
10000
20220107
A_company
9000
20220107
B_company
500
20220107
B_company
400
20220107
B_company
200
in this table, row 1,2 and row 3,4,5 are same in [date, name] but different in [price].
I want to save only the 'first' row of such rows.
date
name
price
20220107
A_company
10000
20220107
B_company
500
What I have done before is read this whole DB into python and use pandas.drop_duplicate function.
import pandas as pd
import sqlite3
conn = sqlite3.connect("TRrecord.db")
query = pd.read_sql_query(f"SELECT * FROM TR_INFO, conn)
df = pd.DataFrame(query)
df.drop_duplicates(inplace=True, subset=['date', 'name'], ignore_index=True, keep='first')
However, as DB grows larger, I think this method won't be efficient in the long run.
How can I do this efficiently by using SQL?
There is no implicit 'first' concept in SQL, the database manager can store the records in any order, it has to be specified in SQL. If not specified (by ORDER BY), the order is determined by the database manager (SqlLite in your case), and it is not guaranteed (same data, same query, can return your rows in different order, at different times or different installations).
Having said that, if you are ok to delete any duplicates, and retain just one, you can use the rowid in sqlite for ordering:
delete from MyTbl
where exists (select 1
from MyTbl b
where MyTbl.date=b.date
and MyTbl.name=b.name
and MyTbl.rowid>b.rowid);
This would delete, from your table, any row for which there is another with a smaller rowid (but the same date and name).
If, by 'first', you meant to keep the record that was inserted first, then you need a column to indicate when the record was inserted (an insert_date_time, or an autoincrementing number column, etc.), and use that, instead of rowid.
Related
This question already has answers here:
How to use time-series with Sqlite, with fast time-range queries?
(2 answers)
Closed 2 years ago.
In the following example, t is an increasing sequence going roughly from 0 to 5,000,000 in 1 million of rows.
import sqlite3, random, time
t = 0
db = sqlite3.connect(':memory:')
db.execute("CREATE TABLE IF NOT EXISTS data(id INTEGER PRIMARY KEY, t INTEGER, label TEXT);")
for i in range(1000*1000):
t += random.randint(0, 10)
db.execute("INSERT INTO data(t, label) VALUES (?, ?)", (t, 'hello'))
Selecting a range (let's say t = 1,000,000 ... 2,000,000) with an index:
db.execute("CREATE INDEX t_index ON data(t);")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("index: %.1f ms" % ((time.time()-start)*1000)) # index: 15.0 ms
is 4-5 times faster than doing it without an index:
db.execute("DROP INDEX IF EXISTS t_index;")
start = time.time()
print(list(db.execute(f"SELECT COUNT(id) FROM data WHERE t BETWEEN 1000000 AND 2000000")))
print("no index: %.1f ms" % ((time.time()-start)*1000)) # no index: 73.0 ms
but the database size is at least 30% bigger with an index.
Question: in general, I understand how indexes massively speed up queries, but in such a case where t is integer + increasing, why is an index even needed to speed up the query?
After all, we only need to find the row for which t=1,000,000 (this is possible in O(log n) since the sequence is increasing), find the row for which t=2,000,000, and then we have the range.
TL;DR: when a column is an increasing integer sequence, is there a way to have a fast query for a range, without having to increase +30% the database size with an index?
For example by setting a parameter when creating the table, informing Sqlite that the column is increasing / already sorted?
The short answer is that sqlite doesn't know that your table is sorted by column t. This means it has to scan through the whole table to extract the data.
When you add an index on the column, the column t is sorted in the index, so it can skip the first million rows and then stream the rows of interest to you. You are extracting 20% of the rows, and it returns in 15 ms / 73 ms = 21% of the time. If that fraction is smaller, the benefit you derive from the index is larger.
If the column t is unique, then consider using that column as the primary key, as you would get the index for "free". If you can bound the number of rows with the same t, then you use (t, offset) as primary key where offset might be a tinyint. The point being that size(primary key index) + size(t index) would larger than size(t+offset index). If t was in ms or ns instead of s, it might be unique in practice, or you could fiddle with it when it was not (and just truncate to second resolution when you need the data).
If you don't need a primary key (as unique index), leave it out and just have the non-unique index on t. Without a primary key you can identify a unique row by rowid, or if all the columns collectively create an unique row. If you create the table without rowid, you could still use limit to operate on identical rows.
You could use database warehouse techniques, if you don't need per record data, store it in a less granular fashion (record per minute, or per hour and group_concat the text column).
Finally, there are databases that are optimized for time-series data. They may, for instance, only allow you to remove the oldest data or append new data but not make any changes. This would allow such as system to store the data pre-sorted (mysql, btw, call this feature index ordered table). As the data cannot change, such a database my run-length or delta compress data by column so it only stores differences between rows.
I have a table which has 27 million id columns.
I plan to update the average and count from another table which is taking very long to complete.
Below is the update query (Database - MySQL, I am using Python to connect to the Database)
UPDATE dna_statistics
SET chorus_count =
(SELECT count(*)
FROM dna B
WHERE B.music_id = <music_id>
AND B.label = 'Chorus')
WHERE music_id = 916094
As scaisEdge already said, you need to check if there are indices on the two tables.
I would like to add to scaisEdge's answer that the order of the columns in the composite index should match the order in which you compare them.
You used
WHERE B.music_id = <music_id>
AND B.label = 'Chorus')
So your index should consist of the columns in order (music_id, label) and not (label, music_id).
I would have added this as comment, but I'm still 1 reputation point away from commenting.
UPDATE clause isn't good solution for 27 millions id's
use EXCHANGE PARTITION instead
https://dev.mysql.com/doc/refman/5.7/en/partitioning-management-exchange.html
be sure you have a cmposite index on
table dna column ( label, music_id)
and a index on
table dna_statistics column (music_id)
Suppose we have 2 rows with date and I have to compare amount with previous date and put that value in another row in dynamo db. How can I do this?
TimePeriod LinkedAccount Amount Estimated Unit
2018-07-04 711035872*** 0.7715992257 True USD
2018-07-05 7110358***** 0.7715549731 True USD
DynamoDB is a NoSQL database. There is no ability to write queries that compare one row with another row.
Instead, your application should retrieve all relevant rows and them make the comparison.
Or, retrieve one row, figure out the desired values, then call DynamoDB again with parameters to retrieve any matching row (eg for the previous date).
I have a list of datetime data in python. I have a table in a SQL database that has two columns, an empty Date_Time column and an ID column. The ID column simply increases as 1,2,3 etc. The number of table rows is the same as the length of the datetime list. I insert the list of datetime data into the table fairly simply through the following code:
def populate_datetime(table, datetimes):
sql = cnxn.cursor()
for i in range(1, len(datetimes)+1):
query = '''
update {0}
set Date_Time = '{1}'
where ID = {2};'''.format(table, datetimes[i-1], i)
sql.execute(query)
table is the name of the table in the sql database, and datetimes is the list of datetime data.
This code works perfectly, but the data is lengthy, approximately 800,000 datetimes long. As a result, the code takes approx. 16 minutes to run on average. Any advice on how to reduce the run time?
I am taking 30 days of historical data and modifying it.
Hopefully, I can read the historical data and have it refer to a dynamic rolling date of 30 days. The 'DateTime' value is a column in the raw data.
df_new = df = pd.read_csv(loc+filename)
max_date = df_new['DateTime'].max()
date_range = max_date - Timedelta(30, unit='d')
df_old = pd.read_hdf(loc+filename,'TableName', where = [('max_date > date_range')])
Then I would read the new data which is a separate file which is always Month to Date values (all of June for example, this file is replaced daily with the latest data), and concat them to the old dataframe.
frames = [df_old, df_new]
df = pd.concat(frames)
Then I do some things to the file (I am checking if certain values repeat within a 30 day window, if they do then I place a timestamp in a column).
Now I would want to add this modified data back into my original file (it was HDF5 but it could be a .sqlite file too) called df_old. For sure, there are a ton of duplicates since I am reading the previous 30 days data and the MTD data. How do I manage this?
My only solution is to read the entire file (df_old along with the new data I added) and then drop duplicates and then overwrite it again. This isn't very efficient.
Can .sqlite or .hdf formats enforce non-duplicates? If so then I have 3 columns which identify a unique value (Date, EmpID, CustomerID). I do not want exact duplicate rows.
Define them as primary keys in sqlite. It wont allow you to have set of non-unique primary keys.
e.g.
CREATE TABLE table (
a INT,
b INT,
c INT,
PRIMARY KEY(a,b)
);
wont allow you to have duplicates of a,b added to the data. Then use
INSERT OR IGNORE to add data, and any duplicates will be ignored.
http://sqlite.org/lang_insert.html