Hi I'm recording live stocks data to DB(sqlite3) and, by mistake, unwanted data got into my DB.
For example,
date
name
price
20220107
A_company
10000
20220107
A_company
9000
20220107
B_company
500
20220107
B_company
400
20220107
B_company
200
in this table, row 1,2 and row 3,4,5 are same in [date, name] but different in [price].
I want to save only the 'first' row of such rows.
date
name
price
20220107
A_company
10000
20220107
B_company
500
What I have done before is read this whole DB into python and use pandas.drop_duplicate function.
import pandas as pd
import sqlite3
conn = sqlite3.connect("TRrecord.db")
query = pd.read_sql_query(f"SELECT * FROM TR_INFO, conn)
df = pd.DataFrame(query)
df.drop_duplicates(inplace=True, subset=['date', 'name'], ignore_index=True, keep='first')
However, as DB grows larger, I think this method won't be efficient in the long run.
How can I do this efficiently by using SQL?
There is no implicit 'first' concept in SQL, the database manager can store the records in any order, it has to be specified in SQL. If not specified (by ORDER BY), the order is determined by the database manager (SqlLite in your case), and it is not guaranteed (same data, same query, can return your rows in different order, at different times or different installations).
Having said that, if you are ok to delete any duplicates, and retain just one, you can use the rowid in sqlite for ordering:
delete from MyTbl
where exists (select 1
from MyTbl b
where MyTbl.date=b.date
and MyTbl.name=b.name
and MyTbl.rowid>b.rowid);
This would delete, from your table, any row for which there is another with a smaller rowid (but the same date and name).
If, by 'first', you meant to keep the record that was inserted first, then you need a column to indicate when the record was inserted (an insert_date_time, or an autoincrementing number column, etc.), and use that, instead of rowid.
I have a stock database of the following form where data is given by the minute.
I want to fetch all results which have at least N rows for a particular ticker, for a particular date. (when grouped by ticker and date)
The purpose is to ignore missing data for a given ticker on a given date. There should be at least 629 rows for any given ticker,date combination. Anything less than this, I want to ignore that combination.
I tried something like the following but it wasn't working
start_date = '2021-11-08'
SQL = "SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume FROM candles WHERE id in (SELECT id from candles WHERE session_type='Live' GROUP BY datetime, ticker HAVING COUNT(ID) >= 629) AND datetime >= '{} 00:00:00'::timestamp ORDER BY ticker, datetime_utc ASC".format(start_date)
Could anyone point me at the right SQL incantation? I am using postgres > 9.6.
sample data (apologies i tried to set up an SQL fiddle but it wasn't allowing me to due to character limitations, then data type limitations. The data is below in json, set to be read by pandas)
df = pd.io.json.read_json('{"datetime":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"datetime_utc":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"ticker":{"12372":"AAPL","12351":"AAPL","12493":"AAPL","15210":"AAPL","16212":"NET","16009":"NET","12213":"AAPL","16386":"NET","13580":"NET","14733":"AAPL","12086":"AAPL","14802":"AAPL","14407":"AAPL","14356":"AAPL","16086":"NET","15437":"AAPL","14115":"NET","14331":"AAPL","12457":"AAPL","14871":"AAPL"},"open":{"12372":148.29,"12351":148.44,"12493":148.1,"15210":148.68,"16212":199.72,"16009":202.27,"12213":150.21,"16386":198.65,"13580":194.9,"14733":147.94,"12086":150.2,"14802":147.87,"14407":148.1,"14356":148.09,"16086":193.82,"15437":148.01,"14115":194.64,"14331":148.07,"12457":148.12,"14871":148.2},"close":{"12372":148.32,"12351":148.44,"12493":148.15,"15210":148.69,"16212":199.32,"16009":202.52,"12213":150.25,"16386":198.57,"13580":194.96,"14733":147.99,"12086":150.17,"14802":147.9,"14407":148.1,"14356":147.99,"16086":194.43,"15437":148.01,"14115":194.78,"14331":148.05,"12457":148.11,"14871":148.28},"session_type":{"12372":"Live","12351":"Live","12493":"Post","15210":"Pre","16212":"Live","16009":"Live","12213":"Pre","16386":"Live","13580":"Live","14733":"Live","12086":"Pre","14802":"Post","14407":"Post","14356":"Live","16086":"Live","15437":"Live","14115":"Live","14331":"Live","12457":"Post","14871":"Post"},"high":{"12372":148.3600006104,"12351":148.4700012207,"12493":148.15,"15210":148.69,"16212":199.8399963379,"16009":202.5249938965,"12213":150.25,"16386":198.6499938965,"13580":195.0299987793,"14733":147.9900054932,"12086":150.2,"14802":147.9,"14407":148.1,"14356":148.1049957275,"16086":194.4400024414,"15437":148.0399932861,"14115":195.1699981689,"14331":148.0800018311,"12457":148.12,"14871":148.28},"low":{"12372":148.26,"12351":148.38,"12493":148.06,"15210":148.68,"16212":199.15,"16009":202.27,"12213":150.2,"16386":198.49,"13580":194.79,"14733":147.93,"12086":150.16,"14802":147.85,"14407":148.1,"14356":147.98,"16086":193.82,"15437":148.0,"14115":194.64,"14331":148.01,"12457":148.07,"14871":148.2},"volume":{"12372":99551.0,"12351":68985.0,"12493":0.0,"15210":0.0,"16212":9016.0,"16009":2974.0,"12213":0.0,"16386":1395.0,"13580":5943.0,"14733":59854.0,"12086":0.0,"14802":0.0,"14407":0.0,"14356":341196.0,"16086":8715.0,"15437":45495.0,"14115":16535.0,"14331":173785.0,"12457":0.0,"14871":0.0},"date":{"12372":1636502400000,"12351":1636502400000,"12493":1636502400000,"15210":1636588800000,"16212":1636588800000,"16009":1636502400000,"12213":1636502400000,"16386":1636588800000,"13580":1636502400000,"14733":1636588800000,"12086":1636502400000,"14802":1636588800000,"14407":1636502400000,"14356":1636502400000,"16086":1636502400000,"15437":1636588800000,"14115":1636502400000,"14331":1636502400000,"12457":1636502400000,"14871":1636588800000}}')
Expected result
if N = 4, then the results will exclude NET on 2021-11-11. (count of all shown here for illustration)
Like below (sample only)
You can use analytical function (window function) to count rows for each date and ticker combination.
select datetime, datetime_utc, ticker,open,close,session_type, high,low,volume
from (
SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume,
COUNT(*) OVER(PARTITION BY CAST(datetime AS DATE), ticker) as C
FROM candles
)
WHERE C >= 5
I am not familiar with postgres, so it may contain syntax problem.
You can use the pandasql with a nested groupby to count the daily data points for all tickers and use an inner join
import pandasql as ps
N=4
df_select = ps.sqldf(f"select df.* from (select count(*) as count, ticker, date from df group by date, ticker having count >= {N}) df_counts left join df on df.ticker=df_counts.ticker and df.date=df_counts.date")
Output: