fastest way to insert list mysql - python

I have a list of datetime data in python. I have a table in a SQL database that has two columns, an empty Date_Time column and an ID column. The ID column simply increases as 1,2,3 etc. The number of table rows is the same as the length of the datetime list. I insert the list of datetime data into the table fairly simply through the following code:
def populate_datetime(table, datetimes):
sql = cnxn.cursor()
for i in range(1, len(datetimes)+1):
query = '''
update {0}
set Date_Time = '{1}'
where ID = {2};'''.format(table, datetimes[i-1], i)
sql.execute(query)
table is the name of the table in the sql database, and datetimes is the list of datetime data.
This code works perfectly, but the data is lengthy, approximately 800,000 datetimes long. As a result, the code takes approx. 16 minutes to run on average. Any advice on how to reduce the run time?

Related

Query to find frequency of a column for a given timestamp column

I have a table having a time column in the format: 2022-09-13 07:54:34
So, I am trying to find the frequency of a column(test_language) according to this timestamp column.
For example: if test_lang has occured 5 times for a given timestamp it should show the output that way.
Write now I'm using this query:
Select test_lang,count(test_lang) as Freq , time from testtable
where date='2022-09-13' group by test_lang,time order by Freq desc
This query is giving me Freq in desc order as I want but not sure the time column it is showing is it for that particular time this has occured that many times or not

Removing duplicated rows by subset of columns (SQLite3)

Hi I'm recording live stocks data to DB(sqlite3) and, by mistake, unwanted data got into my DB.
For example,
date
name
price
20220107
A_company
10000
20220107
A_company
9000
20220107
B_company
500
20220107
B_company
400
20220107
B_company
200
in this table, row 1,2 and row 3,4,5 are same in [date, name] but different in [price].
I want to save only the 'first' row of such rows.
date
name
price
20220107
A_company
10000
20220107
B_company
500
What I have done before is read this whole DB into python and use pandas.drop_duplicate function.
import pandas as pd
import sqlite3
conn = sqlite3.connect("TRrecord.db")
query = pd.read_sql_query(f"SELECT * FROM TR_INFO, conn)
df = pd.DataFrame(query)
df.drop_duplicates(inplace=True, subset=['date', 'name'], ignore_index=True, keep='first')
However, as DB grows larger, I think this method won't be efficient in the long run.
How can I do this efficiently by using SQL?
There is no implicit 'first' concept in SQL, the database manager can store the records in any order, it has to be specified in SQL. If not specified (by ORDER BY), the order is determined by the database manager (SqlLite in your case), and it is not guaranteed (same data, same query, can return your rows in different order, at different times or different installations).
Having said that, if you are ok to delete any duplicates, and retain just one, you can use the rowid in sqlite for ordering:
delete from MyTbl
where exists (select 1
from MyTbl b
where MyTbl.date=b.date
and MyTbl.name=b.name
and MyTbl.rowid>b.rowid);
This would delete, from your table, any row for which there is another with a smaller rowid (but the same date and name).
If, by 'first', you meant to keep the record that was inserted first, then you need a column to indicate when the record was inserted (an insert_date_time, or an autoincrementing number column, etc.), and use that, instead of rowid.

Pandas/SQL only return results where groupby has at least n rows

I have a stock database of the following form where data is given by the minute.
I want to fetch all results which have at least N rows for a particular ticker, for a particular date. (when grouped by ticker and date)
The purpose is to ignore missing data for a given ticker on a given date. There should be at least 629 rows for any given ticker,date combination. Anything less than this, I want to ignore that combination.
I tried something like the following but it wasn't working
start_date = '2021-11-08'
SQL = "SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume FROM candles WHERE id in (SELECT id from candles WHERE session_type='Live' GROUP BY datetime, ticker HAVING COUNT(ID) >= 629) AND datetime >= '{} 00:00:00'::timestamp ORDER BY ticker, datetime_utc ASC".format(start_date)
Could anyone point me at the right SQL incantation? I am using postgres > 9.6.
sample data (apologies i tried to set up an SQL fiddle but it wasn't allowing me to due to character limitations, then data type limitations. The data is below in json, set to be read by pandas)
df = pd.io.json.read_json('{"datetime":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"datetime_utc":{"12372":1636572720000,"12351":1636571460000,"12493":1636590240000,"15210":1636633380000,"16212":1636642500000,"16009":1636560060000,"12213":1636554180000,"16386":1636657800000,"13580":1636572660000,"14733":1636659000000,"12086":1636537200000,"14802":1636667880000,"14407":1636585980000,"14356":1636577640000,"16086":1636574280000,"15437":1636661040000,"14115":1636577400000,"14331":1636576140000,"12457":1636582800000,"14871":1636678080000},"ticker":{"12372":"AAPL","12351":"AAPL","12493":"AAPL","15210":"AAPL","16212":"NET","16009":"NET","12213":"AAPL","16386":"NET","13580":"NET","14733":"AAPL","12086":"AAPL","14802":"AAPL","14407":"AAPL","14356":"AAPL","16086":"NET","15437":"AAPL","14115":"NET","14331":"AAPL","12457":"AAPL","14871":"AAPL"},"open":{"12372":148.29,"12351":148.44,"12493":148.1,"15210":148.68,"16212":199.72,"16009":202.27,"12213":150.21,"16386":198.65,"13580":194.9,"14733":147.94,"12086":150.2,"14802":147.87,"14407":148.1,"14356":148.09,"16086":193.82,"15437":148.01,"14115":194.64,"14331":148.07,"12457":148.12,"14871":148.2},"close":{"12372":148.32,"12351":148.44,"12493":148.15,"15210":148.69,"16212":199.32,"16009":202.52,"12213":150.25,"16386":198.57,"13580":194.96,"14733":147.99,"12086":150.17,"14802":147.9,"14407":148.1,"14356":147.99,"16086":194.43,"15437":148.01,"14115":194.78,"14331":148.05,"12457":148.11,"14871":148.28},"session_type":{"12372":"Live","12351":"Live","12493":"Post","15210":"Pre","16212":"Live","16009":"Live","12213":"Pre","16386":"Live","13580":"Live","14733":"Live","12086":"Pre","14802":"Post","14407":"Post","14356":"Live","16086":"Live","15437":"Live","14115":"Live","14331":"Live","12457":"Post","14871":"Post"},"high":{"12372":148.3600006104,"12351":148.4700012207,"12493":148.15,"15210":148.69,"16212":199.8399963379,"16009":202.5249938965,"12213":150.25,"16386":198.6499938965,"13580":195.0299987793,"14733":147.9900054932,"12086":150.2,"14802":147.9,"14407":148.1,"14356":148.1049957275,"16086":194.4400024414,"15437":148.0399932861,"14115":195.1699981689,"14331":148.0800018311,"12457":148.12,"14871":148.28},"low":{"12372":148.26,"12351":148.38,"12493":148.06,"15210":148.68,"16212":199.15,"16009":202.27,"12213":150.2,"16386":198.49,"13580":194.79,"14733":147.93,"12086":150.16,"14802":147.85,"14407":148.1,"14356":147.98,"16086":193.82,"15437":148.0,"14115":194.64,"14331":148.01,"12457":148.07,"14871":148.2},"volume":{"12372":99551.0,"12351":68985.0,"12493":0.0,"15210":0.0,"16212":9016.0,"16009":2974.0,"12213":0.0,"16386":1395.0,"13580":5943.0,"14733":59854.0,"12086":0.0,"14802":0.0,"14407":0.0,"14356":341196.0,"16086":8715.0,"15437":45495.0,"14115":16535.0,"14331":173785.0,"12457":0.0,"14871":0.0},"date":{"12372":1636502400000,"12351":1636502400000,"12493":1636502400000,"15210":1636588800000,"16212":1636588800000,"16009":1636502400000,"12213":1636502400000,"16386":1636588800000,"13580":1636502400000,"14733":1636588800000,"12086":1636502400000,"14802":1636588800000,"14407":1636502400000,"14356":1636502400000,"16086":1636502400000,"15437":1636588800000,"14115":1636502400000,"14331":1636502400000,"12457":1636502400000,"14871":1636588800000}}')
Expected result
if N = 4, then the results will exclude NET on 2021-11-11. (count of all shown here for illustration)
Like below (sample only)
You can use analytical function (window function) to count rows for each date and ticker combination.
select datetime, datetime_utc, ticker,open,close,session_type, high,low,volume
from (
SELECT datetime, datetime_utc, ticker,open,close,session_type, high,low,volume,
COUNT(*) OVER(PARTITION BY CAST(datetime AS DATE), ticker) as C
FROM candles
)
WHERE C >= 5
I am not familiar with postgres, so it may contain syntax problem.
You can use the pandasql with a nested groupby to count the daily data points for all tickers and use an inner join
import pandasql as ps
N=4
df_select = ps.sqldf(f"select df.* from (select count(*) as count, ticker, date from df group by date, ticker having count >= {N}) df_counts left join df on df.ticker=df_counts.ticker and df.date=df_counts.date")
Output:

Using a table in python UDF in Redshift

I need to create a python UDF(user-defined function) in redshift, that would be called in some other procedure. This python UDF takes two date values and compares those dates within the given start and end date, and check for the occurrence of these intermediate dates in some list.
This list needs to collect it's values from another table's column. Now the issue is, python UDF are defined in plpythonplu language and they don't recognize any sql. What should I do to make this list out of the table's column?
This is my function:
create or replace function test_tmp (ending_date date, starting_date date)
returns integer
stable
as $$
def get_working_days(ending_date , starting_date ):
days=0
if start_date is not None and end_date is not None:
for n in range(int ((ending_date - starting_date).days)):
btw_date= (start_date + timedelta(n)).strftime('%Y-%m-%d')
if btw_date in date_list:
days=days+1
return days
return 0
return get_working_days(ending_date,starting_date)
$$ language plpythonu;
Now, I need to create this date_list as something like:
date_list = [str(each["WORK_DATE"]) for each in (select WORK_DATE from public.date_list_table).collect()]
But, using this line in the function obviously gives an error, as select WORK_DATE from public.date_list_table is SQL.
Following is the structure of table public.date_list_table:
CREATE TABLE public.date_list
(
work_date date ENCODE az64
)
DISTSTYLE EVEN;
Some sample values for this table (actually this table stores only the working days values for the entire year):
insert into date_list_table values ('2021-07-01'),('2021-06-30'),('2021-06-29');
An Amazon Redshift Scalar SQL UDF - Amazon Redshift cannot access any tables. It needs to be self-contained by passing all the necessary information into the function. Or, you could store the date information inside the function so it doesn't need to access the table (not unreasonable, since it only needs to hold exceptions such as public holidays on weekdays).
It appears that your use-case is to calculate the number of working days between two dates. One way that this is traditionally solved is to create a table calendar with one row per day and columns providing information such as:
Work Day (Boolean)
Weekend (Boolean)
Public Holiday (Boolean)
Month
Quarter
Day of Year
etc.
You can then JOIN or query the table to identify the desired information, such as:
SELECT COUNT(*) FROM calendar WHERE work_day AND date BETWEEN start_date AND end_date

Pandas DataFrame append slow when appending hundreds of DataFrames with thousands of rows each

The variable "data" in the code below contains hundreds of execution results from querying a database. Each execution result is one day of data containing roughly 7,000 rows of data (columns are timestamp, and value). I append each day to each other resulting in several million rows of data (these hundreds of appends take a long time). After I have the complete data set for one sensor I store this data as a column in the unitdf DataFrame, I then repeat the above process for each sensor and merge them all into the unitdf DataFrame.
Both the append and merge are costly operations I believe. The only possible solution I may have found is splitting up each column into lists and once all data is added to the list bring all the columns together into a DataFrame. Any suggestions to speed things up?
i = 0
for sensor_id in sensors: #loop through each of the 20 sensors
#prepared statement to query Cassandra
session_data = session.prepare("select timestamp, value from measurements_by_sensor where unit_id = ? and sensor_id = ? and date = ? ORDER BY timestamp ASC")
#Executing prepared statement over a range of dates
data = execute_concurrent(session, ((session_data, (unit_id, sensor_id, date)) for date in dates), concurrency=150, raise_on_first_error=False)
sensordf = pd.DataFrame()
#Loops through the execution results and appends all successful executions that contain data
for (success, result) in data:
if success:
sensordf = sensordf.append(pd.DataFrame(result.current_rows))
sensordf.rename(columns={'value':sensor_id}, inplace=True)
sensordf['timestamp'] = pd.to_datetime(sensordf['timestamp'], format = "%Y-%m-%d %H:%M:%S", errors='coerce')
if i == 0:
i+=1
unitdf = sensordf
else:
unitdf = unitdf.merge(sensordf, how='outer')

Categories

Resources