I am attempting to select from (with a WHERE clause) and sort a large database table in sqlite3 via python. The sort is currently taking 30+ minutes on about 36 MB of data. I have a feeling it can work faster than this with indices, but I think the order of my code may be incorrect.
The code is executed in the order listed here.
My CREATE TABLE statement looks like this:
c.execute('''CREATE table gtfs_stop_times (
trip_id text , --REFERENCES gtfs_trips(trip_id),
arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
stop_id text , --REFERENCES gtfs_stops(stop_id),
stop_sequence int NOT NULL --NOT NULL
)''')
The rows are then inserted in the next step:
stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)
Next, I create an index out of two columns (trip_id and stop_sequence):
c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')
Finally, I run a SELECT statement with a WHERE clause that sorts this data by the two columns used in the index and then write that to a csv file:
c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')
f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer
Is there any way to speed up Step 4 (possibly be changing how I add and/or use the index) or should I just go to lunch while this runs?
I have added PRAGMA statements to try to improve performance to no avail:
c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')
The SELECT executes extremely fast because there is no gtfs_stops table and you get nothing but an error message.
If we assume that there is a gtfs_stops table, then your trip_seq index is already quite optimal for the query.
However, you also need an index for looking up stop_code values in the gtfs_stops column.
Related
Using pandas and sqlite3.
I have about 1000 csv files with data that needs a little bit of prunning and then I have to insert them into a database.
The code I am using:
def insert_underlying(underlying_df, conn):
# convert timestamp to timestamp dtype
underlying_df['timestamp'] = pd.to_datetime(underlying_df['timestamp'])
# underlying_df = underlying_df.tail(len(underlying_df)-1)
# sort it
underlying_df.sort_values('timestamp', inplace=True)
# remove the contractname column
underlying_df = underlying_df[['timestamp', 'close', 'bid', 'ask']]
# insert underlying into tickdata
# check if table exists # if it doesn't create
c.execute("CREATE TABLE IF NOT EXISTS {}tick(timestamp timestamp, close REAL, bid REAL, ask REAL)" .format(symbol))
# Insert to external db
query = 'INSERT OR REPLACE INTO ' + symbol + 'tick (timestamp, close, bid, ask) VALUES (?,?,?,?)'
conn.executemany(query, underlying_df.to_records(index=False))
conn.commit()
From what I have read on similar questions, executemany is iterating over the rows and inserting the records. This is very slow. My experience with sql is beginner at best.
What would be the best/fastest way to do this?
Thanks
I am currently using LOAD DATA LOCAL INFILE successfully, and typically with REPLACE. However, I am now trying to use the same command to load a CSV file into a table but only replace specific columns.
For example, table currently looks like:
ID date num1 num2
01 1/1/2017 100 200
01 1/2/2017 101 201
01 1/3/2017 102 202
where ID and date are the primary keys.
I have a similar CSV, but one that only has ID, date, num1 as columns. I want to load that new CSV into the table, but maintain whatever is in num2.
My current code per this post:
LOAD DATA LOCAL INFILE mycsv.csv
REPLACE INTO TABLE mytable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(ID, date, num1)
Again, my code works flawlessly when I'm replacing all the columns, but when I try to replace only select columns, it fills the other columns with NULL values. The similar posts (like the one I referenced) haven't helped.
I don't know if this is python specific issue, but I'm using MySQLdb to connect to the database and I'm familiar with the local_infile parameter and that's all working well. MySQL version 5.6.33.
Simply load csv table into a similarly structured temp table and then run UPDATE JOIN:
CREATE TABLE mytemptable AS SELECT * FROM mytable LIMIT 1; --- RUN ONLY ONCE
DELETE FROM mytemptable;
LOAD DATA LOCAL INFILE mycsv.csv
INTO TABLE mytemptable
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
LINES TERMINATED BY '\n'
IGNORE 1 LINES
(ID, date, num1);
UPDATE mytable t1
INNER JOIN mytemptable t2 ON t1.ID = t2.ID AND t1.`date` = t2.`date`
SET t1.num1 = t2.num1;
I'm trying sum up the values in two columns and truncate my date fields by the day. I've constructed the SQL query to do this(which works):
SELECT date_trunc('day', date) AS Day, SUM(fremont_bridge_nb) AS
Sum_NB, SUM(fremont_bridge_sb) AS Sum_SB FROM bike_count GROUP BY Day
ORDER BY Day;
But I then run into issues when I try to format this into peewee:
Bike_Count.select(fn.date_trunc('day', Bike_Count.date).alias('Day'),
fn.SUM(Bike_Count.fremont_bridge_nb).alias('Sum_NB'),
fn.SUM(Bike_Count.fremont_bridge_sb).alias('Sum_SB'))
.group_by('Day').order_by('Day')
I don't get any errors, but when I print out the variable I stored this in, it shows:
<class 'models.Bike_Count'> SELECT date_trunc(%s, "t1"."date") AS
Day, SUM("t1"."fremont_bridge_nb") AS Sum_NB,
SUM("t1"."fremont_bridge_sb") AS Sum_SB FROM "bike_count" AS t1 ORDER
BY %s ['day', 'Day']
The only thing that I've written in Python to get data successfully is:
Bike_Count.get(Bike_Count.id == 1).date
If you just stick a string into your group by / order by, Peewee will try to parameterize it as a value. This is to avoid SQL injection haxx.
To solve the problem, you can use SQL('Day') in place of 'Day' inside the group_by() and order_by() calls.
Another way is to just stick the function call into the GROUP BY and ORDER BY. Here's how you would do that:
day = fn.date_trunc('day', Bike_Count.date)
nb_sum = fn.SUM(Bike_Count.fremont_bridge_nb)
sb_sum = fn.SUM(Bike_Count.fremont_bridge_sb)
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(day)
.order_by(day))
Or, if you prefer:
query = (Bike_Count
.select(day.alias('Day'), nb_sum.alias('Sum_NB'), sb_sum.alias('Sum_SB'))
.group_by(SQL('Day'))
.order_by(SQL('Day')))
I have a csv file of customer ids (CRM_id). I need to get their primary keys (an autoincrement int) from the customers table of the database. (I can't be assured of the integrity of the CRM_ids so I chose not to make that the primary key).
So:
customers = []
with open("CRM_ids.csv", 'r', newline='') as csvfile:
customerfile = csv.DictReader(csvfile, delimiter = ',', quotechar='"', skipinitialspace=True)
#only one "CRM_id" field per row
customers = [c for c in customerfile]
So far so good? I think this is the most pythonesque way of doing that (but happy to hear otherwise).
Now comes the ugly code. It works, but I hate appending to the list because that has to copy and reallocate memory for each loop, right? Is there a better way (pre-allocate + enumerate to keep track of the index comes to mind, but maybe there's an even quickler/better way by being clever with the SQL so as not to do several thousand separate queries...)?
cnx = mysql.connector.connect(user='me', password=sys.argv[1], host="localhost", database="mydb")
cursor = cnx.cursor()
select_customer = ("SELECT id FROM customers WHERE CRM_id = %(CRM_id)s LIMIT 1;")
c_ids = []
for row in customers:
cursor.execute(select_customer, row)
#note fetchone() returns a tuple, but the SELECTed set
#only has a single column so we need to get this column with the [0]
c_ids.extend(cursor.fetchall())
c_ids = [c[0] for c in c_ids]
Edit:
Purpose is to get the primary keys in a list so I can use these to allocate some other data from other CSV files in linked tables (the customer id primary key is a foreign key to these other tables, and the allocation algorithm changes, so it's best to have the flexibility to do the allocation in python rather than hard coding SQL queries). I know this sounds a little backwards, but the "client" only works with spreadsheets rather than an ERP/PLM, so I have to build the "relations" for this small app myself.
What about changing your query to get what you want?
crm_ids = ",".join(customers)
select_customer = "SELECT UNIQUE id FROM customers WHERE CRM_id IN (%s);" % crm_ids
MySQL should be fine with even a multi-megabyte query, according to the manual; if it gets to be a really long list, you can always break it up - two or three queries is guaranteed much faster than a few thousand.
how about storing your csv in a dict instead of a list:
customers = [c for c in customerfile]
becomes:
customers = {c['CRM_id']:c for c in customerfile}
then select the entire xref:
result = cursor.execute('select id, CRM_id from customers')
and add the new rowid as a new entry in the dict:
for row in result:
customers[row[1]]['newid']=row[0]
So I found a great script over at QuantState that had a great walk-through on setting up my own securities database and loading in historical pricing information. However, I'm not trying to modify the script so that I can run it daily and have the latest stock quotes added.
I adjusted the initial data load to just download 1 week worth of historicals, but I've been having issues with writing the SQL statement to see if the row exists already before adding. Can anyone help me out with this. Here's what I have so far:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
# Create the time now
now = datetime.datetime.utcnow()
# Amend the data to include the vendor ID and symbol ID
daily_data = [(data_vendor_id, symbol_id, d[0], now, now,
d[1], d[2], d[3], d[4], d[5], d[6]) for d in daily_data]
# Create the insert strings
column_str = """data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price,
close_price, volume, adj_close_price"""
insert_str = ("%s, " * 11)[:-2]
final_str = "INSERT INTO daily_price (%s) VALUES (%s) WHERE NOT EXISTS (SELECT 1 FROM daily_price WHERE symbol_id = symbol_id AND price_date = insert_str[2])" % (column_str, insert_str)
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
Some notes regarding your code above:
It's generally easier to defer to now() in pure SQL than to try in Python whenever possible. It avoids lots of potential pitfalls with timezones, library differences, etc.
If you construct a list of columns, you can dynamically generate a string of %s's based on its size, and don't need to hardcode the length into a repeated string with is then sliced.
Since it appears that insert_daily_data_into_db is meant to be called from within a loop on a per-stock basis, I don't believe you want to use executemany here, which would require a list of tuples and is very different semantically.
You were comparing symbol_id to itself in the sub select, instead of a particular value (which would mean it's always true).
To prevent possible SQL Injection, you should always interpolate values in the WHERE clause, including sub selects.
Note: I'm assuming that you're using psycopg2 to access Postgres, and that the primary key for the table is a tuple of (symbol_id, price_date). If not, the code below would need to be tweaked at least a bit.
With those points in mind, try something like this (untested, since I don't have your data, db, etc. but it is syntactically valid Python):
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
column_list = ["data_vendor_id", "symbol_id", "price_date", "created_date",
"last_updated_date", "open_price", "high_price", "low_price",
"close_price", "volume", "adj_close_price"]
insert_list = ['%s'] * len(column_str)
values_tuple = (data_vendor_id, symbol_id, daily_data[0], 'now()', 'now()', daily_data[1],
daily_data[2], daily_data[3], daily_data[4], daily_data[5], daily_data[6])
final_str = """INSERT INTO daily_price ({0})
VALUES ({1})
WHERE NOT EXISTS (SELECT 1
FROM daily_price
WHERE symbol_id = %s
AND price_date = %s)""".format(', '.join(column_list), ', '.join(insert_list))
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.execute(final_str, values_tuple, values_tuple[1], values_tuple[2])