I have a csv file of customer ids (CRM_id). I need to get their primary keys (an autoincrement int) from the customers table of the database. (I can't be assured of the integrity of the CRM_ids so I chose not to make that the primary key).
So:
customers = []
with open("CRM_ids.csv", 'r', newline='') as csvfile:
customerfile = csv.DictReader(csvfile, delimiter = ',', quotechar='"', skipinitialspace=True)
#only one "CRM_id" field per row
customers = [c for c in customerfile]
So far so good? I think this is the most pythonesque way of doing that (but happy to hear otherwise).
Now comes the ugly code. It works, but I hate appending to the list because that has to copy and reallocate memory for each loop, right? Is there a better way (pre-allocate + enumerate to keep track of the index comes to mind, but maybe there's an even quickler/better way by being clever with the SQL so as not to do several thousand separate queries...)?
cnx = mysql.connector.connect(user='me', password=sys.argv[1], host="localhost", database="mydb")
cursor = cnx.cursor()
select_customer = ("SELECT id FROM customers WHERE CRM_id = %(CRM_id)s LIMIT 1;")
c_ids = []
for row in customers:
cursor.execute(select_customer, row)
#note fetchone() returns a tuple, but the SELECTed set
#only has a single column so we need to get this column with the [0]
c_ids.extend(cursor.fetchall())
c_ids = [c[0] for c in c_ids]
Edit:
Purpose is to get the primary keys in a list so I can use these to allocate some other data from other CSV files in linked tables (the customer id primary key is a foreign key to these other tables, and the allocation algorithm changes, so it's best to have the flexibility to do the allocation in python rather than hard coding SQL queries). I know this sounds a little backwards, but the "client" only works with spreadsheets rather than an ERP/PLM, so I have to build the "relations" for this small app myself.
What about changing your query to get what you want?
crm_ids = ",".join(customers)
select_customer = "SELECT UNIQUE id FROM customers WHERE CRM_id IN (%s);" % crm_ids
MySQL should be fine with even a multi-megabyte query, according to the manual; if it gets to be a really long list, you can always break it up - two or three queries is guaranteed much faster than a few thousand.
how about storing your csv in a dict instead of a list:
customers = [c for c in customerfile]
becomes:
customers = {c['CRM_id']:c for c in customerfile}
then select the entire xref:
result = cursor.execute('select id, CRM_id from customers')
and add the new rowid as a new entry in the dict:
for row in result:
customers[row[1]]['newid']=row[0]
Related
I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612
I have a database table that includes TRADE_DATE, CURRVAL, and ITEM fields. I first have two arrays/lists: arrVars (strings), dates (dates). Each string in arrVars represents an ITEM for which I need to retrieve the CURRVALs for each TRADE_DATE in dates. I'm new to Python, and I'm certainly no expert w/ databases, and I'm sure there are ways to speed my code up.
First part is just creating the dates list from my first db connection. I'm just iterating through each row and appending it into the dates list. Is there a better way?
i = 0
for row in cursor.fetchall():
dates.append(row[0])
i+=1
Second, I'm looping through each ITEM in arrVars and then looping through each TRADE_DATE in dates to create each array of CURRVALs and put the arrays into a matrix. This is pretty damn slow, so I'm hoping there is a better way as well.
M = []
dtFormat = '%Y/%m/%d'
for item in arrVars:
tmp = []
for dt in dates:
strSQL = "SELECT CURRVAL FROM tblGanData WHERE ITEM = '" + item + "' AND TRADE_DATE = #" + dt.strftime(dtFormat) + "#"
cursor.execute(strSQL)
tmp.append(cursor.fetchone()[0])
M.append(tmp)
Thank you!!
For the first bit, you might want do to something like this:
dates = [row[0] for row in cursor.fetchall()]
But I'd be very interested in seeing the SQL statement that you're using for that cursor.
select some_date from my_table
is going to be faster than
select * from my_table
(How much faster depends on how many rows you are getting back and the speed of the network connection between your client and server.)
For your second part, you're executing one query (with the full round-trip cost) for each Item/Date combination.
So maybe something like this
# build a list of all the dates
dates_str = ",".join(['#' + dt.strftime(dtFormat) + "#"
for dt in dates])
# build a list of all the items
items_str = ",".join(["'" + item + "'" for item in items])
# run one SQL query that gets everything
cursor.execute("""
select item, trade_date, currval
from tblGanData
where item in (%s)
and trade_date in (%s)
order by item, trade_date
""" % (items_str, dates_str))
You will have to do some logic when fetching the values to turn the list of currvals into a matrix, let me know if you need help with that.
P.S. How many items/dates are we talking about here?
I have a large dataset with +50M records in a PostgreSQL database that require massive calculations, inner join.
Python is the tool of choice with Psycopg2.
Running the process with fetchmany of 20,000 records takes a couple of hours to finish.
The execution needs to take place sequentially, as in each record of the 50M needs to be fetched separately, then another query (in the below example) needs to run before a result is returned and saved in a separate table.
Indexes are properly configured on each table (5 tables in total) and the complex query (that returns a calculated value - example below) takes around 240MS to return results (when the database is not under load).
Celery is used to take care of database inserts of the calculated values in a separate table.
My question is about common strategies to reduce overall running time and produce results/calculations faster.
In other words, what is an effective way to go through all the records, one by one, calculate the value of a field via a second query then save the result.
UPDATE:
There is an important piece of information that I unintentionally missed mentioning while trying to obfuscate sensitive details. Sorry for that.
The original SELECT query calculates a value aggregated from different tables as follows:
SELECT CR.gg, (AX.b + BF.f)/CR.d AS calculated_field
FROM table_one CR
LEFT JOIN table_two AX ON EX.x = CR.x
LEFT JOIN table_three BF ON BF.x = CR.x
WHERE CR.gg = '123'
GROUP BY CR.gg;
PS: the SQL query is written by our experienced DBA so i trust that it is optimised.
don't loop over records and call the DBMS repeatedly for every record.
instead, let the DBMS process large chunks (preferrably: all) of data
and, let it spit out all the results.
Below is a snippet of my twitter-sucker(with a rather complex ugly query)
def fetch_referred_tweets(self):
self.curs = self.conn.cursor()
tups = ()
selrefd = """SELECT twx.id, twx.in_reply_to_id, twx.seq, twx.created_at
FROM(
SELECT tw1.id, tw1.in_reply_to_id, tw1.seq, tw1.created_at
FROM tt_tweets tw1
WHERE 1=1
AND tw1.in_reply_to_id > 0
AND tw1.is_retweet = False
AND tw1.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw1.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw1.in_reply_to_id)
UNION ALL
SELECT tw2.id, tw2.in_reply_to_id, tw2.seq, tw2.created_at
FROM tweets tw2
WHERE 1=1
AND tw2.in_reply_to_id > 0
AND tw2.is_retweet = False
AND tw2.did_resolve = False
AND NOT EXISTS ( SELECT * FROM tweets nx
WHERE nx.id = tw2.in_reply_to_id)
AND NOT EXISTS ( SELECT * FROM tt_tweets nx
WHERE nx.id = tw2.in_reply_to_id)
-- ORDER BY tw2.created_at DESC
)twx
LIMIT %s;"""
# -- AND tw.created_at < now() - '15 min':: interval
# -- AND tw.created_at >= now() - '72 hour':: interval
count = 0
uniqs = 0
self.curs.execute(selrefd, (quotum_referred_tweets, ) )
tups = self.curs.fetchmany(quotum_referred_tweets)
for tup in tups:
if tup == None: break
print ('%d -->> %d [seq=%d] datum=%s' % tup)
self.resolve_list.append(tup[0] ) # this tweet
if tup[1] not in self.refetch_tweets:
self.refetch_tweets[ tup[1] ] = [ tup[0]] # referred tweet
uniqs += 1
count += 1
self.curs.close()
Note: your query makes no sense:
you only select fields from the ertable
so, the two LEFT JOINed tables could be omitted
if ex and ef do contain multiple matching rows, the resultset could be larger than just all the rows selected from er, resulting in duplicateder records
there is a GROUP BY present, but no aggregates are in the select list
select er.gg, er.z, er.y
from table_one er
where er.gg = '123'
-- or:
where er.gg >= '123'
and er.gg <= '456'
ORDER BY er.gg, er.z, er.y -- Or: some other ordering
;
since you are doing a join in your query, the logical thing to do is to work around it, meaning create what's known as a summary table, this summary table -residing on the database- will hold the final joined dataset, so in your python code you will just fetch/select data from it.
another way is to use materialized view link
I took #wildplasser's advice and moved the calculation operation inside the database as a function.
The result has been impressively efficient to say the least and total run time dropped to minutes/~ hour.
To recap:
Database records are no longer fetched in the sequence
mentioned earlier
Calculations happen inside the database via a function PostgreSQL function
So I found a great script over at QuantState that had a great walk-through on setting up my own securities database and loading in historical pricing information. However, I'm not trying to modify the script so that I can run it daily and have the latest stock quotes added.
I adjusted the initial data load to just download 1 week worth of historicals, but I've been having issues with writing the SQL statement to see if the row exists already before adding. Can anyone help me out with this. Here's what I have so far:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
# Create the time now
now = datetime.datetime.utcnow()
# Amend the data to include the vendor ID and symbol ID
daily_data = [(data_vendor_id, symbol_id, d[0], now, now,
d[1], d[2], d[3], d[4], d[5], d[6]) for d in daily_data]
# Create the insert strings
column_str = """data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price,
close_price, volume, adj_close_price"""
insert_str = ("%s, " * 11)[:-2]
final_str = "INSERT INTO daily_price (%s) VALUES (%s) WHERE NOT EXISTS (SELECT 1 FROM daily_price WHERE symbol_id = symbol_id AND price_date = insert_str[2])" % (column_str, insert_str)
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
Some notes regarding your code above:
It's generally easier to defer to now() in pure SQL than to try in Python whenever possible. It avoids lots of potential pitfalls with timezones, library differences, etc.
If you construct a list of columns, you can dynamically generate a string of %s's based on its size, and don't need to hardcode the length into a repeated string with is then sliced.
Since it appears that insert_daily_data_into_db is meant to be called from within a loop on a per-stock basis, I don't believe you want to use executemany here, which would require a list of tuples and is very different semantically.
You were comparing symbol_id to itself in the sub select, instead of a particular value (which would mean it's always true).
To prevent possible SQL Injection, you should always interpolate values in the WHERE clause, including sub selects.
Note: I'm assuming that you're using psycopg2 to access Postgres, and that the primary key for the table is a tuple of (symbol_id, price_date). If not, the code below would need to be tweaked at least a bit.
With those points in mind, try something like this (untested, since I don't have your data, db, etc. but it is syntactically valid Python):
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
column_list = ["data_vendor_id", "symbol_id", "price_date", "created_date",
"last_updated_date", "open_price", "high_price", "low_price",
"close_price", "volume", "adj_close_price"]
insert_list = ['%s'] * len(column_str)
values_tuple = (data_vendor_id, symbol_id, daily_data[0], 'now()', 'now()', daily_data[1],
daily_data[2], daily_data[3], daily_data[4], daily_data[5], daily_data[6])
final_str = """INSERT INTO daily_price ({0})
VALUES ({1})
WHERE NOT EXISTS (SELECT 1
FROM daily_price
WHERE symbol_id = %s
AND price_date = %s)""".format(', '.join(column_list), ', '.join(insert_list))
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.execute(final_str, values_tuple, values_tuple[1], values_tuple[2])
I am attempting to select from (with a WHERE clause) and sort a large database table in sqlite3 via python. The sort is currently taking 30+ minutes on about 36 MB of data. I have a feeling it can work faster than this with indices, but I think the order of my code may be incorrect.
The code is executed in the order listed here.
My CREATE TABLE statement looks like this:
c.execute('''CREATE table gtfs_stop_times (
trip_id text , --REFERENCES gtfs_trips(trip_id),
arrival_time text, -- CHECK (arrival_time LIKE '__:__:__'),
departure_time text, -- CHECK (departure_time LIKE '__:__:__'),
stop_id text , --REFERENCES gtfs_stops(stop_id),
stop_sequence int NOT NULL --NOT NULL
)''')
The rows are then inserted in the next step:
stop_times = csv.reader(open("tmp\\avl_stop_times.txt"))
c.executemany('INSERT INTO gtfs_stop_times VALUES (?,?,?,?,?)', stop_times)
Next, I create an index out of two columns (trip_id and stop_sequence):
c.execute('CREATE INDEX trip_seq ON gtfs_stop_times (trip_id, stop_sequence)')
Finally, I run a SELECT statement with a WHERE clause that sorts this data by the two columns used in the index and then write that to a csv file:
c.execute('''SELECT gtfs_stop_times.trip_id, gtfs_stop_times.arrival_time, gtfs_stop_times.departure_time, gtfs_stops.stop_id, gtfs_stop_times.stop_sequence
FROM gtfs_stop_times, gtfs_stops
WHERE gtfs_stop_times.stop_id=gtfs_stops.stop_code
ORDER BY gtfs_stop_times.trip_id, gtfs_stop_times.stop_sequence;
)''')
f = open("gtfs_update\\stop_times.txt", "w")
writer = csv.writer(f, dialect = 'excel')
writer.writerow([i[0] for i in c.description]) # write headers
writer.writerows(c)
del writer
Is there any way to speed up Step 4 (possibly be changing how I add and/or use the index) or should I just go to lunch while this runs?
I have added PRAGMA statements to try to improve performance to no avail:
c.execute('PRAGMA main.page_size = 4096')
c.execute('PRAGMA main.cache_size=10000')
c.execute('PRAGMA main.locking_mode=EXCLUSIVE')
c.execute('PRAGMA main.synchronous=NORMAL')
c.execute('PRAGMA main.journal_mode=WAL')
c.execute('PRAGMA main.cache_size=5000')
The SELECT executes extremely fast because there is no gtfs_stops table and you get nothing but an error message.
If we assume that there is a gtfs_stops table, then your trip_seq index is already quite optimal for the query.
However, you also need an index for looking up stop_code values in the gtfs_stops column.