I'm new to python and am facing what seems to be a memory leakage error.
I've written a simple script that is trying to fetch multiple columns from a postgres database and then proceeds to perform simple subtraction on these columns and store the result in a temporary variable which is being written to a file. I need to do this on multiple pairs of columns from the db and I'm using a list of lists to store the different column names.
I'm loop over the individual elements of this list until the list is exhausted. While I'm getting valid results(by valid I mean that the output file contains the expected values) for the first few column pairs, the program abruptly gets "Killed" somewhere in between execution. Code below:
varList = [ ['table1', 'col1', 'col2'],
['table1', 'col3', 'col4'],
['table2', 'col1', 'col2'],
# ..
# and many more such lines
# ..
['table2', 'col3', 'col4']]
try:
conn = psycopg2.connect(database='somename', user='someuser', password='somepasswd')
c = conn.cursor()
for listVar in varList:
c.execute("SELECT %s FROM %s" %(listVar[1], listVar[0]))
rowsList1 = c.fetchall();
c.execute("SELECT %s FROM %s" %(listVar[2], listVar[0]))
rowsList2 = c.fetchall();
outfile = file('%s__%s' %(listVar[1], listVar[2]), 'w')
for i in range(0, len(rowsList1)):
if rowsList1[i][0] == None or rowsList2[i][0] == None:
timeDiff = -1
else:
timestamp1 = time.mktime(rowsList1[i][0].timetuple())
timestamp2 = time.mktime(rowsList2[i][0].timetuple())
timeDiff = timestamp2 - timestamp1
outfile.write(str(timeDiff) + '\n')
outfile.close();
del rowsList1, rowsList2
#numpy.savetxt('output.dat', column_stack(rows))
except psycopg2.DatabaseError, e:
print 'Error %s' % e
sys.exit(1)
finally:
if conn:
conn.close()
My initial guess was that there was some form of memory leak and in an attempt to fix this, I added a del statement on the two large arrays hoping that the memory gets properly collected. This time, I got slightly better outputs(by slightly better I mean that more output files were created for the db column pairs).
However, after the 10th or 11th pair of columns, my program was "Killed" again. Can someone tell me what could be wrong here. Is there a better way of getting this done?
Any help is appreciated.
PS: I know that this is a fairly inefficient implementation as I'm looping many times, but I needed something quick and dirty for proof of concept.
I think the problem here is you are selecting everything and then filtering it in the application code when you should be selecting what you want with the sql query. If you select what you want in the sql query like this:
for listvar in varlist:
select listvar[1], listvar[2] from listvar[0] where listvar[1] is not null and listvar[2] is not null
# then...
timeDiff = {}
for row in rows:
timestamp1 = time.mktime(row[0].timetuple())
timestamp2 = time.mktime(row[0].timetuple())
timeDiff[identifier] = timestamp2 - timestamp1 #still need to assoc timediff with row... maybe you need to query a unique identifyer also?
#and possibly a separate... (this may not be necessary depending on your application code. do you really need -1's for irrelevant data or can you just return the important data?)
select listvar[1], listvar[2] from listvar[0] where listvar[1] is null or listvar[2] is null
for row in rows:
timeDiff[identifier] = -1 # or None
Related
I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612
I have a database table that includes TRADE_DATE, CURRVAL, and ITEM fields. I first have two arrays/lists: arrVars (strings), dates (dates). Each string in arrVars represents an ITEM for which I need to retrieve the CURRVALs for each TRADE_DATE in dates. I'm new to Python, and I'm certainly no expert w/ databases, and I'm sure there are ways to speed my code up.
First part is just creating the dates list from my first db connection. I'm just iterating through each row and appending it into the dates list. Is there a better way?
i = 0
for row in cursor.fetchall():
dates.append(row[0])
i+=1
Second, I'm looping through each ITEM in arrVars and then looping through each TRADE_DATE in dates to create each array of CURRVALs and put the arrays into a matrix. This is pretty damn slow, so I'm hoping there is a better way as well.
M = []
dtFormat = '%Y/%m/%d'
for item in arrVars:
tmp = []
for dt in dates:
strSQL = "SELECT CURRVAL FROM tblGanData WHERE ITEM = '" + item + "' AND TRADE_DATE = #" + dt.strftime(dtFormat) + "#"
cursor.execute(strSQL)
tmp.append(cursor.fetchone()[0])
M.append(tmp)
Thank you!!
For the first bit, you might want do to something like this:
dates = [row[0] for row in cursor.fetchall()]
But I'd be very interested in seeing the SQL statement that you're using for that cursor.
select some_date from my_table
is going to be faster than
select * from my_table
(How much faster depends on how many rows you are getting back and the speed of the network connection between your client and server.)
For your second part, you're executing one query (with the full round-trip cost) for each Item/Date combination.
So maybe something like this
# build a list of all the dates
dates_str = ",".join(['#' + dt.strftime(dtFormat) + "#"
for dt in dates])
# build a list of all the items
items_str = ",".join(["'" + item + "'" for item in items])
# run one SQL query that gets everything
cursor.execute("""
select item, trade_date, currval
from tblGanData
where item in (%s)
and trade_date in (%s)
order by item, trade_date
""" % (items_str, dates_str))
You will have to do some logic when fetching the values to turn the list of currvals into a matrix, let me know if you need help with that.
P.S. How many items/dates are we talking about here?
I've been learning Python recently and have learned how to connect to the database and retrieve data from a database using MYSQLdb. However, all the examples show how to get multiple rows of data. I want to know how to retrieve only one row of data.
This is my current method.
cur.execute("SELECT number, name FROM myTable WHERE id='" + id + "'")
results = cur.fetchall()
number = 0
name = ""
for result in results:
number = result['number']
name = result['name']
It seems redundant to do for result in results: since I know there is only going to be one result.
How can I just get one row of data without using the for loop?
.fetchone() to the rescue:
result = cur.fetchone()
use .pop()
if results:
result = results.pop()
number = result['number']
name = result['name']
I have a csv file of customer ids (CRM_id). I need to get their primary keys (an autoincrement int) from the customers table of the database. (I can't be assured of the integrity of the CRM_ids so I chose not to make that the primary key).
So:
customers = []
with open("CRM_ids.csv", 'r', newline='') as csvfile:
customerfile = csv.DictReader(csvfile, delimiter = ',', quotechar='"', skipinitialspace=True)
#only one "CRM_id" field per row
customers = [c for c in customerfile]
So far so good? I think this is the most pythonesque way of doing that (but happy to hear otherwise).
Now comes the ugly code. It works, but I hate appending to the list because that has to copy and reallocate memory for each loop, right? Is there a better way (pre-allocate + enumerate to keep track of the index comes to mind, but maybe there's an even quickler/better way by being clever with the SQL so as not to do several thousand separate queries...)?
cnx = mysql.connector.connect(user='me', password=sys.argv[1], host="localhost", database="mydb")
cursor = cnx.cursor()
select_customer = ("SELECT id FROM customers WHERE CRM_id = %(CRM_id)s LIMIT 1;")
c_ids = []
for row in customers:
cursor.execute(select_customer, row)
#note fetchone() returns a tuple, but the SELECTed set
#only has a single column so we need to get this column with the [0]
c_ids.extend(cursor.fetchall())
c_ids = [c[0] for c in c_ids]
Edit:
Purpose is to get the primary keys in a list so I can use these to allocate some other data from other CSV files in linked tables (the customer id primary key is a foreign key to these other tables, and the allocation algorithm changes, so it's best to have the flexibility to do the allocation in python rather than hard coding SQL queries). I know this sounds a little backwards, but the "client" only works with spreadsheets rather than an ERP/PLM, so I have to build the "relations" for this small app myself.
What about changing your query to get what you want?
crm_ids = ",".join(customers)
select_customer = "SELECT UNIQUE id FROM customers WHERE CRM_id IN (%s);" % crm_ids
MySQL should be fine with even a multi-megabyte query, according to the manual; if it gets to be a really long list, you can always break it up - two or three queries is guaranteed much faster than a few thousand.
how about storing your csv in a dict instead of a list:
customers = [c for c in customerfile]
becomes:
customers = {c['CRM_id']:c for c in customerfile}
then select the entire xref:
result = cursor.execute('select id, CRM_id from customers')
and add the new rowid as a new entry in the dict:
for row in result:
customers[row[1]]['newid']=row[0]
So I found a great script over at QuantState that had a great walk-through on setting up my own securities database and loading in historical pricing information. However, I'm not trying to modify the script so that I can run it daily and have the latest stock quotes added.
I adjusted the initial data load to just download 1 week worth of historicals, but I've been having issues with writing the SQL statement to see if the row exists already before adding. Can anyone help me out with this. Here's what I have so far:
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
# Create the time now
now = datetime.datetime.utcnow()
# Amend the data to include the vendor ID and symbol ID
daily_data = [(data_vendor_id, symbol_id, d[0], now, now,
d[1], d[2], d[3], d[4], d[5], d[6]) for d in daily_data]
# Create the insert strings
column_str = """data_vendor_id, symbol_id, price_date, created_date,
last_updated_date, open_price, high_price, low_price,
close_price, volume, adj_close_price"""
insert_str = ("%s, " * 11)[:-2]
final_str = "INSERT INTO daily_price (%s) VALUES (%s) WHERE NOT EXISTS (SELECT 1 FROM daily_price WHERE symbol_id = symbol_id AND price_date = insert_str[2])" % (column_str, insert_str)
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.executemany(final_str, daily_data)
Some notes regarding your code above:
It's generally easier to defer to now() in pure SQL than to try in Python whenever possible. It avoids lots of potential pitfalls with timezones, library differences, etc.
If you construct a list of columns, you can dynamically generate a string of %s's based on its size, and don't need to hardcode the length into a repeated string with is then sliced.
Since it appears that insert_daily_data_into_db is meant to be called from within a loop on a per-stock basis, I don't believe you want to use executemany here, which would require a list of tuples and is very different semantically.
You were comparing symbol_id to itself in the sub select, instead of a particular value (which would mean it's always true).
To prevent possible SQL Injection, you should always interpolate values in the WHERE clause, including sub selects.
Note: I'm assuming that you're using psycopg2 to access Postgres, and that the primary key for the table is a tuple of (symbol_id, price_date). If not, the code below would need to be tweaked at least a bit.
With those points in mind, try something like this (untested, since I don't have your data, db, etc. but it is syntactically valid Python):
def insert_daily_data_into_db(data_vendor_id, symbol_id, daily_data):
"""Takes a list of tuples of daily data and adds it to the
database. Appends the vendor ID and symbol ID to the data.
daily_data: List of tuples of the OHLC data (with
adj_close and volume)"""
column_list = ["data_vendor_id", "symbol_id", "price_date", "created_date",
"last_updated_date", "open_price", "high_price", "low_price",
"close_price", "volume", "adj_close_price"]
insert_list = ['%s'] * len(column_str)
values_tuple = (data_vendor_id, symbol_id, daily_data[0], 'now()', 'now()', daily_data[1],
daily_data[2], daily_data[3], daily_data[4], daily_data[5], daily_data[6])
final_str = """INSERT INTO daily_price ({0})
VALUES ({1})
WHERE NOT EXISTS (SELECT 1
FROM daily_price
WHERE symbol_id = %s
AND price_date = %s)""".format(', '.join(column_list), ', '.join(insert_list))
# Using the postgre connection, carry out an INSERT INTO for every symbol
with con:
cur = con.cursor()
cur.execute(final_str, values_tuple, values_tuple[1], values_tuple[2])