Comparing two tables with SQLAlchemy - python

I am having a great deal of trouble with this problem. I am trying to compare two different tables from two different databases to see what tuples have been added, what tuples have been deleted, and what tuples have been updated. I do that with the following code:
from sqlalchemy import *
# query the databases to get all tuples from the relations
# save each relation to a list in order to be able to iterate over their tuples multiple times
# iterate through the lists, hash each tuple with k, v being primary key, tuple
# iterate through the "after" relation. for each tuple in the new relation, hash its key in the "before" relation.
# If it's found and the tuple is different, consider that an update, else, do nothing.
# If it is not found, consider that an insert
# iterate through the "before" relation. for each tuple in the "before" relation, hash by the primary key
# if the tuple is found in the "after" relation, do nothing
# if not, consider that a delete.
dev_engine = create_engine('mysql://...')
prod_engine = create_engine('mysql://...')
def transactions(exchange):
dev_connect = dev_engine.connect()
prod_connect = prod_engine.connect()
get_dev_instrument = "select * from " + exchange + "_instrument;"
instruments = dev_engine.execute(get_dev_instrument)
instruments_list = [r for r in instruments]
print 'made instruments_list'
get_prod_instrument = "select * from " + exchange + "_instrument;"
instruments_after = prod_engine.execute(get_prod_instrument)
instruments_after_list = [r2 for r2 in instruments_after]
print 'made instruments after_list'
before_map = {}
after_map = {}
for row in instruments:
before_map[row['instrument_id']] = row
for y in instruments_after:
after_map[y['instrument_id']] = y
print 'formed maps'
update_count = insert_count = delete_count = 0
change_list = []
for prod_row in instruments_after_list:
result = list(prod_row)
try:
row = before_map[prod_row['instrument_id']]
if not row == prod_row:
update_count += 1
for i in range(len(row)):
if not row[i] == prod_row[i]:
result[i] = str(row[i]) + '--->' + str(prod_row[i])
result.append("updated")
change_list.append(result)
except KeyError:
insert_count += 1
result.append("inserted")
change_list.append(result)
for before_row in instruments_list:
result = before_row
try:
after_row = after_map[before_row['instrument_id']]
except KeyError:
delete_count += 1
result.append("deleted")
change_list.append(result)
for el in change_list:
print el
print "Insert: " + str(insert_count)
print "Update: " + str(update_count)
print "Delete: " + str(delete_count)
dev_connect.close()
prod_connect.close()
def main():
transactions("...")
main()
instruments is the "before" table and instruments_after is the "after" table, so I want to see the changes that occurred to change instruments to instruments_after.
The above code works well, but fails when instruments or instruments_after is very large. I have a table that is over 4 million lines long and simply trying to load that into memory causes Python to exit. I have tried overcoming this issue by using LIMIT, OFFSET in my queries to append to the instruments_lists in pieces, but Python still exits because two lists of that size simply take up too much space. My last option is to choose a batch from one relation, and iterate through batches of the second relation and make comparisons, but that is extremely error prone. Is there another way to circumvent this problem? I have considered allocating more memory to my VM but I feel that the space complexity of my code is the issue and that is what should be fixed first.

Related

Getting the last two variables in for loop

I am trying to make a program that shows me the data of two specific coins. What it basically does is to takes the data in an infinite "for loop" to display the info until I close the program.
And now I am trying to get the last two elements of this infinite for loop every time it runs again and make calculations with it. I know I can't just hold all the items in a list and I am not sure how to store last two's and use them every time.
for line in lines:
coinsq = line.strip()
url = priceKey + coinsq + "USDT"
data = requests.get(url)
datax = data.json()
print( datax['symbol'] + " " + datax['price'])
Store the data in a deque (from the collections module).
Initialise your deque like this:
from collections import deque
d = deque([], 2)
Now you can append to d as many times as you like and it will only ever have the most recent two entries.
So, for example:
d.append('a')
d.append('b')
d.append('c')
for e in d:
print(e)
Will give the output:
b
c
Adapting your code to use this technique should be trivial.
I recommend this approach in favour of using two variables because it's easier to change if you (for some reason) decided that you want the last N values because all you need to do is change the deque constructor
You can just use two variables that you update for each new elements, at the end you will just have the two last elements seen :
pre_last = None
last = None
for line in lines:
coinsq = line.strip()
url = priceKey + coinsq + "USDT"
data = requests.get(url)
datax = data.json()
print( datax['symbol'] + " " + datax['price'])
pre_last = last
last = datax
#Do the required calculations with last and pre_last
(And just to be exact this isn't an infinite loop otherwise there wouldn't be a 'last' element)
As your script does not have prior information of when the execution is going to halt, I suggest to define a queue-like structure. In each iteration, you update your last item and your previous-to-last. In that way, you just have to keep in memory two elements. I don't know how were you planning on accessing those two elements when the execution has finished, but you should be able to access that queue when the execution is over.
Sorry for not providing code, but this can be done in many ways, I supposed it was better to suggest you a way of proceeding.
You can define a variable for the second-last element of the for loop, and use the datax variable that's already defined in the loop as the last element:
sec_last = None
datax = None
for line in lines:
sec_last = datax
coinsq = line.strip()
url = priceKey + coinsq + "USDT"
data = requests.get(url)
datax = data.json()
print( datax['symbol'] + " " + datax['price'])
print(f"Last element", datax)
print(f"Second Last element", sec_last)

CS50 'DNA': Ways to speed up my Week 6 'dna.py' program?

So for this problem I had to create a program that takes in two arguments. A CSV database like this:
name,AGATC,AATG,TATC
Alice,2,8,3
Bob,4,1,5
Charlie,3,2,5
And a DNA sequence like this:
TAAAAGGTGAGTTAAATAGAATAGGTTAAAATTAAAGGAGATCAGATCAGATCAGATCTATCTATCTATCTATCTATCAGAAAAGAGTAAATAGTTAAAGAGTAAGATATTGAATTAATGGAAAATATTGTTGGGGAAAGGAGGGATAGAAGG
My program works by first getting the "Short Tandem Repeat" (STR) headers from the database (AGATC, etc.), then counting the highest number of times each STR repeats consecutively within the sequence. Finally, it compares these counted values to the values of each row in the database, printing out a name if a match is found, or "No match" otherwise.
The program works for sure, but is ridiculously slow whenever ran using the larger database provided, to the point where the terminal pauses for an entire minute before returning any output. And unfortunately this is causing the 'check50' marking system to time-out and return a negative result upon testing with this large database.
I'm presuming the slowdown is caused by the nested loops within the 'STR_count' function:
def STR_count(sequence, seq_len, STR_array, STR_array_len):
# Creates a list to store max recurrence values for each STR
STR_count_values = [0] * STR_array_len
# Temp value to store current count of STR recurrence
temp_value = 0
# Iterates over each STR in STR_array
for i in range(STR_array_len):
STR_len = len(STR_array[i])
# Iterates over each sequence element
for j in range(seq_len):
# Ensures it's still physically possible for STR to be present in sequence
while (seq_len - j >= STR_len):
# Gets sequence substring of length STR_len, starting from jth element
sub = sequence[j:(j + (STR_len))]
# Compares current substring to current STR
if (sub == STR_array[i]):
temp_value += 1
j += STR_len
else:
# Ensures current STR_count_value is highest
if (temp_value > STR_count_values[i]):
STR_count_values[i] = temp_value
# Resets temp_value to break count, and pushes j forward by 1
temp_value = 0
j += 1
i += 1
return STR_count_values
And the 'DNA_match' function:
# Searches database file for DNA matches
def DNA_match(STR_values, arg_database, STR_array_len):
with open(arg_database, 'r') as csv_database:
database = csv.reader(csv_database)
name_array = [] * (STR_array_len + 1)
next(database)
# Iterates over one row of database at a time
for row in database:
name_array.clear()
# Copies entire row into name_array list
for column in row:
name_array.append(column)
# Converts name_array number strings to actual ints
for i in range(STR_array_len):
name_array[i + 1] = int(name_array[i + 1])
# Checks if a row's STR values match the sequence's values, prints the row name if match is found
match = 0
for i in range(0, STR_array_len, + 1):
if (name_array[i + 1] == STR_values[i]):
match += 1
if (match == STR_array_len):
print(name_array[0])
exit()
print("No match")
exit()
However, I'm new to Python, and haven't really had to consider speed before, so I'm not sure how to improve upon this.
I'm not particularly looking for people to do my work for me, so I'm happy for any suggestions to be as vague as possible. And honestly, I'll value any feedback, including stylistic advice, as I can only imagine how disgusting this code looks to those more experienced.
Here's a link to the full program, if helpful.
Thanks :) x
Thanks for providing a link to the entire program. It seems needlessly complex, but I'd say it's just a lack of knowing what features are available to you. I think you've already identified the part of your code that's causing the slowness - I haven't profiled it or anything, but my first impulse would also be the three nested loops in STR_count.
Here's how I would write it, taking advantage of the Python standard library. Every entry in the database corresponds to one person, so that's what I'm calling them. people is a list of dictionaries, where each dictionary represents one line in the database. We get this for free by using csv.DictReader.
To find the matches in the sequence, for every short tandem repeat in the database, we create a regex pattern (the current short tandem repeat, repeated one or more times). If there is a match in the sequence, the total number of repetitions is equal to the length of the match divided by the length of the current tandem repeat. For example, if AGATCAGATCAGATC is present in the sequence, and the current tandem repeat is AGATC, then the number of repetitions will be len("AGATCAGATCAGATC") // len("AGATC") which is 15 // 5, which is 3.
count is just a dictionary that maps short tandem repeats to their corresponding number of repetitions in the sequence. Finally, we search for a person whose short tandem repeat counts match those of count exactly, and print their name. If no such person exists, we print "No match".
def main():
import argparse
from csv import DictReader
import re
parser = argparse.ArgumentParser()
parser.add_argument("database_filename")
parser.add_argument("sequence_filename")
args = parser.parse_args()
with open(args.database_filename, "r") as file:
reader = DictReader(file)
short_tandem_repeats = reader.fieldnames[1:]
people = list(reader)
with open(args.sequence_filename, "r") as file:
sequence = file.read().strip()
count = dict(zip(short_tandem_repeats, [0] * len(short_tandem_repeats)))
for short_tandem_repeat in short_tandem_repeats:
pattern = f"({short_tandem_repeat}){{1,}}"
match = re.search(pattern, sequence)
if match is None:
continue
count[short_tandem_repeat] = len(match.group()) // len(short_tandem_repeat)
try:
person = next(person for person in people if all(int(person[k]) == count[k] for k in short_tandem_repeats))
print(person["name"])
except StopIteration:
print("No match")
return 0
if __name__ == "__main__":
import sys
sys.exit(main())

Python list does not append correctly (IndexError: list index out of range)

I tried different approaches for 3 hours now and I just don't get why this does not work.
current_stock_dict = db.execute("SELECT * FROM current_stocks WHERE c_user_id=:user_id ", user_id=session["user_id"])
# make a list for the mainpage
mainpage_list = [[],[]]
# save the lengh of the dict
lengh_dict = len(current_stock_dict)
price_sum = 0
share_sum = 0
# iterate over all rows in the dict
for i in range(0, (lengh_dict - 1)):
# lookup the symbol in the current stocks
c_symbol = current_stock_dict[i]["c_symbol"]
lookup_symbol = lookup(c_symbol)
# append the symbol to the list for the mainpage
mainpage_list[i].append(c_symbol)
# append the name of the share
share_name = lookup_symbol["name"]
mainpage_list[i].append(share_name)
# append the count of shares for mainpage
c_count = current_stock_dict[i]["c_count"]
mainpage_list[i].append(c_count)
# append the current price
share_price = lookup_symbol["price"]
mainpage_list[i].append("$" + str(share_price))
# append the total price of all shares
total_price = float(share_price) * int(c_count)
mainpage_list[i].append("$" + str(total_price))
# count up the price and shares
price_sum += total_price
share_sum += c_count
When i run my website via Flask i get an error message saying:
IndexError: list index out of range
in the line:
mainpage_list[i].append(c_symbol)
(and i guess if it did not allready fail there i'd get it for the rest of the lines too).
As long as lengh_dict = len(current_stock_dict) is 3 or less (So the SQL db has 3 rows or less) the error message does not appear and the code works fine. I do not really understand lists (and multidimensional lists) in python yet so i would be happy if somebody could explain my mistake to me.
Normally i would print out a lot of things and just try out where the mistake is but i just began using flask and i can't print out lists, dicts or anything if the code stops before reaching the bug.
Thanks allready for your help!!!
Let's look at the relevant part of your code.
mainpage_list = [[],[]]
for i in range(0, (lengh_dict - 1)):
mainpage_list[i].append(c_symbol)
mainpage_list is a list that contains two elements, both of which are empty lists. So, accessing mainpage_list[0] is the first list inside mainpage_list, and mainpage_list[1] is the second empty list. Any index above that will result in an IndexError.
It is not exactly clear what you are trying to achieve, but you could initialize mainpage_list with the correct number of empty lists inside if that is what you need, e.g. for the case where you want as many empty lists as the length of current_stock_dict, you could do
mainpage_list = [ [] for _ in range(length_dict) ]
The issue here is that the list mainpage_list is a two element list, and you're trying to access the third element of it.
Generally, when processing lists of indeterminate size, I prefer to iterate and append rather than indexing into the list.
This gives you something like:
source = ["abc", "def", "ghi"] # List of data to process
target = [] # The processed data
for row in source: # For every row of data
value = [] # Empty list to accumate result in
value.append(row[2])
value.append(row[1])
value.append(row[0])
target.append(value)
print(target)
which will work for any size of source list.
Applying this to your code gives you:
# current_stock is a list of dictionaries.
current_stock = db.execute("SELECT * FROM current_stocks WHERE c_user_id=:user_id ", user_id=session["user_id"])
# make a list for the mainpage
mainpage_list = []
price_sum = 0
share_sum = 0
# iterate over all rows in current_stock
for row in current_stock:
value = []
# lookup the symbol in the current stocks
c_symbol = row["c_symbol"]
lookup_symbol = lookup(c_symbol)
# append the symbol to the list for the mainpage
value.append(c_symbol)
# append the name of the share
share_name = lookup_symbol["name"]
value.append(share_name)
# append the count of shares for mainpage
c_count = row["c_count"]
value.append(c_count)
# deleted code
# count up the price and shares
price_sum += total_price
share_sum += c_count
mainpage_list.append(value)

Grabbing values of an array in sets of 100

In the code below, ids is an array which contains the steam64 ids of all users in your friendslist. Now according to the steam web api documentation, GetPlayerSummaries only takes a list of 100 comma separated steam64 ids. Some users have more than 100 friends, and instead of running a for loop 200 times that each time calls the API, I want to take array in sets of 100 steam ids. What would be the most efficient way to do this (in terms of speed)?
I know that I can do ids[0:100] to grab the first 100 elements of an array, but how I accomplish doing this for a friendlist of say 230 users?
def getDescriptions(ids):
sids = ','.join(map(str, ids))
r = requests.get('http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key='+API_KEY+'&steamids=' + sids)
data = r.json();
...
Utilizing the code from this answer, you are able to break this into groups of 100 (or less for the last loop) of friends.
def chunkit(lst, n):
newn = int(len(lst)/n)
for i in xrange(0, n-1):
yield lst[i*newn:i*newn+newn]
yield lst[n*newn-newn:]
def getDescriptions(ids):
friends = chunkit(ids, 3)
while (True):
try:
fids = friends.next()
sids = ','.join(map(str, fids))
r = requests.get('http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key='+API_KEY+'&steamids=' + sids)
data = r.json()
# Do something with the data variable
except StopIteration:
break
This will create iterators broken into 3 (second parameter to chunkit) groups. I chose 3, because the base size of the friends list is 250. You can get more (rules from this post), but it is a safe place to start. You can fine tune that value as you need.
Utilizing this method, your data value will be overwritten each loop. Make sure you do something with it at the place indicated.
I have an easy alternative, just reduce your list size on each while/loop until exhaustion:
def getDescriptions(ids):
sids = ','.join(map(str, ids))
sids_queue = sids.split(',')
data = []
while len(sids_queue) != 0:
r = requests.get('http://api.steampowered.com/ISteamUser/GetPlayerSummaries/v0002/?key='+ \
API_KEY+'&steamids=' + ','.join(sids_queue[:100])
data.append(r.json) # r.json without (), by the way
# then skip [0:100] and reassign to sids_queue, you get the idea
sids_queue = sids_queue[101:]

How can I create a table with reportlab using objects in an existing list

At the moment I have a list of objects as below. And I can print them out by iterating over them no problem. But I don't understand how I can print these out in a table.
people = [("John","Smith"), ("Jane","Doe"), ("Jane","Smith")]
for x in people:
person = x
lineText = (person.getFirstName() + " " + person.getLastName())
p = Paragraph(lineText, helveticaUltraLight)
Story.append(p)
I had a look at this example. Specifically the enumeration of the users in the example. However this always falls over.
I figured it out:
people = [("John","Smith"), ("Jane","Doe"), ("Jane","Smith")]
table_data = []
for i, person in enumerate(people):
# Add a row to the table
table_data.append([person[0], person[1]])
# Create the table
issue_table = Table(table_data, colWidths=[doc.width/3.0]*3)
Story.append(issue_table)

Categories

Resources