Python SQLite3 UPDATE with tuple only uses last value - python

I'm trying to update all rows of 1 column of my database with a big tuple.
c.execute("SELECT framenum FROM learnAlg")
db_framenum = c.fetchall()
print(db_framenum)
db_framenum_new = []
# How much v6 framenum differentiates from v4
change_fn = 0
for f in db_framenum:
t = f[0]
if t in change_numbers:
change_fn += 1
t = t + change_fn
db_framenum_new.append((t,))
print("")
print(db_framenum_new)
c.executemany("UPDATE learnAlg SET framenum=?", (db_framenum_new))
First I take the existing values of the column 'framenum', which look like:
[(0,), (1,), (2,) , ..., (104,)]
Then I transform the tuple to a list so I can change some values in the for f in db_framenum: loop, which result in a similar tuple:
[(0,), (1,), (2,) , ..., (108,)]
Problem
So far so good, but then I try to update the column 'framenum' with these new framenumbers:
c.executemany("UPDATE learnAlg SET framenum=?", (db_framenum_new))
I expect the rows in the column 'framenum' to have the new values, but instead they all have the value: 108 (which is the last value of the tuple 'db_framenum_new'). Why are they not being updated in order (from 1 till 108)?
Expect:
framenum: 1, 2, .., 108
Got:
framenum: 108, 108, ..., 108
Note: The list of tuples has not become longer, only certain values have been changed to. Everything above 46 has +1, everything above 54 additional +1 (+2 total)...
Note2: The column is created with: 'framenum INTEGER'. Another column has the PRIMARY KEY if this matters, made with: 'framekanji TEXT PRIMARY KEY', which has (for now) all value 'NULL'.
Edit
Solved my problem, but I'm still interested in proper use of c.executemany(). I don't know why this only updates the first rowid:
c.execute("SELECT rowid, framenum FROM learnAlg")
db_framenum = c.fetchall()
print(db_framenum)
db_framenum_new = []
# How much v6 framenum differentiates from v4
change_fn = 0
for e, f in enumerate(db_framenum):
e += 1
t = f[1]
if t in change_numbers:
change_fn += 1
t = t + change_fn
db_framenum_new.append((e,t))
print(db_framenum_new)
c.executemany("UPDATE learnAlg SET framenum=? WHERE rowid=?",
(db_framenum_new[1], db_framenum_new[0]))

Yes, you are telling the database to update all rows with the same framenum. That's because the UPDATE statement did not select any specific row. You need to tell the database to change one row at a time, by including a primary key for each value.
Since you are only altering specific framenumbers, you could ask the database to only provide those specific rows instead of going through all of them. You probably also need to specify an order in which to change the numbers; perhaps you need to do so in incrementing framenumber order?
c.execute("""
SELECT rowid, framenum FROM learnAlg
WHERE framenum in ({})
ORDER BY framenum
""".format(', '.join(['?'] * len(change_numbers))),
change_numbers)
update_cursor = conn.cursor()
for change, (rowid, f) in enumerate(c, 1):
update_cursor.execute("""
UPDATE learnAlg SET framenum=? WHERE rowid=?""",
(f + change, rowid))
I altered the structure somewhat there; the query limits the results to frame numbers in the change_numbers sequence only, through a WHERE IN clause. I loop over the cursor directly (no need to fetch all results at once) and use separate UPDATEs to set the new frame number. Instead of a manual counter I used enumerate() to keep count for me.
If you needed to group the updates by change_numbers, then just tell the database to do those updates:
change = len(change_numbers)
for framenumber in reversed(change_numbers):
update_cursor.execute("""
UPDATE learnAlg SET framenum=framenum + ? WHERE framenum=?
""", (change, framenumber))
change -= 1
This starts at the highest framenumber to avoid updating framenumbers you already updated before. This does assume your change_numbers are sorted in incremental order.
Your executemany update should just pass in the whole list, not just the first two items; you do need to alter how you append the values:
for e, f in enumerate(db_framenum):
# ...
db_framenum_new.append((t, e)) # framenum first, then rowid
c.executemany("UPDATE learnAlg SET framenum=? WHERE rowid=?",
db_framenum_new)
Note that the executemany() call takes place outside the for loop!

Thanks #Martijn Pieters, using rowid is what I needed. This is the code that made it work for me:
c.execute("SELECT rowid, framenum FROM learnAlg")
db_framenum = c.fetchall()
print(db_framenum)
# How much v6 framenum differentiates from v4
change_fn = 0
for e, f in enumerate(db_framenum):
e += 1
db_framenum_new = f[1]
if db_framenum_new in change_numbers:
change_fn += 1
db_framenum_new = db_framenum_new + change_fn
c.execute("UPDATE learnAlg SET framenum=? WHERE rowid=?",
(db_framenum_new, e))
However I still don't know how to properly use c.executemany(). See edit for updated question.

Related

Making comparing 2 tables faster (Postgres/SQLAlchemy)

I wrote a code in python to manipulate a table I have in my database. I am doing so using SQL Alchemy. Basically I have table 1 that has 2 500 000 entries. I have another table 2 with 200 000 entries. Basically what I am trying to do, is compare my source ip and dest ip in table 1 with source ip and dest ip in table 2. if there is a match, I replace the ip source and ip dest in table 1 with a data that matches ip source and ip dest in table 2 and I add the entry in table 3. My code also checks if the entry isn't already in the new table. If so, it skips it and then goes on with the next row.
My problem is its extremely slow. I launched my script yesterday and in 24 hours it only went through 47 000 entries out of 2 500 000. I am wondering if there are anyways I can speed up the process. It's a postgres db and I can't tell if the script taking this much time is reasonable or if something is up. If anyone had a similar experience with something like this, how much time did it take before completion ?
Many thanks.
session = Session()
i = 0
start_id = 1
flows = session.query(Table1).filter(Table1.id >= start_id).all()
result_number = len(flows)
vlan_list = {"['0050']", "['0130']", "['0120']", "['0011']", "['0110']"}
while i < result_number:
for flow in flows:
if flow.vlan_destination in vlan_list:
usage = session.query(Table2).filter(Table2.ip ==
str(flow.ip_destination)).all()
if len(usage) > 0:
usage = usage[0].usage
else:
usage = str(flow.ip_destination)
usage_ip_src = session.query(Table2).filter(Table2.ip ==
str(flow.ip_source)).all()
if len(usage_ip_src) > 0:
usage_ip_src = usage_ip_src[0].usage
else:
usage_ip_src = str(flow.ip_source)
if flow.protocol == "17":
protocol = func.REPLACE(flow.protocol, "17", 'UDP')
elif flow.protocol == "1":
protocol = func.REPLACE(flow.protocol, "1", 'ICMP')
elif flow.protocol == "6":
protocol = func.REPLACE(flow.protocol, "6", 'TCP')
else:
protocol = flow.protocol
is_in_db = session.query(Table3).filter(Table3.protocol ==
protocol)\
.filter(Table3.application == flow.application)\
.filter(Table3.destination_port == flow.destination_port)\
.filter(Table3.vlan_destination == flow.vlan_destination)\
.filter(Table3.usage_source == usage_ip_src)\
.filter(Table3.state == flow.state)\
.filter(Table3.usage_destination == usage).count()
if is_in_db == 0:
to_add = Table3(usage_ip_src, usage, protocol, flow.application, flow.destination_port,
flow.vlan_destination, flow.state)
session.add(to_add)
session.flush()
session.commit()
print("added " + str(i))
else:
print("usage already in DB")
i = i + 1
session.close()
EDIT As requested, here are more details : Table 1 has 11 columns, the two we are interested in are source ip and dest ip.
Table 1
Here, I have Table 2 :Table 2. It has an IP and a Usage. What my script is doing is that it takes source ip and dest ip from table one and looks up if there is a match in Table 2. If so, it replaces the ip address by usage, and adds this along with some of the columns of Table 1 in Table 3 :[Table3][3]
Along doing this, when adding the protocol column into Table 3, it writes the protocol name instead of the number, just to make it more readable.
EDIT 2 I am trying to think about this differently, so I did a diagram of my problem Diagram (X problem)
What I am trying to figure out is if my code (Y solution) is working as intended. I've been coding in python for a month only and I feel like I am messing something up. My code is supposed to take every row from my Table 1, compare it to Table 2 and add data to table 3. My Table one has over 2 million entries and it's understandable that it should take a while but its too slow. For example, when I had to load the data from the API to the db, it went faster than the comparisons im trying to do with everything that is already in the db. I am running my code on a virtual machine that has sufficient memory so I am sure it's my code that is lacking and I need direction to as what can be improved. Screenshots of my tables:
Table 2
Table 3
Table 1
EDIT 3 : Postgresql QUERY
SELECT
coalesce(table2_1.usage, table1.ip_source) AS coalesce_1,
coalesce(table2_2.usage, table1.ip_destination) AS coalesce_2,
CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END AS anon_1,
table1.application AS table1_application,
table1.destination_port AS table1_destination_port,
table1.vlan_destination AS table1_vlan_destination,
table1.state AS table1_state
FROM
table1
LEFT OUTER JOIN table2 AS table2_2 ON table2_2.ip = table1.ip_destination
LEFT OUTER JOIN table2 AS table2_1 ON table2_1.ip = table1.ip_source
WHERE
table1.vlan_destination IN (
%(vlan_destination_1) s,
%(vlan_destination_2) s,
%(vlan_destination_3) s,
%(vlan_destination_4) s,
%(vlan_destination_5) s
)
AND NOT (
EXISTS (
SELECT
1
FROM
table3
WHERE
table3.usage_source = coalesce(table2_1.usage, table1.ip_source)
AND table3.usage_destination = coalesce(table2_2.usage, table1.ip_destination)
AND table3.protocol = CASE table1.protocol WHEN %(param_1) s THEN %(param_2) s WHEN %(param_3) s THEN %(param_4) s WHEN %(param_5) s THEN %(param_6) s ELSE table1.protocol END
AND table3.application = table1.application
AND table3.destination_port = table1.destination_port
AND table3.vlan_destination = table1.vlan_destination
AND table3.state = table1.state
)
)
Given the current question, I think this at least comes close to what you might be after. The idea is to perform the entire operation in the database, instead of fetching everything – the whole 2,500,000 rows – and filtering in Python etc.:
from sqlalchemy import func, case
from sqlalchemy.orm import aliased
def newhotness(session, vlan_list):
# The query needs to join Table2 twice, so it has to be aliased
dst = aliased(Table2)
src = aliased(Table2)
# Prepare required SQL expressions
usage = func.coalesce(dst.usage, Table1.ip_destination)
usage_ip_src = func.coalesce(src.usage, Table1.ip_source)
protocol = case({"17": "UDP",
"1": "ICMP",
"6": "TCP"},
value=Table1.protocol,
else_=Table1.protocol)
# Form a query producing the data to insert to Table3
flows = session.query(
usage_ip_src,
usage,
protocol,
Table1.application,
Table1.destination_port,
Table1.vlan_destination,
Table1.state).\
outerjoin(dst, dst.ip == Table1.ip_destination).\
outerjoin(src, src.ip == Table1.ip_source).\
filter(Table1.vlan_destination.in_(vlan_list),
~session.query(Table3).
filter_by(usage_source=usage_ip_src,
usage_destination=usage,
protocol=protocol,
application=Table1.application,
destination_port=Table1.destination_port,
vlan_destination=Table1.vlan_destination,
state=Table1.state).
exists())
stmt = insert(Table3).from_select(
["usage_source", "usage_destination", "protocol", "application",
"destination_port", "vlan_destination", "state"],
flows)
return session.execute(stmt)
If the vlan_list is selective, or in other words filters out most rows, this will perform a lot less operations in the database. Depending on the size of Table2 you may benefit from indexing Table2.ip, but do test first. If it is relatively small, I would guess that PostgreSQL will perform a hash or nested loop join there. If some column of the ones used to filter out duplicates in Table3 is unique, you could perform an INSERT ... ON CONFLICT ... DO NOTHING instead of removing duplicates in the SELECT using the NOT EXISTS subquery expression (which PostgreSQL will perform as an antijoin). If there is a possibility that the flows query may produce duplicates, add a call to Query.distinct() to it.

Python and PostgreSQL - Check on multi-insert operation

I'll make it easier on you.
I need to perform a multi-insert operation using parameters from a text file.
However, I need to report each input line in a log or an err file depending on the insert status.
I was to able to understand if the insert was ok or nor when performing it once at a time (for example, using cur.rowcount or simply a try..except statement).
Is there a way to perform N insert (corresponding to N input line) and to understand which fail?
Here my code:
QUERY="insert into table (field1, field2, field3) values (%s, %s, %s)"
Let
a b c
d e f
g h i
be 3 rows from input file. So
args=[('a','b','c'), ('d','e','f'),('g','h','i')]
cur.executemany(QUERY,args)
Now, let's suppose only the first 2 rows were successfully added. So I have to track such a situation as follows:
log file
a b c
d e f
err file
g h i
Any idea?
Thanks!
try this:
QUERY="insert into table (field1, field2, field3) values ({}, {}, {})"
with open('input.txt', 'r') as inputfile:
readfile = inputfile.read()
inputlist = readfile.splitlines()
listafinal = []
for x in inputlist:
intermediate = x.split(' ')
cur.execute(QUERY.format(intermediate[0], intermediate[1], intermediate[2]))
# if error:
# log into the error file
# else:
# log into the success file
Do not forget to undo the comments and ajust the error as you like
How common do you expect failures to be, and what kind of failures? What I have done in such similar cases is insert 10,000 rows at a time, and if the chunk fails then go back and do that chunk 1 row at a time to get the full error message and specific row. Of course, that depends on failures being rare. What I would be more likely to do today is just turn off synchronous_commit and process them one row at a time always.

Dataflow job is timing out. Having issues comparing two collections, and appending the values of one to another.

Hoping someone can help me here. I have two bigquery tables that I read into 2 different p collections, p1 and p2. I essentially want to update product based on a type II transformation that keeps track of history (previous values in the nested column in product) and appends new values from dwsku.
The idea is to check every row in each collection. If there is a match based on some table values (between p1 and p2), then check product's nested data to see if it contains all values in p1 (based on it's sku number and brand). If it does not contain the most recent data from p2 then take a copy of the format of the current nested data in product, and fit the new data into it. Take this nested format and add it to the existing nested products in product.
def process_changes(element, productdata):
for data in productdata:
if element['sku_number'] == data['sku_number'] and element['brand'] == data['brand']:
logging.info('Processing Product: ' + str(element['sku_number']) + ' brand:' + str(element['brand']))
datatoappend = []
for nestline in data['product']:
logging.info('Nested Data: ' + nestline['product'])
if nestline['in_use'] == 'Y' and (nestline['sku_description'] != element['sku_description'] or nestline['department_id'] != element['department_id'] or nestline['department_description'] != element['department_description']
or nestline['class_id'] != element['class_id'] or nestline['class_description'] != element['class_description'] or nestline['sub_class_id'] != element['sub_class_id'] or nestline['sub_class_description'] != element['sub_class_description'] ):
logging.info('we found a sku we need to update')
logging.info('sku is ' + data['sku_number'])
newline = nestline.copy()
logging.info('most recent nested product element turned off...')
nestline['in_use'] = 'N'
nestline['expiration_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
logging.info(nestline)
logging.info('inserting most recent change in dwsku inside nest')
newline['sku_description'] = element['sku_description']
newline['department_id'] = element['department_id']
newline['department_description'] = element['department_description']
newline['class_id'] = element['class_id']
newline['class_description'] = element['class_description']
newline['sub_class_id'] = element['sub_class_id']
newline['sub_class_description'] = element['sub_class_description']
newline['in_use'] = 'Y'
newline['effective_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_date'] = "%s-%s-%s" % (curdate.year, curdate.month, curdate.day) # CURRENT DATE
newline['modified_time'] = "%s:%s:%s" % (curdate.hour, curdate.minute, curdate.second)
nestline['expiration_date'] = "9999-01-01"
datatoappend.append(newline)
else:
logging.info('Nothing changed for sku ' + str(data['sku_number']))
for dt in datatoappend:
logging.info('processed sku ' + str(element['sku_number']))
logging.info('adding the changes (if any)')
data['product'].append(dt)
return data
changed_product = p1 | beam.FlatMap(process_changes, AsIter(p2))
Afterwards I want to add all values in p1 not in p2 in a nested format as seen in nestline.
Any help would be appreciated as I'm wondering why my job is taking hours to run with nothing to show. Even the output logs in dataflow UI don't show anything.
Thanks in advance!
This can be quite expensive if side input PCollection p2 is large. From your code snippets it's not clear how PCollection p2 is constructed. But if it is, for example, a text file that is if size 62.7MB, processing it per element can be pretty expensive. Can you consider using CoGroupByKey: https://beam.apache.org/documentation/programming-guide/#cogroupbykey
Also note that from a FlatMap, you are supposed to return a iterator of elements from the processing method. Seems like you are returning a dictionary('data') which probably is incorrect.

Inserting to sqlite dynamically with Python 3

I want to write to multiple tables with sqlite, but I don't want to manually specify the query ahead of time (there are dozens of possible permutations).
So for example:
def insert_sqlite(tablename, data_list)
global dbc
dbc.execute("insert into " + tablename + " values (?)", data_list)
tables_and_data = {
'numbers_table': [1,2,3,4,5],
'text_table': ["pies","cakes"]
}
for key in tables_and_data:
insert_sqlite(key, tables_and_data[key])
I want two things to happen:
a) for the tablename to be set dynamically - I've not found a single example where this is done.
b) The data_list values to be correctly used - note that the length of the list varies (as per the example).
But the above doesn't work - How do I dynamically create a sqlite3.execute statement?
Thanks
a) Your code above seems to be setting the table name correctly so no problem there
b) You need a ?(placeholder) per column you wish to insert a value for.
when i recreate your code as is and run it i get the error message:
"sqlite3.OperationalError: table numbers_table has 5 columns but 1 values were supplied".
A solution would be to edit your function to dynamically create the correct number of placeholders:
def insert_sqlite(tablename, data_list):
global dbc
dbc.execute("insert into " + tablename + " values (" + ('?,' * len(data_list))[:-1] + ")", data_list)
after doing this and then re-executing the code with an added select statements (just to test it out):
dbc.execute("""
select * from numbers_table
""")
print(dbc.fetchall());
dbc.execute("""
select * from text_table
""")
print(dbc.fetchall());
I get the result:
[(1, 2, 3, 4, 5)]
[(u'pies', u'cakes')]

Data modeling with AppEngine python and Queries with 'IN' range

I have a list of addresses as string type and I'd like find all events whose location value matches the contents of the list. Because I have thousands of such entries, using the 'IN' with a filter won't work as I've exceeded the limit of 30 items/fetch.
Here's how I'm trying to do a filter:
# addresses come in as list of string items
addresses = ['123 Main St, Portland, ME', '500 Broadway, New York, NY', ...];
query = Event.all();
query.filter('location IN ', addresses);
# above causes the error:
<class 'google.appengine.api.datastore_errors.BadArgumentError'>:
Cannot satisfy query -- too many subqueries (max: 30, got 119).
Probable cause: too many IN/!= filters in query.
My model classes:
class Event(GeoModel):
name = db.StringProperty();
location = db.PostalAddressProperty();
Is there a better way to find all entries that match a specific criteria?
There's no way around this other than multiple queries - you are, after all, asking for the combined results of a set of queries for different addresses, and this is how 'IN' queries are implemented in the datastore. You might want to consider using ndb or asynchronous queries so you can run them in parallel.
Perhaps if you explain what you're trying to achieve, we can suggest a more efficient approach.
A simple solution/(hack) would be to break up your list of addresses to lists of 30 each. Do 1 query per 30 locations then take an intersection of the query results to get the events in every location in the original list.
GQL ‘IN’ does not allow sub-queries more than 30. For this purpose, I have divided sub queries into small chunks for less than or equal 30 sub-queries and result stored into an array.
resultArray = []
rLength = 0.0
rCount = len(subQueryArray)
rLength = len(subQueryArray)/29.0
arrayLength = int(math.ceil( rLength ))
# If subqueries are greater than 30 than divide sub-query length by 29 or 30
if arrayLength > 1:
for ii in range (0, arrayLength):
#srange = start range, nrange = new range
if ii == 0:
srange = ii
else:
srange = nrange + 1
nrange = 29 * (ii + 1)
newList = []
for nii in range (srange, nrange+1):
if nii < rCount:
newList.append(subQueryArray[nii])
query = db.GqlQuery(“SELECT * FROM table_name ” +“WHERE req_id in:1”,newList)
for result in query.run():
# result.id belongs to table entity
resultArray.append(result.id)

Categories

Resources