How can I fix for loop to other things - python

Currently I have set a for loop which retrieves data from the database for every row.
I need to use run a while loop, however it would not run as the for loop finishes once after it has retrieved database data. In result, this stops the rest of my While true loop to await for user response
c.execute("SELECT * FROM maincores WHERE set_status = 1")
rows = c.fetchall()
for v in rows:
# skip
while True:
#skip
I have tried using a global variable to store the database data then return the loop, all resulting in a fail.
How can I get sqlite3 database information without using for loop?

I'm not 100% on the problem, but I think you might want to use a generator so that you throttle your intake of information with your loop. So, you could do a function like:
def getDBdata():
c.execute("SELECT * FROM maincores WHERE set_status = 1")
rows = c.fetchall()
for v in rows:
yield(v) #just returns one result at a time ...
x = True
data = getDBdata()
while x is True:
do something with data
if <condition>:
next(data) #get the next row of data
else:
x = False
So, now you are controlling the data flow from your DB so that you don't exhaust your while loop as a condition of the data flow.
My apologies if I'm not answering the question your asking, but I hope this helps.

Related

AWS DMS Losing records using DatabaseMigrationService.Client.describe_table_statistics with large result set

I'm using describe_table_statistics to retrieve the list of tables in the given DMS task, and conditionally looping over describe_table_statistics with the response['Marker'].
When I use no filters, I get the correct number of records, 13k+.
When I use a filter, or combination of filters, whose result set has fewer than MaxRecords, I get the correct number of records.
However, when I pass in a filter that would get a record set larger than MaxRecords, I get far fewer records than I should.
Here's my function to retrieve the set of tables:
def get_dms_task_tables(account, region, task_name, schema_name=None, table_state=None):
tables=[]
max_records=500
filters=[]
if schema_name:
filters.append({'Name':'schema-name', 'Values':[schema_name]})
if table_state:
filters.append({'Name':'table-state', 'Values':[table_state]})
task_arn = get_dms_task_arn(account, region, task_name)
session = boto3.Session(profile_name=account, region_name=region)
client = session.client('dms')
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records)
tables += response['TableStatistics']
while len(response['TableStatistics']) == max_records:
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records
,Marker=response['Marker'])
tables += response['TableStatistics']
return tables
For troubleshooting, I loop over tables printing one line per table:
print(', '.join((
t['SchemaName']
,t['TableName']
,t['TableState'])))
When I pass in no filters and grep for that table state of 'Table completed' I get 12k+ records, which is the correct count, via the console
So superficially at least, the response loop works.
When I pass in a schema name and the table state filter conditions, I get the correct count, as confirmed by the console, but this count is less than MaxRecords.
When I just pass in the table state filter for 'Table completed', I only get 949 records, so I'm missing 11k records.
I've tried omitting the Filter parameter from the describe_table_statistics inside the loop, but I get the same results in all cases.
I suspect there's something wrong with my call to describe_table_statistics inside the loop but I've been unable to find examples of this in amazon's documentation to confirm that.
When filters are applied, describe_table_statistics does not adhere to the MaxRecords limit.
In fact, what it seems to do is retrieve (2 x MaxRecords), apply the filter, and return that set. Or possibly it retrieves MaxRecords, applies the filter, and continues until the result set is larger than MaxRecords. Either way my while condition was the problem.
I replaced
while len(response['TableStatistics']) == max_records:
with
while 'Marker' in response:
and now the function returns the correct number of records.
Incidentally, my first attempt was
while len(response['TableStatistics']) >= 1:
but on the last iteration of the loop it threw this error:
KeyError: 'Marker'
The completed and working function now looks thusly:
def get_dms_task_tables(account, region, task_name, schema_name=None, table_state=None):
tables=[]
max_records=500
filters=[]
if schema_name:
filters.append({'Name':'schema-name', 'Values':[schema_name]})
if table_state:
filters.append({'Name':'table-state', 'Values':[table_state]})
task_arn = get_dms_task_arn(account, region, task_name)
session = boto3.Session(profile_name=account, region_name=region)
client = session.client('dms')
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records)
tables += response['TableStatistics']
while 'Marker' in response:
response = client.describe_table_statistics(
ReplicationTaskArn=task_arn
,Filters=filters
,MaxRecords=max_records
,Marker=response['Marker'])
tables += response['TableStatistics']
return tables

Inserting billions of data to Sqlite via Python

I want to insert billions of values(exchange rates) to a sqlite db file. I want to use threading because it takes a lot of time but threading pool loop executes same nth element multiple times. I have a print statement in the begining of my method and it prints out multiple times instead of just one.
pool = ThreadPoolExecutor(max_workers=2500)
def gen_nums(i, cur):
global x
print('row number', x, ' has started')
gen_numbers = list(mydata)
sql_data = []
for f in gen_numbers:
sql_data.append((f, i, mydata[i]))
cur.executemany('INSERT INTO numbers (rate, min, max) VALUES (?, ?, ?)', sql_data)
print('row number', x, ' has finished')
x += 1
with conn:
cur = conn.cursor()
for i in mydata:
pool.submit(gen_nums, i, cur)
pool.shutdown(wait=True)
and the output is:
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
row number 1 has started
...
Divide your data into chunks on the fly using generator expressions, make inserts inside the transaction.
Here how your code may look like.
Also, sqlite has an ability to import CSV files.
Sqlite can do tens of thousands of inserts per second, just make sure to do all of them in a single transaction by surrounding the inserts with BEGIN and COMMIT. (executemany() does this automatically.)
As always, don't optimize before you know speed will be a problem. Test the easiest solution first, and only optimize if the speed is unacceptable.

How to take output of function and use as input of another function python?

More specifically my function grabs all the domains in a my database table and returns them. I want to know how to input these domains into another function that will run the Kali Linux tool URLcrazy for each of the domains in that table.
For example my function that outputs these:
google.com
yahoo.com
Here is the function:
def grab_domains():
try:
db = pymysql.connect(host="localhost",
user="root", passwd="root",
db="typosquat")
except ConnectionAbortedError as e:
print(e, "\n")
temp = ""
cursor = db.cursor()
cursor.execute('SELECT name From domains')
for rows in cursor.fetchall():
for entries in rows:
temp += entries
domains = entries[0:]
print(domains)
return temp
Here is the output:
google.com
yahoo.com
How do I write another function that will run the script URLcrazy on each of these domains? Assuming all scripts are in same file location.
This is all I have I cant figure out how to run it for each domain, only know how to for a single output.
def run_urlcrazy():
np = os.system("urlcrazy " + grab_domains())
print(np)
return np
How to I get this function to run URLcrazy for each domain?^^
This is my first post ever on stack overflow let me know what I can do to improve it and help me with question if possible! Thanks
You'll need a loop:
def run_urlcrazy():
ret_vals = []
for domain in grab_domains():
np = os.system("urlcrazy " + domain)
ret_vals.append(np)
return ret_vals
I recommend a for loop because it can efficiently iterate over whatever your function returns.
You'll need to make a minor modification to your grab_domains() function as well:
temp = []
cursor = db.cursor()
cursor.execute('SELECT name From domains')
for rows in cursor.fetchall():
for entries in rows:
domains = entries[0:]
temp.extend(domains)
return temp
Now, your function returns a list of domains. You can iterate over this.

SQLite3 How to Select first 100 rows from database, then the next 100

Currently I have database filled with 1000s of rows.
I want to SELECT the first 100 rows, and then select the next 100, then the next 100 and so on...
So far I have:
c.execute('SELECT words FROM testWords')
data = c.fetchmany(100)
This allows me to get the first 100 rows, however, I can't find the syntax for selecting the next 100 rows after that, using another SELECT statement.
I've seen it is possible with other coding languages, but haven't found a solution with Python's SQLite3.
When you are using cursor.fetchmany() you don't have to issue another SELECT statement. The cursor is keeping track of where you are in the series of results, and all you need to do is call c.fetchmany(100) again until that produces an empty result:
c.execute('SELECT words FROM testWords')
while True:
batch = c.fetchmany(100)
if not batch:
break
# each batch contains up to 100 rows
or using the iter() function (which can be used to repeatedly call a function until a sentinel result is reached):
c.execute('SELECT words FROM testWords')
for batch in iter(lambda: c.fetchmany(100), []):
# each batch contains up to 100 rows
If you can't keep hold of the cursor (say, because you are serving web requests), then using cursor.fetchmany() is the wrong interface. You'll instead have to tell the SELECT statement to return only a selected window of rows, using the LIMIT syntax. LIMIT has an optional OFFSET keyword, together these two keywords specify at what row to start and how many rows to return.
Note that you want to make sure that your SELECT statement is ordered so you get a stable result set you can then slice into batches.
batchsize = 1000
offset = 0
while True:
c.execute(
'SELECT words FROM testWords ORDER BY somecriteria LIMIT ? OFFSET ?',
(batchsize, offset))
batch = list(c)
offset += batchsize
if not batch:
break
Pass the offset value to a next call to your code if you need to send these batches elsewhere and then later on resume.
sqlite3 is nothing to do with Python. It is a standalone database; Python just supplies an interface to it.
As a normal database, sqlite supports standard SQL. In SQL, you can use LIMIT and OFFSET to determine the start and end for your query. Note that if you do this, you should really use an explicit ORDER BY clause, to ensure that your results are consistently ordered between queries.
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100')
...
c.execute('SELECT words FROM testWords ORDER BY ID LIMIT 100 OFFSET 100')
You can crate iterator and call it multiple times:
def ResultIter(cursor, arraysize=100):
while True:
results = cursor.fetchmany(arraysize)
if not results:
break
for result in results:
yield result
Or simply like this for returning the first 5 rows:
num_rows = 5
cursor = dbconn.execute("SELECT words FROM testWords" )
for row in cursor.fetchmany(num_rows):
print( "Words= " + str( row[0] ) + "\n" )

Python iterate over csv row by row while keeping track of content of next row

I have a csv with following structure:
user_id,user_name,code
0001,user_a,e-5
0001,user_a,s-N
0002,user_b,e-N
0002,user_b,t-5
I want to iterate over the file such that after processing a user before getting to the next user do some additional work based on processed user's code. User can have multiple entries and number of entries could be from 1 to n
For example, as we process the user we keep track of first letter of the code that user has identified/mentioned. I have add it to the list. We need to make sure this list get reset after processing a specific user (its a user level).
As an example lets consider user_id 0001, after I reach row of 0002 I want to add more rows related to user 0001 where those new rows has the code that we have not seen before:
Here is how I tried to accomplish this:
with open(os.path.expanduser('out_put_csv_file.csv'), 'w+') as data_file:
writer = csv.writer(data_file)
writer.writerow(('user_id', 'user_name', 'code'))
l_file = csv.DictReader(open('some_file_name'))
previous_user = None
current_user = None
tracker = []
for row in l_file:
current_user = row['user_id']
tracker.append(row['code'].split('-')[0])
writer.writerow([row['user_id'], row['user_name'], row['code']])
if current_user != previous_user:
for l_code in list_with_all_codes:
if l_code not in tracker:
writer.writerow([row['user_id'], row['user_name'], l_code])
tracker = []
previous_user = current_user
problem with this is that: I get following:
user_id,user_name,code
0001,user_a,e-5
0001,user_a,n
0001,user_a,t
0001,user_a,i
0001,user_a,s #don't want this
0001,user_a,s-N
0002,user_b,e-N
0002,user_b,n
0002,user_b,t # don't want this
0002,user_b,i
0002,user_b,ta-5
instead of that, I want is
user_id,user_name,code
0001,user_a,e-5
0001,user_a,n
0001,user_a,t
0001,user_a,i
0001,user_a,s-N
0002,user_b,e-N
0002,user_b,n
0002,user_b,i
0002,user_b,ta-5
What am I doing wrong here? Whats the best way to accomplish this?
Your problem is that you write one line of data for the new user before realizing that you need to fill in for the old user... then write old user data using the new user name.
Since you want to write multiple bits of data about the previous user, you'll need to keep his/her entire row. When you see a new user, write the data for the old user (using his info) before you do anything else. There is a special case for the first user when there isn't any previous user to deal with.
import os
import csv
open('some_file_name', 'w').write("""user_id,user_name,code
0001,user_a,e-5
0001,user_a,s-N
0002,user_b,e-N
0002,user_b,t-5
""")
list_with_all_codes = ['e', 's', 'n', 't', 'a']
def set_unused_codes(writer, row, tracker):
for l_code in list_with_all_codes:
if l_code not in tracker:
writer.writerow([row['user_id'], row['user_name'], l_code])
with open(os.path.expanduser('out_put_csv_file.csv'), 'w+') as data_file:
writer = csv.writer(data_file)
writer.writerow(('user_id', 'user_name', 'code'))
l_file = csv.DictReader(open('some_file_name'))
previous_row = None
tracker = []
for row in l_file:
if not previous_row:
previous_row = row
if row['user_id'] != previous_row.get('user_id'):
set_unused_codes(writer, previous_row, tracker)
previous_row = row
tracker = []
tracker.append(row['code'].split('-')[0])
writer.writerow([row['user_id'], row['user_name'], row['code']])
set_unused_codes(writer, row, tracker)
print(open('out_put_csv_file.csv').read())
The output is...
user_id,user_name,code
0001,user_a,e-5
0001,user_a,s-N
0001,user_a,n
0001,user_a,t
0001,user_a,a
0002,user_b,e-N
0002,user_b,t-5
0002,user_b,s
0002,user_b,n
0002,user_b,a
If you don't mind what order your missing codes are written, you could use sets to speed things up by a minisculely trivial amount (did I over promote that?!)
set_of_all_codes = set(list_of_all_codes)
... the for loop ...
for code in set_of_all_codes - set(tracker):
writer.writewrow(...)
The common pattern when you need to see the next data unit before knowing you're done with the current one is as follows (loosely sketched after your use case)
oldname = ""
data = []
for row in input:
n,name,code = row.split(',')
if name != oldname:
if data: flush(data)
data = []
oldname = name
update(data,n,name,code)
# remember to flush the data buffer when you're done with your file
flush(data)
data could be a list of lists, as in
def update(data, n, name, code):
if not data:
data.append(n)
data.append(name)
data.append([code])
else:
data[2].append(code)
With respect to flush, if you don't know how to order your output (re your comment following the Q) neither do I. But it's just a matter of iterating on data[2] and list_of_all_codes, you've already done something alike in your original code.

Categories

Resources