python querying all rows of azure table - python

I have around 20000 rows in my azure table . I wanted to query all the rows in the azure table . But due to certain azure limitation i am getting only 1000 rows.
My code
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='YYY')
i=0
tasks=table_service.query_entities('ValidOutputTable',"PartitionKey eq 'tasksSeattle'")
for task in tasks:
i+=1
print task.RowKey,task.DomainUrl,task.Status
print i
I want to get all the rows from the validoutputtable .Is there a way to do so

But due to certain azure limitation i am getting only 1000 rows.
This is a documented limitation. Each query request to Azure Table will return no more than 1000 rows. If there are more than 1000 entities, table service will return a continuation token that must be used to fetch next set of entities (See Remarks section here: http://msdn.microsoft.com/en-us/library/azure/dd179421.aspx)
Please see the sample code to fetch all entities from a table:
from azure import *
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='yyy')
i=0
next_pk = None
next_rk = None
while True:
entities=table_service.query_entities('Address',"PartitionKey eq 'Address'", next_partition_key = next_pk, next_row_key = next_rk, top=1000)
i+=1
for entity in entities:
print(entity.AddressLine1)
if hasattr(entities, 'x_ms_continuation'):
x_ms_continuation = getattr(entities, 'x_ms_continuation')
next_pk = x_ms_continuation['nextpartitionkey']
next_rk = x_ms_continuation['nextrowkey']
else:
break;

Update 2019
Just running for loop on the query result (as author of the topic does) - will get all the data from the query.
from azure.cosmosdb.table.tableservice import TableService
table_service = TableService(account_name='accont_name', account_key='key')
#counter to keep track of records
counter=0
# get the rows. Debugger shows the object has only 100 records
rows = table_service.query_entities(table,"PartitionKey eq 'mykey'")
for row in rows:
if (counter%100 == 0):
# just to keep output smaller, print every 100 records
print("Processing {} record".format(counter))
counter+=1
The output proves that loop goes over a 1000 records
...
Processing 363500 record
Processing 363600 record
...

Azure Table Storage has a new python library in preview release that is available for installation via pip. To install use the following pip command
pip install azure-data-tables
To query all rows for a given table with the newest library, you can use the following code snippet:
from azure.data.tables import TableClient
key = os.environ['TABLES_PRIMARY_STORAGE_ACCOUNT_KEY']
account_name = os.environ['tables_storage_account_name']
endpoint = os.environ['TABLES_STORAGE_ENDPOINT_SUFFIX']
account_url = "{}.table.{}".format(account_name, endpoint)
table_name = "myBigTable"
with TableClient(account_url=account_url, credential=key, table_name=table_name) as table_client:
try:
table_client.create_table()
except:
pass
i = 0
for entity in table_client.list_entities():
print(entity['value'])
i += 1
if i % 100 == 0:
print(i)
Your outlook would look like this: (modified for brevity, assuming there are 2000 entities)
...
1100
1200
1300
...

Related

Batching the bulk updates of millions of rows with peewee

I have an SQLite3 database with a table that has twenty million rows.
I would like to update the values of some of the columns in the table (for all rows).
I am running into performance issues (about only 1'000 rows processed per second).
I would like to continue using the peewee module in python to interact with the
database.
So I'm not sure if I am taking the right approach with my code. After trying some ideas that all failed, I attempted to perform the update in batches. My first solution here was to iterate in over the cursor with islice as so:
import math, itertools
from tqdm import tqdm
from cool_project.database import db, MyTable
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
total_rows = rows.count()
page_size = 1000
total_pages = math.ceil(total_rows / page_size)
# Start #
with db.atomic():
for page_num in tqdm(range(total_pages)):
page = list(itertools.islice(rows, page_size))
for row in page: update_row(row)
MyTable.bulk_update(page, fields=fields)
This failed, because it would attempt to put the result of the whole query into memory. So I adapted the code to use the paginate function.
import math
from tqdm import tqdm
from cool_project.database import db, MyTable
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
total_rows = rows.count()
page_size = 1000
total_pages = math.ceil(total_rows / page_size)
# Start #
with db.atomic():
for page_num in tqdm(range(1, total_pages+1)):
# Get a batch #
page = MyTable.select().paginate(page_num, page_size)
# Update #
for row in page: update_row(row)
# Commit #
MyTable.bulk_update(page, fields=fields)
But it's still quite slow, and would take >24 hours to complete.
What is strange is that the speed (in number of rows per second) notably decreases as time goes by. The scripts starts with ~1000 rows per second. But after half an hour it's down to 250 rows per second.
Am I missing something? Thanks!
The issues are twofold -- you are pulling all results into memory and you are using the bulk update API, which is quite complex/special and which is also completely unnecessary for SQLite.
Try the following:
def update_row(row):
row.column_a = computation(row.column_d)
row.column_b = computation(row.column_d)
row.column_c = computation(row.column_d)
fields = (MyTable.column_a
MyTable.column_b
MyTable.column_c)
rows = MyTable.select()
with db.atomic():
for row in rows.iterator(): # Add ".iterator()" to avoid caching rows
update_row(row)
row.save(only=fields)

Pandas_gbq does not append in order

I am using VM instance in google cloud and i want to use bigquery as well.
I am trying to append the newly generated report in while loop to last row of the bigquery table every 10 minutes with below script.
position_size = np.zeros([24, 24], dtype=float)
... some codes here
... some codes here
... some codes here
... some codes here
while 1:
current_time = datetime.datetime.now()
if current_time.minute % 10 == 0:
try:
report = pd.DataFrame(position_size[12:24], columns=('pos'+str(x) for x in range(0, 24)))
report.insert(loc=0, column="timestamp", datetime.datetime.now())
... some codes here
pandas_gbq.to_gbq(report, "reports.report", if_exists = 'append')
except ccxt.BaseError as Error:
print("[ERROR] ", Error)
continue
But as you can see below screenshot it does not append in an order. How can i solve this issue? Thank you in advance
How are you reading results? In most databases (BigQuery included), the row order is indeterminate in the absence of an ordering expression. You likely need an ORDER BY clause in your SELECT statement.

Using cx_Oracle How Can I speed-up Oracle Query to Pandas

I have python code that reads a very large Oracle db of unknown row numbers to extract some data confined by lat/lon guidelines but it takes about 20 minutes per query. I am trying to re-write or add something to my code to improve this efficiency time since i have many queries to run one at a time. Here is my code that i'm using now:
plant_name = 'NEW HARVEST'
conn= cx_Oracle.connect('DOMINA_CONSULTA/password#ex021-orc.corp.companyname.com:1540/domp_domi_bi')
try:
query1 = '''
SELECT * FROM DOMINAGE.DGE_RAYOS WHERE FECHA_RAYO >= '01-JAN-19' AND FECHA_RAYO < '01-JAN-
20' AND COORDENADA_X>=41.82 AND COORDENADA_X<=42.52 AND COORDENADA_Y>=-95.83 AND
COORDENADA_Y<=-95.13
'''
dfp = pd.read_sql(con = conn, sql = query1)
finally:
conn.close()
dfp.head()
#drop col's not needed
dfp = dfp[['FECHA_RAYO','INTENSIDAD_KA','COORDENADA_X','COORDENADA_Y']]
dfp = dfp.assign(SITE=plant_name)

Slow MySQL database query time in a Python for loop

I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.

Unable to run a standardSQL query to bigquery in python

I am trying to query a table in Bigquery via a Python script. However I have written the query as a standard sql query. For this I need to start my query with '#standardsql'. However when I do this it then comments out the rest of my query. I have tried to write the query using multiple lines but it does not allow me to do this either. Has anybody dealt with a problem like this and found out a solution? Below is my first code where the query becomes commented out.
client = bigquery.Client('dataworks-356fa')
query = ("#standardsql SELECT count(distinct serial) FROM `dataworks-356fa.FirebaseArchive.test2` Where (PeripheralType = 1 or PeripheralType = 2 or PeripheralType = 12) AND EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1 AND serial != 'null'")
dataset = client.dataset('FirebaseArchive')
table = dataset.table('test2')
tbl = dataset.table('Count_BB_Serial_weekly')
job = client.run_async_query(str(uuid.uuid4()), query)
job.destination = tbl
job.write_disposition= 'WRITE_TRUNCATE'
job.begin()
When I try to write the query like this python does not read anything past on the second line as the query.
query = ("#standardsql
SELECT count(distinct serial) FROM `dataworks-356fa.FirebaseArchive.test2` Where (PeripheralType = 1 or PeripheralType = 2 or PeripheralType = 12) AND EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1 AND serial != 'null'")
The query Im running selects values that have been produced within the last week. If there is a variation of this that would not be required to use standardsql I would be willing to switch my other queries as well but I have not been able to figure out how to do that. I would prefer for this to be the last resort though. Thank you for the help!
If you want to flag you'll be using Standard SQL inside the query itself, you can build it like:
query = """#standardSQL
SELECT count(distinct serial) FROM `dataworks-356fa.FirebaseArchive.test2` Where (PeripheralType = 1 or PeripheralType = 2 or PeripheralType = 12) AND EXTRACT(WEEK FROM createdAt) = EXTRACT(WEEK FROM CURRENT_TIMESTAMP()) - 1 AND serial != 'null'
"""
Another option you can use as well is setting the property use_legacy_sql of the job created to False, something like:
job = client.run_async_query(job_name, query)
job.use_legacy_sql = False # -->this also makes the API use Standard SQL
job.begin()

Categories

Resources