Pandas_gbq does not append in order - python

I am using VM instance in google cloud and i want to use bigquery as well.
I am trying to append the newly generated report in while loop to last row of the bigquery table every 10 minutes with below script.
position_size = np.zeros([24, 24], dtype=float)
... some codes here
... some codes here
... some codes here
... some codes here
while 1:
current_time = datetime.datetime.now()
if current_time.minute % 10 == 0:
try:
report = pd.DataFrame(position_size[12:24], columns=('pos'+str(x) for x in range(0, 24)))
report.insert(loc=0, column="timestamp", datetime.datetime.now())
... some codes here
pandas_gbq.to_gbq(report, "reports.report", if_exists = 'append')
except ccxt.BaseError as Error:
print("[ERROR] ", Error)
continue
But as you can see below screenshot it does not append in an order. How can i solve this issue? Thank you in advance

How are you reading results? In most databases (BigQuery included), the row order is indeterminate in the absence of an ordering expression. You likely need an ORDER BY clause in your SELECT statement.

Related

Python and Pandas to Query API's and update DB

I've been querying a few API's with Python to individually create CSV's for a table.
I would like to try and instead of recreating the table each time, update the existing table with any new API data.
At the moment the way the Query is working, I have a table that looks like this,
From this I am taking the suburbs of each state and copying them into a csv for each different state.
Then using this script I am cleaning them into a list (the api needs the %20 for any spaces),
"%20"
#suburbs = ["want this", "want this (meh)", "this as well (nope)"]
suburb_cleaned = []
#dont_want = frozenset( ["(meh)", "(nope)"] )
for urb in suburbs:
cleaned_name = []
name_parts = urb.split()
for part in name_parts:
if part in dont_want:
continue
cleaned_name.append(part)
suburb_cleaned.append('%20'.join(cleaned_name))
Then taking the suburbs for each state and putting them into this API to return a csv,
timestr = time.strftime("%Y%m%d-%H%M%S")
Name = "price_data_NT"+timestr+".csv"
url_price = "http://mwap.com/api"
string = 'gxg&state='
api_results = {}
n = 0
y = 2
for urbs in suburb_cleaned:
url = url_price + urbs + string + "NT"
print(url)
print(urbs)
request = requests.get(url)
api_results[urbs] = pd.DataFrame(request.json())
n = n+1
if n == y:
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
y = y+2
continue
print("made it through"+urbs)
# print(request.json())
# print(api_results)
dfs = pd.concat(api_results).reset_index(level=1, drop=True).rename_axis(
'key').reset_index().set_index(['key'])
dfs.to_csv(Name, sep='\t', encoding='utf-8')
Then adding the states manually in excel, and combining and cleaning the suburb names.
# use pd.concat
df = pd.concat([act, vic,nsw,SA,QLD,WA]).reset_index().set_index(['key']).rename_axis('suburb').reset_index().set_index(['state'])
# apply lambda to clean the %20
f = lambda s: s.replace('%20', ' ')
df['suburb'] = df['suburb'].apply(f)
and then finally inserting it into a db
engine = create_engine('mysql://username:password#localhost/dbname')
with engine.connect() as conn, conn.begin():
df.to_sql('Price_historic', conn, if_exists='replace',index=False)
Leading this this sort of output
Now, this is a hek of a process. I would love to simplify it and make the database only update the values that are needed from the API, and not have this much complexity in getting the data.
Would love some helpful tips on achieving this goal - I'm thinking I could do an update on the mysql database instead of insert or something? and with the querying of the API, I feel like I'm overcomplicating it.
Thanks!
I don't see any reason why you would be creating CSV files in this process. It sounds like you can just query the data and then load it into a MySql table directly. You say that you are adding the states manually in excel? Is that data not available through your prior api calls? If not, could you find that information and save it to a CSV, so you can automate that step by loading it into a table and having python look up the values for you?
Generally, you wouldn't want to overwrite the mysql table every time. When you have a table, you can identify the column or columns that uniquely identify a specific record, then create a UNIQUE INDEX for them. For example if your street and price values designate a unique entry, then in mysql you could run:
ALTER TABLE `Price_historic` ADD UNIQUE INDEX(street, price);
After this, your table will not allow duplicate records based on those values. Then, instead of creating a new table every time, you can insert your data into the existing table, with instructions to either update or ignore when you encounter a duplicate. For example:
final_str = "INSERT INTO Price_historic (state, suburb, property_price_id, type, street, price, date) " \
"VALUES (%s, %s, %s, %s, %s, %s, %s, %s) " \
"ON DUPLICATE KEY UPDATE " \
"state = VALUES(state), date = VALUES(date)"
con = pdb.connect(db_host, db_user, db_pass, db_name)
with con:
try:
cur = con.cursor()
cur.executemany(final_str, insert_list)
If the setup you are trying is something for longer term , I would suggest running 2 diff processes in parallel-
Process 1:
Query API 1, obtain required data and insert into DB table, with binary / bit flag that would specify only API 1 has been called.
Process 2:
Run query on DB to obtain all records needed for API call 2 based on binary/bit flag that we set in process 1--> For corresponding data run call 2 and update data back to DB table based on primary Key
Database : I would suggest adding Primary Key as well as [Bit Flag][1] that gives status of different API call statuses. Bit Flag also helps you
- in case you want to double confirm if specific API call has been made for specific record not.
- Expand your project to additional API calls and can still track status of each API call at record level
[1]: Bit Flags: https://docs.oracle.com/cd/B28359_01/server.111/b28286/functions014.htm#SQLRF00612

Slow MySQL database query time in a Python for loop

I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.

python querying all rows of azure table

I have around 20000 rows in my azure table . I wanted to query all the rows in the azure table . But due to certain azure limitation i am getting only 1000 rows.
My code
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='YYY')
i=0
tasks=table_service.query_entities('ValidOutputTable',"PartitionKey eq 'tasksSeattle'")
for task in tasks:
i+=1
print task.RowKey,task.DomainUrl,task.Status
print i
I want to get all the rows from the validoutputtable .Is there a way to do so
But due to certain azure limitation i am getting only 1000 rows.
This is a documented limitation. Each query request to Azure Table will return no more than 1000 rows. If there are more than 1000 entities, table service will return a continuation token that must be used to fetch next set of entities (See Remarks section here: http://msdn.microsoft.com/en-us/library/azure/dd179421.aspx)
Please see the sample code to fetch all entities from a table:
from azure import *
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='yyy')
i=0
next_pk = None
next_rk = None
while True:
entities=table_service.query_entities('Address',"PartitionKey eq 'Address'", next_partition_key = next_pk, next_row_key = next_rk, top=1000)
i+=1
for entity in entities:
print(entity.AddressLine1)
if hasattr(entities, 'x_ms_continuation'):
x_ms_continuation = getattr(entities, 'x_ms_continuation')
next_pk = x_ms_continuation['nextpartitionkey']
next_rk = x_ms_continuation['nextrowkey']
else:
break;
Update 2019
Just running for loop on the query result (as author of the topic does) - will get all the data from the query.
from azure.cosmosdb.table.tableservice import TableService
table_service = TableService(account_name='accont_name', account_key='key')
#counter to keep track of records
counter=0
# get the rows. Debugger shows the object has only 100 records
rows = table_service.query_entities(table,"PartitionKey eq 'mykey'")
for row in rows:
if (counter%100 == 0):
# just to keep output smaller, print every 100 records
print("Processing {} record".format(counter))
counter+=1
The output proves that loop goes over a 1000 records
...
Processing 363500 record
Processing 363600 record
...
Azure Table Storage has a new python library in preview release that is available for installation via pip. To install use the following pip command
pip install azure-data-tables
To query all rows for a given table with the newest library, you can use the following code snippet:
from azure.data.tables import TableClient
key = os.environ['TABLES_PRIMARY_STORAGE_ACCOUNT_KEY']
account_name = os.environ['tables_storage_account_name']
endpoint = os.environ['TABLES_STORAGE_ENDPOINT_SUFFIX']
account_url = "{}.table.{}".format(account_name, endpoint)
table_name = "myBigTable"
with TableClient(account_url=account_url, credential=key, table_name=table_name) as table_client:
try:
table_client.create_table()
except:
pass
i = 0
for entity in table_client.list_entities():
print(entity['value'])
i += 1
if i % 100 == 0:
print(i)
Your outlook would look like this: (modified for brevity, assuming there are 2000 entities)
...
1100
1200
1300
...

pg8000 and cursor.fetchall() failing to return records if the number of records is moderate

I'm using the adaptor pg8000 to read in records in my db with the following code:
cursor = conn.cursor()
results = cursor.execute("SELECT * from data_" + country + " WHERE date >= %s AND date <= %s AND time >= %s AND time <= %s", (datetime.date(fromdate[0], fromdate[1], fromdate[2]), datetime.date(todate[0], todate[1], todate[2]),datetime.time(fromtime[0],fromtime[1]), datetime.time(totime[0],totime[1])))
results = cursor.fetchall()
The problem emerges when I select a date range that brings in say 100 records. Its not a huge number of records but it is enough to cause the following issue and I cannot see where the issue might come from - as it seems to be dependent on the number of records brought back. For example: results = cursor.fetchall() appears to work just fine and return one result.
The error message I get is:
File "/mnt/opt/Centos5.8/python-2.7.8/lib/python2.7/site-packages/pg8000/core.py", line 1650, in handle_messages
raise error
pg8000.errors.ProgrammingError: ('ERROR', '34000', 'portal "pg8000_portal_0" does not exist')
Obviously I cannot find a way of fixing this despite exploring.
When using fetchmany(), here are the results:
results = cursor.fetchmany(100) WORKS - limited to 100
results = cursor.fetchmany(101) FAILS - same error as above
In autocommit mode you can't retrieve more rows than the pg8000 cache holds (which is 100 by default).
I've made a commit that gives a better error message when this happens, which will be in the next release of pg8000.
The reason is that if the number of rows returned by a query is greater than the number of rows in the pg8000 cache, the database portal is kept open, and then when the cache is empty, more rows are fetched from the portal. A portal can only exist within a transaction, so in autocommit mode a portal is immediately closed after the first batch of rows is retrieved. If you try and retrieve a second batch from the portal, you get the 'portal does not exist' error that was reported in the question.
It appears this can be solved by setting:
conn.autocommit = False
Now the code looks like:
conn.autocommit = False
cursor = conn.cursor()
results = cursor.execute("SELECT * from data_" + country + " WHERE date >= %s AND date <= %s AND time >= %s AND time <= %s", (datetime.date(fromdate[0], fromdate[1], fromdate[2]), datetime.date(todate[0], todate[1], todate[2]),datetime.time(fromtime[0],fromtime[1]), datetime.time(totime[0],totime[1])))
results = cursor.fetchall()
I'm not sure why this should be the case - but there seems a limit on the number of records of 100 with autocommit set to True

Python process abruptly killed during execution

I'm new to python and am facing what seems to be a memory leakage error.
I've written a simple script that is trying to fetch multiple columns from a postgres database and then proceeds to perform simple subtraction on these columns and store the result in a temporary variable which is being written to a file. I need to do this on multiple pairs of columns from the db and I'm using a list of lists to store the different column names.
I'm loop over the individual elements of this list until the list is exhausted. While I'm getting valid results(by valid I mean that the output file contains the expected values) for the first few column pairs, the program abruptly gets "Killed" somewhere in between execution. Code below:
varList = [ ['table1', 'col1', 'col2'],
['table1', 'col3', 'col4'],
['table2', 'col1', 'col2'],
# ..
# and many more such lines
# ..
['table2', 'col3', 'col4']]
try:
conn = psycopg2.connect(database='somename', user='someuser', password='somepasswd')
c = conn.cursor()
for listVar in varList:
c.execute("SELECT %s FROM %s" %(listVar[1], listVar[0]))
rowsList1 = c.fetchall();
c.execute("SELECT %s FROM %s" %(listVar[2], listVar[0]))
rowsList2 = c.fetchall();
outfile = file('%s__%s' %(listVar[1], listVar[2]), 'w')
for i in range(0, len(rowsList1)):
if rowsList1[i][0] == None or rowsList2[i][0] == None:
timeDiff = -1
else:
timestamp1 = time.mktime(rowsList1[i][0].timetuple())
timestamp2 = time.mktime(rowsList2[i][0].timetuple())
timeDiff = timestamp2 - timestamp1
outfile.write(str(timeDiff) + '\n')
outfile.close();
del rowsList1, rowsList2
#numpy.savetxt('output.dat', column_stack(rows))
except psycopg2.DatabaseError, e:
print 'Error %s' % e
sys.exit(1)
finally:
if conn:
conn.close()
My initial guess was that there was some form of memory leak and in an attempt to fix this, I added a del statement on the two large arrays hoping that the memory gets properly collected. This time, I got slightly better outputs(by slightly better I mean that more output files were created for the db column pairs).
However, after the 10th or 11th pair of columns, my program was "Killed" again. Can someone tell me what could be wrong here. Is there a better way of getting this done?
Any help is appreciated.
PS: I know that this is a fairly inefficient implementation as I'm looping many times, but I needed something quick and dirty for proof of concept.
I think the problem here is you are selecting everything and then filtering it in the application code when you should be selecting what you want with the sql query. If you select what you want in the sql query like this:
for listvar in varlist:
select listvar[1], listvar[2] from listvar[0] where listvar[1] is not null and listvar[2] is not null
# then...
timeDiff = {}
for row in rows:
timestamp1 = time.mktime(row[0].timetuple())
timestamp2 = time.mktime(row[0].timetuple())
timeDiff[identifier] = timestamp2 - timestamp1 #still need to assoc timediff with row... maybe you need to query a unique identifyer also?
#and possibly a separate... (this may not be necessary depending on your application code. do you really need -1's for irrelevant data or can you just return the important data?)
select listvar[1], listvar[2] from listvar[0] where listvar[1] is null or listvar[2] is null
for row in rows:
timeDiff[identifier] = -1 # or None

Categories

Resources