How does apache spark allocate tasks in the following scenario with mapPartitions? - python

Given the following Apache Spark (Python) code (it is working):
import sys
from random import random
from operator import add
import sqlite3
from datetime import date
from datetime import datetime
from pyspark import SparkContext
def agePartition(recs):
gconn = sqlite3.connect('/home/chris/test.db')
myc = gconn.cursor()
today = date.today()
return_part = []
for rec in recs:
sql = "select birth_date from peeps where name = '{n}'".format(n=rec[0])
myc.execute(sql)
bdrec = myc.fetchone()
born = datetime.strptime(bdrec[0], '%Y-%m-%d')
return_part.append( (rec[0], today.year - born.year - ((today.month, today.day) < (born.month, born.day))) )
gconn.close()
return iter(return_part)
if __name__ == "__main__":
"""
Usage: pi [partitions]
"""
sc = SparkContext(appName="PythonDBTEST")
print('starting...')
data = [('Chris', 1), ('Amanda', 2), ('Shiloh', 2), ('Sammy', 2), ('Tim', 1)]
rdd = sc.parallelize(data,5)
rslt_collect = rdd.mapPartitions(agePartition).collect()
for x in rslt_collect:
print("{n} is {a}".format(n=x[0], a=x[1]))
sc.stop()
In a two compute / slave node setup with a total of 8 cpus would each of the partitions be created as a task and allocated to the 2 nodes so that all 5 partitions run in parallel? If not, what more would need to be done to make sure that happens?
The intent here was testing keeping a global database connection alive per slave work process so the database connection doesn't have to be re-opened for each record in the RDD that gets processed. I'm using SQLite in this example but it will be a SQLCipher database and that has is a lot more time consuming to open on the database connection.

Assuming you have 8 available slots (cpus) in the cluster. You can process up to 8 partitions concurrently. In your case, you have 5 partitions, so they should all be processed in parallel. This would be 5 concurrent connections to the database.
My expectation would be one per core so that if the number of records were much greater I would not be continually recreating database connections.
In your case, it will be per partition. If you have 20 partitions and 8 cores, you will still create the connection 20 times.

Related

Slow MySQL database query time in a Python for loop

I have a task to run 8 equal queries (1 query per 1 country) and doing so return data from MySQL database. The reason I can't run 1 query with all countries in one is that each country needs to have different column names. Also, results need to be updated daily with a dynamic date range (last 7 days). Yes, I could run all countries and do the column naming and everything with Pandas but I thought that the following solution would be more efficient. So, my solution was to create a for loop that uses predefined lists with all the countries their respective dimensions and date range variables that change according to the current date. The problem I'm having is that MySQL query running in the loop takes so much more time than if I run the same query directly in our data warehouse (~140-500 seconds vs. 30 seconds). The solution works with smaller tables from DWH. The things is that I don't know which part exactly is causing the problem and how to solve it.
Here is an example of my code with some smaller "tests" implemented in it:
#Import libraries:
from google.cloud import storage
from google.oauth2 import service_account
import mysql.connector
import pandas as pd
import time
from datetime import timedelta, date
#Create a connection to new DWH:
coon = mysql.connector.connect(
host="the host goes here",
user="the user goes here",
passwd="the password goes here"
)
#Create Google Cloud Service credential references:
credentials = service_account.Credentials.from_service_account_file(r'C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\my credential json goes here.json')
project_id='my project id goes here'
cursor = coon.cursor()
#Create lists of countries and dimensions
countries = ['EE','FI','LV','LT']
appp_id_dim = ['ga:dimension5','ga:dimension5','ga:dimension5','ga:dimension5']
status_dim = ['ga:dimension21','ga:dimension12','ga:dimension20','ga:dimension15']
score_dim = ['ga:dimension11','ga:dimension11','ga:dimension19','ga:dimension14']
#Define the current date and date that was 7 days before current date:
date_now = date.today() - timedelta(days=1)
date_7d_prev = date_now - timedelta(days=7)
#Create a loop
for c,s in zip(countries, score_dim):
start_time = time.time()
#Create the query using string formating:
query = f"""select ca.ID, sv.subType, SUM(svl.score) as '{s}'
from aio.CreditApplication ca
join aio.ScoringResult sr
on sr.creditApplication_ID = ca.ID
join aio.ScorecardVariableLine svl
on svl.id = sr.scorecardVariableLine_ID
join aio.ScorecardVariable sv
on sv.ID = svl.scorecardVariable_ID
where sv.country='{c}'
#and sv.subType ="asc"
and sv.subType != 'fsc'
and sr.created >= '2020-01-01'
and sr.created between '{date_7d_prev} 00:00:00' and '{date_now} 23:59:59'
group by ca.id,sv.subType"""
#Check of sql query
print('query is done', time.time()-start_time)
start_time = time.time()
sql = pd.read_sql_query(query, coon)
#check of assigning sql:
print ('sql is assigned',time.time()-start_time)
start_time = time.time()
df = pd.DataFrame(sql
#, columns = ['created','ID','state']
)
#Check the df assignment:
print ('df has been assigned', time.time()-start_time)
#Create a .csv file from the final dataframe:
start_time = time.time()
df.to_csv(fr"C:\Users\ivo.vancans\OneDrive\Documents\Python Workspace\Testing Ground\{c}_sql_loop_test.csv", index = False, header=True, encoding='utf-8', sep=';')
#Check csv file creation:
print ('csv has been created',time.time()-start_time)
#Close the session
start_time = time.time()
cursor.close()
#Check the session closing:
print('The cursor is closed',time.time()-start_time)
This example has 4 countries because I tried cutting the amount in half but that doesn't help also. That was me thinking that I have some sort of query restrictions on the DWH end because major slow down always started with the 5th country. Running them separately takes almost the same time for each but it still takes too long.
So, my tests show that the loop always lags at the step of querying data. Every other step takes less than a second, but querying time goes up to 140-500, sometimes even more, seconds as mentioned previously. So, what do you think is the problem?
Found the solution! After talking to a person in my company who has a lot more experience with SQL and our particular DWH engine, he agreed to help and rewrote the SQL part. Instead of left joining a subquery, I had to rewrite it so that there would be no subquery. Why? Because our particular engine doesn't create an index for subquery, bet separately joined tables will have indexes. That improved the time of the whole script dramatically, from ~40 minutes run time to ~ less than 1 minute.

How to subtract a timedelta from a datetime in peewee?

Consider the following tables:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
class Schedule(db.Model):
delay = IntegerField() # I would prefer if we had a TimeDeltaField
Now, I'd like to get all those events which should recur:
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.occurred_at < now - Schedule.delay) # wishful
Unfortunately, this doesn't work. Hence, I'm currently doing something as follows:
for schedule in schedules:
then = now - timedelta(minutes=schedule.delay)
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Schedule == schedule, Recurring.occurred_at < then)
However, now instead of executing one query, I am executing multiple queries.
Is there a way to solve the above problem only using one query? One solution that I thought of was:
class Recurring(db.Model):
schedule = ForeignKeyField(Schedule)
occurred_at = DateTimeField(default=datetime.now)
repeat_after = DateTimeField() # repeat_after = occurred_at + delay
query = Recurring.select(Recurring, Schedule).join(Schedule)
query = query.where(Recurring.repeat_after < now)
However, the above schema violates the rules of the third normal form.
Each database implements different datetime addition functionality, which sucks. So it will depend a little bit on what database you are using.
For postgres, for example, we can use the "interval" helper:
# Calculate the timestamp of the next occurrence. This is done
# by taking the last occurrence and adding the number of seconds
# indicated by the schedule.
one_second = SQL("INTERVAL '1 second'")
next_occurrence = Recurring.occurred_at + (one_second * Schedule.delay)
# Get all recurring rows where the current timestamp on the
# postgres server is greater than the calculated next occurrence.
query = (Recurring
.select(Recurring, Schedule)
.join(Schedule)
.where(SQL('current_timestamp') >= next_occurrence))
for recur in query:
print(recur.occurred_at, recur.schedule.delay)
You could also substitute a datetime object for the "current_timestamp" if you prefer:
my_dt = datetime.datetime(2019, 3, 1, 3, 3, 7)
...
.where(Value(my_dt) >= next_occurrence)
For SQLite, you would do:
# Convert to a timestamp, add the scheduled seconds, then convert back
# to a datetime string for comparison with the last occurrence.
next_ts = fn.strftime('%s', Recurring.occurred_at) + Schedule.delay
next_occurrence = fn.datetime(next_ts, 'unixepoch')
For MySQL, you would do:
# from peewee import NodeList
nl = NodeList((SQL('INTERVAL'), Schedule.delay, SQL('SECOND')))
next_occurrence = fn.date_add(Recurring.occurred_at, nl)
Also lastly, I'd suggest you try better names for your models/fields. i.e., Schedule.interval instead of Schedule.delay, and Recurring.last_run instead of occurred_at.

Fastest way to read huge MySQL table in python

I was trying to read a very huge MySQL table made of several millions of rows. I have used Pandas library and chunks. See the code below:
import pandas as pd
import numpy as np
import pymysql.cursors
connection = pymysql.connect(user='xxx', password='xxx', database='xxx', host='xxx')
try:
with connection.cursor() as cursor:
query = "SELECT * FROM example_table;"
chunks=[]
for chunk in pd.read_sql(query, connection, chunksize = 1000):
chunks.append(chunk)
#print(len(chunks))
result = pd.concat(chunks, ignore_index=True)
#print(type(result))
#print(result)
finally:
print("Done!")
connection.close()
Actually the execution time is acceptable if I limit the number of rows to select. But if want to select also just a minimum of data (for example 1 mln of rows) then the execution time dramatically increases.
Maybe is there a better/faster way to select the data from a relational database within python?
Another option might be to use the multiprocessing module, dividing the query up and sending it to multiple parallel processes, then concatenating the results.
Without knowing much about pandas chunking - I think you would have to do the chunking manually (which depends on the data)... Don't use LIMIT / OFFSET - performance would be terrible.
This might not be a good idea, depending on the data. If there is a useful way to split up the query (e.g if it's a timeseries, or there some kind of appropriate index column to use, it might make sense). I've put in two examples below to show different cases.
Example 1
import pandas as pd
import MySQLdb
def worker(y):
#where y is value in an indexed column, e.g. a category
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE col_x = {0}".format(y)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
data = p.map(worker, [y for y in col_x_categories])
#assuming there is a reasonable number of categories in an indexed col_x
p.close()
results = pd.concat(data)
Example 2
import pandas as pd
import MySQLdb
import datetime
def worker(a,b):
#where a and b are timestamps
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
query = "SELECT * FROM example_table WHERE x >= {0} AND x < {1}".format(a,b)
return pd.read_sql(query, connection)
p = multiprocessing.Pool(processes=10)
#(or however many process you want to allocate)
date_range = pd.date_range(start=d1, end=d2, freq="A-JAN")
# this arbitrary here, and will depend on your data /knowing your data before hand (ie. d1, d2 and an appropriate freq to use)
date_pairs = list(zip(date_range, date_range[1:]))
data = p.map(worker, date_pairs)
p.close()
results = pd.concat(data)
Probably nicer ways doing this (and haven't properly tested etc). Be interested to know how it goes if you try it.
You could try using a different mysql connector. I would recommend trying mysqlclient which is the fastest mysql connector (by a considerable margin I believe).
pymysql is a pure python mysql client, whereas mysqlclient is wrapper around the (much faster) C libraries.
Usage is basically the same as pymsql:
import MySQLdb
connection = MySQLdb.connect(user='xxx', password='xxx', database='xxx', host='xxx')
Read more about the different connectors here: What's the difference between MySQLdb, mysqlclient and MySQL connector/Python?
For those using Windows and having troubles to install MySQLdb. I'm using this way to fetch data from huge table.
import mysql.connector
i = 1
limit = 1000
while True:
sql = "SELECT * FROM super_table LIMIT {}, {}".format(i, limit)
cursor.execute(sql)
rows = self.cursor.fetchall()
if not len(rows): # break the loop when no more rows
print("Done!")
break
for row in rows: # do something with results
print(row)
i += limit

pyodbc to SQL Server too slow while fetching results

I am using jupiter notebook with Python 3 and connecting to a SQL server database. I am using pyodbc version 4.0.22 to connect to the database.
My goal is to store the SQL results in a pandas dataframe, but the query was so slow.
Here is the code:
import pyodbc
cnxn = pyodbc.connect("DSN=ISTPRD02;"
"Trusted_Connection=yes;")
ontem = '20180521'
query = "SELECT LOJA, COUNT(DISTINCT RA) FROM VENDAS_CONTRATO(NOLOCK) WHERE DT_RETIRADA_RA = '" + ontem + "' AND SITUACAO IN ('ABERTO', 'FECHADO') GROUP BY LOJA"
start = time.time()
ra_ontem = pd.read_sql_query(query, cnxn)
end = time.time()
print("Tempo: ", end - start)
Tempo: 26.379971981048584
Since it took a long time, I have monitored the database server, and it takes about 3 seconds to run the query on the server, as you can see below:
query = "SELECT LOJA, COUNT(DISTINCT RA) FROM VENDAS_CONTRATO(NOLOCK) WHERE DT_RETIRADA_RA = '" + ontem + "' AND SITUACAO IN ('ABERTO', 'FECHADO') GROUP BY LOJA"
start = time.time()
crsr = cnxn.cursor()
crsr.execute(query)
end = time.time()
print("Tempo: ", end - start)
Tempo: 3.7947773933410645
start = time.time()
crsr.fetchone()
end = time.time()
print("Tempo: ", end - start)
Tempo: 0.2396855354309082
start = time.time()
crsr.fetchall()
end = time.time()
print("Tempo: ", end - start)
Tempo: 23.67447066307068
So it seams that my problem is local, when the data is already retrieved from the database server and it looks like the pyhton code is slow when dealing with the data.
But I have only 892 lines !
ra_ontem.shape
(189, 2)
So my question is how can I make this faster and load the results into a Pandas Dataframe ?
Thanks
This might get you a bit faster than usual
cursor.execute(query)
df = cursor.fetchallarrow().to_pandas()
I had the same issue, it was just because tracing is switched on.
Just open up ODBC Data Source Administrator and go to the Tracing tab and turn off tracing. It completely solves the problem.
Your problem is not with pyodbc but is with sql-server. your code has two problems:
1) you need to create indecies on the columns which appear in "WHERE" clause.(i.e. DT_RETIRADA and SITUACAO ). Please note that if you always filter SITUACAO with those two values constantly, you can use filtered index. If you have index on these two field the best solution is to rebuild the index.
2) you query most probably suffers from "parameter sniffing". you need to search more about that

python querying all rows of azure table

I have around 20000 rows in my azure table . I wanted to query all the rows in the azure table . But due to certain azure limitation i am getting only 1000 rows.
My code
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='YYY')
i=0
tasks=table_service.query_entities('ValidOutputTable',"PartitionKey eq 'tasksSeattle'")
for task in tasks:
i+=1
print task.RowKey,task.DomainUrl,task.Status
print i
I want to get all the rows from the validoutputtable .Is there a way to do so
But due to certain azure limitation i am getting only 1000 rows.
This is a documented limitation. Each query request to Azure Table will return no more than 1000 rows. If there are more than 1000 entities, table service will return a continuation token that must be used to fetch next set of entities (See Remarks section here: http://msdn.microsoft.com/en-us/library/azure/dd179421.aspx)
Please see the sample code to fetch all entities from a table:
from azure import *
from azure.storage import TableService
table_service = TableService(account_name='xxx', account_key='yyy')
i=0
next_pk = None
next_rk = None
while True:
entities=table_service.query_entities('Address',"PartitionKey eq 'Address'", next_partition_key = next_pk, next_row_key = next_rk, top=1000)
i+=1
for entity in entities:
print(entity.AddressLine1)
if hasattr(entities, 'x_ms_continuation'):
x_ms_continuation = getattr(entities, 'x_ms_continuation')
next_pk = x_ms_continuation['nextpartitionkey']
next_rk = x_ms_continuation['nextrowkey']
else:
break;
Update 2019
Just running for loop on the query result (as author of the topic does) - will get all the data from the query.
from azure.cosmosdb.table.tableservice import TableService
table_service = TableService(account_name='accont_name', account_key='key')
#counter to keep track of records
counter=0
# get the rows. Debugger shows the object has only 100 records
rows = table_service.query_entities(table,"PartitionKey eq 'mykey'")
for row in rows:
if (counter%100 == 0):
# just to keep output smaller, print every 100 records
print("Processing {} record".format(counter))
counter+=1
The output proves that loop goes over a 1000 records
...
Processing 363500 record
Processing 363600 record
...
Azure Table Storage has a new python library in preview release that is available for installation via pip. To install use the following pip command
pip install azure-data-tables
To query all rows for a given table with the newest library, you can use the following code snippet:
from azure.data.tables import TableClient
key = os.environ['TABLES_PRIMARY_STORAGE_ACCOUNT_KEY']
account_name = os.environ['tables_storage_account_name']
endpoint = os.environ['TABLES_STORAGE_ENDPOINT_SUFFIX']
account_url = "{}.table.{}".format(account_name, endpoint)
table_name = "myBigTable"
with TableClient(account_url=account_url, credential=key, table_name=table_name) as table_client:
try:
table_client.create_table()
except:
pass
i = 0
for entity in table_client.list_entities():
print(entity['value'])
i += 1
if i % 100 == 0:
print(i)
Your outlook would look like this: (modified for brevity, assuming there are 2000 entities)
...
1100
1200
1300
...

Categories

Resources