Related
I am running oracle 19c and I want to get the best insert performance I can. Currently, I insert using INSERT /*+APPEND */ ... which is fine, but not the speeds I wanted.
I read that using FORALL is a lot faster but I couldn't really find any examples.
here is the code snippet (python 3) :
connection = pool.acquire()
cursor = connection.cursor()
cursor.executemany("INSERT /*+APPEND*/ INTO RANDOM VALUES (:1, :2, :3)", list(random))
connection.commit()
cursor.close()
connection.close()
I really get interested in what would be faster, so I've tested some possibile ways to compare them:
simple executemany with no tricks.
the same with APPEND_VALUES hint inside the statement.
union all approach you've tried in another question. This should be slower than above since it generates a really very large statement (that potentially can require more network than the data itself). It then should be parsed at DB side that will also consume a lot of time and neglect all the benefits (not talking about potential size limit). Then I've executemany'ed it to test with chunks not to build a single statement for 100k records. I didn't use concatenation of values inside the statement, because wanted to keep it safe.
insert all. The same downsides, but no unions. Compare it with the union version.
serialize the data in JSON and do deserialization at DB side with json_table. Potentially good performance with single short statement and single data transfer with little overhead of JSON.
Your suggested FORALL in PL/SQL wrapper procedure. Should be the same as executemany since does the same, but at the database side. Overhead of transformation of the data into the collection.
The same FORALL, but with columnar approach to pass the data: pass simple lists of column values instead of complex type. Should be much faster than FORALL with collection since there's no need to serialize the data into collection's type.
I've used Oracle Autonomous Database in Oracle Cloud with free account. Each method was executed for 10 times in loop with the same input dataset of 100k records, table was recreated before each test. This is the result I've got. Preparation and execution times here are data transformation at client side end DB call itself respectively.
>>> t = PerfTest(100000)
>>> t.run("exec_many", 10)
Method: exec_many.
Duration, avg: 2.3083874 s
Preparation time, avg: 0.0 s
Execution time, avg: 2.3083874 s
>>> t.run("exec_many_append", 10)
Method: exec_many_append.
Duration, avg: 2.6031369 s
Preparation time, avg: 0.0 s
Execution time, avg: 2.6031369 s
>>> t.run("union_all", 10, 10000)
Method: union_all.
Duration, avg: 27.9444233 s
Preparation time, avg: 0.0408773 s
Execution time, avg: 27.8457551 s
>>> t.run("insert_all", 10, 10000)
Method: insert_all.
Duration, avg: 70.6442494 s
Preparation time, avg: 0.0289269 s
Execution time, avg: 70.5541995 s
>>> t.run("json_table", 10)
Method: json_table.
Duration, avg: 10.4648237 s
Preparation time, avg: 9.7907693 s
Execution time, avg: 0.621006 s
>>> t.run("forall", 10)
Method: forall.
Duration, avg: 5.5622837 s
Preparation time, avg: 1.8972456000000002 s
Execution time, avg: 3.6650380999999994 s
>>> t.run("forall_columnar", 10)
Method: forall_columnar.
Duration, avg: 2.6702698000000002 s
Preparation time, avg: 0.055710800000000005 s
Execution time, avg: 2.6105702 s
>>>
The fastest way is just executemany, not so much surprise. Interesting here is that APPEND_VALUES does not improve the query and gets more time on average, so this needs more investigation.
About FORALL: as expected, individual array for each column takes less time as there's no data preparation for it. It is more or less comparable with executemany, but I think PL/SQL overhead plays some role here.
Another interesting part for me is JSON: most of the time was spent on writing LOB into database and serialization, but the query itself was very fast. Maybe write operation can be improved in some way with chuncsize or some another way to pass LOB data into select statement, but as of my code it is far from very simple and straightforward approach with executemany.
There`re also possible approaches without Python that should be faster as native tools for external data, but I didn't tested them:
Oracle SQL*Loader
External table
Below is the code I've used for testing.
import cx_Oracle as db
import os, random, json
import datetime as dt
class PerfTest:
def __init__(self, size):
self._con = db.connect(
os.environ["ora_cloud_usr"],
os.environ["ora_cloud_pwd"],
"test_low",
encoding="UTF-8"
)
self._cur = self._con.cursor()
self.inp = [(i, "Test {i}".format(i=i), random.random()) for i in range(size)]
def __del__(self):
if self._con:
self._con.rollback()
self._con.close()
#Create objets
def setup(self):
try:
self._cur.execute("drop table rand")
#print("table dropped")
except:
pass
self._cur.execute("""create table rand(
id int,
str varchar2(100),
val number
)""")
self._cur.execute("""create or replace package pkg_test as
type ts_test is record (
id rand.id%type,
str rand.str%type,
val rand.val%type
);
type tt_test is table of ts_test index by pls_integer;
type tt_ids is table of rand.id%type index by pls_integer;
type tt_strs is table of rand.str%type index by pls_integer;
type tt_vals is table of rand.val%type index by pls_integer;
procedure write_data(p_data in tt_test);
procedure write_data_columnar(
p_ids in tt_ids,
p_strs in tt_strs,
p_vals in tt_vals
);
end;""")
self._cur.execute("""create or replace package body pkg_test as
procedure write_data(p_data in tt_test)
as
begin
forall i in indices of p_data
insert into rand(id, str, val)
values (p_data(i).id, p_data(i).str, p_data(i).val)
;
commit;
end;
procedure write_data_columnar(
p_ids in tt_ids,
p_strs in tt_strs,
p_vals in tt_vals
) as
begin
forall i in indices of p_ids
insert into rand(id, str, val)
values (p_ids(i), p_strs(i), p_vals(i))
;
commit;
end;
end;
""")
def build_union(self, size):
return """insert into rand(id, str, val)
select id, str, val from rand where 1 = 0 union all
""" + """ union all """.join(
["select :{}, :{}, :{} from dual".format(i*3+1, i*3+2, i*3+3)
for i in range(size)]
)
def build_insert_all(self, size):
return """
""".join(
["into rand(id, str, val) values (:{}, :{}, :{})".format(i*3+1, i*3+2, i*3+3)
for i in range(size)]
)
#Test case with executemany
def exec_many(self):
start = dt.datetime.now()
self._cur.executemany("insert into rand(id, str, val) values (:1, :2, :3)", self.inp)
self._con.commit()
return (dt.timedelta(0), dt.datetime.now() - start)
#The same as above but with prepared statement (no parsing)
def exec_many_append(self):
start = dt.datetime.now()
self._cur.executemany("insert /*+APPEND_VALUES*/ into rand(id, str, val) values (:1, :2, :3)", self.inp)
self._con.commit()
return (dt.timedelta(0), dt.datetime.now() - start)
#Union All approach (chunked). Should have large parse time
def union_all(self, size):
##Chunked list of big tuples
start_prepare = dt.datetime.now()
new_inp = [
tuple([item for t in r for item in t])
for r in list(zip(*[iter(self.inp)]*size))
]
new_stmt = self.build_union(size)
dur_prepare = dt.datetime.now() - start_prepare
#Execute unions
start_exec = dt.datetime.now()
self._cur.executemany(new_stmt, new_inp)
dur_exec = dt.datetime.now() - start_exec
##In case the size is not a divisor
remainder = len(self.inp) % size
if remainder > 0 :
start_prepare = dt.datetime.now()
new_stmt = self.build_union(remainder)
new_inp = tuple([
item for t in self.inp[-remainder:] for item in t
])
dur_prepare += dt.datetime.now() - start_prepare
start_exec = dt.datetime.now()
self._cur.execute(new_stmt, new_inp)
dur_exec += dt.datetime.now() - start_exec
self._con.commit()
return (dur_prepare, dur_exec)
#The same as union all, but with no need to union something
def insert_all(self, size):
##Chunked list of big tuples
start_prepare = dt.datetime.now()
new_inp = [
tuple([item for t in r for item in t])
for r in list(zip(*[iter(self.inp)]*size))
]
new_stmt = """insert all
{}
select * from dual"""
dur_prepare = dt.datetime.now() - start_prepare
#Execute
start_exec = dt.datetime.now()
self._cur.executemany(
new_stmt.format(self.build_insert_all(size)),
new_inp
)
dur_exec = dt.datetime.now() - start_exec
##In case the size is not a divisor
remainder = len(self.inp) % size
if remainder > 0 :
start_prepare = dt.datetime.now()
new_inp = tuple([
item for t in self.inp[-remainder:] for item in t
])
dur_prepare += dt.datetime.now() - start_prepare
start_exec = dt.datetime.now()
self._cur.execute(
new_stmt.format(self.build_insert_all(remainder)),
new_inp
)
dur_exec += dt.datetime.now() - start_exec
self._con.commit()
return (dur_prepare, dur_exec)
#Serialize at server side and do deserialization at DB side
def json_table(self):
start_prepare = dt.datetime.now()
new_inp = json.dumps([
{ "id":t[0], "str":t[1], "val":t[2]} for t in self.inp
])
lob_var = self._con.createlob(db.DB_TYPE_CLOB)
lob_var.write(new_inp)
start_exec = dt.datetime.now()
self._cur.execute("""
insert into rand(id, str, val)
select id, str, val
from json_table(
to_clob(:json), '$[*]'
columns
id int,
str varchar2(100),
val number
)
""", json=lob_var)
dur_exec = dt.datetime.now() - start_exec
self._con.commit()
return (start_exec - start_prepare, dur_exec)
#PL/SQL with FORALL
def forall(self):
start_prepare = dt.datetime.now()
collection_type = self._con.gettype("PKG_TEST.TT_TEST")
record_type = self._con.gettype("PKG_TEST.TS_TEST")
def recBuilder(x):
rec = record_type.newobject()
rec.ID = x[0]
rec.STR = x[1]
rec.VAL = x[2]
return rec
inp_collection = collection_type.newobject([
recBuilder(i) for i in self.inp
])
start_exec = dt.datetime.now()
self._cur.callproc("pkg_test.write_data", [inp_collection])
dur_exec = dt.datetime.now() - start_exec
return (start_exec - start_prepare, dur_exec)
#PL/SQL with FORALL and plain collections
def forall_columnar(self):
start_prepare = dt.datetime.now()
ids, strs, vals = map(list, zip(*self.inp))
start_exec = dt.datetime.now()
self._cur.callproc("pkg_test.write_data_columnar", [ids, strs, vals])
dur_exec = dt.datetime.now() - start_exec
return (start_exec - start_prepare, dur_exec)
#Run test
def run(self, method, iterations, *args):
#Cleanup schema
self.setup()
start = dt.datetime.now()
runtime = []
for i in range(iterations):
single_run = getattr(self, method)(*args)
runtime.append(single_run)
dur = dt.datetime.now() - start
dur_prep_total = sum([i.total_seconds() for i, _ in runtime])
dur_exec_total = sum([i.total_seconds() for _, i in runtime])
print("""Method: {meth}.
Duration, avg: {run_dur} s
Preparation time, avg: {prep} s
Execution time, avg: {ex} s""".format(
inp_s=len(self.inp),
meth=method,
run_dur=dur.total_seconds() / iterations,
prep=dur_prep_total / iterations,
ex=dur_exec_total / iterations
))
I have this code that is rather done in a hurry but it works in general. The only thing it runs forever. The idea is to update 2 columns on a table that is holding 1495748 rows, so the number of the list of timestamp being queried in first place. For each update value there has to be done a comparison in which the timestamp has to be in an hourly interval that is formed by two timestamps coming from the api in two different dicts. Is there a way to speed up things a little or maybe multiprocess it?
Hint: db_mac = db_connection to a Postgres database.
the response looks like this:
{'meta': {'source': 'National Oceanic and Atmospheric Administration, Deutscher Wetterdienst'}, 'data': [{'time': '2019-11-26 23:00:00', 'time_local': '2019-11-27 00:00', 'temperature': 8.3, 'dewpoint': 5.9, 'humidity': 85, 'precipitation': 0, 'precipitation_3': None, 'precipitation_6': None, 'snowdepth': None, 'windspeed': 11, 'peakgust': 21, 'winddirection': 160, 'pressure': 1004.2, 'condition': 4}, {'time': '2019-11-27 00:00:00', ....
import requests
import db_mac
from collections import defaultdict
import datetime
import time
t = time.time()
station = [10382,"DE","Berlin / Tegel",52.5667,13.3167,37,"EDDT",10382,"TXL","Europe/Berlin"]
dates = [("2019-11-20","2019-11-22"), ("2019-11-27","2019-12-02") ]
insert_dict = defaultdict(tuple)
hist_weather_list = []
for d in dates:
end = d[1]
start = d[0]
print(start, end)
url = "https://api.meteostat.net/v1/history/hourly?station={station}&start={start}&end={end}&time_zone={timezone}&&time_format=Y-m-d%20H:i&key=<APIKEY>".format(station=station[0], start=start, end=end, timezone=station[-1])
response = requests.get(url)
weather = response.json()
print(weather)
for i in weather["data"]:
hist_weather_list.append(i)
sql = "select timestamp from dump order by timestamp asc"
result = db_mac.execute(sql)
hours, rem = divmod(time.time() - t, 3600)
minutes, seconds = divmod(rem, 60)
print("step1 {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
for row in result:
try:
ts_dump = datetime.datetime.timestamp(row[0])
for i, hour in enumerate(hist_weather_list):
ts1 = datetime.datetime.timestamp(datetime.datetime.strptime(hour["time"], '%Y-%m-%d %H:%M:%S'))
ts2 = datetime.datetime.timestamp(datetime.datetime.strptime(hist_weather_list[i + 1]["time"], '%Y-%m-%d %H:%M:%S'))
if ts1 <= ts_dump and ts_dump < ts2:
insert_dict[row[0]] = (hour["temperature"], hour["pressure"])
except Exception as e:
pass
hours, rem = divmod(time.time() - t, 3600)
minutes, seconds = divmod(rem, 60)
print("step2 {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
for key, value in insert_dict.items():
sql2 = """UPDATE dump SET temperature = """ + str(value[0]) + """, pressure = """+ str(value[1]) + """ WHERE timestamp = '"""+ str(key) + """';"""
db_mac.execute(sql2)
hours, rem = divmod(time.time() - t, 3600)
minutes, seconds = divmod(rem, 60)
print("step3 {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
UPDATE the code for multiprocessing. I'll let it run the night and give an update of the running time.
import requests
import db_mac
from collections import defaultdict
import datetime
import time
import multiprocessing as mp
t = time.time()
station = [10382,"DE","Berlin / Tegel",52.5667,13.3167,37,"EDDT",10382,"TXL","Europe/Berlin"]
dates = [("2019-11-20","2019-11-22"), ("2019-11-27","2019-12-02") ]
insert_dict = defaultdict(tuple)
hist_weather_list = []
for d in dates:
end = d[1]
start = d[0]
print(start, end)
url = "https://api.meteostat.net/v1/history/hourly?station={station}&start={start}&end={end}&time_zone={timezone}&&time_format=Y-m-d%20H:i&key=wzwi2YR5".format(station=station[0], start=start, end=end, timezone=station[-1])
response = requests.get(url)
weather = response.json()
print(weather)
for i in weather["data"]:
hist_weather_list.append(i)
sql = "select timestamp from dump order by timestamp asc"
result = db_mac.execute(sql)
hours, rem = divmod(time.time() - t, 3600)
minutes, seconds = divmod(rem, 60)
print("step1 {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
def find_parameters(x):
for row in result[x[0]:x[1]]:
try:
ts_dump = datetime.datetime.timestamp(row[0])
for i, hour in enumerate(hist_weather_list):
ts1 = datetime.datetime.timestamp(datetime.datetime.strptime(hour["time"], '%Y-%m-%d %H:%M:%S'))
ts2 = datetime.datetime.timestamp(datetime.datetime.strptime(hist_weather_list[i + 1]["time"], '%Y-%m-%d %H:%M:%S'))
if ts1 <= ts_dump and ts_dump < ts2:
insert_dict[row[0]] = (hour["temperature"], hour["pressure"])
except Exception as e:
pass
step1 = int(len(result) /4)
step2 = 2 * step1
step3 = 3 * step1
step4 = len(result)
steps = [[0,step1],[step1,step2],[step2,step3], [step3,step4]]
pool = mp.Pool(mp.cpu_count())
pool.map(find_parameters, steps)
hours, rem = divmod(time.time() - t, 3600)
minutes, seconds = divmod(rem, 60)
print("step2 {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
for key, value in insert_dict.items():
sql2 = """UPDATE dump SET temperature = """ + str(value[0]) + """, pressure = """+ str(value[1]) + """ WHERE timestamp = '"""+ str(key) + """';"""
db_mac.execute(sql2)
hours, rem = divmod(time.time() - t, 3600)
minutes, seconds = divmod(rem, 60)
print("step3 {:0>2}:{:0>2}:{:05.2f}".format(int(hours),int(minutes),seconds))
UPDATE 2
It finished and ran for 2:45 hours in 4 cores on a raspberry pi. Though is there a more efficient way to do such things?
So theres a few minor things I can think of to speed this up a little. I figure anything little bit helps especially if you have a lot of rows to process. For starters, print statements can slow down your code a lot. I'd get rid of those if they are unneeded.
Most importantly, you are calling the api in every iteration of the loop. Waiting for a response from the API is probably taking up the bulk of your time. I looked a bit at the api you are using, but don't know the exact case you're using it for or what your dates "start" and "end" look like, but if you could do it in less calls that would surely speed up this loop by a lot. Another way you can do this is, it looks like the api has a .csv version of the data you can download and use. Running this on local data would be way faster. If you choose to go this route i'd suggest using pandas. (Sorry if you already know pandas and i'm over explaining) You can use: df = pd.read_csv("filename.csv") and edit the table from there easily. You can also do df.to_sql(params) to write to your data base. Let me know if you want help forming a pandas version of this code.
Also, not sure from your code if this would cause an error, but I would try, instead of your for loop (for i in weather["data"]).
hist_weather_list += weather["data"]
or possibly
hist_weather_list += [weather["data"]
Let me know how it goes!
My aim is to increase the throughput of versioning data in Cassandra. I have used concurrent reads and writes and have also increased the chunk size that my code reads from the file. My machine is 16gb with 8 cores and yes, I have changed Cassandra's yaml file for 10k concurrent reads and writes and when timed it, I found out that 10000 writes/reads takes less than a second.
My minimal, viable code is:
import json
import logging
import os
import sys
from datetime import datetime
from hashlib import sha256, sha512, sha1
import pandas as pd
from cassandra import ConsistencyLevel, WriteTimeout
from cassandra.cluster import (EXEC_PROFILE_DEFAULT, BatchStatement, Cluster,
ExecutionProfile)
from cassandra.concurrent import (execute_concurrent,
execute_concurrent_with_args)
from cassandra.query import SimpleStatement, dict_factory
class PythonCassandraExample:
def __init__(self, file_to_be_versioned, working_dir=os.getcwd(), mode='append'):
self.cluster = None
self.session = None
self.keyspace = None
self.log = None
self.mode = mode
self.file_to_be_versioned = file_to_be_versioned
self.insert_patch = []
self.delete_patch = []
self.update_patch = []
self.working_dir = working_dir
def __del__(self):
self.cluster.shutdown()
def createsession(self):
profile = ExecutionProfile(
row_factory=dict_factory,
request_timeout=6000
)
self.cluster = Cluster(
['localhost'],
connect_timeout=50,
execution_profiles={
EXEC_PROFILE_DEFAULT: profile
}
)
self.session = self.cluster.connect(self.keyspace)
def getsession(self):
return self.session
# How about Adding some log info to see what went wrong
def setlogger(self):
log = logging.getLogger()
log.setLevel('INFO')
handler = logging.StreamHandler()
handler.setFormatter(logging.Formatter(
"%(asctime)s [%(levelname)s] %(name)s: %(message)s"))
log.addHandler(handler)
self.log = log
# Create Keyspace based on Given Name
def handle_error(self, exception):
self.log.error("Failed to fetch user info: %s", exception)
def createkeyspace(self, keyspace):
"""
:param keyspace: The Name of Keyspace to be created
:return:
"""
# Before we create new lets check if exiting keyspace; we will drop that and create new
self.log.info("creating keyspace...")
self.session.execute("""
CREATE KEYSPACE IF NOT EXISTS %s
WITH replication = { 'class': 'SimpleStrategy', 'replication_factor': '1' }
""" % keyspace)
self.log.info("setting keyspace...")
self.keyspace = keyspace
self.session.set_keyspace(self.keyspace)
def create_table_and_set_version(self, table_name):
self.table_name = table_name.lower()
table_select_query = "SELECT table_name FROM system_schema.tables WHERE keyspace_name='{}' AND table_name='{}'".format(
self.keyspace, self.table_name)
print(table_select_query)
table_exists = self.session.execute(table_select_query).one()
self.log.info("Table exists: {}".format(table_exists))
if table_exists:
self.log.info(
"Datapackage already exists! Checking the last version")
self.version = self.session.execute(
"SELECT version FROM {} LIMIT 1".format(self.table_name)).one()
self.log.info(
"The version fetched is: {} of type: {}".format(
self.version, type(self.version)
)
)
if not self.version:
self.version = 0
else:
self.version = self.version['version']
else:
self.log.info("Table didn't exist!")
self.version = 0
self.target_version = int(str(self.version)) + 1
self.log.info(
"Current and candidate versions are: {}, {}".format(
self.version,
self.target_version
)
)
# c_sql = "CREATE TABLE IF NOT EXISTS {} (id varchar, version int, row varchar, row_hash varchar, PRIMARY KEY(id, version)) with clustering order by (version desc)".format(
# self.table_name)
c_sql = "CREATE TABLE IF NOT EXISTS {} (id varchar, version int, row_hash varchar, PRIMARY KEY(version, id))".format(
self.table_name
)
self.session.execute(c_sql)
self.log.info("DP Table Created !!!")
self.log.info("Current and candidate versions are: {}, {}".format(
self.version, self.target_version))
def push_to_update_patch(self, update_patch_file, last_patch=False):
if len(self.update_patch) >= 10000:
with open(update_patch_file, mode='a') as json_file:
json_file.writelines(
self.update_patch
)
del self.update_patch[:]
if last_patch is True and len(self.update_patch) > 0:
with open(update_patch_file, mode='a') as json_file:
json_file.writelines(
self.update_patch
)
del self.update_patch[:]
def push_to_insert_patch(self, insert_patch_file, last_patch=False):
if len(self.insert_patch) >= 10000:
with open(insert_patch_file, mode='a') as json_file:
json_file.writelines(
self.insert_patch
)
del self.insert_patch[:]
if last_patch is True and len(self.update_patch) > 0:
with open(insert_patch_file, mode='a') as json_file:
json_file.writelines(
self.insert_patch
)
del self.insert_patch[:]
def push_to_delete_patch(self, delete_patch_file, last_patch=False):
if len(self.delete_patch) >= 10000:
with open(delete_patch_file, mode='a') as json_file:
json_file.writelines(
self.delete_patch
)
del self.delete_patch[:]
if last_patch is True and len(self.delete_patch) > 0:
with open(delete_patch_file, mode='a') as json_file:
json_file.writelines(
self.delete_patch
)
del self.delete_patch[:]
def push_to_patch(self, key, value, mode='update'):
return
if key is None or value is None:
raise ValueError(
"Key or value or both not specified for making a patch. Exiting now."
)
data = {}
data["id"] = str(key)
data["data"] = json.dumps(value, default=str)
# convert dict to json str so that the patch is a list of line jsons.
data = json.dumps(data, default=str)
json_patch_file = os.path.join(
self.working_dir,
"version_{}_{}.json".format(
self.target_version, mode
)
)
if mode == 'update':
self.update_patch.append(
data + "\n"
)
self.push_to_update_patch(
json_patch_file
)
if mode == 'insert':
self.insert_patch.append(
data + "\n"
)
self.push_to_insert_patch(
json_patch_file
)
if mode == 'delete':
self.delete_patch.append(
data + "\n"
)
self.push_to_delete_patch(
json_patch_file
)
def clone_version(self):
if self.mode == 'replace':
return
self.log.info("Cloning version")
start_time = datetime.utcnow()
if self.version == 0:
return
insert_sql = self.session.prepare(
(
"INSERT INTO {} ({}, {}, {}) VALUES (?,?,?)"
).format(
self.table_name, "id", "version", "row_hash"
)
)
futures = []
current_version_query = "SELECT id, row_hash FROM {} WHERE version={}".format(
self.table_name, self.version
)
current_version_rows = self.session.execute(
current_version_query
)
for current_version_row in current_version_rows:
params = (
current_version_row['id'],
self.target_version,
current_version_row['row_hash']
)
futures.append(
(
insert_sql,
params
)
)
self.log.info(
"Time taken to clone the version is: {}".format(
datetime.utcnow() - start_time
)
)
def hash_string(self, value):
return (sha1(str(value).encode('utf-8')).hexdigest())
def hash_row(self, row):
row_json = json.dumps(row, default=str)
return (self.hash_string(row_json))
def insert_data(self, generate_diff=False):
self.generate_diff = generate_diff
destination = self.file_to_be_versioned
chunksize = 100000
concurrency_value = 1000
patch_length_for_cql = chunksize
chunks = pd.read_csv(destination, chunksize=chunksize)
chunk_counter = 0
insert_sql = self.session.prepare(
(
"INSERT INTO {} ({}, {}, {}) VALUES (?,?,?)"
).format(
self.table_name, "id", "version", "row_hash"
)
)
select_sql = self.session.prepare(
(
"SELECT id, version, row_hash FROM {} WHERE version=? AND id=?"
).format(
self.table_name
)
)
futures = []
check_for_patch = [] #this list comprises rows with ids and values for checking whether its an update/insert
rows_for_checking_patch = []
start_time = datetime.utcnow()
for df in chunks:
rows_for_checking_patch = df.values.tolist()
chunk_counter += 1
df["row_hash"] = df.apply(
self.hash_row
)
df["key"] = df["column_test_3"].apply(
self.hash_string
)
keys = list(df["key"])
row_hashes = list(df["row_hash"])
start_time_de_params = datetime.utcnow()
for i in range(chunksize):
row_check = None
params = (
str(keys[i]),
self.target_version,
str(row_hashes[i])
)
check_for_patch_params = (
self.version,
str(keys[i])
)
check_for_patch.append(
(
select_sql,
check_for_patch_params
)
)
futures.append(
(
insert_sql,
params
)
)
self.log.info("Time for params: {}".format(datetime.utcnow() - start_time_de_params))
if len(check_for_patch) >= patch_length_for_cql:
start_time_de_update = datetime.utcnow()
results = execute_concurrent(
self.session, check_for_patch, concurrency=concurrency_value, raise_on_first_error=False
)
self.log.info("Time for just the query: {}".format(datetime.utcnow() - start_time_de_update))
row_counter_for_patch = 0
for (success, result) in results:
if not result:
self.push_to_patch(
keys[row_counter_for_patch],
rows_for_checking_patch[row_counter_for_patch],
mode='insert'
)
row_counter_for_patch += 1
continue
if not success:
# result will be an Exception
self.log.error("Error has occurred in insert cql")
self.handle_error(result)
id_to_be_compared = result[0]["id"]
row_hash_to_be_compared = result[0]["row_hash"]
if (row_hash_to_be_compared != row_hashes[row_counter_for_patch]):
self.push_to_patch(
id_to_be_compared,
rows_for_checking_patch[row_counter_for_patch]["row"],
mode='update'
)
row_counter_for_patch += 1
del check_for_patch[:]
del rows_for_checking_patch[:]
row_counter_for_patch = 0
self.log.info("Time for check patch: {}".format(
datetime.utcnow() - start_time_de_update
))
if (len(futures) >= patch_length_for_cql):
start_time_de_insert = datetime.utcnow()
results = execute_concurrent(
self.session, futures, concurrency=concurrency_value, raise_on_first_error=False
)
for (success, result) in results:
if not success:
# result will be an Exception
self.log.error("Error has occurred in insert cql")
self.handle_error(result)
del futures[:]
self.log.info("Time for insert patch: {}".format(
datetime.utcnow() - start_time_de_insert
))
self.log.info(chunk_counter)
# self.log.info("This chunk got over in {}".format(datetime.utcnow() - start_time))
if len(check_for_patch) > 0:
results = execute_concurrent(
self.session, check_for_patch, concurrency=concurrency_value, raise_on_first_error=False
)
row_counter_for_patch = 0
for (success, result) in results:
if not result:
self.push_to_patch(
rows_for_checking_patch[row_counter_for_patch]["id"],
rows_for_checking_patch[row_counter_for_patch]["row"],
mode='insert'
)
row_counter_for_patch += 1
continue
if not success:
# result will be an Exception
self.log.error("Error has occurred in insert cql")
self.handle_error(result)
id_to_be_compared = result[0]["id"]
row_hash_to_be_compared = result[0]["row_hash"]
if (row_hash_to_be_compared != rows_for_checking_patch[row_counter_for_patch]["row_hash"]):
self.push_to_patch(
id_to_be_compared,
rows_for_checking_patch[row_counter_for_patch]["row"],
mode='update'
)
row_counter_for_patch += 1
del check_for_patch[:]
del rows_for_checking_patch[:]
if len(futures) > 0: # in case the last dataframe has #rows < 10k.
results = execute_concurrent(
self.session, futures, concurrency=concurrency_value, raise_on_first_error=False
)
for (success, result) in results:
if not success:
self.handle_error(result)
del futures[:]
self.log.info(chunk_counter)
# Check the delete patch
if self.generate_diff is True and self.mode is 'replace' and self.version is not 0:
self.log.info("We got to find the delete patch!")
start_time = datetime.utcnow()
current_version_query = "SELECT id, row, row_hash FROM {} WHERE version={}".format(
self.table_name, self.version
)
current_version_rows = self.session.execute(
current_version_query
)
for current_version_row in current_version_rows:
row_check_query = "SELECT {} FROM {} WHERE {}={} AND {}='{}' ".format(
"id", self.table_name, "version", self.target_version, "id", current_version_row.id
)
row_check = self.session.execute(row_check_query).one()
if row_check is not None: # row exists in both version.
continue
self.push_to_patch(
current_version_row.id,
current_version_row.id,
mode="delete"
)
print("Complete insert's duration is: {}".format(
datetime.utcnow() - start_time)
)
# Calling last_patch for all remaining diffs
modes = [
'update',
'insert',
'delete'
]
for mode in modes:
json_patch_file = os.path.join(
self.working_dir,
"version_{}_{}.json".format(
self.target_version, mode
)
)
if mode == 'update':
self.push_to_update_patch(
json_patch_file,
last_patch=True
)
if mode == 'insert':
self.push_to_insert_patch(
json_patch_file,
last_patch=True
)
if mode == 'delete':
self.push_to_delete_patch(
json_patch_file,
last_patch=True
)
if __name__ == '__main__':
example1 = PythonCassandraExample(
file_to_be_versioned="hundred_million_eleven_columns.csv"
)
example1.createsession()
example1.setlogger()
example1.createkeyspace('sat_athena_one')
example1.create_table_and_set_version('five_hundred_rows')
example1.clone_version()
example1.insert_data(generate_diff=True)
I have a csv file of 100M rows and 11 cols. The script used to generate such a file is:
import csv
import sys
import os
import pandas as pd
file_name = "hundred_million_eleven_columns.csv"
rows_list = []
chunk_counter = 1
headers = [
"column_test_1",
"column_test_2",
"column_test_3",
"column_test_4",
"column_test_5",
"column_test_6",
"column_test_7",
"column_test_8",
"column_test_9",
"column_test_10",
"column_test_11",
]
file_exists = os.path.isfile(file_name)
with open(file_name, 'a') as csvfile:
writer = csv.DictWriter(csvfile, delimiter=',',
lineterminator='\n', fieldnames=headers)
if not file_exists:
writer.writeheader() # file doesn't exist yet, write a header
for i in range(100000000):
dict1 = [
i, i+1, i+2, i+3, i+4, i+5, i+6, i+7, i+8, i+9, i+10
]
# get input row in dictionary format
# key = col_name
rows_list.append(dict1)
if len(rows_list) == 100000:
df = pd.DataFrame(rows_list)
df.to_csv(file_name,
mode='a', index=False, header=False)
del rows_list[:]
del df
print(chunk_counter)
chunk_counter += 1
if len(rows_list) > 0:
df = pd.DataFrame(rows_list)
df.to_csv(file_name, mode='a', index=False, header=False)
del rows_list[:]
del df
print(chunk_counter)
chunk_counter += 1
My cassandra's yaml file is here
Make sure that your code can even generate that much at 50k. If you remove the execute's, can you even read the CSV and generate the sha that fast? A C* instance on that sized host with SSDs should be able to do 50k writes/sec but theres a lot going on outside of the C* writes that are likely part of the issue.
If your concurrent reads/writes are above 128 you are going to have some serious issues. On a system that can handle it 64 even is enough be able to go past 200k writes a sec. You are actually going to make things much worse with that high of a setting. There is no IO involved in that so as the documentation states, 8x your core count is a good value. I would even recommend lowering the concurrency your pushing from 10k to like 1024 or even lower. You can play with different settings to see how it impacts things.
Make sure python was compiled with cython on your install as its going to be dominated on the serialization otherwise. Speaking of the python driver is the slowest so keep that in mind.
Your blocking on the sha can be a majority of the time. Without perf tracing - just try it with a fixed value to see the difference.
"My machine" -- is this a single node cluster? If your throwing availability out the window might as well disable durable_writes on the keyspace to speed up the writes a bit. Missing heap settings but make sure you have a minimum of 8gb, even if this is a pretty small host cassandra needs memory. If not reading consider disabling the keycache and maybe disabling compactions while the job is running and then enable afterwards.
Comment recomends 8 * number of cores.
On the other hand, since writes are almost never IO bound, the ideal
number of "concurrent_writes" is dependent on the number of cores in
your system; (8 * number_of_cores) is a good rule of thumb.
64 is proper in 8core machine.
concurrent_reads: 64
concurrent_writes: 64
concurrent_counter_writes: 64
This limits are may recommended because there are many other io operations except normal IO. ex) writting commit log, caching, compaction, replication, view (if exist)
Some rules of thumb
disk_optimization_strategy: ssd // If your disk is hdd, chage value to spinning
use dedicated commit log disk. ssd recommended.
more disks = better performance
I'm using PyQt for a GUI software. I also use a sqlite database to feed the software with data.
Somewhere in my code, I have this method:
def loadNotifications(self):
"""Method to find the number of unread articles,
for each search. Load a list of id, for the unread articles,
in each table. And a list of id, for the concerned articles, for
each table"""
count_query = QtSql.QSqlQuery(self.bdd)
count_query.setForwardOnly(True)
# Don't treat the articles if it's the main tab, it's
# useless because the article will be concerned for sure
for table in self.list_tables_in_tabs[1:]:
# Empty these lists, because when loadNotifications is called
# several times during the use, the nbr of unread articles is
# added to the nbr of notifications
table.list_new_ids = []
table.list_id_articles = []
# Try to speed things up
append_new = table.list_new_ids.append
append_articles = table.list_id_articles.append
req_str = self.refineBaseQuery(table.base_query, table.topic_entries, table.author_entries)
print(req_str)
count_query.exec_(req_str)
start_time = datetime.datetime.now()
i = 0
while count_query.next():
i += 1
record = count_query.record()
append_articles(record.value('id'))
if record.value('new') == 1:
append_new(record.value('id'))
print(datetime.datetime.now() - start_time)
print("Nbr of entries processed: {}".format(i))
Let's assume this loop has ~400 entries to process. It takes about one second, and I think it's too long. I tried to optimize the process as much as I could, but it still takes too much time.
Here is what the previous method typically prints:
SELECT * FROM papers WHERE id IN(320, 1320, 5648, 17589, 20092, 20990, 49439, 58378, 65251, 68772, 73509, 86859, 90594)
0:00:00.001403
Nbr of entries processed: 13
SELECT * FROM papers WHERE topic_simple LIKE '% 3D print%'
0:00:00.591745
Nbr of entries processed: 81
SELECT * FROM papers WHERE id IN (5648, 11903, 14258, 30587, 40339, 55691, 57383, 58378, 62951, 65251, 68772, 87295)
0:00:00.000478
Nbr of entries processed: 12
SELECT * FROM papers WHERE topic_simple LIKE '% Python %'
0:00:00.596490
Nbr of entries processed: 9
SELECT * FROM papers WHERE topic_simple LIKE '% Raspberry pi %' OR topic_simple LIKE '% arduino %'
0:00:00.988276
Nbr of entries processed: 5
SELECT * FROM papers WHERE topic_simple LIKE '% sensor array%' OR topic_simple LIKE '% biosensor %'
0:00:00.996164
Nbr of entries processed: 433
SELECT * FROM papers WHERE id IN (320, 540, 1320, 1434, 1860, 4527, 5989, 6022, 6725, 6978, 7268, 8625, 9410, 9814, 9850, 10608, 13219, 15572, 15794, 19345, 19674, 19899, 20990, 22530, 26443, 26535, 28721, 29089, 30923, 31145, 31458, 31598, 32069, 34129, 35820, 36142, 36435, 37546, 39188, 39952, 40949, 41764, 43529, 43610, 44184, 45206, 49210, 49807, 50279, 50943, 51536, 51549, 52921, 52967, 54610, 56036, 58087, 60490, 62133, 63051, 63480, 63535, 64861, 66906, 68107, 68328, 69021, 71797, 73058, 74974, 75331, 77697, 78138, 80152, 80539, 82172, 82370, 82840, 86859, 87467, 91528, 92167)
0:00:00.002891
Nbr of entries processed: 82
SELECT * FROM papers WHERE id IN (7043, 41643, 44688, 50447, 64723, 72601, 81006, 82380, 84285)
0:00:00.000348
Nbr of entries processed: 9
Is this the better way ? Can I get better results ?
NOTE: the time displayed is the time needed to run the loop, not the time needed to run the query.
I tried count_query.setForwardOnly(True), as mentioned in the doc, but it had no effect on the perfs.
EDIT:
Here is test database with ~600 entries:
database
Obviously I can't test this, so I don't know if it will make a significant difference, but you could try using index-based look-ups:
id_index = count_query.record().indexOf('id')
new_index = count_query.record().indexOf('new')
while count_query.next():
record = count_query.record()
id_value = record.value(id_index)
append_articles(id_value)
if record.value(new_index) == 1:
append_new(id_value)
UPDATE:
Using your sample db, I cannot reproduce the issue you are seeing, and I also found my method above is about twice as fast as your original one. Here's some sample output:
IDs: 660, Articles: 666
IDs: 660, Articles: 666
IDs: 660, Articles: 666
test(index=False): 0.19050272400090762
IDs: 660, Articles: 666
IDs: 660, Articles: 666
IDs: 660, Articles: 666
test(index=True): 0.09384496400161879
Test case:
import sys, os, timeit
from PyQt4 import QtCore, QtGui
from PyQt4.QtSql import QSqlDatabase, QSqlQuery
def test(index=False):
count_query = QSqlQuery('select * from results')
list_new_ids = []
list_id_articles = []
append_new = list_new_ids.append
append_articles = list_id_articles.append
if index:
id_index = count_query.record().indexOf('id')
new_index = count_query.record().indexOf('new')
while count_query.next():
record = count_query.record()
id_value = record.value(id_index)
append_articles(id_value)
if record.value(new_index) == 1:
append_new(id_value)
else:
while count_query.next():
record = count_query.record()
append_articles(record.value('id'))
if record.value('new') == 1:
append_new(record.value('id'))
print('IDs: %d, Articles: %d' % (
len(list_new_ids), len(list_id_articles)))
class Window(QtGui.QWidget):
def __init__(self):
super(Window, self).__init__()
self.button = QtGui.QPushButton('Test', self)
self.button.clicked.connect(self.handleButton)
layout = QtGui.QVBoxLayout(self)
layout.addWidget(self.button)
self.database = QSqlDatabase.addDatabase("QSQLITE")
path = os.path.join(os.path.dirname(__file__), 'tmp/perf-test.db')
self.database.setDatabaseName(path)
self.database.open()
def handleButton(self):
for stmt in 'test(index=False)', 'test(index=True)':
print('%s: %s' % (stmt, timeit.timeit(
stmt, 'from __main__ import test', number=3)))
if __name__ == '__main__':
import sys
app = QtGui.QApplication(sys.argv)
window = Window()
window.setGeometry(600, 300, 200, 100)
window.show()
sys.exit(app.exec_())
I have query with dynamic conditions,i.e.
select (lambda obj:obj.A = 'a' and obj.B = 'b' and ...)
So i write code for this:
def search(self,**kwargs):
q = unicode('lambda obj:', 'utf-8')
for field,value in kwargs.iteritems():
value = unicode(value, 'utf-8')
field = unicode(field, 'utf-8')
q+=u" obj.%s == '%s' and" % (field,value
q = q[0:q.rfind('and')]
res = select(q.encode('utf-8'))[:]
But i have this error during execution of function:
tasks.search(title='Задача 1',url='test.com')
res = select(q.encode('utf-8'))[:]
File "<string>", line 2, in select
File ".../local/lib/python2.7/site-packages/pony/utils.py", line 96, in cut_traceback
return func(*args, **kwargs)
File ".../local/lib/python2.7/site-packages/pony/orm/core.py", line 3844, in select
if not isinstance(tree, ast.GenExpr): throw(TypeError)
File "...local/lib/python2.7/site-packages/pony/utils.py", line 123, in throw
raise exc
TypeError
While it is possible to use strings in order to apply conditions to a query, it can be unsafe, because of the risk of SQL injection. The better way for applying conditions to a query is using the filter() method. You can take the latest version of Pony ORM from https://github.com/ponyorm/pony repository and try a couple of examples provided below.
First we define entities and create a couple of objects:
from decimal import Decimal
from pony.orm import *
db = Database('sqlite', ':memory:')
class Product(db.Entity):
name = Required(unicode)
description = Required(unicode)
price = Required(Decimal)
quantity = Required(int, default=0)
db.generate_mapping(create_tables=True)
with db_session:
Product(name='iPad', description='Air, 16GB', price=Decimal('478.99'), quantity=10)
Product(name='iPad', description='Mini, 16GB', price=Decimal('284.95'), quantity=15)
Product(name='iPad', description='16GB', price=Decimal('299.00'), quantity=10)
Now we'll apply filters passing them as keyword arguments:
def find_by_kwargs(**kwargs):
q = select(p for p in Product)
q = q.filter(**kwargs)
return list(q)
with db_session:
products = find_by_kwargs(name='iPad', quantity=10)
for p in products:
print p.name, p.description, p.price, p.quantity
Another option is to use lambdas in order to specify the conditions:
def find_by_params(name=None, min_price=None, max_price=None):
q = select(p for p in Product)
if name is not None:
q = q.filter(lambda p: p.name.startswith(name))
if min_price is not None:
q = q.filter(lambda p: p.price >= min_price)
if max_price is not None:
q = q.filter(lambda p: p.price <= max_price)
return list(q)
with db_session:
products = find_by_params(name='iPad', max_price=400)
for p in products:
print p.name, p.description, p.price, p.quantity
As you can see filters can be applied dynamically. You can find more information about using filters following by this link: http://doc.ponyorm.com/queries.html#Query.filter
If you still want to filter using strings, you have to apply new filter for each key/value pair.
Something like this:
def search(self,**kwargs):
q = select(m for m in Product)
for field,value in kwargs.iteritems():
value = unicode(value, 'utf-8')
field = unicode(field, 'utf-8')
flt = u"m.{0} == {1}".format(value, field)
q = q.filter(flt)
# return q # return Query which can be further modified (for ex. paging, ordering, etc.)
return q[:] # or return found products
HTH, Tom