Read optimisation cassandra using python

Read optimisation cassandra using python - python

I have a table with the following model:
CREATE TABLE IF NOT EXISTS {} (
user_id bigint ,
pseudo text,
importance float,
is_friend_following bigint,
is_friend boolean,
is_following boolean,
PRIMARY KEY ((user_id), is_friend_following)
);
I also have a table containing my seeds. Those (20) users are the starting point of my graph. So I select their ID and search in the table above to get their Followers and friends, and from there I build my graph (networkX).
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
friend_ids = []
follower_ids = []
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
friend_ids.append(row.friend_follower_id)
if row.is_follower:
follower_ids.append(row.friend_follower_id)
if friend_ids:
for friend_id in friend_ids:
obj.graph.add_edge(seed, friend_id)
if follower_ids:
for follower_id in follower_ids:
obj.graph.add_edge(follower_id, seed)
return obj
The problem is that the time it takes to build the graph is too long and I would like to optimize it.
I've got approximately 5 millions rows in my table 'network_table'.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?

I suspect the real issue may not be the queries but rather the processing time.
I'm wondering if it would be faster for me if instead of doing a query with a where clauses to just do a single query on whole table? Will it fit in memory? Is that a good Idea? Are there better way?
There should not be any problem with doing a single query on the whole table if you enable paging (https://datastax.github.io/python-driver/query_paging.html - using fetch_size). Cassandra will return up to the fetch_size and will fetch additional results as you read them from the result_set.
Please note that if you have many rows in the table that are non seed related then a full scan may be slower as you will receive rows that will not include a "seed"
Disclaimer - I am part of the team building ScyllaDB - a Cassandra compatible database.
ScyllaDB have published lately a blog on how to efficiently do a full scan in parallel http://www.scylladb.com/2017/02/13/efficient-full-table-scans-with-scylla-1-6/ which applies to Cassandra as well - if a full scan is relevant and you can build the graph in parallel than this may help you.

It seems like you can get rid of the last 2 if statements, since you're going through data that you already have looped through once:
def build_seed_graph(cls, name):
obj = cls()
obj.name = name
query = "SELECT twitter_id FROM {0};"
seeds = obj.session.execute(query.format(obj.seed_data_table))
obj.graph.add_nodes_from(obj.seeds)
for seed in seeds:
query = "SELECT friend_follower_id, is_friend, is_follower FROM {0} WHERE user_id={1}"
statement = SimpleStatement(query.format(obj.network_table, seed), fetch_size=1000)
for row in obj.session.execute(statement):
if row.friend_follower_id in obj.seeds:
if row.is_friend:
obj.graph.add_edge(seed, row.friend_follower_id)
elif row.is_follower:
obj.graph.add_edge(row.friend_follower_id, seed)
return obj
This also gets rid of many append operations on lists that you're not using, and should speed up this function.

Related

Fast way to convert SQLAlchemy objects to Python dicts

I have this query that returns a list of student objects:
query = db.session.query(Student).filter(Student.is_deleted == false())
query = query.options(joinedload('project'))
query = query.options(joinedload('image'))
query = query.options(joinedload('student_locator_map'))
query = query.options(subqueryload('attached_addresses'))
query = query.options(subqueryload('student_meta'))
query = query.order_by(Student.student_last_name, Student.student_first_name,
Student.student_middle_name, Student.student_grade, Student.student_id)
query = query.filter(filter_column == field_value)
students = query.all()
The query itself does not take much time. The problem is converting all these objects (can be 5000+) to Python dicts. It takes over a minute with this many objects.Currently, the code loops thru the objects and converts using to_dict(). I have also tried _dict__ which was much faster but this does not convert all relational objects it seems.
How can I convert all these Student objects and related objects quickly?

Maybe this will help you...
from collections import defaultdict
def query_to_dict(student_results):
result = defaultdict(list)
for obj in student_results:
instance = inspect(obj)
for key, x in instance.attrs.items():
result[key].append(x.value)
return result
output = query_to_dict(students)

query = query.options(joinedload('attached_addresses').joinedload('address'))
By chaining address joinedload to attached_addresses I was able to significantly speed up the query.
My understanding of why this is the case:
Address objects were not being loaded with the initial query. Every iteration thru the loop, the db would get hit to retrieve the Address object. With joined load, Address objects are now loaded upon initial query.
Thanks to Corley Brigman for the help.

How to generate a unique random number when insert in MySQL?

I have a database for articles and may want to generate a unique random integer for each articles so that they can be visited through URL like https://blablabla.com/articles/8373734 etc.
I could achieve that in python backend, but how do we achieve this in MySQL sentences?
For example, a new article was done, and inserted into database:
INSERT into article_table (title, date, url_id) VALUES ('asdf', '11/11/1111', 8373734)
the url_id here is the unique random integer (1000~10000000) that automatically generated.
I believe The primary key ID and auto-increasment are good way to solve this. But my question is:
In practical scenario, do they (companies) literally use primary ID or auto-increasment? This may expose how piece of data you (ever) have in database. Take this https://www.zhihu.com/question/41490222 for example, I tried hundreds of number around 41490222, all returns 404 not found. it seems that the number are recorded very sparsely, not very possible achieved by auto-increasement.
Are there any efficient way to generate such random number without checking duplication for every loop?

Use mysql function RAND()
-------------------------
select FLOOR(RAND() * 999999)

You can use UUID(), or if it has to be numeric UUID_SHORT() for that.

albeit my sql skills are a bit rusty, I think you might want to create a function using the RAND function.
CREATE PROCEDURE GetRandomValue()
BEGIN
DECLARE newUrlId INT DEFAULT 0;
WHILE (
newUrlId = 0
OR IF EXISTS(SELECT 1 FROM yourTable WHERE url_id = newUrlId)
)
DO
SET newUrlId = SELECT FLOOR(RAND() * 999999)
END WHILE
RETURN newUrlId
END
Then again, why creating such a fuss while you could use other ways to create "bigger random numbers"
for example:
function createBiggerNumber(id) {
return (id * constants.MySecretMultiplyValue) + constants.MySecretAddedValue;
}
function extractIdFromBiggerNumber(number) {
return (number - constants.MySecretAddedValue) / constants.MySecretMultiplyValue
}

the logic is combine with their primary key | id , so we dont need re check if the data is exist or not.
DELIMITER $$
DROP TRIGGER IF EXISTS `auto_number`$$
CREATE TRIGGER `auto_number` BEFORE INSERT on users
FOR EACH ROW BEGIN
SET new.auto_number = CONCAT(new.id, LEFT(UUID(), 8));
END$$
DELIMITER ;
https://gist.github.com/yogithesymbian/698b27138a5ba89d2a32e3fc7ddd3cfb

Querying objects using attribute of member of many-to-many

I have the following models:
class Member(models.Model):
ref = models.CharField(max_length=200)
# some other stuff
def __str__(self):
return self.ref
class Feature(models.Model):
feature_id = models.BigIntegerField(default=0)
members = models.ManyToManyField(Member)
# some other stuff
A Member is basically just a pointer to a Feature. So let's say I have Features:
feature_id = 2, members = 1, 2
feature_id = 4
feature_id = 3
Then the members would be:
id = 1, ref = 4
id = 2, ref = 3
I want to find all of the Features which contain one or more Members from a list of "ok members." Currently my query looks like this:
# ndtmp is a query set of member-less Features which Members can point to
sids = [str(i) for i in list(ndtmp.values('feature_id'))]
# now make a query set that contains all rels and ways with at least one member with an id in sids
okmems = Member.objects.filter(ref__in=sids)
relsways = Feature.geoobjects.filter(members__in=okmems)
# now combine with nodes
op = relsways | ndtmp
This is enormously slow, and I'm not even sure if it's working. I've tried using print statements to debug, just to make sure anything is actually being parsed, and I get the following:
print(ndtmp.count())
>>> 12747
print(len(sids))
>>> 12747
print(okmems.count())
... and then the code just hangs for minutes, and eventually I quit it. I think that I just overcomplicated the query, but I'm not sure how best to simplify it. Should I:
Migrate Feature to use a CharField instead of a BigIntegerField? There is no real reason for me to use a BigIntegerField, I just did so because I was following a tutorial when I began this project. I tried a simple migration by just changing it in models.py and I got a "numeric" value in the column in PostgreSQL with format 'Decimal:( the id )', but there's probably some way around that that would force it to just shove the id into a string.
Use some feature of Many-To-Many Fields which I don't know abut to more efficiently check for matches
Calculate the bounding box of each Feature and store it in another column so that I don't have to do this calculation every time I query the database (so just the single fixed cost of calculation upon Migration + the cost of calculating whenever I add a new Feature or modify an existing one)?
Or something else? In case it helps, this is for a server-side script for an ongoing OpenStreetMap related project of mine, and you can see the work in progress here.
EDIT - I think a much faster way to get ndids is like this:
ndids = ndtmp.values_list('feature_id', flat=True)
This works, producing a non-empty set of ids.
Unfortunately, I am still at a loss as to how to get okmems. I tried:
okmems = Member.objects.filter(ref__in=str(ndids))
But it returns an empty query set. And I can confirm that the ref points are correct, via the following test:
Member.objects.values('ref')[:1]
>>> [{'ref': '2286047272'}]
Feature.objects.filter(feature_id='2286047272').values('feature_id')[:1]
>>> [{'feature_id': '2286047272'}]

You should take a look at annotate:
okmems = Member.objects.annotate(
feat_count=models.Count('feature')).filter(feat_count__gte=1)
relsways = Feature.geoobjects.filter(members__in=okmems)

Ultimately, I was wrong to set up the database using a numeric id in one table and a text-type id in the other. I am not very familiar with migrations yet, but as some point I'll have to take a deep dive into that world and figure out how to migrate my database to use numerics on both. For now, this works:
# ndtmp is a query set of member-less Features which Members can point to
# get the unique ids from ndtmp as strings
strids = ndtmp.extra({'feature_id_str':"CAST( \
feature_id AS VARCHAR)"}).order_by( \
'-feature_id_str').values_list('feature_id_str',flat=True).distinct()
# find all members whose ref values can be found in stride
okmems = Member.objects.filter(ref__in=strids)
# find all features containing one or more members in the accepted members list
relsways = Feature.geoobjects.filter(members__in=okmems)
# combine that with my existing list of allowed member-less features
op = relsways | ndtmp
# prove that this set is not empty
op.count()
# takes about 10 seconds
>>> 8997148 # looks like it worked!
Basically, I am making a query set of feature_ids (numerics) and casting it to be a query set of text-type (varchar) field values. I am then using values_list to make it only contain these string id values, and then I am finding all of the members whose ref ids are in that list of allowed Features. Now I know which members are allowed, so I can filter out all the Features which contain one or more members in that allowed list. Finally, I combine this query set of allowed Features which contain members with ndtmp, my original query set of allowed Features which do not contain members.

Adding parameters to SQLITE3 SELECT column queries python

I am trying to streamLine queries to SQLITE3. I use it for financial price modelling and so am re-using the same basic query alot, but have to keep changing the hard coding to get out different column queries each time. So I want a generic query where I just write in what I want once, then it spits out the columns as lists. This is a basic version of what I want but basically still hard coded so you can see what I am trying to create.
dbName = 'NASDAQ_Equities'
ticker = 'AAPL'
def pullDataTest(dbPathName, ticker, *args):
datep = []
openp = []
highp = []
db = sqlite3.connect(dbPathName + '.mydb', detect_types=sqlite3.PARSE_DECLTYPES | sqlite3.PARSE_COLNAMES, timeout=3)
cursor = db.cursor()
cursor.execute('''SELECT ''' + str(args) + ''' FROM ''' + ticker)
for row in cursor:
datep.append(row[0])
openp.append(row[1])
highp.append(row[2])
pullData(dbName, ticker, 'datep', 'openp', 'highp')
At the moment I am lost on how to enter an *args into the select statement as it rejects it because of the () brackets. Also what will be an issue is creating empty lists and appending to those lists from from an *args. Would it be better to create a order dict to append to, then brake that into lists at the end somehow? On returning values for use later down the track I was thinking of making them globals? Any suggestions? Thanks

Best practices for manipulating database result sets in Python?

I am writing a simple Python web application that consists of several pages of business data formatted for the iPhone. I'm comfortable programming Python, but I'm not very familiar with Python "idiom," especially regarding classes and objects. Python's object oriented design differs somewhat from other languages I've worked with. So, even though my application is working, I'm curious whether there is a better way to accomplish my goals.
Specifics: How does one typically implement the request-transform-render database workflow in Python? Currently, I am using pyodbc to fetch data, copying the results into attributes on an object, performing some calculations and merges using a list of these objects, then rendering the output from the list of objects. (Sample code below, SQL queries redacted.) Is this sane? Is there a better way? Are there any specific "gotchas" I've stumbled into in my relative ignorance of Python? I'm particularly concerned about how I've implemented the list of rows using the empty "Record" class.
class Record(object):
pass
def calculate_pnl(records, node_prices):
for record in records:
try:
# fill RT and DA prices from the hash retrieved above
if hasattr(record, 'sink') and record.sink:
record.da = node_prices[record.sink][0] - node_prices[record.id][0]
record.rt = node_prices[record.sink][1] - node_prices[record.id][1]
else:
record.da = node_prices[record.id][0]
record.rt = node_prices[record.id][1]
# calculate dependent values: RT-DA and PNL
record.rtda = record.rt - record.da
record.pnl = record.rtda * record.mw
except:
print sys.exc_info()
def map_rows(cursor, mappings, callback=None):
records = []
for row in cursor:
record = Record()
for field, attr in mappings.iteritems():
setattr(record, attr, getattr(row, field, None))
if not callback or callback(record):
records.append(record)
return records
def get_positions(cursor):
# get the latest position time
cursor.execute("SELECT latest data time")
time = cursor.fetchone().time
hour = eelib.util.get_hour_ending(time)
# fetch the current positions
cursor.execute("SELECT stuff FROM atable", (hour))
# read the rows
nodes = {}
def record_callback(record):
if abs(record.mw) > 0:
if record.id: nodes[record.id] = None
return True
else:
return False
records = util.map_rows(cursor, {
'id': 'id',
'name': 'name',
'mw': 'mw'
}, record_callback)
# query prices
for node_id in nodes:
# RT price
row = cursor.execute("SELECT price WHERE ? ? ?", (node_id, time, time)).fetchone()
rt5 = row.lmp if row else None
# DA price
row = cursor.execute("SELECT price WHERE ? ? ?", (node_id, hour, hour)).fetchone()
da = row.da_lmp if row else None
# update the hash value
nodes[node_id] = (da, rt5)
# calculate the position pricing
calculate_pnl(records, nodes)
# sort
records.sort(key=lambda r: r.name)
# return the records
return records

The empty Record class and the free-floating function that (generally) applies to an individual Record is a hint that you haven't designed your class properly.
class Record( object ):
"""Assuming rtda and pnl must exist."""
def __init__( self ):
self.da= 0
self.rt= 0
self.rtda= 0 # or whatever
self.pnl= None #
self.sink = None # Not clear what this is
def setPnl( self, node_prices ):
# fill RT and DA prices from the hash retrieved above
# calculate dependent values: RT-DA and PNL
Now, your calculate_pnl( records, node_prices ) is simpler and uses the object properly.
def calculate_pnl( records, node_prices ):
for record in records:
record.setPnl( node_prices )
The point isn't to trivially refactor the code in small ways.
The point is this: A Class Encapsulates Responsibility.
Yes, an empty-looking class is usually a problem. It means the responsibilities are scattered somewhere else.
A similar analysis holds for the collection of records. This is more than a simple list, since the collection -- as a whole -- has operations it performs.
The "Request-Transform-Render" isn't quite right. You have a Model (the Record class). Instances of the Model get built (possibly because of a Request.) The Model objects are responsible for their own state transformations and updates. Perhaps they get displayed (or rendered) by some object that examines their state.
It's that "Transform" step that often violates good design by scattering responsibility all over the place. "Transform" is a hold-over from non-object design, where responsibility was a nebulous concept.

Have you considered using an ORM? SQLAlchemy is pretty good, and Elixir makes it beautiful. It can really reduce the ammount of boilerplate code needed to deal with databases. Also, a lot of the gotchas mentioned have already shown up and the SQLAlchemy developers dealt with them.

Depending on how much you want to do with the data you may not need to populate an intermediate object. The cursor's header data structure will let you get the column names - a bit of introspection will let you make a dictionary with col-name:value pairs for the row.
You can pass the dictionary to the % operator. The docs for the odbc module will explain how to get at the column metadata.
This snippet of code to shows the application of the % operator in this manner.
>>> a={'col1': 'foo', 'col2': 'bar', 'col3': 'wibble'}
>>> 'Col1=%(col1)s, Col2=%(col2)s, Col3=%(col3)s' % a
'Col1=foo, Col2=bar, Col3=wibble'
>>>

Using a ORM for an iPhone app might be a bad idea because of performance issues, you want your code to be as fast as possible. So you can't avoid boilerplate code. If you are considering a ORM, besides SQLAlchemy I'd recommend Storm.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Read optimisation cassandra using python - python

Related

Fast way to convert SQLAlchemy objects to Python dicts

How to generate a unique random number when insert in MySQL?

Querying objects using attribute of member of many-to-many

Adding parameters to SQLITE3 SELECT column queries python

Best practices for manipulating database result sets in Python?

Categories

Resources