Python Mongodb sorting too big, how to use index? - python

I'm trying to iterate in Python over all elements of a large Mongodb database.
Usually, I do:
mgclient = MongoClient('mongodb://user:pwd#0.0.0.0:27017')
mgdb = mgclient['mongo']
mgcol = mgdb['name']
for mg_ob in mgcol.find().sort('Date').sort('time'):
#DOTHINGS
But it says "Sort operation used more than the maximum 33554432 bytes of RAM. Add an index, or specify a smaller limit".
So I created an index named 'SortedTime', but I don't understand how I can use it now.
Basically, I'm trying to have something like:
mgclient = MongoClient('mongodb://user:pwd#0.0.0.0:27017')
mgdb = mgclient['mongo']
mgcol = mgdb['name']
for mg_ob in mgcol.find()['SortedTime']:
#DOTHINGS
Any ideas ? A little hand would be much appreciated.
I hope this post will help others. Thank you very much
Update:
I managed to make it work thanks to Joe. After I created the Index:
resp = mgcol.create_index(
[
("date", 1),
("time", 1)
]
)
print ("index response:", resp)
What I did was just:
mgclient = MongoClient('mongodb://user:pwd#0.0.0.0:27017')
mgdb = mgclient['mongo']
mgcol = mgdb['name']
for mg_ob in mgcol.find():
#DOTHINGS
No need to use the index name.

Your query is sorting on 2 fields, Date and time, so you will need an index that includes these fields first in the key specification.
Working from the mongo shell, you might use the createIndex shell helper:
db.getSiblingDB("mongo").getCollection("name").createIndex({Date:1, time:1})
Working from the client side, you might use the createIndexes database command.
Once the index has been created, query just like you did before and the mongod's query executor should use the index.
You can use explain() to get detailed query execution stages to see which indexes were considered and the comparative performance of each.

Related

KDB+ query in QPython: Filter based on DataFrame list

I am using qpython to query into a KDB+ database and then performing operations on the output. old_df is output from an earlier qpython sync query which has '[source_id]' as a string column. Now am querying into another database trades_database which has the same fields (as source_id) under a different column name customer (also string, no issues in data type)
params = np.array([])
for i in old_df['source_id']:
params = np.append(params, np.string_(i))
new_df = q.sync('{[w]select from trade_database where customer in w}', *params, pandas=True)
Unfortunately, there is very little available online to solve such queries. I have learned a fair bit from the questions asked in here, but am really stuck here. My list could be very long and so would need to write a query where it is taken as an input only.
I also tried:
new_df= q1.sync('{select from trades_database where customer in (`1234, `ABCD)}', pandas=True)
which works but I get
<qpython.qtype.QLambda object at 0x000000000413F710>
How does one "unpack" a QLambda object?
Please ignore the 2nd question if I am not allowed to ask 2 questions in the same post pls. Apologies in that case.
Thanks!
here is what I did and it seems to work:
params = np.array(one_id) #just input the initial id used to search for old_df, and not put the square brackets to make it into a list
for i in old_df['source_id']:
params = np.append(params,np.string_(i))
params=np.unique(params)
new_df = q1.sync('{[w]select from trades_database where customer in w}', params, pandas=True)

Django storing a lot of data in table

Right now, I use this code to save the data to the database-
for i in range(len(companies)):
for j in range(len(final_prices)):
linechartdata = LineChartData()
linechartdata.foundation = company //this refers to a foreign-key of a different model
linechartdata.date = finald[j]
linechartdata.price = finalp[j]
linechartdata.save()
Now len(companies) can vary from [3-50] and len(final_prices) can vary from somewhere between [5000-10000]. I know its a very inefficient way to store it in the database and takes a lot of time. What should I do to make it effective and less time consuming?
If you really need to store them in the database you might check bulk_create. From the documents:
This method inserts the provided list of objects into the database in an efficient manner (generally only 1 query, no matter how many objects there are):
Although, I never personally used it for that many objects, docs say it can. This could make your code more efficient in terms of hitting the database and using multiple save().
Basically to try; create list of objects (without saving) and then use bulk_create. Like this:
arr = []
for i in range(len(companies)):
for j in range(len(final_prices)):
arr.append(
LineChartData(
foundation = company,
date = finald[j],
price = finalp[j]
)
)
LineChartData.objects.bulk_create(arr)

Filtering with joined tables

I'm trying to get some query performance improved, but the generated query does not look the way I expect it to.
The results are retrieved using:
query = session.query(SomeModel).
options(joinedload_all('foo.bar')).
options(joinedload_all('foo.baz')).
options(joinedload('quux.other'))
What I want to do is filter on the table joined via 'first', but this way doesn't work:
query = query.filter(FooModel.address == '1.2.3.4')
It results in a clause like this attached to the query:
WHERE foos.address = '1.2.3.4'
Which doesn't do the filtering in a proper way, since the generated joins attach tables foos_1 and foos_2. If I try that query manually but change the filtering clause to:
WHERE foos_1.address = '1.2.3.4' AND foos_2.address = '1.2.3.4'
It works fine. The question is of course - how can I achieve this with sqlalchemy itself?
If you want to filter on joins, you use join():
session.query(SomeModel).join(SomeModel.foos).filter(Foo.something=='bar')
joinedload() and joinedload_all() are used only as a means to load related collections in one pass, not used for filtering/ordering!. Please read:
http://docs.sqlalchemy.org/en/latest/orm/tutorial.html#joined-load - the note on "joinedload() is not a replacement for join()", as well as :
http://docs.sqlalchemy.org/en/latest/orm/loading.html#the-zen-of-eager-loading

How to query a couchdb view using a composite key?

I have a couchdb view "record_by_date_product" with the following definition:
function(doc) {
emit([doc.logtime, doc.product_id], doc);
}
I am trying to run a query which is something like:
(logtime > fromdate & logtime < todate) & product_id in (1,2,6)
Is this possible with this view?
I am also using couchdb python library to access couchdb. Here is a code snippet:
server = couchdb.Server()
db = server['mydb']
results = db.view('_design/record_by_date_product/_view/record_by_date_product')
This page http://packages.python.org/CouchDB/client.html#viewresults specifies that we can use a startkey and endkey. But I am not able to get it working.
Thanks
I think I just found the exact answer:
Design a view 'sampleview' which is like:
{
"records_by_date_product": {
"map": "function(doc) {\n emit([doc.prod_id, doc.logtime], doc);\n}"
}
}
Let us say that the query parameters are:
prod_id in [1,3]
from_date = '2010-01-01 00:00:00'
to_date = '2010-01-02 00:00:00'
Then you will have to run 2 separate queries on the same view:
http://localhost:5984/db/_design/sampleview/_view/records_by_date_product?startkey='\["1,2010-01-01%2000:00:00"\]'&endkey='\[1,"2010-01-02%2000:00:00"\]'
http://localhost:5984/db/_design/sampleview/_view/records_by_date_product?startkey='\[2,"2010-01-01%2000:00:00"\]'&endkey='\[2,"2010-01-02%2000:00:00"\]'
Notice that the same query is run each time except that the prod_id is changed in the second query. The results have to be collated later. Hope this helps!
That exact query is not possible. As the documentation suggests, you can get everything in a view in a particular key range. Views are sorted data structures, so all CouchDB does to fulfill this request is locate the start key and begin returning items until you hit the end key.
The strategy you should use for this query depends on characteristics of the data itself. Most importantly, will you waste a lot of time weeding out items if you use only the first part of the key (logtime) and iterate through those in Python, weeding out items where product_id won't match? If so, you should consider writing another view that is primarily sorted by product_id. If not, go ahead and use the weed-out approach.
How about this solution:
I create a view for each product with logtime as the index.
Access each view if required and filter theresults using the range - [fromdate todate]
Do 3 for each product in the input parameters and collate the results
This has a drawback that for every product we will have to create a view and this looks like a manual process.
Just a thought! Let me know your views.

How to a query a set of objects and return a set of object specific attribute in SQLachemy/Elixir?

Suppose that I have a table like:
class Ticker(Entity):
ticker = Field(String(7))
tsdata = OneToMany('TimeSeriesData')
staticdata = OneToMany('StaticData')
How would I query it so that it returns a set of Ticker.ticker?
I dig into the doc and seems like select() is the way to go. However I am not too familiar with the sqlalchemy syntax. Any help is appreciated.
ADDED: My ultimate goal is to have a set of current ticker such that, when new ticker is not in the set, it will be inserted into the database. I am just learning how to create a database and sql in general. Any thought is appreciated.
Thanks. :)
Not sure what you're after exactly but to get an array with all 'Ticker.ticker' values you would do this:
[instance.ticker for instance in Ticker.query.all()]
What you really want is probably the Elixir getting started tutorial - it's good so take a look!
UPDATE 1: Since you have a database, the best way to find out if a new potential ticker needs to be inserted or not is to query the database. This will be much faster than reading all tickers into memory and checking. To see if a value is there or not, try this:
Ticker.query.filter_by(ticker=new_ticker_value).first()
If the result is None you don't have it yet. So all together,
if Ticker.query.filter_by(ticker=new_ticker_value).first() is None:
Ticker(ticker=new_ticker_value)
session.commit()

Categories

Resources