This is my first post :) so I'll apologize beforehand. I'm working on exporting data from mysql to couchdb, when an item has been saved, i mark the mysql item with a recently updated date. Below I have a python function which takes in a json object one by one and some random id to update on mysql locally:
def write_json_to_couchdb(json_obj, id):
#couchdb auto create doc_id and rev
doc_id = ''
revision_or_exception = ''
for (success, doc_id, revision_or_exception) in db.update(json_obj):
print(success, doc_id, revision_or_exception)
# mark id inside mysql db, so we know its been saved to couchdb
mysql.update_item_date(id)
This solution above works but is quite slow, both writing to couchdb and updating onto mysql, how can I use "bulk api" or "batch api", without using curl. I believe couchdb's db.update(item) can also take a list like this db.update(dict_of_items). How can i specify "batch ok". Are there any other method i'm unaware of. Seems there's few examples online.
Would this increase speed significantly? Also how can i specify the "batch size" of lets say 1000 records.
Here's what I'm thinking a better solution would be:
def write_json_to_couchdb_bulk(json_obj_list, id_list):
doc_id = ''
revision_or_exception = ''
for (success, doc_id, revision_or_exception) in db.update(json_obj_list):
print(success, doc_id, revision_or_exception)
# update added_date with current datetime
for id in id_list:
mysql.update_item_date(id)
Thanks,
SW
import couchdb
from couchdb import *
def write_json_to_couchdb_bulk(json_obj_list):
for doc in db.update(json_obj_list):
print(repr(doc))
json_obj_list = [
Document(type='Person', name='John Doe'),
Document(type='Person', name='Mary Jane'),
Document(type='City', name='Gotham City')
]
write_json_to_couchdb_bulk(json_obj_list)
Here's the solution I came up with, its much faster.
Related
I have a database of reviews and want to create a new field in my database that indicates whether a review contains words relating to "pool".
import re
import pandas as pd
from pymongo import MongoClient
client = MongoClient()
db = client.Hotels_Copenhagen
collection = db.get_collection("hotel_review_table")
data = pd.DataFrame(list(collection.find()))
def common_member(a, b):
a_set = set(a)
b_set = set(b)
if a_set & b_set:
return True
else:
return False
pool_set = {"pool","swim","swimming"}
for single_review in data.review_text:
make_it_lowercase = str(single_review).lower()
tokenize_it = re.split("\s|\.|,", make_it_lowercase)
pool_mentioned = common_member(tokenize_it, pool_set)
db.hotel_review_table.update_one({}, {"$set":{"pool_mentioned": pool_mentioned}})
In python I already counted the amount of reviews containing words related to "pool" and it turns out that 1k/ 50k of my reviews talk about pools.
I solved my previously posted problem of getting the same entry everywhere by moving the db.hotel_review_table.update_one line into the loop.
Thus the main problem is solved. However, it takes quite some time to update the database like this. Is there any other way to make it faster ?
You've gone to a lot of trouble to implement a feature that is available straight out of the box in MongoDB . You need to use text indexes.
Create a text index (in MongoDB shell):
db.hotel_review_table.createIndex( { "single_review": "text" } )
Then your code distils down to:
from pymongo import MongoClient
db = MongoClient()['Hotels_Copenhagen']
for keyword in ['pool', 'swim', 'swimming']:
db.hotel_review_table.update_many({'single_review': keyword}, {'$set': {'pool_mentioned': True}})
Note this doesn't set the value to false in the case that it isn't mentioned; if this is really needed, you can write another update to set any values that aren't true to false.
recently we've been learning Mongo and Python at University, however, in the latest assignment we were given, we're required to use PyMongo to use a pre-existent database of restaurants and perform various basic tasks.
Due to a surgery, I couldn't assist to the lectures given, so I'm still a bit confused onto how to do certain things with PyMongo.
One part of the task is write some querys to do stuff like "searching restaurants with latitudes lower than -90" and similar tasks.
My code looks a bit like this:
conn = pymongo.MongoClient()
restDB = conn.restaurantes
doc = restDB.datos
prim = doc.find({"address.coord.0":{"$lt":-95.754168}}, {"name":True})
for resultado in prim:
print resultado
#print prim
Whenever I do this, no results are displayed and there are no errors as well, so that's what confuses me. Also trying to print "prim" gives me some .cursor message.
The basic template for querying a collection and iterating through the result with pymongo is:
from pymongo import MongoClient
with MongoClient() as client:
db = client.get_database("my_db_name")
coll = db.get_collection("my_collection_name")
result_cursor = coll.find({"some_field":{"$lt":100}}, {"some_field":1})
for doc in result_cursor:
print doc
>>
{u'some_field': 3, u'_id': ObjectId('578d52a93ea71afa979e5737')}
...
I'm trying to truncate a table and insert only ~3000 rows of data using SQLAlchemy, and it's very slow (~10 minutes).
I followed the recommendations on this doc and leveraged sqlalchemy core to do my inserts, but it's still running very very slow. What are possible culprits for me to look at? Database is a postgres RDS instance. Thanks!
engine = sa.create_engine(db_string, **kwargs, pool_recycle=3600)
with engine.begin() as conn:
conn.execute("TRUNCATE my_table")
conn.execute(
MyTable.__table__.insert(),
data #where data is a list of dicts
)
I was bummed when I saw this didn't have an answer... I ran into the exact same problem the other day: Trying to bulk-insert about millions of rows to a Postgres RDS Instance using CORE. It was taking hours.
As a workaround, I ended up writing my own bulk-insert script that generated the raw sql itself:
bulk_insert_str = []
for entry in entry_list:
val_str = "('{}', '{}', ...)".format(entry["column1"], entry["column2"], ...)
bulk_insert_str.append(val_str)
engine.execute(
"""
INSERT INTO my_table (column1, column2 ...)
VALUES {}
""".format(",".join(bulk_insert_str))
)
While ugly, this gave me the performance we needed (~500,000 rows/minute)
Did you find a CORE-based solution? If not, hope this helps!
UPDATE: Ended up moving my old script into a spare EC2 instance that we weren't using which actually fixed the slow performance issue. Not sure what your setup is, but apparently there's a network overhead in communicating with RDS from an external (non-AWS) connection.
Some time ago I had been struggling with the problem while working in the company, so we had created a library with functions for bulk insert and update. Hope we've taken into account all performance and security concerns. This library is open-sourced and available on PyPI, its name: bulky.
Let me show you some examples of usage:
insert:
import bulky
from your.sqlalchemy.models import Model
from your.sqlalchemy.session import Session
data = [
{Model.column_float: random()}
for _ in range(100_000_000)
]
rows_inserted = bulky.insert(
session=Session,
table_or_model=Model,
values_series=data,
returning=[Model.id, Model.column_float]
)
new_items = {row.id: row.column_float for row in rows_inserted}
update:
import bulky
from your.sqlalchemy.models import ManyToManyTable
from your.sqlalchemy.session import Session
data = [
{
ManyToManyTable.fk1: i,
ManyToManyTable.fk2: j,
ManyToManyTable.value: i + j,
}
for i in range(100_000_000)
for j in range(100_000_000)
]
rows_updated = bulky.update(
session=Session,
table_or_model=ManyToManyTable,
values_series=data,
returning=[
ManyToManyTable.fk1,
ManyToManyTable.fk2,
ManyToManyTable.value,],
reference=[
ManyToManyTable.fk1,
ManyToManyTable.fk2,],
)
updated_items = {(row.fk1, row.fk2): row.value for row in rows_updated}
Not sure if links are allowed, so I'll put them under spoiler
Readme and PyPI
I am having an issue figuring out how to start a query on the OpportunityFieldHistory from Salesforce.
The code I usually use and works for querying Opportunty or Leads work fine, but I do not know how should be written for the FieldHistory.
When I want to query the opportunity or Lead I use the following:
oppty1 = sf.opportunity.get('00658000002vFo3')
lead1 = sf.lead.get('00658000002vFo3')
and then do the proper query code with the access codes...
The problem arises when I want to do the analysis on the OpportunityFieldHistory, I tried the following:
opptyhist = sf.opportunityfieldhistory.get('xxx')
Guess what, does not work. Do you have any clue on what should I write between sf. and .get?
Thanks in advance
Looking at the simple-salesforce API, it appears that the get method accepts an ID which you are passing correctly. However, a quick search in the Salesforce API reference seems to indicate that the OpportunityFieldHistory may need to be obtained by another function such as get_by_custom_id(self, custom_id_field, custom_id).
(OpportunityFieldHistory){
Id = None
CreatedDate = 2012-08-27 12:00:03
Field = "StageName"
NewValue = "3.0 - Presentation & Demo"
OldValue = "2.0 - Qualification & Discovery"
OpportunityId = "0067000000RFCDkAAP"
},
I'm having some issues working with a very large MongoDB collection (19 million documents).
When I simply iterate over the collection, as below, PyMongo seems to give up after 10,593,454 documents. This seems to be the same even if I use skip(), the latter half of the collection seems programmatically inaccessible.
#!/usr/bin/env python
import pymongo
client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]
print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]
for ii, classification in enumerate(classification_collection.find(no_cursor_timeout=True)):
print "%s: created at %s" % (ii,classification["created_at"])
print "Done."
The script reports initially:
Collection contains 19036976 documents.
Eventually, the script completes, I get no errors, and I do get the "Done." message. But the last row printed is
10593454: created at 2013-12-12 02:17:35
All my records logged in just over the last 2 years, the most recent ones, seem inaccessible. Does anyone have any idea what is going on here? What can I do about this?
OK well thanks to this helpful article I found another way to page through the documents, which doesn't seem to be subject to this "missing data"/"timeout" issue. Essentially, you have to use find() and limit() and rely on the natural _id ordering of your collection to retrieve the document in pages. Here's my revised code:
#!/usr/bin/env python
import pymongo
client = pymongo.MongoClient()
db = client['mydb']
classification_collection = db["my_classifications"]
print "Collection contains %s documents." % db.command("collstats", "my_classifications")["count"]
# get first ID
pageSize = 100000
first_classification = classification_collection.find_one()
completed_page_rows=1
last_id = first_classification["_id"]
# get the next page of documents (read-ahead programming style)
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)
# keep getting pages until there are no more
while next_results.count()>0:
for ii, classification in enumerate(next_results):
completed_page_rows+=1
if completed_page_rows % pageSize == 0:
print "%s (id = %s): created at %s" % (completed_page_rows,classification["_id"],classification["created_at"])
last_id = classification["_id"]
next_results = classification_collection.find({"_id":{"$gt":last_id}},{"created_at":1},no_cursor_timeout=True).limit(pageSize)
print "\nDone.\n"
I hope that by writing up this solution this will help others who hit this issue.
Note: This updated listing also takes on the suggestions of #Takarii and #adam-comerford in the comments, I now retrieve only the fields I need (_id comes by default), and I also print out the IDs for reference.