MongoDB + K means clustering - python

I'm using MongoDB as my datastore and wish to store a "clustered" configuration of my documents in a separate collection.
So in one collection, I'd have my original set of objects, and in my second, it'd have
kMeansCollection: {
1: [mongoObjectCopy1], [mongoObjectCopy2]...
2: [mongoObjectCopy3], [mongoObjectCopy4]...
}
I'm following the implementation a K-means for text clustering here, http://tech.swamps.io/recipe-text-clustering-using-nltk-and-scikit-learn/, but I'm having a hard time thinking about how I'd tie the outputs back into MongoDB.
An example (taken from the link):
if __name__ == "__main__":
tags = collection.find({}, {'tag_data': 1, '_id': 0})
clusters = cluster_texts(tags, 5) #algo runs here with 5 clusters
pprint(dict(clusters))
The var "tags" is the required input for the algo to run.
It must be in the form of an array, but currently tags returns an array of objects (I must therefore extract the text values from the query)
However, after magically clustering my collection 5 ways, how can I reunite them with their respective object entry from mongo?
I am only feeding specific text content from one property of the object.
Thanks a lot!

You would need to have some identifier for the documents. It is probably a good idea to include the _id field in your query so that you do have a unique document identifier. Then you can create parallel lists of ids and tag_data.
docs = collection.find({}, {'tag_data': 1, '_id': 1})
ids = [doc['_id'] for doc in docs]
tags = [doc['tag_data'] for doc in docs]
Then call the cluster function on the tag data.
clusters = cluster_text(tags)
And zip the results back with the ids.
doc_clusters = zip(ids, clusters)
From here you have built tuples of (_id, cluster) so you can update the cluster labels on your mongo documents.

The efficient way to do this is to use the aggregation framework to create the list of "_id" and "tag-data" using server-side operation. This also reduces both the amount of data sent over the wire and the time and memory used to decode documents on the client-side.
You need to $group your documents and use the $push accumulator operator to return the list of _id and the list of tag-data. Of course the aggregate() method gives access to the aggregation pipeline.
cursor = collection.aggregate([{
'$group': {
'_id': None,
'ids': {'$push': '$_id'},
'tags': {'$push': '$tag-data'}
}
}])
You then retrieve you data using the .next() method on the CommandCursor because we group by None thus our cursor hold one element.
data = cursor.next()
After that, simply call your function and zip the result.
clusters = cluster_text(data['tags'])
doc_clusters = zip(data['ids'], clusters)

Related

Iterate and append faster through millions of dictionnaries in python

I'm writing a script to handle millions of dictionnaries from many files of 1 million lines each.
The main interest of this script is to create a json file and send it to Elasticsearch bulk edit.
What I'm trying to do is to read lines from called "entity" files, and in those entities, I have to find their matched addresses in "sub" files (for sub-entities). But my problem here, is that the function which should associate them, take REALLY too much time to do a single iteration... But after trying to optimize it as much as possible, the association still insanely slow.
So, just to be clear about the data structure:
Entities are object like Persons : an id, an unique_id, a name, list of postal addresses, list of email addresses :
{id: 0, uniqueId: 'Z5ER1ZE5', name: 'John DOE', postalList: [], emailList: []}
Sub-entities are object referenced as differents types of addresses (mail, postal, etc ..) : {personUniqueId: 'Z5ER1ZE5', 'Email': 'john.doe#gmail.com'}
So, I get files content with Pandas using pd.read_csv(filename).
To optimize as much as possible I've decided to handle every iterations on data using multiprocessing (working fine, even if I didn't handled RAM usage..) :
## Using manager to be alble to pass my main object and update it through processes
manager = multiprocessing.Manager()
main_obj = manager.dict({
'dataframes': manager.dict(),
'dicts': manager.dict(),
'json': manager.dict()
})
## Just an example of how I use multiprocessing
pool = multiprocessing.Pool()
result = pool.map(partial(func, obj=main_obj), data_list_to_iterate)
pool.close()
pool.join()
And I have some referentials, where identifiers refers to a dict having entities name in keys and their uniqueId as value, and sub_associations is a dict where we have sub-entities name as keys and their related collection as value :
sub_associations = {
'Persons_emlAddr': 'postalList',
'Persons_pstlAddr': 'emailList'
}
identifiers = {
'Animals': 'uniqueAnimalId',
'Persons': 'uniquePersonId',
'Persons_emlAddr': 'uniquePersonId',
'Persons_pstlAddr': 'uniquePersonId'
}
This said, now, I experienced a big issue in my function to get sub-entities, for my entity:
for key in list(main_obj['dicts'].keys()):
main_obj['json'][key] = ''
with mp.Pool() as stringify_pool:
res_stringify = stringify_pool.map(partial(convert_to_json, obj=main_obj, name=key), main_obj['dicts'][key]['records'])
stringify_pool.close()
stringify_pool.join()
Here is where I call my issued function. I feed it with keys I have in my main_obj['dicts'], where keys are just entities filename (Persons, Animals, ..), and my main_obj['dicts'][key] is a list of dict where main_obj['dicts'][key] = {name: 'Persons', records: []} where records is the list of entities dicts I need to iterate on.
def convert_to_json(item, obj, name):
global sub_associations
global identifiers
dump = ''
subs = [val for val in sub_associations.keys() if val.startswith(name)]
if subs:
for sub in subs:
df = obj['dataframes'][sub]
id_name = identifiers[name]
sub_items = df[df[id_name] == item[id_name]].to_dict('records')
if sub_items:
item[sub_associations[sub]] = sub_items
else:
item[sub_associations[sub]] = []
index = {
"index": {
"_index": name,
"_id": item[identifiers[name]]
}
}
dump += f'{json.dumps(index)}\n'
dump += f'{json.dumps(item)}\n'
obj['json'][name] += dump
return 'Done'
Does someone could have an idea about what could be the main issue ? And how I could change it to make it faster ?
If you need any additionnal information.. If I haven't been clear on some things, feel free.
Thank you in advance ! :)

Pymongo updating documents using list of dictionaries [duplicate]

First, some background.
I have a function in Python which consults an external API to retrieve some information associated with an ID. Such function takes as argument an ID and it returns a list of numbers (they correspond to some metadata associated with such ID).
For example, let us introduce in such function the IDs {0001, 0002, 0003}. Let's say that the function returns for each ID the following arrays:
0001 → [45,70,20]
0002 → [20,10,30,45]
0003 → [10,45]
My goal is to implement a collection which structures data as so:
{
"_id":45,
"list":[0001,0002,0003]
},
{
"_id":70,
"list":[0001]
},
{
"_id":20,
"list":[0001,0002]
},
{
"_id":10,
"list":[0002,0003]
},
{
"_id":30,
"list":[0002]
}
As it can be seen, I want my collection to index the information by the metadata itself. With this structure, the document with $_id "45" contains a list with all the IDs that have metadata 45 associated. This way I can retrieve with a single request to the collection all IDs mapped to a particular metadata value.
The class method in charge of inserting IDs and metadata in the collection is the following:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
for data in metadataVector:
self.SegmentDB.update_one(
filter = {"_id":data},
update = {"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
metadataVector is the list which contains all metadata (integers) associated to a given ID (i.e.:[45,70,20]).
id is the ID associated to the metadata in metadataVector. (i.e.:0001).
This method currently iterates through the list and performs an operation for every element (every metadata) on the list. This method implements the collection I desire: it updates the document whose "_id" is a given metadata and adds to its corresponding list the ID from which such metadata originated (if such document doesn't exist yet, it inserts it - that's what upsert = true is all for).
However, this implementation ends up being somewhat slow on the long run. metadataVector usually has around 1000-3000 items for each ID (metainformation integers which can range in 800 - 23000000), and I have around 40000 IDs to analyze. As a result, the collection grows quickly. At the moment, I have around 3.2m documents in the collection (one specifically dedicated to each individual metadata integer). I would like to implement a faster solution; if possible, I would like to insert all metadata in one only DB request instead of calling an update for each item in metadataVector individually.
I tried this approach but it doesn't seem to work as I intended:
def add_entries(self,id,metadataVector):
start = time.time()
id=int(id)
self.SegmentDB.update_many(
filter={"_id": {"$in":metadataVector}},
update={"$addToSet":{"list":id}},
upsert = True
)
end = time.time()
duration = end-start
return duration
I tried using update_many (as it seemed the natural approach to tackle the problem) specifying a filter which, to my understanding, states "any document whose _id is in metadataVector". In this way, all documents involved would add to the list the originating ID (or the document would be created if it didn't exist due to the Upsert condition) but instead the collection ends up being filled with documents containing a single element in the list and an ObjectId() _id.
Picture showing the final result.
Is there a way to implement what I want? Should I restructure the DB differently all together?
Thanks a lot in advance!
Here is an example, and it uses Bulk Write operations. Bulk operations submits multiple inserts, updates, deletes (can be a combination) as a single call to the database and returns a result. This is more efficient than multiple single calls to the database.
Scenario 1:
Input: 3 -> [10, 45]
def some_fn(id):
# id = 3; and after some process... returns a dictionary
return { 10: 3, 45: 3, }
Scenario 2:
Input (as a list):
3 -> [10, 45]
1 -> [45, 70, 20]
def some_fn(ids):
# ids are 1 and 3; and after some process... returns a dictionary
return { 10: [ 3 ], 45: [ 3, 1 ], 20: [ 1 ], 70: [ 1 ] }
Perform Bulk Write
Now, perform the bulk operation on the database using the returned value from some_fn.
data = some_fn(id) # or some_fn(ids)
requests = []
for k, v in data.items():
op = UpdateOne({ '_id': k }, { '$push': { 'list': { '$each': v }}}, upsert=True)
requests.append(op)
result = db.collection.bulk_write(requests, ordered=False)
Note the ordered=False - this option is used for, again, better performance as writes can happen in parallel.
References:
collection.bulk_write

How to query with analyzers and stop-words in elastic search

So what I need to do is to pass some information from XML files into elasticsearch and then search those files with tfidf weights applied to them. I also need to output the top 20 best results. I want to do this with python.
So far I have been able to pass the XML data and create an index successfully through python by creating arrays and then indexing them through a json-like format. I am aware that this means that while indexing most other options that are available through elasticsearch get a default value however I was unable to find a way to do this in a different way. What remains for me to do since all the data is passed into the index, is to search for it. I am given 10 documents that contain the title and a small summary of the text contained and I need to return the top 20 results with tfidf through elasticsearch. This is how I gather the 10 text files that need to be searched for in my index and this is how I try to search for them.
queries = []
with open("testingQueries.txt") as file:
queries = [i.strip() for i in file]
for query_text in queries:
query = {
'query': {
'more_like_this': {
'fields': ['document.text'],
'like': query_text
}
}
}
results = es.search(index=INDEX_NAME, body=query)
print(str(results) + "\n")
As you can see I haven't added an analyzer in this query and I have no idea how to add tfidf weights to search for these queries in my data. I've been searching for an answer everywhere but most answers are either not python related or do not really solve my problem. The search results that I am getting are also not giving me the top 20 results...in fact they aren't giving me any results. The output looks like this: {'took': 14, 'timed_out': False, '_shards': {'total': 5, 'successful': 5, 'skipped': 0, 'failed': 0}, 'hits': {'total': 0, 'max_score': None, 'hits': []}}
when I try to do the same with 'match' instead of 'more_like_this' I get a lot more results with hits but again I would still need tfidf scores and a result of the top 20 documents that are similar to my queries.

Dynamodb: query using more than two attributes

In Dynamodb you need to specify in an index the attributes that can be used for making queries.
How can I make a query using more than two attributes?
Example using boto.
Table.create('users',
schema=[
HashKey('id') # defaults to STRING data_type
], throughput={
'read': 5,
'write': 15,
}, global_indexes=[
GlobalAllIndex('FirstnameTimeIndex', parts=[
HashKey('first_name'),
RangeKey('creation_date', data_type=NUMBER),
],
throughput={
'read': 1,
'write': 1,
}),
GlobalAllIndex('LastnameTimeIndex', parts=[
HashKey('last_name'),
RangeKey('creation_date', data_type=NUMBER),
],
throughput={
'read': 1,
'write': 1,
})
],
connection=conn)
How can I look for users with first name 'John', last name 'Doe', and created on '3-21-2015' using boto?
Your data modeling process has to take into consideration your data retrieval requirements, in DynamoDB you can only query by hash or hash + range key.
If querying by primary key is not enough for your requirements, you can certainly have alternate keys by creating secondary indexes (Local or Global).
However, the concatenation of multiple attributes can be used in certain scenarios as your primary key to avoid the cost of maintaining secondary indexes.
If you need to get users by First Name, Last Name and Creation Date, I would suggest you to include those attributes in the Hash and Range Key, so the creation of additional indexes are not needed.
The Hash Key should contain a value that could be computed by your application and at same time provides uniform data access. For example, say that you choose to define your keys as follow:
Hash Key (name): first_name#last_name
Range Key (created) : MM-DD-YYYY-HH-mm-SS-milliseconds
You can always append additional attributes in case the ones mentioned are not enough to make your key unique across the table.
users = Table.create('users', schema=[
HashKey('name'),
RangeKey('created'),
], throughput={
'read': 5,
'write': 15,
})
Adding the user to the table:
with users.batch_write() as batch:
batch.put_item(data={
'name': 'John#Doe',
'first_name': 'John',
'last_name': 'Doe',
'created': '03-21-2015-03-03-02-3243',
})
Your code to find the user John Doe, created on '03-21-2015' should be something like:
name_john_doe = users.query_2(
name__eq='John#Doe',
created__beginswith='03-21-2015'
)
for user in name_john_doe:
print user['first_name']
Important Considerations:
i. If your query starts to get too complicated and the Hash or Range Key too long by having too many concatenated fields then definitely use Secondary Indexes. That's a good sign that only a primary index is not enough for your requirements.
ii. I mentioned that the Hash Key should provide uniform data access:
"Dynamo uses consistent hashing to partition its key space across its
replicas and to ensure uniform load distribution. A uniform key
distribution can help us achieve uniform load distribution assuming
the access distribution of keys is not highly skewed." [DYN]
Not only the Hash Key allows to uniquely identify the record, but also is the mechanism to ensure load distribution. The Range Key (when used) helps to indicate the records that will be mostly retrieved together, therefore, the storage can also be optimized for such need.
The link below has a complete explanation about the topic:
http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.UniformWorkload

Code to extract text fields from a mongodb collection and append it to a list in python using pymongo

I have created an if statement to cycle through a mongodb collection of json objects and extract the text field from each and append it to a list. Here is the code below.
appleSentimentText = []
for record in db.Apple.find():
if record.get('text'):
appleSentimentText.append(record.get("text"))
This works grand but I have 20 collections to do this to and I fear the code may start to get a little messy and unmanageable with another 19 variations of this code. I have started to write a piece of code that may accomplish this. Firstly I have created an array with the names of the 20 collections in it shown below.
filterKeywords = ['IBM', 'Microsoft', 'Facebook', 'Yahoo', 'Apple','Google', 'Amazon', 'EBay', 'Diageo',
'General Motors', 'General Electric', 'Telefonica', 'Rolls Royce', 'Walmart', 'HSBC', 'BP',
'Investec', 'WWE', 'Time Warner', 'Santander Group']
I then use this array in an if statement to cycle through each collection
for word in filterKeywords:
for record in db[word].find():
if db[word].get('text'):
I now want it to create a list variable based on the collection name (ie AppleSentimentText if collection is apple, FacebookSentimentText if it is Facebook collection, etc) though im unsure of what to do next. Any help is welcome
You may use $exists and limit the returned field to "text" so it doesn't need to go through all records, in pymongo it should be something like this:
Edited:
As #BarnieHackett pointed out, you can filter out the _id as well.
for word in filterKeywords:
for r in db[word].find({'text': {'$exists': True}}, {'text': 1, '_id': False}):
appleSentimentText.append(r['text'])
The key is to use $exists and then limit the return field to 'text', unfortunately since pymongo returns the cursor which includes the '_id' & 'text' field, you need to filter this out.
Hope this helps.

Categories

Resources