I am running topic modeling using Gensim. Before creating the document-term matrix, one needs to create a dictionary of tokens.
dictionary = corpora.Dictionary(tokenized_reviews)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]
But, I don't understand what kind of object "dictionary" is.
So, when I type:
type(dictionary)
I get
gensim.corpora.dictionary.Dictionary
Is this a dictionary ( a kind of data structure)? If so, why can't I see the content (I am just curious)?
When I type
dictionary
I get:
<gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0>
The same issue exists with some of the objects in NLTK.
If this is a dictionary (as a data structure), why I am not able to see the keys and values like any other Python dictionary?
Thanks,
Navid
This is a specific Dictionary class implemented by the Gensim project.
It will be very similar in interface to the standard Python dict (and other various Dictionary/HashMap/etc types you may have used elsewhere).
However, to see exactly what it can do, you should consult the class-specific documentation:
https://radimrehurek.com/gensim/corpora/dictionary.html
Like a dict, you can do typical operations:
len(dictionary) # gets number of entries
dictionary[key] # gets the value at a certain key (word)
dictionary.keys() # gets all stored keys
The reason you see a generic <gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0> when you try to display the value of the dictionary itself is that it hasn't defined any convenience display-string with more info - so you're seeing the default for any random Python object. (Such dictionaries are usually far too large to usefull dump their full contents whenever asked, generically, to "show yourself".
I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).
Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).
What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:
Object 1: text="hello" ID=1
Object 2: text="hello" ID=2
Object 3: text="something else" ID=3
Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2), so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:
How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.
EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).
change to using a database ...
import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()
print(db.execute("SELECT * FROM my_items").fetchall())
this may have maginally higher overhead in terms of time ... but you will save in RAM
Could use a dict with the text as key and the newest object for each key as value.
Setting up some demo data:
>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)
Solution:
>>> unique = {}
>>> for obj in objects:
if obj.text not in unique or obj.ID > unique[obj.text].ID:
unique[obj.text] = obj
>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]
Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary. (Perl calls the same thing a hash, for some reason. ;-) )
The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.
Python has the built-in function hash() that returns the standard computation of this value. It also supports the __hash__() function, for objects that wish to compute their own hash value.
In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.
In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:
seen_before = {} # Empty dictionary to start with.
while ... something :
info = read_next_record() # You figure this out.
fld = info.fields[some_field] # The value you care about
hv = hash(fld) # Compute hash value for field.
if hv in seen_before:
print("This field value has been seen before")
else:
seen_before[hv] = True # Never seen ... until NOW!
I have a list of dictionaries that is encoded:
[u"{'name':'Tom', 'uid':'asdlfkj223'}", u"{'name':'Jerry', 'uid':'alksd32'}", ...]
Is there anyway I can create a list of just the values of the key name?
Even better if someone knows Django ORM well enough to pull down a list of a data/column with properties from a PostgreSQL database.
Thanks!
To get only that value for the name column from the DB table, use:
names = Person.objects.values_list('name', flat=True)
(as per https://docs.djangoproject.com/en/dev/ref/models/querysets/#django.db.models.query.QuerySet.values_list)
otherwise, given
people = [{'name':'Tom', 'uid':'asdlfkj223'}, {'name':'Jerry', 'uid':'alksd32'},]
this should do the job:
names = [person['name'] for person in people]
And you should find out why your data items are strings (containing a string representation of a dict) to start with—it doesn't look like the way it's supposed to be.
Or, if you're actually storing dict's in your database as strings, either prefer JSON over the Python string representation, or if you must use the current format, the AST parsing solution provided in another question here should do the job.
You can use ast.literal_eval:
>>> data = [u"{'name':'Tom', 'uid':'asdlfkj223'}",u"{'name':'Jerry', 'uid':'alksd32'}"]
>>> import ast
>>> [ast.literal_eval(d)['name'] for d in data]
['Tom', 'Jerry']
I'm doing some data munging which would be quite a bit simpler if I could stick a bunch of dictionaries in an in-memory database, then run simply queries against it.
For example, something like:
people = db([
{"name": "Joe", "age": 16},
{"name": "Jane", "favourite_color": "red"},
])
over_16 = db.filter(age__gt=16)
with_favorite_colors = db.filter(favorite_color__exists=True)
There are three confounding factors, though:
Some of the values will be Python objects, and serializing them is out of the question (too slow, breaks identity). Of course, I could work around this (eg, by storing all the items in a big list, then serializing their indexes in that list… But that could take a fair bit of fiddling).
There will be thousands of data, and I will be running lookup-heavy operations (like graph traversals) against them, so it must be possible to perform efficient (ie, indexed) queries.
As in the example, the data is unstructured, so systems which require me to predefine a schema would be tricky.
So, does such a thing exist? Or will I need to kludge something together?
What about using an in-memory SQLite database via the sqlite3 standard library module, using the special value :memory: for the connection? If you don't want to write your on SQL statements, you can always use an ORM, like SQLAlchemy, to access an in-memory SQLite database.
EDIT: I noticed you stated that the values may be Python objects, and also that you require avoiding serialization. Requiring arbitrary Python objects be stored in a database also necessitates serialization.
Can I propose a practical solution if you must keep those two requirements? Why not just use Python dictionaries as indices into your collection of Python dictionaries? It sounds like you will have idiosyncratic needs for building each of your indices; figure out what values you're going to query on, then write a function to generate and index for each. The possible values for one key in your list of dicts will be the keys for an index; the values of the index will be a list of dictionaries. Query the index by giving the value you're looking for as the key.
import collections
import itertools
def make_indices(dicts):
color_index = collections.defaultdict(list)
age_index = collections.defaultdict(list)
for d in dicts:
if 'favorite_color' in d:
color_index[d['favorite_color']].append(d)
if 'age' in d:
age_index[d['age']].append(d)
return color_index, age_index
def make_data_dicts():
...
data_dicts = make_data_dicts()
color_index, age_index = make_indices(data_dicts)
# Query for those with a favorite color is simply values
with_color_dicts = list(
itertools.chain.from_iterable(color_index.values()))
# Query for people over 16
over_16 = list(
itertools.chain.from_iterable(
v for k, v in age_index.items() if age > 16)
)
If the in memory database solution ends up being too much work, here is a method for filtering it yourself that you may find useful.
The get_filter function takes in arguments to define how you want to filter a dictionary, and returns a function that can be passed into the built in filter function to filter a list of dictionaries.
import operator
def get_filter(key, op=None, comp=None, inverse=False):
# This will invert the boolean returned by the function 'op' if 'inverse == True'
result = lambda x: not x if inverse else x
if op is None:
# Without any function, just see if the key is in the dictionary
return lambda d: result(key in d)
if comp is None:
# If 'comp' is None, assume the function takes one argument
return lambda d: result(op(d[key])) if key in d else False
# Use 'comp' as the second argument to the function provided
return lambda d: result(op(d[key], comp)) if key in d else False
people = [{'age': 16, 'name': 'Joe'}, {'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("age", operator.gt, 15), people)
# [{'age': 16, 'name': 'Joe'}]
print filter(get_filter("name", operator.eq, "Jane"), people)
# [{'name': 'Jane', 'favourite_color': 'red'}]
print filter(get_filter("favourite_color", inverse=True), people)
# [{'age': 16, 'name': 'Joe'}]
This is pretty easily extensible to more complex filtering, for example to filter based on whether or not a value is matched by a regex:
p = re.compile("[aeiou]{2}") # matches two lowercase vowels in a row
print filter(get_filter("name", p.search), people)
# [{'age': 16, 'name': 'Joe'}]
The only solution I know is a package I stumbled across a few years ago on PyPI, PyDbLite. It's okay, but there are few issues:
It still wants to serialize everything to disk, as a pickle file. But that was simple enough for me to rip out. (It's also unnecessary. If the objects inserted are serializable, so is the collection as a whole.)
The basic record type is a dictionary, into which it inserts its own metadata, two ints under keys __id__ and __version__.
The indexing is very simple, based only on value of the record dictionary. If you want something more complicated, like based on a the attribute of a object in the record, you'll have to code it yourself. (Something I've meant to do myself, but never got around to.)
The author does seem to be working on it occasionally. There's some new features from when I used it, including some nice syntax for complex queries.
Assuming you rip out the pickling (and I can tell you what I did), your example would be (untested code):
from PyDbLite import Base
db = Base()
db.create("name", "age", "favourite_color")
# You can insert records as either named parameters
# or in the order of the fields
db.insert(name="Joe", age=16, favourite_color=None)
db.insert("Jane", None, "red")
# These should return an object you can iterate over
# to get the matching records. These are unindexed queries.
#
# The first might throw because of the None in the second record
over_16 = db("age") > 16
with_favourite_colors = db("favourite_color") != None
# Or you can make an index for faster queries
db.create_index("favourite_color")
with_favourite_color_red = db._favourite_color["red"]
Hopefully it will be enough to get you started.
As far as "identity" anything that is hashable you should be able to compare, to keep track of object identity.
Zope Object Database (ZODB):
http://www.zodb.org/
PyTables works well:
http://www.pytables.org/moin
Also Metakit for Python works well:
http://equi4.com/metakit/python.html
supports columns, and sub-columns but not unstructured data
Research "Stream Processing", if your data sets are extremely large this may be useful:
http://www.trinhhaianh.com/stream.py/
Any in-memory database, that can be serialized (written to disk) is going to have your identity problem. I would suggest representing the data you want to store as native types (list, dict) instead of objects if at all possible.
Keep in mind NumPy was designed to perform complex operations on in-memory data structures, and could possibly be apart of your solution if you decide to roll your own.
I wrote a simple module called Jsonstore that solves (2) and (3). Here's how your example would go:
from jsonstore import EntryManager
from jsonstore.operators import GreaterThan, Exists
db = EntryManager(':memory:')
db.create(name='Joe', age=16)
db.create({'name': 'Jane', 'favourite_color': 'red'}) # alternative syntax
db.search({'age': GreaterThan(16)})
db.search(favourite_color=Exists()) # again, 2 different syntaxes
Not sure if it complies with all your requirements, but TinyDB (using in-memory storage) is also probably worth the try:
>>> from tinydb import TinyDB, Query
>>> from tinydb.storages import MemoryStorage
>>> db = TinyDB(storage=MemoryStorage)
>>> db.insert({'name': 'John', 'age': 22})
>>> User = Query()
>>> db.search(User.name == 'John')
[{'name': 'John', 'age': 22}]
Its simplicity and powerful query engine makes it a very interesting tool for some use cases. See http://tinydb.readthedocs.io/ for more details.
If you are willing to work around serializing, MongoDB could work for you. PyMongo provides an interface almost identical to what you describe. If you decide to serialize, the hit won't be as bad since Mongodb is memory mapped.
It should be possible to do what you are wanting to do with just isinstance(), hasattr(), getattr() and setattr().
However, things are going to get fairly complicated before you are done!
I suppose one could store all the objects in a big list, then run a query on each object, determining what it is and looking for a given attribute or value, then return the value and the object as a list of tuples. Then you could sort on your return values pretty easily. copy.deepcopy will be your best friend and your worst enemy.
Sounds like fun! Good luck!
I started developing one yesterday and it isn't published yet. It indexes your objects and allows you to run fast queries. All data is kept in RAM and I'm thinking about smart load and save methods. For testing purposes it is loading and saving through cPickle.
Let me know if you are still interested.
ducks is exactly what you are describing.
It builds indexes on Python objects
It does not serialize or persist anything
Missing attributes are handled correctly
It uses C libraries so it's very fast and RAM-efficient
pip install ducks
from ducks import Dex, ANY
objects = [
{"name": "Joe", "age": 16},
{"name": "Jane", "favourite_color": "red"},
]
# Build the index
dex = Dex(objects, ['name', 'age', 'favourite_color'])
# Look up by any combination of attributes
dex[{'age': {'>=': 16}}] # Returns Joe
# Match the special value ANY to find all objects with the attribute
dex[{'favourite_color': ANY}] # Returns Jane
This example uses dicts, but ducks works on any object type.
I have a Dictionary of Classes where the classes hold attributes that are lists of strings.
I made this function to find out the max number of items are in one of those lists for a particular person.
def find_max_var_amt(some_person) #pass in a patient id number, get back their max number of variables for a type of variable
max_vars=0
for key, value in patients[some_person].__dict__.items():
challenger=len(value)
if max_vars < challenger:
max_vars= challenger
return max_vars
What I want to do is rewrite it so that I do not have to use the .iteritems() function. This find_max_var_amt function works fine as is, but I am converting my code from using a dictionary to be a database using the dbm module, so typical dictionary functions will no longer work for me even though the syntax for assigning and accessing the key:value pairs will be the same. Thanks for your help!
Since dbm doesn't let you iterate over the values directly, you can iterate over the keys. To do so, you could modify your for loop to look like
for key in patients[some_person].__dict__:
value = patients[some_person].__dict__[key]
# then continue as before
I think a bigger issue, though, will be the fact that dbm only stores strings. So you won't be able to store the list directly in the database; you'll have to store a string representation of it. And that means that when you try to compute the length of the list, it won't be as simple as len(value); you'll have to develop some code to figure out the length of the list based on whatever string representation you use. It could just be as simple as len(the_string.split(',')), just be aware that you have to do it.
By the way, your existing function could be rewritten using a generator, like so:
def find_max_var_amt(some_person):
return max(len(value) for value in patients[some_person].__dict__.itervalues())
and if you did it that way, the change to iterating over keys would look like
def find_max_var_amt(some_person):
dct = patients[some_person].__dict__
return max(len(dct[key]) for key in dct)