How to see the content of a Gensim-generated dictionary? - python

I am running topic modeling using Gensim. Before creating the document-term matrix, one needs to create a dictionary of tokens.
dictionary = corpora.Dictionary(tokenized_reviews)
doc_term_matrix = [dictionary.doc2bow(rev) for rev in tokenized_reviews]
But, I don't understand what kind of object "dictionary" is.
So, when I type:
type(dictionary)
I get
gensim.corpora.dictionary.Dictionary
Is this a dictionary ( a kind of data structure)? If so, why can't I see the content (I am just curious)?
When I type
dictionary
I get:
<gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0>
The same issue exists with some of the objects in NLTK.
If this is a dictionary (as a data structure), why I am not able to see the keys and values like any other Python dictionary?
Thanks,
Navid

This is a specific Dictionary class implemented by the Gensim project.
It will be very similar in interface to the standard Python dict (and other various Dictionary/HashMap/etc types you may have used elsewhere).
However, to see exactly what it can do, you should consult the class-specific documentation:
https://radimrehurek.com/gensim/corpora/dictionary.html
Like a dict, you can do typical operations:
len(dictionary) # gets number of entries
dictionary[key] # gets the value at a certain key (word)
dictionary.keys() # gets all stored keys
The reason you see a generic <gensim.corpora.dictionary.Dictionary at 0x1bac985ebe0> when you try to display the value of the dictionary itself is that it hasn't defined any convenience display-string with more info - so you're seeing the default for any random Python object. (Such dictionaries are usually far too large to usefull dump their full contents whenever asked, generically, to "show yourself".

Related

Python convert named string fields to tuple

Similar to this question: Tuple declaration in Python
I have this function:
def get_mouse():
# Get: x:4631 y:506 screen:0 window:63557060
mouse = os.popen( "xdotool getmouselocation" ).read().splitlines()
print mouse
return mouse
When I run it it prints:
['x:2403 y:368 screen:0 window:60817757']
I can split the line and create 4 separate fields in a list but from Python code examples I've seen I feel there is a better way of doing it. I'm thinking something like x:= or window:=, etc.
I'm not sure how to properly define these "named tuple fields" nor how to reference them in subsequent commands?
I'd like to read more on the whole subject if there is a reference link handy.
It seems it would be a better option to use a dictionary here. Dictionaries allow you to set a key, and a value associated to that key. This way you can call a key such as dictionary['x'] and get the corresponding value from the dictionary (if it exists!)
data = ['x:2403 y:368 screen:0 window:60817757'] #Your return data seems to be stored as a list
result = dict(d.split(':') for d in data[0].split())
result['x']
#'2403'
result['window']
#'60817757'
You can read more on a few things here such as;
Comprehensions
Dictionaries
Happy learning!
try
dict(mouse.split(':') for el in mouse
This should give you a dict (rather than tuples, though dicts are mutable and also required hashability of keys)
{x: 2403, y:368, ...}
Also the splitlines is probably not needed, as you are only reading one line. You could do something like:
mouse = [os.popen( "xdotool getmouselocation" ).read()]
Though I don't know what xdotool getmouselocation does or if it could ever return multiple lines.

Error with unhashable type while using TweetTokenize

I start by downloading some tweets from Twitter.
tweet_text = DonaldTrump["Tweets"]
tweet_text = tweet_text.str.lower()
Then in next step, we move with TweetTokenizer.
Tweet_tkn = TweetTokenizer()
tokens = [Tweet_tkn.tokenize(t) for t in tweet_text]
tokens[0:3]
Can someone explain to me and help me solve it.
I have been through similar questions that face similar errors but they provide different solutions.
Lists are mutable and can therefore not be used as dict keys. Otherwise, the program could add a list to a dictionary, change its value, and it is now unclear whether the value in the dictionary should be available under the new or the old list value, or neither.
If you want to use structured data as keys, you need to convert them to immutable types first, such as tuple or frozenset. For non-nested objects, you can simply use tuple(obj). For a simple list of lits, you can use this:
tuple(tuple(elem) for elem in obj)
But for an arbitrary structure, you will have to use recursion.

Trying to remove duplicates from large list of objects, keep certain one

I have a large list of objects in Python that I'm storing in a text file (for lack of knowledge of how to use any other database for the present).
Currently there are 40,000 but I expect the list length eventually may exceed 1,000,000. I'm trying to remove duplicates, where duplicates are defined as different objects having the same value for a text string attribute, but keep the most recent version of that object (defined as having the highest value in another attribute).
What I want to make is a function that returns only objects 2 and 3 from the following list, reliably:
Object 1: text="hello" ID=1
Object 2: text="hello" ID=2
Object 3: text="something else" ID=3
Doing this manually (looping through the list each time for each object) is too slow already and will get slower with O(l^2), so I need a smarter way to do it. I have seen hashing the objects and using the set function recommended multiple times, but I have two questions about this that I haven't found satisfactory answers to:
How does hashing improve the efficiency to the degree it does?
How can I do this and retain only the most recent such object? The examples I have seen all use the set function and I'm not sure how that would return only the most recent one.
EDIT: I can probably find good answers to question 1 elsewhere, but I am still stuck on question 2. To take another stab at explaining it, hashing the objects above on their text and using the set function will return a set where the objects chosen from duplicates are randomly chosen from each group of duplicates (e.g., above, either a set of (Object 2, Object 3) or (Object 1, Object 3) could be returned; I need (Object 2, Object 3)).
change to using a database ...
import sqlite3
db = sqlite3.connect("my.db")
db.execute("CREATE TABLE IF NOT EXISTS my_items (text PRIMARY KEY, id INTEGER);")
my_list_of_items = [("test",1),("test",2),("asdasd",3)]
db.execute_many("INSERT OR REPLACE INTO my_items (text,id) VALUES (?,?)",my_list_of_items)
db.commit()
print(db.execute("SELECT * FROM my_items").fetchall())
this may have maginally higher overhead in terms of time ... but you will save in RAM
Could use a dict with the text as key and the newest object for each key as value.
Setting up some demo data:
>>> from collections import namedtuple
>>> Object = namedtuple('Object', 'text ID')
>>> objects = Object('foo', 1), Object('foo', 2), Object('bar', 4), Object('bar', 3)
Solution:
>>> unique = {}
>>> for obj in objects:
if obj.text not in unique or obj.ID > unique[obj.text].ID:
unique[obj.text] = obj
>>> unique.values()
[Object(text='foo', ID=2), Object(text='bar', ID=4)]
Hashing is a well-researched subject in Computer Science. One of the standard uses is for implementing what Python calls a dictionary. (Perl calls the same thing a hash, for some reason. ;-) )
The idea is that for some key, such as a string, you can compute a simple numeric function - the hash value - and use that number as a quick way to look up the associated value stored in the dictionary.
Python has the built-in function hash() that returns the standard computation of this value. It also supports the __hash__() function, for objects that wish to compute their own hash value.
In a "normal" scenario, one way to determine if you have seen a field value before would be to use the field value as part of a dictionary. For example, you might stored a dictionary that maps the field in question to the entire record, or a list of records that all share the same field value.
In your case, your data is too big (according to you), so that would be a bad idea. Instead, you might try something like this:
seen_before = {} # Empty dictionary to start with.
while ... something :
info = read_next_record() # You figure this out.
fld = info.fields[some_field] # The value you care about
hv = hash(fld) # Compute hash value for field.
if hv in seen_before:
print("This field value has been seen before")
else:
seen_before[hv] = True # Never seen ... until NOW!

Why does CouchDb-python (or do I) confuse strings and dictionaries?

I'm trying to use the Python wrapper for CouchDB to update a database. The file is structured as a nested dictionary as follows.
doc = { ...,
'RLSoo': {'RT_freq': 2, 'tweet': "They're going to play monopoly now.
This makes me feel like an excellent mother. #Sandy #NYC"},
'GiltCityNYC': {},
....}
I would like to put each entry of the larger dicitionary, for example RLSoo into its own document. However, I get an error message when I try the following code.
for key in doc:
db.update(doc[key],all_or_nothing=True)
Error Message
TypeError: expected dict, got <type 'str'>
I don't understand why CouchDB won't accept the dictionary.
According Database.update() method realization and his documentation, first argument should be list of document objects (e.g. list of dicts). Since you doc variable has dict type, direct iteration over it actually iterates over all his keys which are string typed. If I understood your case right, probably your doc contains nested documents as values. So, try just:
db.update(doc.values(), all_or_nothing=True)
And it all first level values are dicts, it should works!

How to rewrite this Dictionary For Loop in Python?

I have a Dictionary of Classes where the classes hold attributes that are lists of strings.
I made this function to find out the max number of items are in one of those lists for a particular person.
def find_max_var_amt(some_person) #pass in a patient id number, get back their max number of variables for a type of variable
max_vars=0
for key, value in patients[some_person].__dict__.items():
challenger=len(value)
if max_vars < challenger:
max_vars= challenger
return max_vars
What I want to do is rewrite it so that I do not have to use the .iteritems() function. This find_max_var_amt function works fine as is, but I am converting my code from using a dictionary to be a database using the dbm module, so typical dictionary functions will no longer work for me even though the syntax for assigning and accessing the key:value pairs will be the same. Thanks for your help!
Since dbm doesn't let you iterate over the values directly, you can iterate over the keys. To do so, you could modify your for loop to look like
for key in patients[some_person].__dict__:
value = patients[some_person].__dict__[key]
# then continue as before
I think a bigger issue, though, will be the fact that dbm only stores strings. So you won't be able to store the list directly in the database; you'll have to store a string representation of it. And that means that when you try to compute the length of the list, it won't be as simple as len(value); you'll have to develop some code to figure out the length of the list based on whatever string representation you use. It could just be as simple as len(the_string.split(',')), just be aware that you have to do it.
By the way, your existing function could be rewritten using a generator, like so:
def find_max_var_amt(some_person):
return max(len(value) for value in patients[some_person].__dict__.itervalues())
and if you did it that way, the change to iterating over keys would look like
def find_max_var_amt(some_person):
dct = patients[some_person].__dict__
return max(len(dct[key]) for key in dct)

Categories

Resources