applying the Similar function in Gensim.Doc2Vec - python

I am trying to get the doc2vec function to work in python 3.
I Have the following code:
tekstdata = [[ index, str(row["StatementOfTargetFiguresAndPoliciesForTheUnderrepresentedGender"])] for index, row in data.iterrows()]
def prep (x):
low = x.lower()
return word_tokenize(low)
def cleanMuch(data, clean):
output = []
for x, y in data:
z = clean(y)
output.append([str(x), z])
return output
tekstdata = cleanMuch(tekstdata, prep)
def tagdocs(docs):
output = []
for x,y in docs:
output.append(gensim.models.doc2vec.TaggedDocument(y, x))
return output
tekstdata = tagdocs(tekstdata)
print(tekstdata[100])
vectorModel = gensim.models.doc2vec.Doc2Vec(tekstdata, size = 100, window = 4,min_count = 3, iter = 2)
ranks = []
second_ranks = []
for x, y in tekstdata:
print (x)
print (y)
inferred_vector = vectorModel.infer_vector(y)
sims = vectorModel.docvecs.most_similar([inferred_vector], topn=1001, restrict_vocab = None)
rank = [docid for docid, sim in sims].index(y)
ranks.append(rank)
All works as far as I can understand until the rank function.
The error I get is that there is no zero in my list e.g. the documents I am putting in does not have 10 in list:
File "C:/Users/Niels Helsø/Documents/github/Speciale/Test/Data prep.py", line 59, in <module>
rank = [docid for docid, sim in sims].index(y)
ValueError: '10' is not in list
It seems to me that it is the similar function that does not work.
the model trains on my data (1000 documents) and build a vocab which is tagged.
The documentation I mainly have used is this:
Gensim dokumentation
Torturial
I hope that some one can help. If any additional info is need please let me know.
best
Niels

If you're getting ValueError: '10' is not in list, you can rely on the fact that '10' is not in the list. So have you looked at the list, to see what is there, and if it matches what you expect?
It's not clear from your code excerpts that tagdocs() is ever called, and thus unclear what form tekstdata is in when provided to Doc2Vec. The intent is a bit convoluted, and there's nothing to display what the data appears as in its raw, original form.
But perhaps the tags you are supplying to TaggedDocument are not the required list-of-tags, but rather a simple string, which will be interpreted as a list-of-characters. As a result, even if you're supplying a tags of '10', it will be seen as ['1', '0'] – and len(vectorModel.doctags) will be just 10 (for the 10 single-digit strings).
Separate comments on your setup:
1000 documents is pretty small for Doc2Vec, where most published results use tens-of-thousands to millions of documents
an iter of 10-20 is more common in Doc2Vec work (and even larger values might be helpful with smaller datasets)
infer_vector() often works better with non-default values in its optional parameters, especially a steps that's much larger (20-200) or a starting alpha that's more like the bulk-training default (0.025)

Related

Issue with Python list interaction with for loop

I am having a problem with my genetic feature optimization algorithm that I am attempting to build. The idea is that a specific combination of features will be tested and if the model accuracy using those features is higher than the previous maximum, then the combination of features replaces the previous maximum combination. through running through the remaining potential features in this way, the final combination should be the optimal combination of features given the feature vector size. Currently, the code that looks to achieve this looks like:
def mutate_features(features, feature):
new_features = features
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
def run_series(n, f_list, df):
features_list = []
results_list = []
max_results_list = [[0,0,0,0,0]]
max_feature_list = []
features = [0,0,0,0,1]
for i in range(0,5): # 5 has just been chosen as the range for testing purposes
results = run_algorithm(df, f_list, features)
features_list.append(features)
results_list.append(results)
if (check_result_vector(max_results_list, results)):
max_results_list.append(results)
max_feature_list.append(features)
else:
print("Revert to previous :" +str(max_feature_list[-1]))
features = max_feature_list[-1]
features = mutate_features(features, f_list[i])
print("Feature List = " +str(features_list))
print("Results List = " +str(results_list))
print("Max Results List = " +str(max_results_list))
print("Max Feature List = "+str(max_feature_list))
The output from this code has been included below;
Click to zoom or enlarge the photo
The section that I do not understand is the output of the max_feature_list and feature_list.
If anything is added through the use of .append() to the max_feature_list or the feature_list inside the for loop, it seems to change all items that are already members of the list to be the same as the latest addition to the list. I may not be fully understanding of the syntax/logic around this and would really appreciate any feedback as to why the program is doing this.
It happens because you change the values of features inside mutate_features function and then, since the append to max_feature_list is by reference, the populated values in max_feature_list are changing too because their underlying value changed.
One way to prevent such behaviour is to deepcopy features inside mutate_features, mutate the copied features as you want and then return it.
For example:
from copy import deepcopy
def mutate_features(features, feature):
new_features = deepcopy(features)
index = random.randint(0,len(features)-1)
new_features[index] = feature
return new_features
features = [1, 2, 3]
res = []
res.append(features)
features = mutate_features(features, feature)
res.append(features)
print(res)

Numpy (n, 1, m) to (n,m)

I am working on a problem which involves a batch of 19 tokens each with 400 features. I get the shape (19,1,400) when concatenating two vectors of size (1, 200) into the final feature vector. If I squeeze the 1 out I am left with (19,) but I am trying to get (19,400). I have tried converting to list, squeezing and raveling but nothing has worked.
Is there a way to convert this array to the correct shape?
def attn_output_concat(sample):
out_h, state_h = get_output_and_state_history(agent.model, sample)
attns = get_attentions(state_h)
inner_outputs = get_inner_outputs(state_h)
if len(attns) != len(inner_outputs):
print 'Length err'
else:
tokens = [np.zeros((400))] * largest
print(tokens.shape)
for j, (attns_token, inner_token) in enumerate(zip(attns, inner_outputs)):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(np.array(tokens).shape)
return tokens
The easiest way would be to declare tokens to be a numpy.shape=(19,400) array to start with. That's also more memory/time efficient. Here's the relevant portion of your code revised...
import numpy as np
attns_token = np.zeros(shape=(1,200))
inner_token = np.zeros(shape=(1,200))
largest = 19
tokens = np.zeros(shape=(largest,400))
for j in range(largest):
tokens[j] = np.concatenate([attns_token, inner_token], axis=1)
print(tokens.shape)
BTW... It makes it difficult for people to help you if you don't include a self-contained and runnable segment of code (which is probably why you haven't gotten a response on this yet). Something like the above snippet is preferred and will help you get better answers because there's less guessing at what your trying to accomplish.

TensorFlow, Dataset API and flat_map operation

I'm having difficulties working with tf.contrib.data.Dataset API and wondered if some of you could help. I wanted to transform the entire skip-gram pre-processing of word2vec into this paradigm to play with the API a little bit, it involves the following operations:
Sequence of tokens are loaded dynamically (to avoid loading all dataset in memory at a time), say we then start with a Stream (to be understood as Scala's way, all data is not in memory but loaded when access is needed) of sequence of tokens: seq_tokens.
From any of these seq_tokens we extract skip-grams with a python function that returns a list of tuples (token, context).
Select for features the column of tokens and for label the column of contexts.
In pseudo-code to make it clearer it would look like above. We should be taking advantage of the framework parallelism system not to load by ourselves the data, so I would do something like first load in memory only the indices of sequences, then load sequences (inside a map, hence if not all lines are processed synchronously, data is loaded asynchronously and there's no OOM to fear), and apply a function on those sequences of tokens that would create a varying number of skip-grams that needs to be flattened. In this end, I would formally end up with data being of shape (#lines=number of skip-grams generated, #columns=2).
data = range(1:N)
.map(i => load(i): Seq[String]) // load: Int -> Seq[String] loads dynamically a sequence of tokens (sequences have varying length)
.flat_map(s => skip_gram(s)) // skip_gram: Seq[String] -> Seq[(String, String)] with output length
features = data[0] // features
lables = data[1] // labels
I've tried naively to do so with Dataset's API but I'm stuck, I can do something like:
iterator = (
tf.contrib.data.Dataset.range(N)
.map(lambda i: tf.py_func(load_data, [i], [tf.int32, tf.int32])) // (1)
.flat_map(?) // (2)
.make_one_shot_iterator()
)
(1) TensorFlow's not happy here because sequences loaded have differents lengths...
(2) Haven't managed yet to do the skip-gram part... I actually just want to call a python function that computes a sequence (of variable size) of skip-grams and flatten it so that if the return type is a matrix, then each line should be understood as a new line of the output Dataset.
Thanks a lot if anyone has any idea, and don't hesitate if I forgot to mention useful precisions...
I'm just implementing the same thing; here's how I solved it:
dataset = tf.data.TextLineDataset(filename)
if mode == ModeKeys.TRAIN:
dataset = dataset.shuffle(buffer_size=batch_size * 100)
dataset = dataset.flat_map(lambda line: string_to_skip_gram(line))
dataset = dataset.batch(batch_size)
In my dataset, I treat every line as standalone, so I'm not worrying about contexts that span multiple lines.
I therefore flat map each line through a function string_to_skip_gram that returns a Dataset of a length that depends on the number of tokens in the line.
string_to_skip_gram turns the line into a series of tokens, represented by IDs (using the method tokenize_str) using tf.py_func:
def string_to_skip_gram(line):
def handle_line(line):
token_ids = tokenize_str(line)
(features, labels) = skip_gram(token_ids)
return np.array([features, labels], dtype=np.int64)
res = tf.py_func(handle_line, [line], tf.int64)
features = res[0]
labels = res[1]
return tf.data.Dataset.from_tensor_slices((features, labels))
Finally, skip_gram returns a list of all possible context words and target words:
def skip_gram(token_ids):
skip_window = 1
features = []
labels = []
context_range = [i for i in range(-skip_window, skip_window + 1) if i != 0]
for word_index in range(skip_window, len(token_ids) - skip_window):
for context_word_offset in context_range:
features.append(token_ids[word_index])
labels.append(token_ids[word_index + context_word_offset])
return features, labels
Note that I'm not sampling the context words here; just using all of them.

Format a python list and search for patterns

I am getting rows from a spreadsheet with mixtures of numbers, text and dates
I want to find elements within the list, some numbers and some text
for example
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(str, sg)
#sg = map(unicode, sg) #option?
if any("-1" in s for s in sg):
#do something if matched
I don't feel this is the correct way to do this, I am also trying to match stuff like -1.5 and -1.5C and other unexpected characters like OPEN15 compared to 15
I have also looked at
sg.index("-1")
If positive then its a match (Only good for direct matches)
Some help would be appreciated
If you want to call a function for each case, I would do it this way:
def stub1(elem):
#do something for match of type '-1'
return
def stub2(elem):
#do something for match of type 'SD4'
return
def stub3(elem):
#do something for match of type 'OPEN15'
return
sg = [500782, u'BMOU9015488', u'SD4', u'CLOSED', -1, '', '', -1]
sg = map(unicode, sg)
patterns = {u"-1":stub1, u"SD4": stub2, u"OPEN15": stub3} # add more if you want
for elem in sg:
for k, stub in patterns.iteritems():
if k in elem:
stub(elem)
break
Where stub1, stub2, ... are the fonctions that contains the code for each case.
It will be called (max 1 time per strings) if the string contains a matching substring.
What do you mean by "I don't feel this is the correct way to do this" ? Are you not getting the result you expect ? Is it too slow ?
Maybe, you can organize your data by columns instead of rows and have a more specific filters. If you are looking for speed, I'd suggest using the numpy module which has a very intersting function called select()
Scipy select example
By transforming all your rows in a numpy array, you can test several columns in one pass. This function is amazingly efficient and powerful ! Basically it's used like this:
import numpy as np
a = array(...)
conds = [a < 10, a % 3 == 0, a > 25]
actions = [a + 100, a / 3, a * 10]
result = np.select(conds, actions, default = 0)
All values in a will be transformed as follow:
A value 100 will be added to any value of a which is smaller than 10
Any value in a which is a multiple of 3, will be divided by 3
Any value above 25 will be multiplied by 10
Any other value, not matching the previous conditions, will be set to 0
Bot conds and actions are lists, and must have the same number of arguments. The first element in conds has its action set as the first element of actions.
It could be used to determine the index in a vector for a particular value (eventhough this should be done using the nonzero() numpy function).
a = array(....)
conds = [a <= target, a > target]
actions = [1, 0]
index = select(conds, actions).sum()
This is probably a stupid way of getting an index, but it demonstrates how we can use select()... and it works :-)

put stockprices into groups when they are within 0.5% of each other

Thanks for the answers, I have not used StackOverflow before so I was suprised by the number of answers and the speed of them - its fantastic.
I have not been through the answers properly yet, but thought I should add some information to the problem specification. See the image below.
I can't post an image in this because i don't have enough points but you can see an image
at http://journal.acquitane.com/2010-01-20/image003.jpg
This image may describe more closely what I'm trying to achieve. So you can see on the horizontal lines across the page are price points on the chart. Now where you get a clustering of lines within 0.5% of each, this is considered to be a good thing and why I want to identify those clusters automatically. You can see on the chart that there is a cluster at S2 & MR1, R2 & WPP1.
So everyday I produce these price points and then I can identify manually those that are within 0.5%. - but the purpose of this question is how to do it with a python routine.
I have reproduced the list again (see below) with labels. Just be aware that the list price points don't match the price points in the image because they are from two different days.
[YR3,175.24,8]
[SR3,147.85,6]
[YR2,144.13,8]
[SR2,130.44,6]
[YR1,127.79,8]
[QR3,127.42,5]
[SR1,120.94,6]
[QR2,120.22,5]
[MR3,118.10,3]
[WR3,116.73,2]
[DR3,116.23,1]
[WR2,115.93,2]
[QR1,115.83,5]
[MR2,115.56,3]
[DR2,115.53,1]
[WR1,114.79,2]
[DR1,114.59,1]
[WPP,113.99,2]
[DPP,113.89,1]
[MR1,113.50,3]
[DS1,112.95,1]
[WS1,112.85,2]
[DS2,112.25,1]
[WS2,112.05,2]
[DS3,111.31,1]
[MPP,110.97,3]
[WS3,110.91,2]
[50MA,110.87,4]
[MS1,108.91,3]
[QPP,108.64,5]
[MS2,106.37,3]
[MS3,104.31,3]
[QS1,104.25,5]
[SPP,103.53,6]
[200MA,99.42,7]
[QS2,97.05,5]
[YPP,96.68,8]
[SS1,94.03,6]
[QS3,92.66,5]
[YS1,80.34,8]
[SS2,76.62,6]
[SS3,67.12,6]
[YS2,49.23,8]
[YS3,32.89,8]
I did make a mistake with the original list in that Group C is wrong and should not be included. Thanks for pointing that out.
Also the 0.5% is not fixed this value will change from day to day, but I have just used 0.5% as an example for spec'ing the problem.
Thanks Again.
Mark
PS. I will get cracking on checking the answers now now.
Hi:
I need to do some manipulation of stock prices. I have just started using Python, (but I think I would have trouble implementing this in any language). I'm looking for some ideas on how to implement this nicely in python.
Thanks
Mark
Problem:
I have a list of lists (FloorLevels (see below)) where the sublist has two items (stockprice, weight). I want to put the stockprices into groups when they are within 0.5% of each other. A groups strength will be determined by its total weight. For example:
Group-A
115.93,2
115.83,5
115.56,3
115.53,1
-------------
TotalWeight:12
-------------
Group-B
113.50,3
112.95,1
112.85,2
-------------
TotalWeight:6
-------------
FloorLevels[
[175.24,8]
[147.85,6]
[144.13,8]
[130.44,6]
[127.79,8]
[127.42,5]
[120.94,6]
[120.22,5]
[118.10,3]
[116.73,2]
[116.23,1]
[115.93,2]
[115.83,5]
[115.56,3]
[115.53,1]
[114.79,2]
[114.59,1]
[113.99,2]
[113.89,1]
[113.50,3]
[112.95,1]
[112.85,2]
[112.25,1]
[112.05,2]
[111.31,1]
[110.97,3]
[110.91,2]
[110.87,4]
[108.91,3]
[108.64,5]
[106.37,3]
[104.31,3]
[104.25,5]
[103.53,6]
[99.42,7]
[97.05,5]
[96.68,8]
[94.03,6]
[92.66,5]
[80.34,8]
[76.62,6]
[67.12,6]
[49.23,8]
[32.89,8]
]
I suggest a repeated use of k-means clustering -- let's call it KMC for short. KMC is a simple and powerful clustering algorithm... but it needs to "be told" how many clusters, k, you're aiming for. You don't know that in advance (if I understand you correctly) -- you just want the smallest k such that no two items "clustered together" are more than X% apart from each other. So, start with k equal 1 -- everything bunched together, no clustering pass needed;-) -- and check the diameter of the cluster (a cluster's "diameter", from the use of the term in geometry, is the largest distance between any two members of a cluster).
If the diameter is > X%, set k += 1, perform KMC with k as the number of clusters, and repeat the check, iteratively.
In pseudo-code:
def markCluster(items, threshold):
k = 1
clusters = [items]
maxdist = diameter(items)
while maxdist > threshold:
k += 1
clusters = Kmc(items, k)
maxdist = max(diameter(c) for c in clusters)
return clusters
assuming of course we have suitable diameter and Kmc Python functions.
Does this sound like the kind of thing you want? If so, then we can move on to show you how to write diameter and Kmc (in pure Python if you have a relatively limited number of items to deal with, otherwise maybe by exploiting powerful third-party add-on frameworks such as numpy) -- but it's not worthwhile to go to such trouble if you actually want something pretty different, whence this check!-)
A stock s belong in a group G if for each stock t in G, s * 1.05 >= t and s / 1.05 <= t, right?
How do we add the stocks to each group? If we have the stocks 95, 100, 101, and 105, and we start a group with 100, then add 101, we will end up with {100, 101, 105}. If we did 95 after 100, we'd end up with {100, 95}.
Do we just need to consider all possible permutations? If so, your algorithm is going to be inefficient.
You need to specify your problem in more detail. Just what does "put the stockprices into groups when they are within 0.5% of each other" mean?
Possibilities:
(1) each member of the group is within 0.5% of every other member of the group
(2) sort the list and split it where the gap is more than 0.5%
Note that 116.23 is within 0.5% of 115.93 -- abs((116.23 / 115.93 - 1) * 100) < 0.5 -- but you have put one number in Group A and one in Group C.
Simple example: a, b, c = (0.996, 1, 1.004) ... Note that a and b fit, b and c fit, but a and c don't fit. How do you want them grouped, and why? Is the order in the input list relevant?
Possibility (1) produces ab,c or a,bc ... tie-breaking rule, please
Possibility (2) produces abc (no big gaps, so only one group)
You won't be able to classify them into hard "groups". If you have prices (1.0,1.05, 1.1) then the first and second should be in the same group, and the second and third should be in the same group, but not the first and third.
A quick, dirty way to do something that you might find useful:
def make_group_function(tolerance = 0.05):
from math import log10, floor
# I forget why this works.
tolerance_factor = -1.0/(-log10(1.0 + tolerance))
# well ... since you might ask
# we want: log(x)*tf - log(x*(1+t))*tf = -1,
# so every 5% change has a different group. The minus is just so groups
# are ascending .. it looks a bit nicer.
#
# tf = -1/(log(x)-log(x*(1+t)))
# tf = -1/(log(x/(x*(1+t))))
# tf = -1/(log(1/(1*(1+t)))) # solved .. but let's just be more clever
# tf = -1/(0-log(1*(1+t)))
# tf = -1/(-log((1+t))
def group_function(value):
# don't just use int - it rounds up below zero, and down above zero
return int(floor(log10(value)*tolerance_factor))
return group_function
Usage:
group_function = make_group_function()
import random
groups = {}
for i in range(50):
v = random.random()*500+1000
group = group_function(v)
if group in groups:
groups[group].append(v)
else:
groups[group] = [v]
for group in sorted(groups):
print 'Group',group
for v in sorted(groups[group]):
print v
print
For a given set of stock prices, there is probably more than one way to group stocks that are within 0.5% of each other. Without some additional rules for grouping the prices, there's no way to be sure an answer will do what you really want.
apart from the proper way to pick which values fit together, this is a problem where a little Object Orientation dropped in can make it a lot easier to deal with.
I made two classes here, with a minimum of desirable behaviors, but which can make the classification a lot easier -- you get a single point to play with it on the Group class.
I can see the code bellow is incorrect, in the sense the limtis for group inclusion varies as new members are added -- even it the separation crieteria remaisn teh same, you heva e torewrite the get_groups method to use a multi-pass approach. It should nto be hard -- but the code would be too long to be helpfull here, and i think this snipped is enoguh to get you going:
from copy import copy
class Group(object):
def __init__(self,data=None, name=""):
if data:
self.data = data
else:
self.data = []
self.name = name
def get_mean_stock(self):
return sum(item[0] for item in self.data) / len(self.data)
def fits(self, item):
if 0.995 < abs(item[0]) / self.get_mean_stock() < 1.005:
return True
return False
def get_weight(self):
return sum(item[1] for item in self.data)
def __repr__(self):
return "Group-%s\n%s\n---\nTotalWeight: %d\n\n" % (
self.name,
"\n".join("%.02f, %d" % tuple(item) for item in self.data ),
self.get_weight())
class StockGrouper(object):
def __init__(self, data=None):
if data:
self.floor_levels = data
else:
self.floor_levels = []
def get_groups(self):
groups = []
floor_levels = copy(self.floor_levels)
name_ord = ord("A") - 1
while floor_levels:
seed = floor_levels.pop(0)
name_ord += 1
group = Group([seed], chr(name_ord))
groups.append(group)
to_remove = []
for i, item in enumerate(floor_levels):
if group.fits(item):
group.data.append(item)
to_remove.append(i)
for i in reversed(to_remove):
floor_levels.pop(i)
return groups
testing:
floor_levels = [ [stock. weight] ,... <paste the data above> ]
s = StockGrouper(floor_levels)
s.get_groups()
For the grouping element, could you use itertools.groupby()? As the data is sorted, a lot of the work of grouping it is already done, and then you could test if the current value in the iteration was different to the last by <0.5%, and have itertools.groupby() break into a new group every time your function returned false.

Categories

Resources