I am new in pandas and python. I want to find common words for my data set. e.g i have list of companies ["Microsoft.com", "Microsoft", "Microsoft com", "apple" ...] etc. I have around 1M list of such companies and i want to calculate correlation between them to find the relevance for the words e.g Microsoft.com, Microsoft, Microsoft com there are common words.
This is what i did but its very slow:
import hashlib
companies = pd.read_csv('/tmp/companies.csv', error_bad_lines=False)
unique_companies = companies.groupby(['company'])['company'].unique()
df = DataFrame()
for company in unique_companies:
df[hashlib.md5(company).hexdigest()] = [{'name': company[0], 'code': [ord(c) for c in company[0]]}]
rows = df.unstack()
for company in rows:
series1 = Series(company['code'])
for word in rows:
series2 = Series(word['code'])
if series1.corr(series2) > 0.8:
company['match'] = [word['name']]
can anyone please guide me how to find matrix correlation for the words ?
I don't think there's a corr function that will work for strings - only numerics.
If you can somehow compress your words into meaningful numeric values that preserves the "closeness" of one against another, you might then be able to "corr" them, but other options are available.
Hamming Distance is one (basic) method, but slightly better is calculating the Levenshtein difference: http://en.wikipedia.org/wiki/Levenshtein_distance
It's tricky but one way of trying this would be to build a matrix of m x n cells. Where m is the number of unique words in your first wordlist, and n is the number of unique words in the secornd wordlist - then calculate the Hamming, or Levenshtein distances between row/column identifiers.
There are python modules that package up the distance-algorithms for you -
e.g. https://pypi.python.org/pypi/python-Levenshtein/
Or you could write your own, I think the packaged ones are likely to be faster as they're C'ified.
So, assuming the Levenshtein module (I don't know, as have not used it) provides a function say getLev (word1, word2) that generates a numeric score, you should be able to feed in the contents from two wordlists from sources 1 and 2. If you make sure your inputs are already filtered for uniqueness, and maybe sorted alphabetically, that would help too.
Feed them into a matrix generation function.
Here, I've imported numpy as np and am using that module for speed
def genLevenshteinMatrix(wordlist1, wordlist2):
x = len(wordlist1)
y = len(wordlist1)
l_matrix = np.zeros(( x, y))
for i in range( 0 , x ):
x_word = wordlist1[i]
for j in range ( 0 , y ):
y_word = wordlist[j]
l_matrix[i][j] = getLev ( x_word, y_word )
Something like that should allow you to generate a matrix that stores a measure of which words are most like which other words.
Once that's created, you can interrogate it using a function like this:
def interrogate_Levenshtein_matrix (ndarray_x, wordlist1, wordlist2, float_threshold):
l = []
x = len(ndarray_x)
y = len(ndarray_x[0])
for i in range(0 , x ):
for j in range(0 , y ):
if ndarray_x[i][j] >= float_threshold:
l.append ([(wordlist1[i],wordlist2[j]),ndarray_x[i][j]])
return l
And that will output a list of words that are "close" (i.e. have a lower distance) as measured by the Levenshtein function used earlier, as a list, containing a list of the two similar words.
You might need to trim it down somehow, as I think you'll get all like-combinations twice, i.e. ['word','work'] as one return value, and ['work','word'] as another.
As you develop the code, you could swap in different correlation functions and try different threshold values.
Related
I have a corpus with m documents and n unique words.
Based on this corpus, I want to calculate a co-occurrence matrix for words and calculate their similarity.
To do so, I have created a NumPy array occurrences (m x n), which indicates which words are present in each document.
Based on the occurrences, I have created cooccurrences as follows:
cooccurrences = np.transpose(occurrences) # occurrences
Furthermore, word_occurrences gives the sum per word in the corpus:
word_occurrences = occurrences.sum(axis=0)
Now, I want to calculate the similarity scores of words in cooccurrences based on the association strength.
I want to divide each cell i, j in cooccurrences, by word_occurrences[i] * word_occurrences[j].
Currently, I loop through cooccurrences to do this.
def calculate_association_strength(cooc, i, j, word_occurrences):
return cooc/(word_occurrences[i]*word_occurrences[j])
for i in range(len(cooccurrences)):
for j in range(len(cooccurrences)):
if i != j:
if cooccurrences[i,j] > 0 :
cooccurrences[i,j] = 1 - self.calculate_association_strength(cooccurrences[i,j], i,j,word_occurrences)
else:
cooccurrences[i,j] = 0
But with m > 30 000, this is very time-consuming. Is there a faster way to do this?
Here, they discuss mapping a function on a np.array. However, they don't use multiple variables derived from the array.
If I understand the problem here correctly, You could vectorise the whole operation, so the result would be:
cooccurrences / word_occurrences.reshape(-1,1) * word_occurrences
This is pretty much guaranteed to work faster than looping through the array
I have a document with many reviews. I am creating a bag-of-words BW using TfidfVectorizer. What I want to do is: I only want to use words in BW that are also in other document D.
The document D is a document with positive words. I am using this positive to improve my model. What I mean is: I only want to count the words that are positive.
Is there a way of doing this?
Thank you
I created a piece of code to do that job, as fallows:
train_x is a panda data frame with Reviews.
pos_file = open("positive-words.txt")
neg_file = open("negative-words.txt")
#creating arrays based on the files
for ln in pos_file:
pos_words.append(ln.strip())
for ln in neg_file:
neg_words.append(ln.strip())
#adding all the positive and negative words together
sentiment_words.append(pos_words)
sentiment_words.append(neg_words)
pos_file.close()
neg_file.close()
#filtering all the words that are not in the sentiment array
filtered_res =[]
for r in train_x:
keep = []
parts = r.split()
for p in parts:
if p in pos_words:
keep.append(p)
#turning the Review array back to text again
filtered_res.append(" ".join(keep))
train_x = filtered_res
Although I was able to accomplish my needs, I know that the code is not the best. Also, I was trying to find a standard function in python to do that
PS: Python has so many features that I always ask what it can do without using the amount of code that I used
Here is a bit more optimized version (because
it does not do linear search p in pos_words in the loop
it vectorize the loop (more pythonic)
instead of keeping a list for each r, it has generator version
import re
pos_words_set = set (pos_words)
def filter (r):
keep = []
# use [A-Za-z] to avoid numbers
for p in re.finditer(r"[A-Za-z0-9]+", string):
if p in pos_words_set:
keep.append(p)
return " ".join(keep)
train_x = train_x.apply(lambda x : filter(x), axis=1)
The following function is designed to find the unique rows of an array:
def unique_rows(a):
b = np.ascontiguousarray(a).view(np.dtype((np.void, a.dtype.itemsize * a.shape[1])))
_, idx = np.unique(b, return_index=True)
unique_a = a[idx]
return unique_a
For example,
test = np.array([[1,0,1],[1,1,1],[1,0,1]])
unique_rows(test)
[[1,0,1],[1,1,1]]
I believe that this function should work all the time, however it may not be watertight. In my code I would like to calculate how many unique positions exist for a set of particles. The particles are stored in a 2d array, each row corresponding to the position of a particle. The positions are of type np.float64.
I have also defined the following function
def pos_tag(pos):
x,y,z = pos[:,0],pos[:,1],pos[:,2]
return (2**x)*(3**y)*(5**z)
In principle this function should produce a unique value for any (x,y,z) position.
However, when I use these to functions to calculate the number of unique positions in my set of particles they produce different answers. Is this due to some possible logical flaw in the first function, or the second function not producing a unique value for each given position?
EDIT: Usage example
I have some long code that produces a 2d array of particle postions.
partpos.shape = (6039539,3)
I then calculate the number of unique rows as follows
len(unqiue_rows(partpos))
6034411
And
posids = pos_tag(partpos)
len(np.unique(posids))
5328871
I believe that the discrepancy arises due to a precision error.
Using the code
print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos)))
6034411
6034411
However with
print len(unique_rows(partpos.astype(np.float32)))
print len(np.unique(pos_tag(partpos.astype(np.float32))))
6034411
5328871
a = [[1,0,1],[1,1,1],[1,0,1]]
# Convert rows to tuples so they're hashable, creating a generator thereof
b = (tuple(row) for row in a)
# Convert back to list of lists, after coercing to a set to eliminate non-unique rows
unique_rows = list(list(row) for row in set(b))
Edit: Well that's embarrassing. I just realized I didn't really address the question asked. This could still be the answer the OP is looking for, so I'll leave it, but it's not really what was asked. Sorry for that.
The problem is asking me to find all possible subsets of a list that added together (in pairs, alone, or multiple of them) will equal a given number. I have been reading a lot on subset sum problems and not sure if this applies to this problem.
To explain the problem more, I have a max weight of candy that I am allowed to purchase.
I know the weight of ten pieces of different candy that I have stored in a list
candy = [ [snickers, 150.5], [mars, 130.3], ......]
I can purchase at most max_weight = 740.5 grams EXACTLY.
Thus I have to find all possible combinations of candy that will equal exactly the max_weight. I will be programming in python. Don't need the exact code but just whether or not it is a subset sum problem and possible suggestions on how to proceed.
Ok here's a brute force approach exploiting numpy's index magic:
from itertools import combinations
import numpy as np
candy = [ ["snickers", 150.5], ["mars", 130.3], ["choc", 10.0]]
n = len(candy)
ww = np.array([c[1] for c in candy]) # extract weights of candys
idx = np.arange(n) # list of indexes
iidx,sums = [],[]
# generate all possible sums with index list
for k in range(n):
for ii in combinations(idx, k+1):
ii = list(ii) # convert tupel to list, so it can be used as a list of indeces
sums.append(np.sum(ww[ii]))
iidx.append(ii)
sums = np.asarray(sums)
ll = np.where(np.abs(sums-160.5)<1e-9) # filter out values which match 160.5
# print results
for le in ll:
print([candy[e] for e in iidx[le]])
This is exactly the subset sum problem. You could use a dynamic programming approach to solve it
I have already asked a few questions on here about this same topic, but I'm really trying not to disappoint the professor I'm doing research with. This is my first time using Python and I may have gotten in a little over my head.
Anyways, I was sent a file to read and was able to using this command:
SNdata = numpy.genfromtxt('...', dtype=None,
usecols (0,6,7,8,9,19,24,29,31,33,34,37,39,40,41,42,43,44),
names ['sn','off1','dir1','off2','dir2','type','gal','dist',
'htype','d1','d2','pa','ai','b','berr','b0','k','kerr'])
sn is just an array of the names of a particular supernova; type is an array of the type of supernovae it is (Ia or II), etc.
One of the first things I need to do is simply calculate the probabilities of certain properties given the SN type (Ia or II).
For instance, the column htype is the morphology of a galaxy (given as an integer 1=elliptical to 8=irregular). I need to calculate the probability of an elliptical given a TypeIa and an elliptical given TypeII, for all of the integers to up to 8.
For ellipticals, I know that I just need the number of elements that have htype = 1 and type = Ia divided by the total number of elements of type = Ia. And then the number of elements that have htype = 1 and type = II divided by the total number of elements that have type = II.
I just have no idea how to write code for this. I was planning on finding the total number of each type first and then running a for loop to find the number of elements that have a certain htype given their type (Ia or II).
Could anyone help me get started with this? If any clarification is needed, let me know.
Thanks a lot.
Numpy supports boolean array operations, which will make your code fairly straightforward to write. For instance, you could do:
htype_sums = {}
for htype_number in xrange(1,9):
htype_mask = SNdata.htype == htype_number
Ia_mask = SNdata.type == 'Ia'
II_mask = SNdata.type == 'II'
Ia_sum = (htype_mask & Ia_mask).sum() / Ia_mask.sum()
II_sum = (htype_mask & II_mask).sum() / II_mask.sum()
htype_sums[htype_number] = (Ia_sum, II_sum)
Each of the _mask variables are boolean arrays, so when you sum them you count the number of elements that are True.
You can use collections.Counter to count needed observations.
For example,
from collections import Counter
types_counter = Counter(row['type'] for row in data)
will give you desired counts of sn types.
htypes_types_counter = Counter((row['type'], row['htype']) for row in data)
counts for morphology and types. Then, to get your evaluation for ellipticals, just divide
1.0*htypes_types_counter['Ia', 1]/types_counter['Ia']