Related
I have two CSV, one is the Master-Data and the other is the Component-Data, Master-Data has Two Rows and two columns, where as Component-Data has 5 rows and two Columns.
I'm trying to find the cosine-similarity between each of them after Tokenization, Stemming and Lemmatization and then append the similarity index to the new columns, I'm unable to append the corresponding values to the column in the data-frame which is further needs to be converted to CSV.
My Approach:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\\Similarity\\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\\Similarity\\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
expected output after converting the df to CSV,
Facing issues while appending the values to the Data-frame, Please assist me with an approach for this. Thanks.
Here's what I came up with:
Sample set up
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
We end up with 2 dataframes (showing df_cd):
SI.No
Sentences
0
1
Emma is writing a letter.
1
2
We wake up early in the morning.
2
3
Did Emma Write a letter?
3
4
We sleep early at night.
4
5
Emma wrote a letter.
I replaced the 2 functions you used by the following dummy functions:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
Processing data
First, we apply the fetchLemmantizedWords function on each dataframe. The regex replace, lowercase and split of the sentences is done by Pandas instead of doing them in the function itself.
By making the sentence lowercase first, we can simplify the regex to only consider lowercase letters.
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
Result for df_cd:
SI.No
Sentences
lem
0
1
Emma is writing a letter.
[29, 5, 4, 9, 28]
1
2
We wake up early in the morning.
[16, 8, 21, 14, 13, 4, 6]
2
3
Did Emma Write a letter?
[30, 9, 23, 16, 5]
3
4
We sleep early at night.
[8, 25, 24, 7, 3]
4
5
Emma wrote a letter.
[30, 30, 15, 7]
Next, we use a cross-join to make a dataframe with all possible combinations of md and cd data.
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged contents:
SI.No_md
lem_md
SI.No_cd
lem_cd
0
1
[14, 22, 9, 21, 4]
1
[3, 4, 8, 17, 2]
1
1
[14, 22, 9, 21, 4]
2
[29, 3, 10, 2, 19, 18, 21]
2
1
[14, 22, 9, 21, 4]
3
[20, 22, 29, 4, 3]
3
1
[14, 22, 9, 21, 4]
4
[17, 7, 1, 27, 19]
4
1
[14, 22, 9, 21, 4]
5
[17, 5, 3, 29]
5
2
[12, 30, 10, 11, 7, 11, 8]
1
[3, 4, 8, 17, 2]
6
2
[12, 30, 10, 11, 7, 11, 8]
2
[29, 3, 10, 2, 19, 18, 21]
7
2
[12, 30, 10, 11, 7, 11, 8]
3
[20, 22, 29, 4, 3]
8
2
[12, 30, 10, 11, 7, 11, 8]
4
[17, 7, 1, 27, 19]
9
2
[12, 30, 10, 11, 7, 11, 8]
5
[17, 5, 3, 29]
Next, we calculate the cosine value:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
In the last step, we pivot the data and merge the original df_cd with the calculated results :
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
Result (again, these are dummy calculations):
SI.No
Sentences
1
2
1
Emma is writing a letter.
100
64
2
We wake up early in the morning.
63
100
3
Did Emma Write a letter?
100
5
4
We sleep early at night.
100
17
5
Emma wrote a letter.
35
9
I have large 1D NumPy array a of any comparable dtype, some of its elements may be repeated.
How do I find sorting indexes ix that will stable-sort (stability in a sense described here) a by frequencies of values in descending/ascending orders?
I want to find fastest and simplest way to do this. Maybe there is existing standard numpy function to do that.
There is another related question here but it was asking specifically to remove arrays duplicates, i.e. output only unique sorted values, I need all values of original array including duplicates.
I've coded my first trial to do the task, but it is not the fastest (uses Python's loop) and probably not shortest/simplest possible form. This python loop can be very expensive if repeating of equal elements is not high and array is huge. Also would be nice to have short function for doing this all if available in NumPy (e.g. imaginary np.argsort_by_freq()).
Try it online!
import numpy as np
np.random.seed(1)
hi, n, desc = 7, 24, True
a = np.random.choice(np.arange(hi), (n,), p = (
lambda p = np.random.random((hi,)): p / p.sum()
)())
us, cs = np.unique(a, return_counts = True)
af = np.zeros(n, dtype = np.int64)
for u, c in zip(us, cs):
af[a == u] = c
if desc:
ix = np.argsort(-af, kind = 'stable') # Descending sort
else:
ix = np.argsort(af, kind = 'stable') # Ascending sort
print('rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)')
print(' / sorted_freqs(4) / sorting_ix(5)')
print(np.stack((
np.arange(n), a, af, a[ix], af[ix], ix,
), 0))
outputs:
rows: i_col(0) / original_a(1) / freqs(2) / sorted_a(3)
/ sorted_freqs(4) / sorting_ix(5)
[[ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23]
[ 1 1 1 1 3 0 5 0 3 1 1 0 0 4 6 1 3 5 5 0 0 0 5 0]
[ 7 7 7 7 3 8 4 8 3 7 7 8 8 1 1 7 3 4 4 8 8 8 4 8]
[ 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 5 5 5 5 3 3 3 4 6]
[ 8 8 8 8 8 8 8 8 7 7 7 7 7 7 7 4 4 4 4 3 3 3 1 1]
[ 5 7 11 12 19 20 21 23 0 1 2 3 9 10 15 6 17 18 22 4 8 16 13 14]]
I might be missing something, but it seems that with a Counter you can then sort the indexes of each element according to the count of that element's value, using the element value and then the index to break ties. For example:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], v, i) for i, v in enumerate(a)]
t.sort()
print([v[2] for v in t])
t.sort(reverse=True)
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[23, 21, 20, 19, 12, 11, 7, 5, 15, 10, 9, 3, 2, 1, 0, 22, 18, 17, 6, 16, 8, 4, 14, 13]
If you want to maintain ascending order of indexes with groups with equal counts, you can just use a lambda function for the descending sort:
t.sort(key = lambda x:(-x[0],-x[1],x[2]))
print([v[2] for v in t])
Output:
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 14, 13]
If you want to maintain the ordering of elements in the order that they originally appeared in the array if their counts are the same, then rather than sort on the values, sort on the index of their first occurrence in the array:
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
idxs = {}
t = []
for i, v in enumerate(a):
if not v in idxs:
idxs[v] = i
t.append((counts[v], idxs[v], i))
t.sort()
print([v[2] for v in t])
t.sort(key = lambda x:(-x[0],x[1],x[2]))
print([v[2] for v in t])
Output:
[13, 14, 4, 8, 16, 6, 17, 18, 22, 0, 1, 2, 3, 9, 10, 15, 5, 7, 11, 12, 19, 20, 21, 23]
[5, 7, 11, 12, 19, 20, 21, 23, 0, 1, 2, 3, 9, 10, 15, 6, 17, 18, 22, 4, 8, 16, 13, 14]
To sort according to count, and then position in the array, you don't need the value or the first index at all:
from collections import Counter
a = [ 1, 1, 1, 1, 3, 0, 5, 0, 3, 1, 1, 0, 0, 4, 6, 1, 3, 5, 5, 0, 0, 0, 5, 0]
counts = Counter(a)
t = [(counts[v], i) for i, v in enumerate(a)]
t.sort()
print([v[1] for v in t])
t.sort(key = lambda x:(-x[0],x[1]))
print([v[1] for v in t])
This produces the same output as the prior code for the sample data, for your string array:
a = ['g', 'g', 'c', 'f', 'd', 'd', 'g', 'a', 'a', 'a', 'f', 'f', 'f',
'g', 'f', 'c', 'f', 'a', 'e', 'b', 'g', 'd', 'c', 'b', 'f' ]
This produces the output:
[18, 19, 23, 2, 4, 5, 15, 21, 22, 7, 8, 9, 17, 0, 1, 6, 13, 20, 3, 10, 11, 12, 14, 16, 24]
[3, 10, 11, 12, 14, 16, 24, 0, 1, 6, 13, 20, 7, 8, 9, 17, 2, 4, 5, 15, 21, 22, 19, 23, 18]
I just figured myself probably very fast solution for any dtype using just numpy functions without python looping, it works in O(N log N) time. Used numpy functions: np.unique, np.argsort and array indexing.
Although wasn't asked in original question, I implemented extra flag equal_order_by_val if it is False then array elements with same frequencies are sorted as equal stable range, meaning that there could be c d d c d c output like in outputs dumps below, because this is the order as elements go in original array for equal frequency. When flag is True such elements are in addition sorted by value of original array, resulting in c c c d d d. In other words in case of False we sort stably just by key freq, and when it is True we sort by (freq, value) for ascending order and by (-freq, value) for descending order.
Try it online!
import string, math
import numpy as np
np.random.seed(0)
# Generating input data
hi, n, desc = 7, 25, True
letters = np.array(list(string.ascii_letters), dtype = np.object_)[:hi]
a = np.random.choice(letters, (n,), p = (
lambda p = np.random.random((letters.size,)): p / p.sum()
)())
for equal_order_by_val in [False, True]:
# Solving task
us, ui, cs = np.unique(a, return_inverse = True, return_counts = True)
af = cs[ui]
sort_key = -af if desc else af
if equal_order_by_val:
shift_bits = max(1, math.ceil(math.log(us.size) / math.log(2)))
sort_key = ((sort_key.astype(np.int64) << shift_bits) +
np.arange(us.size, dtype = np.int64)[ui])
ix = np.argsort(sort_key, kind = 'stable') # Do sorting itself
# Printing results
print('\nequal_order_by_val:', equal_order_by_val)
for name, val in [
('i_col', np.arange(n)), ('original_a', a),
('freqs', af), ('sorted_a', a[ix]),
('sorted_freqs', af[ix]), ('sorting_ix', ix),
]:
print(name.rjust(12), ' '.join([str(e).rjust(2) for e in val]))
outputs:
equal_order_by_val: False
i_col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5 5 3 7 3 3 5 4 4 4 7 7 7 5 7 3 7 4 1 2 5 3 3 2 7
sorted_a f f f f f f f g g g g g a a a a c d d c d c b b e
sorted_freqs 7 7 7 7 7 7 7 5 5 5 5 5 4 4 4 4 3 3 3 3 3 3 2 2 1
sorting_ix 3 10 11 12 14 16 24 0 1 6 13 20 7 8 9 17 2 4 5 15 21 22 19 23 18
equal_order_by_val: True
i_col 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
original_a g g c f d d g a a a f f f g f c f a e b g d c b f
freqs 5 5 3 7 3 3 5 4 4 4 7 7 7 5 7 3 7 4 1 2 5 3 3 2 7
sorted_a f f f f f f f g g g g g a a a a c c c d d d b b e
sorted_freqs 7 7 7 7 7 7 7 5 5 5 5 5 4 4 4 4 3 3 3 3 3 3 2 2 1
sorting_ix 3 10 11 12 14 16 24 0 1 6 13 20 7 8 9 17 2 15 22 4 5 21 19 23 18
I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)
I have a code in Matlab which I need to translate in Python. A point here that shapes and indexes are really important since it works with tensors. I'm a little bit confused since it seems that it's enough to use order='F' in python reshape(). But when I work with 3D data I noticed that it does not work. For example, if A is an array from 1 to 27 in python
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
if I perform A.reshape(3, 9, order='F') I get
[[ 1 4 7 2 5 8 3 6 9]
[10 13 16 11 14 17 12 15 18]
[19 22 25 20 23 26 21 24 27]]
In Matlab for A = 1:27 reshaped to [3, 3, 3] and then to [3, 9] it seems that I get another array:
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27
And SVD in Matlab and Python gives different results. So, is there a way to fix this?
And maybe you know the correct way of operating with multidimensional arrays in Matlab -> python, like should I get the same SVD for arrays like arange(1, 13).reshape(3, 4) and in Matlab 1:12 -> reshape(_, [3, 4]) or what is the correct way to work with that? Maybe I can swap axes somehow in python to get the same results as in Matlab? Or change the order of axes in reshape(x1, x2, x3,...) in Python?
I was having the same issues, until I found this wikipedia article: row- and column-major order
Python (and C) organizes the data arrays in row major order. As you can see in your first example code, the elements first increases with the columns:
array([[[ 1, 2, 3],
- - - -> increasing
Then in the rows
array([[[ 1, 2, 3],
[ 4, <--- new element
When all columns and rows are full, it moves to the next page.
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, <-- new element in next page
In matlab (as fortran) increases first the rows, then the columns, and so on.
For N-dimensionals arrays it looks like:
Python (row major -> last dimension is contiguous): [dim1,dim2,...,dimN]
Matlab (column major -> first dimension is contiguous): the same tensor in memory would look the other way around .. [dimN,...,dim2,dim1]
If you want to export n-dim. arrays from python to matlab, the easiest way is to permute the dimensions first:
(in python)
import numpy as np
import scipy.io as sio
A=np.reshape(range(1,28),[3,3,3])
sio.savemat('A',{'A':A})
(in matlab)
load('A.mat')
A=permute(A,[3 2 1]);%dimensions in reverse ordering
reshape(A,9,3)' %gives the same result as A.reshape([3,9]) in python
Just notice that the (9,3) an the (3,9) are intentionally putted in reverse order.
In Matlab
A = 1:27;
A = reshape(A,3,3,3);
B = reshape(A,9,3)'
B =
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27
size(B)
ans =
3 9
In Python
A = np.array(range(1,28))
A = A.reshape(3,3,3)
B = A.reshape(3,9)
B
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24, 25, 26, 27]])
np.shape(B)
(3, 9)
This question already has answers here:
Indexing a list with an unique index
(6 answers)
Closed 6 years ago.
This is a program that takes a list of words(text) and adds numbers to the numbers lists(called numbers) to represent the indexes of the original text e.g. the phrase "the sailor went to sea sea sea to see what he could see see see but all that he could see see see was the bottom of the deep blue sea sea sea" should be returned as "1 2 3 4 5 5 5 4 6 7 8 9 6 6 6 10 11 12 8 9 6 6 6 13 1 14 15 16 17 5 5 5" however is returned as "1 2 3 4 5 5 5 4 9 10 11 12 9 9 9 13 14 15 11 12 9 9 9 16 1 17 18 1 19 20 5 5 5", causing a problem.
This is the part of the program that is the problem:
for position, item in enumerate(text):
if text.count(item) < 2:
numbers.append(max(numbers) + 1)
else:
numbers.append(text.index(item) + 1)
The "numbers" and "text" are both lists.
A solution with dictionaries :
text="the sailor went to sea sea sea to see what he could see see see but all that he could see see see was the bottom of the deep blue sea sea sea"
l=text.split(' ')
d=dict()
cnt=0
for word in l :
if word not in d :
cnt+=1
d[word]=cnt
out=[d[w] for w in l]
#[1, 2, 3, 4, 5, 5, 5, 4, 6, 7, 8, 9, 6, 6, 6, 10, 11, 12, 8, 9, 6, 6, 6, 13, 1, 14, 15, 1, 16, 17, 5, 5, 5]
An easy solution is to get the create a version of the text without duplicates, but maintaining the same order, and finding indexes of the words in the original text from that list using index():
Create a list from the string by splitting by spaces:
text="the sailor went to sea sea sea to see what he could see see see but all that he could see see see was the bottom of the deep blue sea sea sea"
listText=text.split(" ")
Create new list without duplicates containing all words in text, using count() to check word has not appeared previously:
unique_text=[listText[x] for x in range(len(listText))if listText[:x].count(listText[x])<1]
Use list comprehension to get index of every word in listText in unique_text(and add 1):
positions=[unique_text.index(x)+1 for x in listText]
Final code:
text="the sailor went to sea sea sea to see what he could see see see but all that he could see see see was the bottom of the deep blue sea sea sea"
listText=text.split(" ")
unique_text=[listText[x] for x in range(len(listText))if listText[:x].count(listText[x])<1]
positions=[unique_text.index(x)+1 for x in listText]
Output:
[1, 2, 3, 4, 5, 5, 5, 4, 6, 7, 8, 9, 6, 6, 6, 10, 11, 12, 8, 9, 6, 6, 6, 13, 1, 14, 15, 1, 16, 17, 5, 5, 5]