I have two CSV, one is the Master-Data and the other is the Component-Data, Master-Data has Two Rows and two columns, where as Component-Data has 5 rows and two Columns.
I'm trying to find the cosine-similarity between each of them after Tokenization, Stemming and Lemmatization and then append the similarity index to the new columns, I'm unable to append the corresponding values to the column in the data-frame which is further needs to be converted to CSV.
My Approach:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\\Similarity\\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\\Similarity\\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
expected output after converting the df to CSV,
Facing issues while appending the values to the Data-frame, Please assist me with an approach for this. Thanks.
Here's what I came up with:
Sample set up
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
We end up with 2 dataframes (showing df_cd):
SI.No
Sentences
0
1
Emma is writing a letter.
1
2
We wake up early in the morning.
2
3
Did Emma Write a letter?
3
4
We sleep early at night.
4
5
Emma wrote a letter.
I replaced the 2 functions you used by the following dummy functions:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
Processing data
First, we apply the fetchLemmantizedWords function on each dataframe. The regex replace, lowercase and split of the sentences is done by Pandas instead of doing them in the function itself.
By making the sentence lowercase first, we can simplify the regex to only consider lowercase letters.
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
Result for df_cd:
SI.No
Sentences
lem
0
1
Emma is writing a letter.
[29, 5, 4, 9, 28]
1
2
We wake up early in the morning.
[16, 8, 21, 14, 13, 4, 6]
2
3
Did Emma Write a letter?
[30, 9, 23, 16, 5]
3
4
We sleep early at night.
[8, 25, 24, 7, 3]
4
5
Emma wrote a letter.
[30, 30, 15, 7]
Next, we use a cross-join to make a dataframe with all possible combinations of md and cd data.
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged contents:
SI.No_md
lem_md
SI.No_cd
lem_cd
0
1
[14, 22, 9, 21, 4]
1
[3, 4, 8, 17, 2]
1
1
[14, 22, 9, 21, 4]
2
[29, 3, 10, 2, 19, 18, 21]
2
1
[14, 22, 9, 21, 4]
3
[20, 22, 29, 4, 3]
3
1
[14, 22, 9, 21, 4]
4
[17, 7, 1, 27, 19]
4
1
[14, 22, 9, 21, 4]
5
[17, 5, 3, 29]
5
2
[12, 30, 10, 11, 7, 11, 8]
1
[3, 4, 8, 17, 2]
6
2
[12, 30, 10, 11, 7, 11, 8]
2
[29, 3, 10, 2, 19, 18, 21]
7
2
[12, 30, 10, 11, 7, 11, 8]
3
[20, 22, 29, 4, 3]
8
2
[12, 30, 10, 11, 7, 11, 8]
4
[17, 7, 1, 27, 19]
9
2
[12, 30, 10, 11, 7, 11, 8]
5
[17, 5, 3, 29]
Next, we calculate the cosine value:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
In the last step, we pivot the data and merge the original df_cd with the calculated results :
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
Result (again, these are dummy calculations):
SI.No
Sentences
1
2
1
Emma is writing a letter.
100
64
2
We wake up early in the morning.
63
100
3
Did Emma Write a letter?
100
5
4
We sleep early at night.
100
17
5
Emma wrote a letter.
35
9
Related
I'm stuck on trying to add values to the numbers column.
import pandas as pd
def twod_array(num):
data = {"group": [-1, 0, 1, 2],
'numbers': [[2], [14, 15], [16, 17], [19, 20, 21]],
}
df = pd.DataFrame(data=data)
print(df)
return 0
Currently it prints this:
group numbers
0 -1 [2]
1 0 [14, 15]
2 1 [16, 17]
3 2 [19, 20, 21]
What I'd like to do is to add a value based on the passed input, so for example if I pass 14.5 as a num, this is the output I'd like to see:
group numbers
0 -1 [2]
1 0 [14,14.5 15]
2 1 [16, 17]
3 2 [19, 20, 21]
I'm hoping someone can help with this. This is what I have so far but it fails at the insert line with the error "numpy.ndarray" object has no attribute 'insert'.
df = pd.DataFrame({"group": [-1, 0, 1, 2],
'numbers': [[2], [14, 15], [16, 17], [19, 20, 21]],
})
arr = df['numbers'].to_list()
num = 14.5
for i, sub_arr in enumerate(arr):
for j, n in enumerate(sub_arr):
if arr[i][j]>num:
if j!=0:
arr[i].insert(j,num)
else: arr[i-1].insert(-1 ,num)
df['numbers'] = arr
num = 14.5
mask = (df.numbers.apply(min).lt(num) &
df.numbers.apply(max).gt(num))
index = mask[mask].index[0]
df.numbers.at[index].append(num)
df.numbers.at[index].sort()
print(df)
# Output:
group numbers
0 -1 [2]
1 0 [14, 14.5, 15]
2 1 [16, 17]
3 2 [19, 20, 21]
Iterate through the df and see if num lies between the first and last value of the numbers column. If it does, use the bisect module to insert num in a sorted fashion.
import bisect
for i in range(len(df)):
if num >= df.loc[i,'numbers'][0] and num<= df.loc[i,'numbers'][-1]:
bisect.insort(df.loc[i,'numbers'],num)
print(df)
group numbers
0 -1 [2]
1 0 [14, 14.5, 15]
2 1 [16, 17]
3 2 [19, 20, 21]
Is there a simpler way, using e.g. numpy, to get samples for a given X and delta than the below code?
>>> X = [1, 4, 5, 6, 11, 13, 15, 20, 21, 22, 25, 30]
>>> delta = 5
>>> samples = [X[0]]
>>> for x in X:
... if x - samples[-1] >= delta:
... samples.append(x)
>>> samples
[1, 6, 11, 20, 25, 30]
If you are aiming to "vectorize" the process for performance reasons (e.g. using numpy), you could compute the number of elements that are less than each element plus the delta. This will give you indices for the items to select with the items that need to be skipped getting the same index as the preceding ones to be kept.
import numpy as np
X = np.array([1, 4, 5, 6, 11, 13, 15, 20, 21, 22, 25, 30])
delta = 5
i = np.sum(X<X[:,None]+delta,axis=1) # index of first to keep
i = np.insert(i[:-1],0,0) # always want the first, never the last
Y = X[np.unique(i)] # extract values as unique indexes
print(Y)
[ 1 6 11 20 25 30]
This assumes that the numbers are in ascending order
[EDIT]
As indicated in my comment, the above solution is flawed and will only work some of the time. Although vectorizing a python function does not fully leverage the parallelism (and is slower than the python loop), it is possible to implement the filter like this
X = np.array([1, 4, 5, 6, 10,11,12, 13, 15, 20, 21, 22, 25, 30])
delta = 5
fdelta = np.frompyfunc(lambda a,b:a if a+delta>b else b,2,1)
Y = X[X==fdelta.accumulate(X,dtype=np.object)]
print(Y)
[ 1 6 11 20 25 30]
I have an input df:
input_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val1', 'Y_val2', 'Y_val3'],
[1, 10, 11, 31],
[2, 20, 12, 21],
[3, 30, 13, 11],])
and want to concat every y-value but still distinct where the value came from for plotting and analysis,
I have multiple files with variable number of Y columns and ended up concatenating them column by column and extending with multiplied value, but was wondering if there is a better solution, because mine is terribly tedious.
expected_output_ = pd.DataFrame.from_records(
[
['X_val', 'Y_val' 'Y_type'],
[1, 10, 'Y_val1'],
[1, 11, 'Y_val2'],
[1, 31, 'Y_val3'],
[2, 20, 'Y_val1'],
[2, 12, 'Y_val2'],
[2, 21, 'Y_val3'],
[3, 30, 'Y_val1'],
[3, 13, 'Y_val2'],
[3, 11, 'Y_val3'],])
You can use pandas.DataFrame.melt :
input_.melt(
id_vars=['X_val'],
value_vars=['Y_val1', 'Y_val2', 'Y_val3'],
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Alternatively, as suggested by #Vishnudev, you can also use the following variation, especially for large number of similarly named Y_val* columns:
input_.melt(
id_vars=['X_val'],
value_vars=input_.filter(regex='Y_val').columns,
var_name='Y_type',
value_name='Y_val'
).sort_values(['X_val'], ignore_index=True)
Output:
X_val Y_type Y_val
0 1 Y_val1 10
1 1 Y_val2 11
2 1 Y_val3 31
3 2 Y_val1 20
4 2 Y_val2 12
5 2 Y_val3 21
6 3 Y_val1 30
7 3 Y_val2 13
8 3 Y_val3 11
Optionally, you can rearrange the column sequence if you like.
I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)
I have an array that I want to sum specific elements while iterating through it. I struggle to find a way to do this with loop.
The array shape is (25,25)
array
[ 92843, 86851, 91950, 98232, 83329, 94591, 88962, 97020,
107113, 98452, 103242, 106442, 123032, 119063, 112971, 114715,
108654, 114856, 109872, 124583, 120518, 112815, 120780, 127831,
147174],
[132633, 124073, 131357, 140331, 119041, 135131, 127089, 138601,
153019, 140647, 147489, 152061, 175761, 170090, 161388, 163879,
155221, 164080, 156960, 177976, 172169, 161165, 172544, 182617,
210249],
[159159, 148887, 157629, 168397, 142849, 162157, 152507, 166321,
183623, 168776, 176986, 182473, 210913, 204108, 193665, 196655,
186265, 196896, 188352, 213571, 206602, 193398, 207052, 219140,
252298]
I want to print out results like below for each iteration
print(array[23][0]+array[23][1]) # 159159 + 148887 = 308046
print(array[22][0]+array[22][1]+array[22][2]) #132633 + 124073 + 131357 = 388063
print(array[21][0]+array[21][1]+array[21][2]+array[21][3]) # 92843 + 86851 + 91950 + 98232 = 369876
Presenting each element as array[i][j], as you can see in each iteration i - 1, and the "length" of j increased one.
Is there anyway I can use loop to do this task ? Thanks!
Try this:
for i, sub in enumerate(reversed(array)):
print(sum(sub[:i]))
For example, if
array = [[ 1, 2, 3, 4, 5],
[ 6, 7, 8, 9, 10],
[11, 12, 13, 14, 15],
[16, 17, 18, 19, 20],
[21, 22, 23, 24, 25]]
the output would be
0 # last row, no elements summed
16 # 16 = 16
23 # 11 + 12 = 23
21 # 6 + 7 + 8 = 21
10 # 1 + 2 + 3 + 4 = 10
You may simply want the np.tril, followed by a np.sum(_, axis=0). This will give the sum of each row of the lower triangle of the matrix. Easily altered to give the upper triangle, if that's what you need.
print(np.sum(np.tril(array), axis=0))
In [661]: arr = np.arange(1,17).reshape(4,4)
In [662]: arr
Out[662]:
array([[ 1, 2, 3, 4],
[ 5, 6, 7, 8],
[ 9, 10, 11, 12],
[13, 14, 15, 16]])
In [666]: for i in range(3,-1,-1):
...: c = arr[i,:4-i]
...: print(c.sum(), c)
...:
13 [13]
19 [ 9 10]
18 [5 6 7]
10 [1 2 3 4]