Related
I have 17 data frames in a list dataframes, they all have the same column names and length save for the first column which describes the source of the data. There are 7 columns which describe the date for the data which is again the same for each data frame across each row. So, there are a total of 19 columns per data frame. What I would like to do is dynamically concatenate each of the columns which have the same column name such that there is a total of 11 data frames with 24 columns 7 of which describe the date and the other 17 are the concatenated columns which shared the same column name for the list of 17 data frames.
Below is just an example of 3 data frames and the expected outcome.
df1 = pd.DataFrame(np.array([
['a', 1, 3, 9],
['a', 2, 4, 61],
['a', 3, 24, 9]]),
columns=['name', 'date','attr11', 'attr12'])
df2 = pd.DataFrame(np.array([
['b', 1, 5, 19],
['b', 2, 14, 16],
['b', 3, 4, 9]]),
columns=['name','date', 'attr11', 'attr12'])
df3 = pd.DataFrame(np.array([
['c', 1, 3, 49],
['c', 2, 4, 36],
['c', 3, 14, 9]]),
columns=['name','date' ,'attr11', 'attr12']
Result
dfattr11
[1, 3, 5, 49],
[2, 4, 14, 36],
[3, 24, 4, 9]]),
columns=['date', 'attr11', 'attr11', 'attr11']
dfattr12...
new_dataframes = [dfattr11, dfattr12, ...]
I tried using Pandas Python: Concatenate dataframes having same columns for guidance but it seems like the solution stacked the columns opposed to parallel.
I know how I would use concat to create a new data frame but the challenge arises when trying to do it iteratively or dynamically as there are 17 data frames each with 11 columns that need to be put into their separate df. Any help would be greatly appreciated.
IIUC, you could use pandas.concat to generate a big dataframe with all data and split it using groupby. You will get a dictionary of dataframes as output:
dfs = [df1,df2,df3]
out = {k: d.droplevel(0, axis=1) for k,d in
pd.concat({d['name'].iloc[0]: d.set_index('date')
.drop(columns='name')
for d in dfs}, axis=1)
.groupby(level=1, axis=1)
}
Output:
{'attr11': attr11 attr11 attr11
date
1 3 5 3
2 4 14 4
3 24 4 14,
'attr12': attr12 attr12 attr12
date
1 9 19 49
2 61 16 36
3 9 9 9}
I have two CSV, one is the Master-Data and the other is the Component-Data, Master-Data has Two Rows and two columns, where as Component-Data has 5 rows and two Columns.
I'm trying to find the cosine-similarity between each of them after Tokenization, Stemming and Lemmatization and then append the similarity index to the new columns, I'm unable to append the corresponding values to the column in the data-frame which is further needs to be converted to CSV.
My Approach:
import re
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer,WordNetLemmatizer
from collections import Counter
import pandas as pd
portStemmer=PorterStemmer()
wordNetLemmatizer = WordNetLemmatizer()
fields = ['Sentences']
cosineSimilarityList = []
def fetchLemmantizedWords():
eliminatePunctuation = re.sub('[^a-zA-Z]', ' ',value)
convertLowerCase = eliminatePunctuation.lower()
tokenizeData = convertLowerCase.split()
eliminateStopWords = [word for word in tokenizeData if not word in set(stopwords.words('english'))]
stemWords= list(set([portStemmer.stem(value) for value in eliminateStopWords]))
wordLemmatization = [wordNetLemmatizer.lemmatize(x) for x in stemWords]
return wordLemmatization
def fetchCosine(eachMasterData,eachComponentData):
masterDataValues = Counter(eachMasterData)
componentDataValues = Counter(eachComponentData)
bagOfWords = list(masterDataValues.keys() | componentDataValues.keys())
masterDataVector = [masterDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
componentDataVector = [componentDataValues.get(bagOfWords, 0) for bagOfWords in bagOfWords]
masterDataLength = sum(contractElement*contractElement for contractElement in masterDataVector) ** 0.5
componentDataLength = sum(questionElement*questionElement for questionElement in componentDataVector) ** 0.5
dotProduct = sum(contractElement*questionElement for contractElement,questionElement in zip(masterDataVector, componentDataVector))
cosine = int((dotProduct / (masterDataLength * componentDataLength))*100)
return cosine
masterData = pd.read_csv('C:\\Similarity\\MasterData.csv', skipinitialspace=True)
componentData = pd.read_csv('C:\\Similarity\\ComponentData.csv', skipinitialspace=True)
for value in masterData['Sentences']:
eachMasterData = fetchLemmantizedWords()
for value in componentData['Sentences']:
eachComponentData = fetchLemmantizedWords()
cosineSimilarity = fetchCosine(eachMasterData,eachComponentData)
cosineSimilarityList.append(cosineSimilarity)
for value in cosineSimilarityList:
componentData = componentData.append(pd.DataFrame(cosineSimilarityList, columns=['Cosine Similarity']), ignore_index=True)
#componentData['Cosine Similarity'] = value
expected output after converting the df to CSV,
Facing issues while appending the values to the Data-frame, Please assist me with an approach for this. Thanks.
Here's what I came up with:
Sample set up
csv_master_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
"""
csv_component_data = \
"""
SI.No;Sentences
1;Emma is writing a letter.
2;We wake up early in the morning.
3;Did Emma Write a letter?
4;We sleep early at night.
5;Emma wrote a letter.
"""
import pandas as pd
from io import StringIO
df_md = pd.read_csv(StringIO(csv_master_data), delimiter=';')
df_cd = pd.read_csv(StringIO(csv_component_data), delimiter=';')
We end up with 2 dataframes (showing df_cd):
SI.No
Sentences
0
1
Emma is writing a letter.
1
2
We wake up early in the morning.
2
3
Did Emma Write a letter?
3
4
We sleep early at night.
4
5
Emma wrote a letter.
I replaced the 2 functions you used by the following dummy functions:
import random
def fetchLemmantizedWords(words):
return [random.randint(1,30) for x in words]
def fetchCosine(lem_md, lem_cd):
return 100 if len(lem_md) == len(lem_cd) else random.randint(0,100)
Processing data
First, we apply the fetchLemmantizedWords function on each dataframe. The regex replace, lowercase and split of the sentences is done by Pandas instead of doing them in the function itself.
By making the sentence lowercase first, we can simplify the regex to only consider lowercase letters.
for df in (df_md, df_cd):
df['lem'] = df.apply(lambda x: fetchLemmantizedWords(x.Sentences
.lower()
.replace(r'[^a-z]', ' ')
.split()),
result_type='reduce',
axis=1)
Result for df_cd:
SI.No
Sentences
lem
0
1
Emma is writing a letter.
[29, 5, 4, 9, 28]
1
2
We wake up early in the morning.
[16, 8, 21, 14, 13, 4, 6]
2
3
Did Emma Write a letter?
[30, 9, 23, 16, 5]
3
4
We sleep early at night.
[8, 25, 24, 7, 3]
4
5
Emma wrote a letter.
[30, 30, 15, 7]
Next, we use a cross-join to make a dataframe with all possible combinations of md and cd data.
df_merged = pd.merge(df_md[['SI.No', 'lem']],
df_cd[['SI.No', 'lem']],
how='cross',
suffixes=('_md','_cd')
)
df_merged contents:
SI.No_md
lem_md
SI.No_cd
lem_cd
0
1
[14, 22, 9, 21, 4]
1
[3, 4, 8, 17, 2]
1
1
[14, 22, 9, 21, 4]
2
[29, 3, 10, 2, 19, 18, 21]
2
1
[14, 22, 9, 21, 4]
3
[20, 22, 29, 4, 3]
3
1
[14, 22, 9, 21, 4]
4
[17, 7, 1, 27, 19]
4
1
[14, 22, 9, 21, 4]
5
[17, 5, 3, 29]
5
2
[12, 30, 10, 11, 7, 11, 8]
1
[3, 4, 8, 17, 2]
6
2
[12, 30, 10, 11, 7, 11, 8]
2
[29, 3, 10, 2, 19, 18, 21]
7
2
[12, 30, 10, 11, 7, 11, 8]
3
[20, 22, 29, 4, 3]
8
2
[12, 30, 10, 11, 7, 11, 8]
4
[17, 7, 1, 27, 19]
9
2
[12, 30, 10, 11, 7, 11, 8]
5
[17, 5, 3, 29]
Next, we calculate the cosine value:
df_merged['cosine'] = df_merged.apply(lambda x: fetchCosine(x.lem_md,
x.lem_cd),
axis=1)
In the last step, we pivot the data and merge the original df_cd with the calculated results :
pd.merge(df_cd.drop(columns='lem').set_index('SI.No'),
df_merged.pivot_table(index='SI.No_cd',
columns='SI.No_md').droplevel(0, axis=1),
how='inner',
left_index=True,
right_index=True)
Result (again, these are dummy calculations):
SI.No
Sentences
1
2
1
Emma is writing a letter.
100
64
2
We wake up early in the morning.
63
100
3
Did Emma Write a letter?
100
5
4
We sleep early at night.
100
17
5
Emma wrote a letter.
35
9
I am new to python and its libraries. Searched all the forums but could not find a proper solution. This is the first time posting a question here. Sorry if I did something wrong.
So, I have two DataFrames like below containing X Y Z coordinates (UTM) and other features.
In [2]: a = {
...: 'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
...: 'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
...: 'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19],
...: }
...:
In [3]: b = {
...: 'X': [1, 8, 20, 7, 32],
...: 'Y': [6, 4, 17, 45, 32],
...: 'Z': [52, 12, 6, 8, 31],
...: }
In [4]: df1 = pd.DataFrame(data=a)
In [5]: df2 = pd.DataFrame(data=b)
In [6]: print(df1)
X Y Z
0 1 3 12
1 2 4 4
2 5 8 9
3 7 15 16
4 10 20 13
5 5 12 1
6 2 23 8
7 3 22 17
8 24 14 11
9 21 7 19
In [7]: print(df2)
X Y Z
0 1 6 52
1 8 4 12
2 20 17 6
3 7 45 8
4 32 32 31
I need to find the closest point (distance) in df1 to each point of df2 and creating new DataFrame.
So I wrote the code below and actually find the closest point (distance) to df2.iloc[0].
In [8]: x = (
...: np.sqrt(
...: ((df1['X'].sub(df2["X"].iloc[0]))**2)
...: .add(((df1['Y'].sub(df2["Y"].iloc[0]))**2))
...: .add(((df1['Z'].sub(df2["Z"].iloc[0]))**2))
...: )
...: ).idxmin()
In [9]: x1 = df1.iloc[[x]]
In[10]: print(x1)
X Y Z
3 7 15 16
So, I guess I need a loop to iterate through df2 and apply above code to each row. As a result I need a new updated df1 containing all the closest points to each point of df2. But couldn't make it. Please advise.
This is actually a great example of a case where numpy's broadcasting rules have distinct advantages over pandas.
Manually aligning df1's coordinates as column vectors (by referencing df1[[col]].to_numpy()) and df2's coordinates as row vectors (df2[col].to_numpy()), we can get the distance from every element in each dataframe to each element in the other very quickly with automatic broadcasting:
In [26]: dists = np.sqrt(
...: (df1[['X']].to_numpy() - df2['X'].to_numpy()) ** 2
...: + (df1[['Y']].to_numpy() - df2['Y'].to_numpy()) ** 2
...: + (df1[['Z']].to_numpy() - df2['Z'].to_numpy()) ** 2
...: )
In [27]: dists
Out[27]:
array([[40.11234224, 7.07106781, 24.35159132, 42.61455151, 46.50806382],
[48.05205511, 10. , 22.29349681, 41.49698784, 49.12229636],
[43.23193264, 5.83095189, 17.74823935, 37.06750599, 42.29657197],
[37.58989226, 11.74734012, 16.52271164, 31.04834939, 33.74907406],
[42.40283009, 16.15549442, 12.56980509, 25.67099531, 30.85449724],
[51.50728104, 13.92838828, 16.58312395, 33.7934905 , 45.04442252],
[47.18050445, 20.32240143, 19.07878403, 22.56102835, 38.85871846],
[38.53569774, 19.33907961, 20.85665361, 25.01999201, 33.7194306 ],
[47.68647607, 18.89444363, 7.07106781, 35.48239 , 28.0713377 ],
[38.60051813, 15.06651917, 16.43167673, 41.96427052, 29.83286778]])
Argmin will now give you the correct vector of positional indices:
In [28]: dists.argmin(axis=0)
Out[28]: array([3, 2, 8, 6, 8])
Or, to select the appropriate values from df1:
In [29]: df1.iloc[dists.argmin(axis=0)]
Out[29]:
X Y Z
3 7 15 16
2 5 8 9
8 24 14 11
6 2 23 8
8 24 14 11
Edit
An answer popped up just after mine, then was deleted, which made reference to scipy.spatial.distance_matrix, computing dists with:
distance_matrix(df1[list('XYZ')].to_numpy(), df2[list('XYZ')].to_numpy())
Not sure why that answer was deleted, but this seems like a really nice, clean approach to getting the array I produced manually above!
Performance Note
Note that if you are just trying to get the closest value, there's no need to take the square root, as this is a costly operation compared to addition, subtraction, and powers, and sorting on dist**2 is still valid.
First, you define a function that returns the closest point using numpy.where. Then you use the apply function to run through df2.
import pandas as pd
import numpy as np
a = {
'X': [1, 2, 5, 7, 10, 5, 2, 3, 24, 21],
'Y': [3, 4, 8, 15, 20, 12, 23, 22, 14, 7],
'Z': [12, 4, 9, 16, 13, 1, 8, 17, 11, 19]
}
b = {
'X': [1, 8, 20, 7, 32],
'Y': [6, 4, 17, 45, 32],
'Z': [52, 12, 6, 8, 31]
}
df1 = pd.DataFrame(a)
df2 = pd.DataFrame(b)
dist = lambda dx,dy,dz: np.sqrt(dx**2+dy**2+dz**2)
def closest(row):
darr = dist(df1['X']-row['X'], df1['Y']-row['Y'], df1['Z']-row['Z'])
idx = np.where(darr == np.amin(darr))[0][0]
return df1['X'][idx], df1['Y'][idx], df1['Z'][idx]
df2['closest'] = df2.apply(closest, axis=1)
print(df2)
Output:
X Y Z closest
0 1 6 52 (7, 15, 16)
1 8 4 12 (5, 8, 9)
2 20 17 6 (24, 14, 11)
3 7 45 8 (2, 23, 8)
4 32 32 31 (24, 14, 11)
I have a code in Matlab which I need to translate in Python. A point here that shapes and indexes are really important since it works with tensors. I'm a little bit confused since it seems that it's enough to use order='F' in python reshape(). But when I work with 3D data I noticed that it does not work. For example, if A is an array from 1 to 27 in python
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, 11, 12],
[13, 14, 15],
[16, 17, 18]],
[[19, 20, 21],
[22, 23, 24],
[25, 26, 27]]])
if I perform A.reshape(3, 9, order='F') I get
[[ 1 4 7 2 5 8 3 6 9]
[10 13 16 11 14 17 12 15 18]
[19 22 25 20 23 26 21 24 27]]
In Matlab for A = 1:27 reshaped to [3, 3, 3] and then to [3, 9] it seems that I get another array:
1 4 7 10 13 16 19 22 25
2 5 8 11 14 17 20 23 26
3 6 9 12 15 18 21 24 27
And SVD in Matlab and Python gives different results. So, is there a way to fix this?
And maybe you know the correct way of operating with multidimensional arrays in Matlab -> python, like should I get the same SVD for arrays like arange(1, 13).reshape(3, 4) and in Matlab 1:12 -> reshape(_, [3, 4]) or what is the correct way to work with that? Maybe I can swap axes somehow in python to get the same results as in Matlab? Or change the order of axes in reshape(x1, x2, x3,...) in Python?
I was having the same issues, until I found this wikipedia article: row- and column-major order
Python (and C) organizes the data arrays in row major order. As you can see in your first example code, the elements first increases with the columns:
array([[[ 1, 2, 3],
- - - -> increasing
Then in the rows
array([[[ 1, 2, 3],
[ 4, <--- new element
When all columns and rows are full, it moves to the next page.
array([[[ 1, 2, 3],
[ 4, 5, 6],
[ 7, 8, 9]],
[[10, <-- new element in next page
In matlab (as fortran) increases first the rows, then the columns, and so on.
For N-dimensionals arrays it looks like:
Python (row major -> last dimension is contiguous): [dim1,dim2,...,dimN]
Matlab (column major -> first dimension is contiguous): the same tensor in memory would look the other way around .. [dimN,...,dim2,dim1]
If you want to export n-dim. arrays from python to matlab, the easiest way is to permute the dimensions first:
(in python)
import numpy as np
import scipy.io as sio
A=np.reshape(range(1,28),[3,3,3])
sio.savemat('A',{'A':A})
(in matlab)
load('A.mat')
A=permute(A,[3 2 1]);%dimensions in reverse ordering
reshape(A,9,3)' %gives the same result as A.reshape([3,9]) in python
Just notice that the (9,3) an the (3,9) are intentionally putted in reverse order.
In Matlab
A = 1:27;
A = reshape(A,3,3,3);
B = reshape(A,9,3)'
B =
1 2 3 4 5 6 7 8 9
10 11 12 13 14 15 16 17 18
19 20 21 22 23 24 25 26 27
size(B)
ans =
3 9
In Python
A = np.array(range(1,28))
A = A.reshape(3,3,3)
B = A.reshape(3,9)
B
array([[ 1, 2, 3, 4, 5, 6, 7, 8, 9],
[10, 11, 12, 13, 14, 15, 16, 17, 18],
[19, 20, 21, 22, 23, 24, 25, 26, 27]])
np.shape(B)
(3, 9)
I had written a python program to sort a two-dimensional array using the second column and if elements in the second column are the same sort by the first column. Though I solved the problem with my rudimentary python knowledge.
I think it can be improved. Can anyone help optimizing it?
Please also suggest if using other data types for sorting will be good option?
#created a two dimensional array
two_dim_array=[[2, 5], [9, 1], [4, 8], [10, 0], [50, 32], [33, 31],[1, 5], [12, 5], [22, 5], [32, 5], [9, 5],[3, 31], [91, 32] ]
#get the length of the array
n_ship=len(two_dim_array)
#sorting two dimensional array by using second column
sort_by_second_column=sorted(two_dim_array, key=lambda x: x[1], reverse=False)
#declared new variable for storing second soeted array
new_final_data=[]
#parameter used to slice the two dimensional column
first_slice=0
#tmp=[]
index=[0]
for m in range(1, n_ship):
#print('m is: '+str(m)+'final_data[m-1][1] is: '+str(final_data[m-1][1])+'final_data[m][1] is: '+str(final_data[m][1]))
#subtracting second column elements to detect changes and saved to array
if(abs(sort_by_second_column[m-1][1]-sort_by_second_column[m][1])!=0):
index.append(m)
# print(index)
l=1
# used the above generated index to slice the data
for z in range(len(index)):
tmp=[]
if(l==1):
first_slice=0
last=index[z+1]
mid_start=index[z]
# print('l is start'+ 'first is '+str(first_slice)+'last is'+str(last))
v=sort_by_second_column[:last]
elif l==len(index):
first_slice=index[z]
# print('l is last'+str(1)+ 'first is '+str(first_slice)+'last is'+str(last))
v=sort_by_second_column[first_slice:]
else:
first_slice=index[z]
last=index[z+1]
#print('l is middle'+str(1)+ 'first is '+str(first_slice)+'last is'+str(last))
v=sort_by_second_column[first_slice:last]
tmp.extend(v)
tmp=sorted(tmp, key=lambda x: x[0], reverse=False)
#print(tmp)
new_final_data.extend(tmp)
# print(new_final_data)
l+=1
for l in range(n_ship):
print(str(new_final_data[l][0])+' '+str(new_final_data[l][1]))
''' Input
2 5
9 1
4 8
10 0
50 32
33 31
1 5
12 5
22 5
32 5
9 5
3 31
91 32
Output
10 0
9 1
1 5
2 5
9 5
12 5
22 5
32 5
4 8
3 31
33 31
50 32
91 32'''
You should read the documentation on sorted(), as this is exactly what you need to use:
https://docs.python.org/3/library/functions.html#sorted
newarray=sorted(two_dim_array, key=lambda x:(x[1],x[0]))
Outputs:
[10, 0]
[9, 1]
[1, 5]
[2, 5]
[9, 5]
[12, 5]
[22, 5]
[32, 5]
[4, 8]
[3, 31]
[33, 31]
[50, 32]
[91, 32]