import numpy as np
import pandas as pd
from scipy.spatial.distance import directed_hausdorff
df:
1 1.1 2 2.1 3 3.1 4 4.1
45.13 7.98 45.10 7.75 45.16 7.73 NaN NaN
45.35 7.29 45.05 7.68 45.03 7.96 45.05 7.65
Calculated distance for 1 couple
x = df['3']
y = df['3.1']
P = np.array([x, y])
q = df['4']
w = df['4.1']
Q = np.array([q, w])
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
directed_hausdorff(P_final, Q_final)[0]
Desired output:
Same process with for loop for the whole dataset
distance from a['0'], a['0']is 0
from a['0'], a['1'] is 0.234 (some number)
from a['0'], a['2'] is .. ...
From [0] to all, then to [1] to all and etc.
Finally I should get a matrix with 0s` in diagonal
I Have tried:
space = list(df.index)
dist = []
for j in space:
for k in space:
if k != j:
dist.append((j, k, directed_hausdorff(P_final, Q_final)[0]))
But getting same value of distance between [3] and [4]
I am not entirely sure what you are trying to do.. but based on how you calculated the first one, here is a possible solution:
import pandas as pd
import numpy as np
from scipy.spatial.distance import directed_hausdorff
df = pd.read_csv('something.csv')
groupby = lambda l, n: [tuple(l[i:i+n]) for i in range(0, len(l), n)]
values = groupby(df.columns.values, 2)
matrix = np.zeros((4, 4))
for Ps in values:
x = df[str(Ps[0])]
y = df[str(Ps[1])]
P = np.array([x, y])
for Qs in values:
q = df[str(Qs[0])]
w = df[str(Qs[1])]
Q = np.array([q, w])
Q_final = list(zip(Q[0], Q[1]))
P_final = list(zip(P[0], P[1]))
matrix[values.index(Ps), values.index(Qs)] = directed_hausdorff(P_final, Q_final)[0]
print(matrix)
Output:
[[0. 0.49203658 0.47927028 0.46861498]
[0.31048349 0. 0.12083046 0.1118034 ]
[0.25179357 0.22135944 0. 0.31064449]
[0.33955854 0.03 0.13601471 0. ]]
Related
I am trying to compute the eigen values of a matrix built by a matrix product M^{-1}K.
I know M and K, I have initialized them properly. I thus try to compute the inverse of M:
M_inv = np.linalg.inv(M)
with np.printoptions(threshold=np.inf, precision=10, suppress=True,linewidth=20000):
print(np.matrix(M_inv * M))
That should print the identity, but I get:
Which clearly is not the identity. I need to find the eigen values of M_inv * K, but if M_Inv is so innacurate I won't get anything useful, what do I do?
This is the matrix:
And this is my initialization code:
def mij(i, j, h):
if i==j:
return 2.0 * h / 3.0
else:
return h / 6.0
def kij(i, j, h):
if i==j:
return 2.0 / h
else:
return -1 / h
n = 500
size=n+1
h = 1 / n
t=np.linspace(0,1,n)
# Get A
M = np.zeros((n, n))
K = np.zeros((n, n))
for i in range(0, n):
M[i,i] = mij(i, i, h)
if i+1 < n:
M[i,i+1] = mij(i, i+1, h)
if i-1 >= 0:
M[i,i-1] = mij(i, i-1, h)
K[i,i] = kij(i, i, h)
if i+1 < n:
K[i,i+1] = kij(i, i+1, h)
if i-1 >= 0:
K[i,i-1] = kij(i, i-1, h)
Try to compute the inverse column by column using this:
c1 = numpy.linalg.solve(M, [1, 0, ..., 0])
cn = numpy.linalg.solve(M, [0, ..., 0, 1])
An example with a tri-diagonal matrix in this code:
import numpy as np
M = np.array([[1,2,0],[1,4,9],[0,8,27]])
I = np.identity(3)
print(M)
#using inv
Minv1 = np.linalg.inv(M)
#using solve
Minv2 = list()
for i in range(3):
Minv2.append(np.linalg.solve(M, I[i]))
Minv2 = np.array([list(column) for column in zip(*Minv2)])
#same as:
Minv3 = np.linalg.solve(M, I)
print(Minv1)
print(Minv2)
print(Minv3)
Generated output:
[[ 1 2 0]
[ 1 4 9]
[ 0 8 27]]
[[-2. 3. -1. ]
[ 1.5 -1.5 0.5 ]
[-0.44444444 0.44444444 -0.11111111]]
[[-2. 3. -1. ]
[ 1.5 -1.5 0.5 ]
[-0.44444444 0.44444444 -0.11111111]]
[[-2. 3. -1. ]
[ 1.5 -1.5 0.5 ]
[-0.44444444 0.44444444 -0.11111111]]
The numpy.linalg.solve function is supposed to have a higher precission than the numpy.linalg.inv.
With n=5:
M = np.array([[1,2,0,0,0],[1,4,9,0,0],[0,8,27,1,0],[0,0,81,1,2],[0,0,0,1,23]])
I = np.identity(len(M))
print(M)
#using inv
Minv1 = np.linalg.inv(M)
#using solve
Minv2 = list()
for i in range(len(M)):
Minv2.append(np.linalg.solve(M, I[i]))
Minv2 = np.array([list(column) for column in zip(*Minv2)])
#same as:
Minv3 = np.linalg.solve(M, I)
print(Minv1)
print(Minv2)
print(Minv3)
Generated output:
[[ 1 2 0 0 0]
[ 1 4 9 0 0]
[ 0 8 27 1 0]
[ 0 0 81 1 2]
[ 0 0 0 1 23]]
[[ 1.63157895e+00 -6.31578947e-01 -9.21052632e-02 1.00877193e-01
-8.77192982e-03]
[-3.15789474e-01 3.15789474e-01 4.60526316e-02 -5.04385965e-02
4.38596491e-03]
[-4.09356725e-02 4.09356725e-02 -1.02339181e-02 1.12085770e-02
-9.74658869e-04]
[ 3.63157895e+00 -3.63157895e+00 9.07894737e-01 1.00877193e-01
-8.77192982e-03]
[-1.57894737e-01 1.57894737e-01 -3.94736842e-02 -4.38596491e-03
4.38596491e-02]]
[[ 1.63157895e+00 -6.31578947e-01 -9.21052632e-02 1.00877193e-01
-8.77192982e-03]
[-3.15789474e-01 3.15789474e-01 4.60526316e-02 -5.04385965e-02
4.38596491e-03]
[-4.09356725e-02 4.09356725e-02 -1.02339181e-02 1.12085770e-02
-9.74658869e-04]
[ 3.63157895e+00 -3.63157895e+00 9.07894737e-01 1.00877193e-01
-8.77192982e-03]
[-1.57894737e-01 1.57894737e-01 -3.94736842e-02 -4.38596491e-03
4.38596491e-02]]
[[ 1.63157895e+00 -6.31578947e-01 -9.21052632e-02 1.00877193e-01
-8.77192982e-03]
[-3.15789474e-01 3.15789474e-01 4.60526316e-02 -5.04385965e-02
4.38596491e-03]
[-4.09356725e-02 4.09356725e-02 -1.02339181e-02 1.12085770e-02
...
[ 3.63157895e+00 -3.63157895e+00 9.07894737e-01 1.00877193e-01
-8.77192982e-03]
[-1.57894737e-01 1.57894737e-01 -3.94736842e-02 -4.38596491e-03
4.38596491e-02]]
I have a dataframe called 'erm' like this:
enter image description here
I would like to add a new column 'typeRappel' xith value = 1 if erm['Calcul'] has value 4.
This is my code:
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
But this code send me an error like this:
enter image description here
What can be the problem ??
# IF ( calcul = 4 ) TypeRappel = 1.
# erm.loc[erm.Calcul = 4, "typeRappel"] = 1
#erm["typeRappel"] = np.where(erm['Calcul'] = 4.0, 1, 0)
# erm["Terminal"] = ["1" if c = "010" for c in erm['Code']]
# erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
import numpy as np
import pandas as pd
erm['typeRappel'] = [ 1 if x == 4 for x in erm['Calcul']]
You can achieve what you want using lambda
import pandas as pd
df = pd.DataFrame(
data=[[1,2],[4,5],[7,8],[4,11]],
columns=['Calcul','other_col']
)
df['typeRappel'] = df['Calcul'].apply(lambda x: 1 if x == 4 else None)
This results in
Calcul
other_col
typeRappel
1
2
NaN
4
5
1.0
7
8
NaN
4
11
1.0
You have 2 way for this
first way:
use from .loc method because you have just 1 condition
df['new']=None
df.loc[df.calcul.eq(4), 'new'] =1
Second way:
use from numpy.select method
import numpy as np
cond=[df.calcul.eq(4)]
df['new']= np.select(cond, [1], None)
import numpy as np
import pandas as pd
#erm['typeRappel']=None
erm.loc[erm.Calcul.eq(4), 'typeRappel'] = 1
import numpy as np
cond=[erm.Calcul.eq(4)]
erm['ok']= np.select(cond, [1], None)
My toy example is as follows:
import numpy as np
from sklearn.datasets import load_iris
import pandas as pd
### prepare data
Xy = np.c_[load_iris(return_X_y=True)]
mycol = ['x1','x2','x3','x4','group']
df = pd.DataFrame(data=Xy, columns=mycol)
dat = df.iloc[:100,:] #only consider two species
dat['group'] = dat.group.apply(lambda x: 1 if x ==0 else 2) #two species means two groups
dat.shape
dat.head()
### Linear discriminant analysis procedure
G1 = dat.iloc[:50,:-1]; x1_bar = G1.mean(); S1 = G1.cov(); n1 = G1.shape[0]
G2 = dat.iloc[50:,:-1]; x2_bar = G2.mean(); S2 = G2.cov(); n2 = G2.shape[0]
Sp = (n1-1)/(n1+n2-2)*S1 + (n2-1)/(n1+n2-2)*S2
a = np.linalg.inv(Sp).dot(x1_bar-x2_bar); u_bar = (x1_bar + x2_bar)/2
m = a.T.dot(u_bar); print("Linear discriminant boundary is {} ".format(m))
def my_lda(x):
y = a.T.dot(x)
pred = 1 if y >= m else 2
return y.round(4), pred
xx = dat.iloc[:,:-1]
xxa = xx.agg(my_lda, axis=1)
xxa.shape
type(xxa)
We have xxa is a pandas.core.series.Series with shape (100,). Note that there are two columns in parentheses of xxa, I want convert xxa to a pd.DataFrame with 100 rows x 2 columns and I try
xxa_df1 = pd.DataFrame(data=xxa, columns=['y','pred'])
which gives ValueError: Shape of passed values is (100, 1), indices imply (100, 2).
Then I continue to try
xxa2 = xxa.to_frame()
# xxa2 = pd.DataFrame(xxa) #equals `xxa.to_frame()`
xxa_df2 = pd.DataFrame(data=xxa2, columns=['y','pred'])
and xxa_df2 presents all NaN with 100 rows x 2 columns. What should I do next?
Let's try Series.tolist()
xxa_df1 = pd.DataFrame(data=xxa.tolist(), columns=['y','pred'])
print(xxa_df1)
y pred
0 42.0080 1
1 32.3859 1
2 37.5566 1
3 31.0958 1
4 43.5050 1
.. ... ...
95 -56.9613 2
96 -61.8481 2
97 -62.4983 2
98 -38.6006 2
99 -61.4737 2
[100 rows x 2 columns]
I currently have a dataframe in the following format:
step tag_id x_pos y_pos
1 1 5 3
1 2 3 4
2 1 2 2
2 3 1 6
.........................
.........................
N 1 5 7
For each row in the df, I am aiming to add an additional m rows oversampling from a Gaussian distribution for the x and y values (independent). Thus, a df of N = 100 and m = 10 would result in a df length 1010, including the original and oversampled values.
The code I have for this works, but it is extremely slow over a large dataset (N > 100k). There are many operations (creating new arrays/ dfs, use of itertuples, etc.) that I'm sure are hampering performance; I would appreciate any help as to how I can improve the performance so I can generate higher m values over the whole dataset. For instance: input data is from a pandas dataframe, but the multi-variate normal function operates on numpy arrays. Is there a more natural way to implement this through pandas without the copying between numpy arrays and dataframes? Thanks!
Reproducible example:
import pandas as pd
import numpy as np
import random
def gaussianOversample2(row, n):
sigma = 2
mean_x = float(getattr(row,'x_pos'))
mean_y = float(getattr(row,'y_pos'))
step = getattr(row, 'step')
tag_id = getattr(row, 'tag_id')
sigma = np.array([1,1])
cov = np.diag(sigma ** 2)
x,y = np.random.multivariate_normal([mean_x, mean_y], cov, n).T
x = np.concatenate(([mean_x], x))
y = np.concatenate(([mean_y], y))
steps = np.empty(n+1)
tags = np.empty(n+1)
steps.fill(step)
tags.fill(tag_id)
return x,y, steps, tags
def oversampleDf(df, n):
oversampled_arr = np.empty((0,4), float)
# with input df with step, tag_id, x_pos, y_pos
data = pd.DataFrame(columns = df.columns)
count = 0
for row in df.itertuples(index=False):
count = count + 1
temp = np.zeros((len(row), n+1))
oversample_x, oversample_y, steps, tags = gaussianOversample2(row, n)
temp[0] = steps
temp[1] = tags
temp[2] = oversample_x
temp[3] = oversample_y
temp = pd.DataFrame(temp.T, columns = df.columns)
data = data.append(temp)
if count % 1000 == 0:
print("Row: ", count)
return data
df = pd.DataFrame([[1, 1, 5, 3],[1, 2, 3, 4],[2, 1, 2, 2],[2, 3, 1, 6], columns = ['step', 'tag_id', 'x_pos', 'y_pos']])
res = oversampleDf(df, 20)
"""
# Result should be:
step tag_id x_pos y_pos
0 1.0 1.0 5.000000 3.000000
1 1.0 1.0 3.423492 3.886602
2 1.0 1.0 5.404581 2.177559
3 1.0 1.0 4.023274 2.883737
4 1.0 1.0 3.390710 3.038782
.. ... ... ... ...
16 2.0 3.0 1.894151 5.510321
17 2.0 3.0 1.110932 5.281578
18 2.0 3.0 1.623538 4.529825
19 2.0 3.0 -0.576756 7.476872
20 2.0 3.0 -0.866123 5.898048
"""
This is the solution I have found for myself; it is more of a workaround than a technique using quicker methods. I instead write out to a csv file, which I then read in once complete, as so:
def gaussianOversample3(row, n):
mean_x = float(getattr(row,'x_pos'))
mean_y = float(getattr(row,'y_pos'))
step = getattr(row, 'step')
tag_id = getattr(row, 'tag_id')
sigma = np.array([1,1])
cov = np.diag(sigma ** 2)
x,y = np.random.multivariate_normal([mean_x, mean_y], cov, n).T
x = np.concatenate(([mean_x], x))
y = np.concatenate(([mean_y], y))
steps = np.empty(n+1)
tags = np.empty(n+1)
steps.fill(step)
tags.fill(tag_id)
pd.DataFrame(data = np.column_stack((steps,tags,x,y))).to_csv("oversample.csv", mode = 'a', header = False)
def oversampleDf2(df, n):
filename = "oversample.csv"
d = pd.DataFrame(list())
d.to_csv(filename)
#count = 0
for row in df.itertuples(index=False):
#count = count + 1
gaussianOversample3(row, n)
#if count % 10000 == 0:
# print("Row: ", count)
Because of how it is reading the file, I have to do the following:
oversampleDf2(defensive_df2, num_oversamples)
oversampled_df = pd.read_csv("oversample_10.csv", sep= ' ')
oversampled_df.columns = ['col']
oversampled_df = oversampled_df.col.str.split(",",expand=True)
oversampled_df.columns = ['temp', 'step', 'tag_id', 'x_pos', 'y_pos']
oversampled_df = oversampled_df.drop(['temp'], axis = 1)
oversampled_df = oversampled_df.astype(float)
I want to create a loop that loads all the iterations of two variables into a dataframe in seperate columns. I want variable "a" to hold values between 0 and 1 in 0.1 increments, and the same for variable "b". In otherwords there should be 100 iterations when complete, starting with 0 & 0, and ending with 1 & 1.
I've tried the following code
data = [['Decile 1', 10], ['Decile_2', 15], ['Decile_3', 14]]
staging_table = pd.DataFrame(data, columns = ['Decile', 'Volume'])
profile_table = pd.DataFrame(columns = ['Decile', 'Volume'])
a = 0
b = 0
finished = False
while not finished:
if b != 1:
if a != 1:
a = a + 0.1
staging_table['CAM1_Modifier'] = a
staging_table['CAM2_Modifier'] = b
profile_table = profile_table.append(staging_table)
else:
b = b + 0.1
else:
finished = True
profile_table
You can use itertools.product to get all the combinations:
import itertools
import pandas as pd
x = [i / 10 for i in range(11)]
df = pd.DataFrame(
list(itertools.product(x, x)),
columns=["a", "b"]
)
# a b
# 0 0.0 0.0
# 1 0.0 0.1
# 2 0.0 0.2
# ... ... ...
# 118 1.0 0.8
# 119 1.0 0.9
# 120 1.0 1.0
#
# [121 rows x 2 columns]
itertools is your friend.
from itertools import product
for a, b in product(map(lambda x: x / 10, range(10)),
map(lambda x: x / 10, range(10))):
...
range(10) gives us the integers from 0 to 10 (regrettably, range fails on floats). Then we divide those values by 10 to get your range from 0 to 1. Then we take the Cartesian product of that iterable with itself to get every combination.