my question is the following:
I have an array that has feature vectors that correspond to several audio files. So if for example there are 10 audio files than this array would have length 10.
I have a feature that is itself a list (this list comprises the information of a specific feature of the audio file) and for a given audio file the feature vector looks like this:
array([0.03861840871664194, 187.72393405210002, 62.59881268743305,
array([4963.40332031, 3229.98046875, 2691.65039062, 3208.44726562,
4338.94042969, 4220.5078125 , 4166.67480469, 4801.90429688,
5555.56640625, 5910.86425781, 6115.4296875 , 5706.29882812,
4984.93652344, 2756.25 , 1991.82128906, 2551.68457031,
2734.71679688, 2906.98242188, 3143.84765625, 3219.21386719,
3186.9140625 , 3165.38085938, 3068.48144531, 2465.55175781,
2110.25390625, 2508.61816406, 2993.11523438, 3843.67675781,
4715.77148438, 5652.46582031, 5480.20019531, 5792.43164062,
5932.39746094, 6244.62890625, 6072.36328125, 6201.5625 ,
6158.49609375, 6201.5625 , 6233.86230469, 6061.59667969])],
Now when I try to feed this data into the svm model:
from sklearn import svm
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
model = svm.SVC(),y_train)
yt_p = model.predict(X_train)
yv_p = model.predict(X_val)
I get this error ValueError: setting an array element with a sequence.
How can I structure my feature vector in order to be able to feed it to the svm?
Here I provide with an example of X
if we have 5 audio files then X will be:
array([[0.017455393927437918, 227.66237105624407, 32.42076654734572,
array([1851.85546875, 2433.25195312, 3057.71484375, 3079.24804688,
3079.24804688, 3068.48144531, 3046.94824219, 3359.1796875 ,
3908.27636719, 4618.87207031, 4618.87207031, 4521.97265625,
4091.30859375, 3111.54785156, 3100.78125 , 2863.91601562,
1561.15722656, 1119.7265625 , 1065.89355469, 947.4609375 ,
979.76074219, 990.52734375, 990.52734375, 1356.59179688,
2077.95410156, 2993.11523438, 3025.41503906, 3068.48144531,
3079.24804688, 3090.01464844, 3100.78125 , 3111.54785156,
2993.11523438, 3100.78125 , 3079.24804688, 2853.14941406,
1205.859375 , 1281.22558594, 1614.99023438, 2131.78710938,
2325.5859375 , 2034.88769531, 1916.45507812, 1744.18945312,
1851.85546875, 2357.88574219, 2368.65234375, 1916.45507812,
1959.52148438, 1959.52148438, 1754.95605469, 1787.25585938,
[0.042435441297643324, 128.81225073038124, 20.912528554426807,
array([4349.70703125, 4242.04101562, 4274.34082031, 4123.60839844,
4457.37304688, 4834.20410156, 4661.93847656, 4306.640625 ,
4231.27441406, 4543.50585938, 4435.83984375, 6201.5625 ,
8817.84667969, 8817.84667969, 742.89550781, 721.36230469,
732.12890625, 732.12890625, 710.59570312, 721.36230469,
925.92773438, 1119.7265625 , 1141.25976562, 1431.95800781,
7762.71972656, 7934.98535156, 7891.91894531, 7332.05566406,
3789.84375 , 2799.31640625, 2831.61621094, 2217.91992188,
581.39648438, 602.9296875 , 2217.91992188, 2228.68652344,
2368.65234375, 2519.38476562, 2863.91601562, 3682.17773438,
3649.87792969, 4188.20800781, 4112.84179688])],
[0.006295381642571726, 130.28309914454434, 5.193614287487564,
array([7978.05175781, 8010.3515625 , 8118.01757812, 8430.24902344,
8257.98339844, 8451.78222656, 8591.74804688, 8677.88085938,
8796.31347656, 8850.14648438, 8796.31347656, 8925.51269531,
6244.62890625, 344.53125 , 344.53125 , 1614.99023438,
2325.5859375 , 2971.58203125, 3316.11328125, 3617.578125 ,
3294.58007812, 2788.54980469, 2637.81738281, 2702.41699219,
2723.95019531, 3133.08105469, 3413.01269531, 5663.23242188,
5770.8984375 , 5577.09960938, 2228.68652344, 1604.22363281,
1690.35644531, 4123.60839844, 5566.33300781, 5803.19824219,
5749.36523438, 5846.26464844, 6772.19238281, 7073.65722656,
7622.75390625, 7859.61914062, 8236.45019531, 8441.015625 ,
8699.4140625 , 8807.08007812, 8742.48046875, 8667.11425781,
8710.18066406, 8947.04589844, 9140.84472656, 9130.078125 ,
8936.27929688, 8925.51269531, 8947.04589844, 8925.51269531,
9097.77832031, 9205.44433594, 9194.67773438, 9140.84472656,
9162.37792969, 9043.9453125 , 9162.37792969, 9108.54492188,
9183.91113281, 9280.81054688, 9270.04394531, 9108.54492188,
9076.24511719, 9356.17675781, 9226.97753906, 9216.2109375 ,
9248.51074219, 9140.84472656, 9237.74414062, 9334.64355469,
9259.27734375, 9226.97753906, 9216.2109375 , 9108.54492188,
9183.91113281, 9216.2109375 , 9248.51074219, 9259.27734375,
[0.017070271599460656, 171.91660927761163, 26.854424936811768,
array([4715.77148438, 4629.63867188, 4898.80371094, 5275.63476562,
4941.87011719, 4532.73925781, 4618.87207031, 4995.703125 ,
4705.00488281, 4500.43945312, 4188.20800781, 4371.24023438,
4457.37304688, 4188.20800781, 4909.5703125 , 4877.27050781,
6761.42578125, 7708.88671875, 7719.65332031, 7956.51855469,
8484.08203125, 9033.17871094, 9043.9453125 , 9000.87890625,
9011.64550781, 9011.64550781, 9000.87890625, 9108.54492188,
8817.84667969, 6686.05957031, 1808.7890625 , 1830.32226562,
1851.85546875, 1636.5234375 , 1022.82714844, 1281.22558594,
1927.22167969, 1948.75488281, 1302.75878906, 1399.65820312,
1873.38867188, 1959.52148438, 7245.92285156, 9011.64550781,
9420.77636719, 9549.97558594, 9453.07617188, 9431.54296875,
9410.00976562, 9248.51074219, 9151.61132812, 9194.67773438,
8968.57910156, 8634.81445312, 8268.75 , 7439.72167969,
5501.73339844, 5232.56835938, 5103.36914062, 7052.12402344,
7299.75585938, 7127.49023438, 7192.08984375, 5673.99902344,
5523.26660156, 5986.23046875, 6729.12597656, 6309.22851562,
5135.66894531, 5081.8359375 , 5329.46777344, 5404.83398438])]],
You can feed the feature with the lists inside to your model in two ways:
Treat the list as additional features
Map all of its elements into a single number with a function you deem appropriate (min, median, mean, max, sum, etc.).
To try the first option:
# Convert `X` to data frame
X = pd.DataFrame(X)
# Rename columns
X.columns = ['feature_' + str(i + 1) for i in range(X.shape[1])]
# Convert the feature with lists inside to long format
x = X['feature_5'].explode().to_frame()
# Create counter by observation so we can pivot
x['observation_id'] = x.groupby(level=0).cumcount()
# Convert to dataset and rename all columns
x = x.pivot(columns='observation_id', values='feature_5').fillna(0)
x = x.add_prefix('list_element_')
# Drop `feature_5` from X
X.drop(columns='feature_5', axis=1, inplace=True)
# Concatenate X and x together
X = pd.concat([X, x], axis=1)
# Carry on as before
X_train, X_val, y_train, y_val = train_test_split(X,y,test_size=0.3)
model = svm.SVC(),y_train)
There's no right answer to the second option and only you can decide how to do this because only you know what the lists mean. However, if you want to get the mean (for example) of each list and use that as a feature:
# Get the mean of each list
means = [np.mean(array) for array in X[:, 4]]
# Replace the lists with `means`
X[:, 4] = means
And then carry on with the splitting and fitting.
I'm building a Kmeans clustering model with a churn dataset and am getting an error that says ValueError: x and y must be the same size when trying to create cluster graph.
I'll post both my function and the graph code here in a sec, but in trying to narrow it down, I think it may have something to do with this line of code in the function:
, y=kmeans.cluster_centers_[:,1]
Here's the full code
def Create_kmeans_cluster_graph(df_final, data, n_clusters, x_title, y_title, chart_title):
""" Display K-means cluster based on data """
kmeans = KMeans(n_clusters=n_clusters # No of cluster in data
, random_state = random_state # Selecting same training data
kmean_colors = [plotColor[c] for c in kmeans.labels_]
fig = plt.figure(figsize=(12,8))
plt.scatter(x= x_title + '_norm'
, y= y_title + '_norm'
, data=data
, color=kmean_colors # color of data points
, alpha=0.25 # transparancy of data points
, y=kmeans.cluster_centers_[:,1]
, color='black'
, marker='X' # Marker sign for data points
, s=100 # marker size
return kmeans.fit_predict(df_final[df_final.Churn==1][[x_title+'_norm', y_title +'_norm']])
df_final['Cluster'] = -1 # by default set Cluster to -1
df_final.iloc[(df_final.Churn==1),'Cluster'] = Create_kmeans_cluster_graph(df_final
,"Tenure vs Monthlycharges : Churn customer cluster")
You get that error because of this line:
plt.scatter(x= x_title + '_norm'
, y= y_title + '_norm'
, data=data
, color=kmean_colors # color of data points
, alpha=0.25 # transparancy of data points
If you use plt.scatter, it does not take in data= as an argument, you can read the help page. You can either do:
plt.scatter(data[x_title + '_norm'],data[y_title + '_norm'],...)
Or you use the plot.scatter method on a pandas dataframe, which I did in a edited version of your function:
def Create_kmeans_cluster_graph(df_final, data, n_clusters, x_title, y_title, chart_title):
plotColor = ['k','g','b']
kmeans = KMeans(n_clusters=n_clusters , random_state = random_state)
kmean_colors = [plotColor[c] for c in kmeans.labels_]
data.plot.scatter(x= x_title + '_norm', y= y_title + '_norm',
return kmeans.labels_
On an example dataset, it works:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
random_state = 42
df_final = pd.DataFrame({'Tenure_norm':np.random.uniform(0,1,50),
,"Tenure vs Monthlycharges : Churn customer cluster")
I have been trying to use RF regression from scikit-learn, but I’m getting an error with my standard (from docs and tutorials) model. Here is the code:
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
db = pd.read_excel('/home/artyom/myprojects//valuevo/field2019/report/segs_inventar_dataframe/excel_var/invcents.xlsx')
age = df[['AGE_1', 'AGE_2', 'AGE_3', 'AGE_4', 'AGE_5']]
hight = df [['HIGHT_','HIGHT_1', 'HIGHT_2', 'HIGHT_3', 'HIGHT_4', 'HIGHT_5']]
diam = df[['DIAM_', 'DIAM_1', 'DIAM_2', 'DIAM_3', 'DIAM_4', 'DIAM_5']]
za = df[['ZAPSYR_', 'ZAPSYR_1', 'ZAPSYR_2', 'ZAPSYR_3', 'ZAPSYR_4', 'ZAPSYR_5']]
tova = df[['TOVARN_', 'TOVARN_1', 'TOVARN_2', 'TOVARN_3', 'TOVARN_4', 'TOVARN_5']]
#df['average'] = df.mean(numeric_only=True, axis=1)
df['meanage'] = age.mean(numeric_only=True, axis=1)
df['meanhight'] = hight.mean(numeric_only=True, axis=1)
df['mediandiam'] = diam.mean(numeric_only=True, axis=1)
df['medianza'] = za.mean(numeric_only=True, axis=1)
df['mediantova'] = tova.mean(numeric_only=True, axis=1)
unite = df[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media','meanhight']].dropna()
from sklearn.model_selection import train_test_split as ttsplit
df_copy = unite.copy()
trainXset = df_copy[['gapA_segA','gapP_segP', 'A_median', 'p_median', 'circ_media','fdi_median', 'pfd_median', 'p_a_median', 'gsci_media']]
trainYset = df_copy [['meanhight']]
trainXset_train, trainXset_test, trainYset_train, trainYset_test = ttsplit(trainXset, trainYset, test_size=0.3) # 70% training and 30% test
rf = RandomForestRegressor(n_estimators = 100, random_state = 40), trainYset_train)
predictions = rf.predict(trainXset_test)
errors = abs(predictions - trainYset_test)
mape = 100 * (errors / trainYset_test)
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
But output doesn’t look ok:
---> 24 errors = abs(predictions - trainYset_test)
25 # Calculate mean absolute percentage error (MAPE)
26 mape = 100 * (errors / trainYset_test)
..... somemore track
ValueError: Unable to coerce to Series, length must be 1: given 780
How can I fix it? 780 is the shape of trainYset_test. I’m not asking for a solution (i.e. write code for me), but for advice on why this error happened. I followed everything as in tutorials.
by seeing in error it is cleared that, the array has to have the shape of one ,
so use reshape to make it in correct shape,
I solved this by making sure the predictions were the same data type as the actual data. In my case, it was:
MSE = (sum((y_test-predictions)**2))/(len(newX)-len(newX.columns))
I resolved this by casting y_test to be a numpy array:
MSE = (sum((np.array(y_test)-predictions)**2))/(len(newX)-len(newX.columns))