When I output this below code through some other functions (sigmoid/weight functions etc). I get the output that my data 'must be one dimensional'.
The data is from a csv that is 329 X 31, I have split this as I need the first column as my 'y' value, and then the remaining 30 columns and all its rows will be my 'x'. How do I go about making this 1 dimensional for my functions?
Is this section of code where I process my data even the issue? could it be an issue from a later functional call? im new to python so im not sure what the issue could be caused by, I was wondering if I converted my data into an array correctly.
df = pd.read_csv('data.csv', header=None)
#splitting dataframe into 70/30 split
trainingdata = df.sample(frac=0.7)
testingdata = df.drop(trainingdata.index)
#splitting very first column to 'y' value
y = trainingdata.loc[:,0]
#splitting rest of columns to 'X' value
X = trainingdata.loc[:,1:]
#printing shape for testing
print(X.shape, y.shape)
if I understand your question correctly, you can flatten the array using the flatten(), or you can use reshape() for more information, read the documentation
y=y.flatten()
print(y.ndim)
doc
Related
I have the paths to each excel file in 'files'(using this thread).
And then, I was trying to use a for loop to iterate through each file and gather flow data and combine it into new matrix 'val' by adding it to a new column each time. 'Flow' is also the column name in the excel so I use that on line 5 to call the column I want.
For example,
Excel 1
Flow data
1
2
Excel 2:
Flow data
3
4
val matrix should have
Excel 1 Excel 2
1 3
2 4
I keep getting this error however.
could not broadcast input array from shape (105408,1) into shape (105408,)
Seems like a common error but I haven't been able to solve it from similar question on here.
val = np.zeros((105408, 50), int)
for x in range (len(files)+1):
dt = pd.read_csv(files[x])
flow_data = dt[['Flow']]
val[:,x] = flow_data
#print(val)
I think you are running into an issue due to the extra pair of brackets surrounding 'Flow', removing should function as you intended. dt[['Flow']] --> dt['Flow']
Using a dataframe might be a better approach to aggregate results though, a numpy.ndarray will throw an error if len(files) turns out to be larger than the preset array width (50 in this case). A df will be more flexible for varying file counts/rows. Which seems to be the case given you are using len(files) and not a specific file count.
Working example (using pd.DataFrame):
aggregate_df = pd.DataFrame()
for x in range (len(files)+1):
dt = pd.read_csv(files[x])
flow_data = dt['Flow']
aggregate_df.loc[:, x] = flow_data # using a df to aggregate results
# print(aggregate_df)
**
# Preprocessing original dataframe
def preprocess_df(dataframe):
x = dataframe.copy()
try:
customer_id = x['CustomerID']
del x['CustomerID'] # Don't need in ML DF
except:
print("already removed customerID")
ml_dummies = pd.get_dummies(x)
ml_dummies.fillna(value=0, inplace=True)
# import random done above
ml_dummies['---randomColumn---'] = np.random.randint(0,1000, size=len(ml_dummies))
try:
label = ml_dummies['Churn Label']
del ml_dummies['Churn Label']
except:
print("label already removed.")
return ml_dummies, customer_id, label
original_df = preprocess_df(df)
output_df = original_df[0].copy()
output_df.shape
output_df['---randomColumn---']
output_df['prediction'] = clf.predict_proba(output_df)[:,1]
output_df['Churn Label'] = original_df[2]
output_df['CustomerID'] = original_df[1]
this is my code and it gives me this error
ValueError: Number of features of the model must match the input. Model n_features is 2746 and input n_features is 2854
**
Only possibility i can think of is:
Let's say you got a dummy variable for something called Languages.
Your initial train dataset to train/fit your model clf consisted of Languages : English, French, Python., Thus creating three dummy columns - Language-English, Language-French, Language-Python.
However, the df you're using for preprocess_df only consisted of Language : English, French, Korean, Java , we're missing Language-Python column the original train dataset had, and has Language-Korean and Language-Java the original trainset didn't have, meaning the ml_dummies you're returning is either/both missing some columns or have extra columns..
Assuming my assumptions were correct, my solution for this was :
If your train and predict is in same code, you should use your train data for your preprocess_df instead.
If unavailable/alternatively, You need to create and save list of column names from a data that was used to train/fit a model after dummy variable creation. THEN you can fit your willing-to-predict data to those column names.
let me know if i misunderstood.
in response to your comment :
You have a dataset to train a model, train_data, which consists of (say) 100 columns.
After your preprocess_df, You'll have x_train which consists of 2746 columns after dummy variable.
Before doing model(clf).train or model(clf).fit, save your x_train's shape somewhere, say model_input_columns = x_train.columns, or even on a file if you're trying to use it in different module/future.
For the DataFrame you're actually trying to predict, which probably also has 100 columns, say predict_data, do preprocess_df(predict_data)
This will give you x_predict_data which consists of 2854 columns.
Now to actually predict using model clf, the 2854 columns dataframe you have (to be predicted), somehow needs to be transformed to a 2746 columns dataframe. This is where you need your model_input_columns we saved before.
There are probably many ways to do this and honestly I'm not exactly an expert in this field, but if you'd still take my probably-not-so-ideal way :
actual_x_input_for_prediction = pd.DataFrame()
default = 0
for column in model_input_columns :
if column in x_predict_data :
actual_x_input_for_prediction[column] = x_predict_data[column]
else :
actual_x_input_for_prediction[column] = [default] * len( x_predict_data)
clf.predict(actual_x_input_for_prediction)
I am currently developing a machine learning algorithm for ticket classification that combines a Title, Description and Customer name together to predict what team a ticket should be assigned to but have been stuck for the past few days.
Title and description are both free text and so I am passing them through TfidfVectorizer. Customer name is a category, for this I am using OneHotEncoder. I want these to work within a pipeline so have them being joined with a column transformer where I can pass in an entire dataframe and have it be processed.
file = "train_data.csv"
train_data= pd.read_csv(train_file)
string_features = ['Title', 'Description']
string_transformer = Pipeline(steps=[('tfidf', TfidfVectorizer()))
categorical_features = ['Customer']
categorical_transformer = Pipeline(steps=[('OHE', preprocessing.OneHotEncoder()))
preprocessor = ColumnTransformer(transformers = [('str', string_transformer, string_features), ('cat', categorical_transformer, categorical_features)])
clf = Pipeline(steps=[('preprocessor', preprocessor),('clf', SGDClassifier())]
X_train = train_data.drop('Team', axis=1)
y_train = train_data['Team']
clf.fit(X_train, y_train)
However I get an error: all the input array dimensions except for the concatenation axis must match exactly.
After looking into it, print(OneHotEncoder().fit_transform(X_train['Customer'])) on its own returns an error: Expected 2d array got 1d array instead.
I believe that OneHotEncoder is failing as it is expecting an array of arrays (a pandas dataframe), each being length one containing the customer name. But instead is just getting a pandas series. By converting the series to a dataframe with .to_frame() the printed output now seems to match what is outputted by the TfidfVectorizer and the dimensions should match.
Is there a way I can modify OneHotEncoder in the pipeline so that it accepts the input as it is in 1 dimension? Or is there something I can add to the pipeline that will convert it before it's passed into OneHotEncoder? Am I right in that this is the reason for the error?
Thanks.
I believe the problem lies in the fact that you're giving two columns to the TfIdfVectorizer (which is thus converted to a DataFrame). This will not work: TfIdfVectorizer expects a list of strings. So an immediate solution (and therefore a check of whether this is in fact the source of the problem), is changing this line to: string_features = 'Description'. Note this is not a list, it just a string. Therefore the Series is passed to the TfIdfVectorizer, and not the DataFrame.
If you would like to combine both string columns, you could either
concatanenate the strings, so you keep one column (which is the easiest), or
Fit two different TfIdfVectorizers, which is more complex but might perform better. See for instance Computing separate tfidf scores for two different columns using sklearn
Should this not solve your problem, I would advise you to share some sample data so we can actually test what is happening.
I believe the difference between your perceived error and the actual pipeline lies in the fact that you're giving it X_train['Customer'] (again a Series), but in the actual pipeline you're giving it X_train[['Customer']] (a DataFrame).
First of all apologies. I am very new to pandas, scikit learn and python. So I am sure I am doing something silly. Let me give a little background.
I am trying to run KNeighborsClassifier from scikit learn (python)
Following is my strategy
#Reading the Training set
data = pd.read_csv('Path_TO_File\\Train_Set.csv', sep=',') # reading CSV File
X = data[['Attribute 1','Attribute 2']]
y = data['Target_Column'] # the output is a Dataframe of single column with many rows
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X,y)
Next I try to read Test data
test = pd.read_csv('PATH_TO_FILE\\Test.csv', sep=',')
t = test[['Attribute 1','Attribute 2']]
pred = neigh.predict(t)
actual = test['Target_Column']
Next I try to check the accuracy by following function which is throwing error.
accuracy=neigh.score(actual,pred)
ERROR: ValueError: could not convert string to float: N
I checked actual and pred both and they are of following data type and content
actual
Out[161]:
Target_Column
0 Y
1 N
:
[614 rows x 1 columns]
pred
Out[162]:
array(['Y', 'N', .....'N'], dtype=object)
N.B : pred has 614 values.
I tried to convert "actual" variable to 1D array I might be able to execute the function however, I am not successful.
I think I need to do following two things, however, was not able to do so (after googling it)
1) Convert actual into 1Dimen array
2) Making a transpose of the 1Dimen array since the pred has 614 columns.
Please let me know how to correct the function.
Thanks in advance !
Raj
Thanks Vivek and Thornhale
Indeed I was doing two wrong things.
As pointed by you guys, I should have been using 1, 0 in stead of Y,
N.
I was giving wrong parameters to the function score. It should be
accuracy=neigh.score(t, actual) , where t is test feature set and
actual is test label information.
You could convert your series which is what you get when you do "test[COLUMN_NAME]" into an array like so:
actual = np.array(test['Target_Column'])
To then reshape an np array, you would emply this command:
actual.reshape(1, 612) # <- Could be the other way around as well.
Your main issue though is that your Series needs to be boolean (as in 0,1).
x_train = train['date_x','activity_category','char_1_x','char_2_x','char_3_x','char_4_x','char_5_x','char_6_x',
'char_7_x','char_8_x','char_9_x','char_10_x',.........,'char_27','char_29','char_30','char_31','char_32','char_33',
'char_34','char_35','char_36','char_37','char_38']
y = y_train
x_test = test['date_x','activity_category','char_1_x','char_2_x','char_3_x','char_4_x','char_5_x','char_6_x',
'char_7_x','char_8_x','char_9_x','char_10_x','char_1_y','group_1','char_1_y','char_2_y','char_3_y', 'char_4_y','char_5_y','char_6_y','char_7_y',
'char_8_y','char_9-y','char_10_y', ...........,'char_29','char_30','char_31','char_32','char_33',
'char_34','char_35','char_36','char_37','char_38']
train.iloc([0:17,19:38])
After trying to slice columns with train([0:17,19:38)], I resorted to data entry of all column names. A pretty cumbersome way of doing this, but I am only getting what I call with 19:38. I am getting Key error message for doing it the first way, by calling the column names.
As suggested by #AndrasDeak
Consider the pd.DataFrame train
train = pd.DataFrame(np.arange(1000).reshape(-1, 20))
Then use the suggestion like this
train.iloc[np.r_[0:17, 19:38]]