I hava dataset with name,ratings,ratings_count,genres columns.
Ex: Movies_Data.csv
Name ratings ratings_count Action Adventure Horror Musical Thriller
Mad-Max 2 7 1 0 0 0 1
Mitchell[1975] 3.25 2 1 0 0 0 1
John Wick 4.23 4 1 0 0 0 0
Insidious 3.75 10 0 0 1 0 0
I divided it into features and labels. Then Performed Label Encoding for the Name column.
Here's my features Dataset after split.
features:
ratings ratings_count Action Adventure Horror Musical Thriller
2 7 1 0 0 0 1
3.25 2 1 0 0 0 1
4.23 4 1 0 0 0 0
3.75 10 0 0 1 0 0
Now the problem is I have around 18 'Genre' Columns. So i think my decision tree is giving more importance to the these columns rather than ratings and ratings_count.
Like if i ask the tree to predict a movie with the following parameters:
ratings:3 ratings_count:2 Action:1 Adventure:0 Horror:0 Musical:0 Thriller:1
It should obviously predict Mitchell[1975] since the ratings:3 is near to 3.25 and ratings_count is same as my input. But it's predicting Mad-Max.
How can i increase the importance of the ratings and ratings_count column?
I'm new to ML. So is there any other way or any other algorithm can i use for better recommendations?
P.s.I know we can use neural networks but i need to stick to Basic ML algorithms only.
Thanks!
First, Random Forests almost always bring better results than Decision Trees. They have a bit more hyperparameters to tune, but that can also help you to bring better results. It's called an Ensemble algorithm and it works well because it averages lots of Decision Trees. It has less overfitting problems, so it should perform better.
If you're still having trouble, you might try to fuse some categories (or get more data), so your algorithm can correctly infer the rating's importance.
Also, this question might be better suited for Cross Validated, where you can ask more theoretical questions.
Good luck !
Related
I'm currently trying to use a number of medical codes to find out if a person has a certain disease and would require help as I tried searching for a couple of days but couldn't find any. Hoping someone can help me with this. Considering I've imported excel file 1 into df1 and excel file 2 into df2, how do I use excel file 2 to identify what disease does the patients in excel file 1 have and indicate them with a header? Below is an example of what the data looks like. I'm currently using pandas Jupyter notebook for this.
Excel file 1:
Patient
Primary Diagnosis
Secondary Diagnosis
Secondary Diagnosis 2
Secondary Diagnosis 3
Alex
50322
50111
John
50331
60874
50226
74444
Peter
50226
74444
Peter
50233
88888
Excel File 2:
Primary Diagnosis
Medical Code
Diabetes Type 2
50322
Diabetes Type 2
50331
Diabetes Type 2
50233
Cardiovescular Disease
50226
Hypertension
50111
AIDS
60874
HIV
74444
HIV
88888
Intended output:
Patient
Positive for Diabetes Type 2
Positive for Cardiovascular Disease
Positive for Hypertension
Positive for AIDS
Positive for HIV
Alex
1
1
0
0
0
John
1
1
0
1
1
Peter
1
1
0
0
1
You can use merge and pivot_table
out = (
df1.melt('Patient', var_name='Diagnosis', value_name='Medical Code').dropna()
.merge(df2, on='Medical Code').assign(dummy=1)
.pivot_table('dummy', 'Patient', 'Primary Diagnosis', fill_value=0)
.add_prefix('Positive for ').rename_axis(columns=None).reset_index()
)
Output:
Patient
Positive for AIDS
Positive for Cardiovescular Disease
Positive for Diabetes Type 2
Positive for HIV
Positive for Hypertension
Alex
0
0
1
0
1
John
1
1
1
1
0
Peter
0
1
1
1
0
IIUC, you could melt df1, then map the codes from reshaped df2, finally pivot_table on the output:
diseases = df2.set_index('Medical Code')['Primary Diagnosis']
(df1
.reset_index()
.melt(id_vars=['index', 'Patient'])
.assign(disease=lambda d: d['value'].map(diseases),
value=1,
)
.pivot_table(index='Patient', columns='disease', values='value', fill_value=0)
)
output:
disease AIDS Cardiovescular Disease Diabetes Type 2 HIV Hypertension
Patient
Alex 0 0 1 0 1
John 1 1 1 1 0
Peter 0 1 1 1 0
Maybe you could convert your excel file 2 to some form of key value pair and then replace the primary diagnostics column in file 1 with the corresponding disease name, later apply some form of encoding like one-hot or something similar to file 1. Not sure if this approach would definitely help, but just sharing my thoughts.
I'm building a model, pretty much similiar to the well known House Price Prediction. I got to the point that I need to encode my nominal categorical variables by using scikit-learns OneHotEncoder. The so called "Dummy Variable Trap" is clear to me so I need to drop one of my OneHot encoded columns to avoid multicollinearity.
What's bothering me, is the way to handle unseen categories. In my understanding the unseen categories will be treated the same way as the "base category" (the category I dropped).
To make it clear have a look at this example:
This is my training data i use to fit my OneHotEncoder.
X_train:
index
city
0
Munich
1
Berlin
2
Hamburg
3
Berlin
OneHotEncoding:
oh = OneHotEncoder(handle_unknown = 'ignore', drop = 'first')
oh.fit_transform(X_train)
Because of drop = 'first' the first column ('city_Munich') will be dropped.
index
city_Berlin
city_Hamburg
0
0
0
1
1
0
2
0
1
3
1
0
Now it's about to encode unseen data:
X_test:
index
city
10
Munich
11
Berlin
12
Hamburg
13
Cologne
oh.transform(X_test)
index
city_Berlin
city_Hamburg
10
0
0
11
1
0
12
0
1
13
0
0
I guess you may see my problem. Row10 (Munich) is treated the same way as row13 (Cologne).
Either I run into the "dummy variable trap" when not dropping one column or I gonna treat unseen data as the "base category" which is in fact wrong.
Whats a proper way to deal with that? Is there any option in the OneHotEncoder class to add a new column for unseen categories like "city_unseen"?
I have to do some analysis using Python3 and pandas with a dataset which is shown as a toy example-
data
'''
location importance agent count
0 London Low chatbot 2
1 NYC Medium chatbot 1
2 London High human 3
3 London Low human 4
4 NYC High human 1
5 NYC Medium chatbot 2
6 Melbourne Low chatbot 3
7 Melbourne Low human 4
8 Melbourne High human 5
9 NYC High chatbot 5
'''
My aim is to group the location and then count the number of Low, Medium and/or High 'importance' column for each location. So far, the code I have come up with is-
data.groupby(['location', 'importance']).aggregate(np.size)
'''
agent count
location importance
London High 1 1
Low 2 2
Melbourne High 1 1
Low 2 2
NYC High 2 2
Medium 2 2
'''
This grouping and count aggregation contains index as the grouping objects-
data.groupby(['location', 'importance']).aggregate(np.size).index
I don't know how to proceed next? Also, how can I visualize this?
Help?
I think you need DataFrame.pivot_table, added aggfunc=sum for aggregate if duplicates and then use DataFrame.plot:
df = data.pivot_table(index='location', columns='importance', values='count', aggfunc='sum')
df.plot()
If need counts of pairs location with importance use crosstab:
df = pd.crosstab(data['location'], data['importance'])
df.plot()
Here is my df after cleaning:
number summary cleanSummary
0 1-123 he loves ice cream love ice cream
1 1-234 she loves ice love ice
2 1-345 i hate avocado hate avocado
3 1-123 i like skim milk like skim milk
As you can see, there are two records that have the same number. Now I'll create and fit the vectorizer.
cv = CountVectorizer(token_pattern=r"(?u)\b\w+\b", ngram_range=(1,1), analyzer='word')
cv.fit(df['cleanSummary'])
Now I'll transform.
freq = cv.transform(df['cleanSummary'])
Now if I take a look at freq...
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
freq
frequency
0 1
1 1
2 1
3 2
4 1
5 2
6 1
7 1
...there doesn't seem to be a logical way to access the original number. I have tried methods of looping through each row, but this runs into problems because of the potential for multiple summaries per number. A loop using a grouped df...
def extractFeatures(groupedDF, textCol):
features = pd.DataFrame()
for id, group in groupedDF:
freq = cv.transform(group[textCol])
freq = sum(freq).toarray()[0]
freq = pd.DataFrame(freq, columns=['frequency'])
dfinner = pd.DataFrame(cv.get_feature_names(), columns=['ngram'])
dfinner['number'] = id
dfinner = dfinner.join(freq)
features = features.append(dfinner)
return features
...works, but the performance is terrible (i.e. 12 hours to run through 45,000 documents with one sentence lengths).
If I change
freq = sum(freq).toarray()[0]
to
freq = freq.toarray()
I get an array of frequencies for each ngram for each document. This is good, but then it doesn't allow me to push that array of lists into a dataframe. And I still wouldn't be able to access nunmber.
How do I access the original labels number for each ngram without looping over a grouped df? My desired result is:
number ngram frequency
1-123 love 1
1-123 ice 1
1-123 cream 1
1-234 love 1
1-234 ice 1
1-345 hate 1
1-345 avocado 1
1-123 like 1
1-123 skim 1
1-123 milk 1
Edit: this is somewhat of a revisit to this question:Convert CountVectorizer and TfidfTransformer Sparse Matrices into Separate Pandas Dataframe Rows. However, after implementing the method described in that answer, I face memory issues for a large corpus, so it doesn't seem scalable.
freq = cv.fit_transform(df.cleanSummary)
dtm = pd.DataFrame(freq.toarray(), columns=cv.get_feature_names(), index=df.number).stack()
dtm[dtm > 0]
number
1-123 cream 1
ice 1
love 1
1-234 ice 1
love 1
1-345 avocado 1
hate 1
1-123 like 1
milk 1
skim 1
dtype: int64
I have what is (to me at least) a complicated dataframe I'm trying to reshape so that I can more easily create visualizations. The data is from a survey and each row is a complete survey with 247 columns. These columns are split as to what sort of data they contain. Some is identifying information, (who took the survey, what product the survey is on, what the scores were on particular questions and what comments they had about that particular product). Here is a simplification of the dataframe
id Evaluator item Mar1 Mar1[Comments] Comf1 Comf1[Com..
1 001 11 3 "asf adfsfs.." 3 "text.."
2 001 14 2 "asf adfsfs.." 4 "text.."
3 002 11 4 "asf adfsfs.." 2 "text.."
4 002 14 3 "asf adfsfs.." 3 "text.."
5 002 34 0 "asf adfsfs.." 1 "text.."
6 003 11 2 "asf adfsfs.." 0 "text.."
....
It continues on from here, but in this case 'Mar1' and 'Comf1' are rated questions. I have another datatable that helps describe all the question and question types within the survey so I can perform data selections like the following...
df[df['ItemNum']==11][(qtable[(qtable['type'].str.contains("OtoU")==True)]).id]
Which pulls from qtable all the 'types' of 'OtoU' (all the rating questions) for the ItemNum 11. This is all well and good and gets me something like this...
Mar1 Mar2 Comf1 Comf2 Comf3 Interop1 Interop2 .....
1 2 3 1 3 4 4
2 3 3 2 4 2 2
2 1 1 4 4 1 2
1 3 2 2 2 1 1
3 4 1 2 3 3 3
I can't really do much with it in that form (at least I don't think I can). What I 'think' I need to do is flatten it out into a form that goes more like
Item Question Score Section Evaluator ...
11 Mar1 3 Maritime 001 ...
11 Comf1 2 Comfort 001 ...
11 Comf2 3 Comfort 001 ...
14 Mar1 1 Maritime 001 ...
But, I'll be damned if I know how to do that. I tried to do it (the wrong way I'm pretty sure) with iterating through the dataframe but I quickly realized that it both took quiet some time to do, and the resulting data was of questionable integrity.
So, (very) long story short. How do I go about doing this sort of transform through the power of pandas? I would like to do a number of plots including box plots by question for each 'item' as well as factorplots broken by 'section' and multi line charts plotting the mean of each question by item... if that helps you better understand where I am trying to go with this thing. Sorry for the long post, I just wanted to make sure I supplied enough information to get a solid answer.
Thanks,