I'm trying to prepare a dataset for scikit learn, planning to build pandas dataframe to feed it to a decision tree classifier.
The data represents different companies with varying criteria, but some criteria can have multiple values - such as "Customer segment" - which, for any given company, could be any, or all of: SMB, midmarket, enterprise, etc. There are other criteria/columns like this with multiple possible values. I need decisions made upon individual values, not the aggregate - so company A for SMB, company A for Midmarket, and not for the "grouping" of customer A for SMB AND midmarket.
Is there guidance on how to handle this? Do I need to generate rows for every variant for a given company to be fed into the learning routine? Such that an input of:
Company,Segment
A,SMB:MM:ENT
becomes:
A, SMB
A, MM
A, ENT
As well as for any other variants that may come from additional criteria/columns - for example "customer vertical" which could also include multiple values? It seems like this will greatly increase the dataset size. Is there a better way to structure this data and/or handle this scenario?
My ultimate goal is to let users complete a short survey with simple questions, and map their responses to values to get a prediction of the "right" company, for a given segment, vertical, product category, etc. But I'm struggling to build the right learning dataset to accomplish that.
Let's try.
df = pd.DataFrame({'company':['A','B'], 'segment':['SMB:MM:ENT', 'SMB:MM']})
expended_segment = df.segment.str.split(':', expand=True)
expended_segment.columns = ['segment'+str(i) for i in range(len(expended_segment.columns))]
wide_df = pd.concat([df.company, expended_segment], axis=1)
result = pd.melt(wide_df, id_vars=['company'], value_vars=list(set(wide_df.columns)-set(['company'])))
result.dropna()
Related
I would like to build a linear regression model to determine the influence of various parameters on quote prices. The data of the quotes were collected over 10 years.
y = Price
X = [System size(int),ZIP, Year, module_manufacturer, module_name, inverter_manufacturer,inverter_name, battery storage (binary), number of installers/offerer in the region(int), installer_density, new_construction(binary), self_installation(binary), household density]
Questions:
What type of regression model is suitable for this dataset?
Due to technological progress, quote prices decrease over years. How can I account for the different years in the model? I found some examples where years where considered as binary variables. Another option: multiple regression models for each year. Is there a way to combine these multiple models?
Is the dataset a type of panel data?
Unfortunately, I have not yet found any information that could explicitly help me with my data. But maybe I didn't use the right search terms. I would be very happy about any suggestions that nudge me in the right direction.
Suppose you have a data.frame called data with columns price, system_size, zip, year, battery_storage etc. Then you can start with a simple linear regression:
lm(price ~ system_size + zip + year + battery_storage, data = data)
year is included in the model so you take changes over time into account.
If you want to remove batch effects (e.g. different regions zip codes) and you just care to model the price after getting rid of the effect of different locations, you can run a linear mixed model:
lmerTest::lmer(price ~ system_size + year + battery_storage + (1|zip), data = data)
If you have a high correlation e.g. between year and system_size, you might want to include interaction terms like year:system_size into your formula.
As a rule of thumb, you need to have 10 samples for each variable to get a reasonable fit. If you have more, you can do a variable selection first.
I collected some product reviews of a website from different users, and I'm trying to find similarities between products through the use of the embeddings of the words used by the users.
I grouped each review per product, such that I can have different reviews succeeding one after the other in my dataframe (i.e: different authors for one product). Furthermore, I also already tokenized the reviews (and all other pre-processing methods). Below is a mock-up dataframe of what I'm having (the list of tokens per product is actually very high, as well as the number of products):
Product
reviews_tokenized
XGame3000
absolutely amazing simulator feel inaccessible ...
Poliamo
production value effect tend cover rather ...
Artemis
absolutely fantastic possibly good oil ...
Ratoiin
ability simulate emergency operator town ...
However, I'm not sure of what would be the most efficient between doc2Vec and Word2Vec. I would initially go for Doc2Vec, since it has the ability to find similarities by taking into account the paragraph/sentence, and find the topic of it (which I'd like to have, since I'm trying to cluster products by topics), but I'm a bit worry about the fact that the reviews are from different authors, and thus might bias the embeddings? Note that I'm quite new to NLP and embeddings, so some notions may escape me. Below is my code for Doc2Vec, which giving me a quite good silhouette score (~0.7).
product_doc = [TaggedDocument(doc.split(' '), [i]) for i, doc in enumerate(df.tokens)]
model3 = Doc2Vec(min_count=1, seed = SEED, ns_exponent = 0.5)
model3.build_vocab(product_doc)
model3.train(product_doc, total_examples=model3.corpus_count, epochs=model3.epochs)
product2vec = [model3.infer_vector((df['tokens'][i].split(' '))) for i in range(0,len(df['tokens']))]
dtv = np.array(product2vec)
What do you think would be the most efficient method to tackle this? If something is not clear enough, or else, please tell me.
Thank you for your help.
EDIT: Below is the clusters I'm obtaining:
There's no way to tell which particular mix of methods will work best for a specific dataset and particular end-goal: you really have to try them against each other, in your own reusable pipeline for scoring them against your desired results.
It looks like you've already stripped the documents down to keywords rather than original natural text, which could hurt with these algorithms - you may want to try it both ways.
Depending on the size & format of your texts, you may also want to look at doing "Word Mover's Distance" (WMD) comparisons between sentences (or other small logical chunks of your data). Some work has demo'd interesting results in finding "similar concerns" (even with different wording) in the review domain, eg: https://tech.opentable.com/2015/08/11/navigating-themes-in-restaurant-reviews-with-word-movers-distance/
Note, though, WMD gets quite costly to calculate in bulk with larger texts.
I'm trying to predict a score that user gives to a restaurant.
The data I have can be grouped into two dataframes
data about user (taste, personal traits, family, ...)
data about restaurant(open hours, location, cuisine, ...).
First major question is: how do I approach this?
I've already tried basic prediction with the user dataframe (predict one column with few others using RandomForest) and it was pretty straightforward. These dataframes are logically different and I can't merge them into one.
What is the best approach when doing prediction like this?
My second question is what is the best way to handle categorical data (cuisine f.e.)?
I know I can create a mapping function and convert each value to index, or I can use Categorical from pandas (and probably few other methods). Is there any prefered way to do this?
1) The second dataset is essentially characteristics of the restaurant which might influence the first dataset. Example-opening timings or location are strong factors that a customer could consider. You can use them, merging them at a restaurant level. It could help you to understand how people treat location, timings as a reflection in their score for the restaurant- note here you could even apply clustering and see different customers have different sensitivities to these variables.
For e.g. for frequent occurring customers(who mostly eats out) may be more mindful of location/ timing etc if its a part of their daily routine.
You should apply modelling techniques and do multiple simulations to get variable importance box plots and see if variables like location/ timings have a high variance in their importance scores when calculated on different subsets of data - it would be indicative of different customer sensitivities.
2) You can look at label enconding or one hot enconding or even use the variable as it is? It will helpful here to explain how many levels are there in the data. You can look at pd.get_dummies kind of functions
Hope this helps.
Tensorflow beginner here.
My data is split into two csv files, a.csv and b.csv, relating to two different events a and b. Both files contain information on the users concerned and, in particular, they both have a user_id field that I can use to merge the data sets.
I want to train a model to predict the probability of b happening based on the features of a. For doing this, I need to append a label column 'has_b_happened' to the data A retrieved from a.csv. In scala spark, I would do something like:
val joined = A
.join(B.groupBy("user_id").count, A("user_id") === B("user_id"), "left_outer")
.withColumn("has_b_happened", col("count").isNotNull.cast("double"))
In tensorflow, however, I haven't found anything comparable to spark's join. Is there a way of achieving the same result or am I trying to use the wrong tool for it?
Normally when I have done Machine Learning in the past I have essentially had one row for each observation. In those cases, I just fed in my data line by line into the algorithm. In my current data, I have what is essentially an index where one name has many counts to it. My problem is in one year I could have a name associated to both Male and Female and I need to weight it by the count (I am building a gender classifier based on name). I have included an image below as an example of how my data looks:
Maybe it is simple and I am missing it, but without expanding out the model into individual rows is there an easy way to read this into a machine learning algorithm and use the Count column to signify the weight? I am primarily planning on using the SciKit Learn suite of tools.
I think you can simply use pandas groupby function and have frequency of name+gender as a column. The below code you can refer to get started:
yourDataFrame = pd.DataFrame(colums=["Name","Gender","Age","SourceFile"])
yourDataFrame["Count"] = 1
dummyDf = yourDataFame.groupby(["Name","Gender"]).count("Count")
Now you can make a simple lookup function which combines yourDataFrame and dummydf for counts/weights.