Regarding hypothesis testing what I have known is that we do select appropriate hypothesis tests based on the type of variable whether ordinal nominal ratio or interval and the number of variables whether 1 or 3 variables . for example if single variable it is ratio or interval data we do one sample t test and for two variables ratio or interval data two sample test . but for the first time heard that the tests are selected based on the type of input variable as well as output variable . I thought only the type of input variable was to be considered.
please enlighten me on this . some can also help me with a widely accepted flow chart showing which parametric and parametric tests to use .
Related
I am building a recommender system in Python using the MovieLens dataset (https://grouplens.org/datasets/movielens/latest/). In order for my system to work correctly, I need all the users and all the items to appear in the training set. However, I have not found a way to do that yet. I tried using sklearn.model_selection.train_test_split on the partition of the dataset relevant to each user and then concatenated the results, thus succeeding in creating training and test datasets that contain at least one rating given by each user. What I need now is to find a way to create training and test datasets that also contain at least one rating for each movie.
This requirement is quite reasonable, but is not supported by the data ingestion routines for any framework I know. Most training paradigms presume that your data set is populated sufficiently that there is a negligible chance of missing any one input or output.
Since you need to guarantee this, you need to switch to an algorithmic solution, rather than a probabilistic one. I suggest that you tag each observation with the input and output, and then apply the "set coverage problem" to the data set.
You can continue with as many distinct covering sets as needed to populate your training set (which I recommend). Alternately, you can set a lower threshold of requirement -- say get three sets of total coverage -- and then revert to random methods for the remainder.
I wanted to know if there is any way to oversample the data using pyspark.
I have dataset with target variable of 10 classes. As of Now I am taking each class and oversampling like below to match
transformed_04=transformed.where(F.col('nps_score')==4)
transformed_03=transformed.where(F.col('nps_score')==3)
transformed_02=transformed.where(F.col('nps_score')==2)
transformed_01=transformed.where(F.col('nps_score')==1)
transformed_00=transformed.where(F.col('nps_score')==0)
transformed_04_more_rows=transformed_04.sample(True,11.3,9)
transformed_03_more_rows=transformed_03.sample(True,16.3,9)
transformed_02_more_rows=transformed_03.sample(True,12,9)
And finally joining all dataframes with union all
transformed_04_more_rows.unionAll(transformed_03_more_rows).unionAll(transformed_02_more_rows)
Sampling values I am checking manually . For ex if 4th class has 2000 rows and second class has 10 rows checking manually and providing values 16,12 accordingly as provided in code above
Forgive me about mentioned code is not complete one . Just to give an view I had put. I wanted to know if there is any automated way like SMOTE in pyspark .
I have seen below link ,
Oversampling or SMOTE in Pyspark
It says my target class has to be only two . If I remove the condition it throws me some datatype issues
Can anyone help me with this implementation in pyspark checking every class and providing sampling values is very painful please help
check out the sampleBy function of spark, this enables us stratified samplint. https://spark.apache.org/docs/2.4.0/api/python/pyspark.sql.html?highlight=sampleby#pyspark.sql.DataFrame.sampleBy
in your case for each of the class you can provide the fraction of sample that you want in a dictionary and use it in sampleBy, try it out.
To decide the fraction, you can do an aggregation count based on your target column , normalize to (0,1) and tune it.
I am working on a project to model the change in a person happiness depending on many variables.
Most of the explanatory variables are daily (how much food they ate, daily exercise, sleep etc…) but some of them are weekly - and they're supposed to be weekly, and have an effect on the predicted variable once a week.
For instance, one of the weekly variable is a person's change of weight when they weigh themselves on the same day each week.
This data is only available once a week and has an effect on the person's happiness on that day.
In that case, can someone please advise how I can handle missing data in python on the days when there is no data availalbe for weekly variables?
It would be wrong to extrapolate data on missing days since the person's happiness isn't affected at all by those weekly variables on days when they aren't available.
I have created a dummy with 1 when the weekly data is available and 0 if not, but I don't know what to do for the missing data. I can't leave NaNs otherwise python won't run the regression but I can't put 0 since sometimes the actual variable value (ex: change in weight) on the day when the data is available can be 0.
SciKit-learn provides classes called Imputers that deal with missing values by following a user-defined strategy (i.e. using a default value, using the mean of the column...). If you do not want to skew training I'd suggest you use a statistic instead of some arbitrary default value.
Additionally, you can store information about which values have been imputed vs. which values are organic using a MissingIndicator.
You can find out more about the different Imputers with some example code in the SciKit-Learn documentation
One way to solve this issue:
Fill in the NaN with the last value (in this case measured weight)
Add a boolean variable "value available today" (which was already done as described in the question)
Add one more variable: (last available value / previous value) * "value available today".
Caveat: modeling a product might prove a little difficult for linear regression algorithms.
I am using lifelines library to estimate Cox PH model. For the regression I have many categorical features, which I one-hot-encode and remove one column per feature to avoid multicollinearity issue (dummy variable trap). I am not attaching the code as the example can be similar to the one given in the documentation here.
By running cph.check_assumptions(data) I receive information that each dummy variable violates the assumptions:
Variable 'dummy_a' failed the non-proportional test: p-value is 0.0063.
Advice: with so few unique values (only 2), you can try `strata=['dummy_a']` in the call in `.fit`. See documentation in link [A] and [B] below.
How should I understand the advice in terms of multiple dummy variables for a single categorical feature? Should I add them all to strata?
I will appreciate any comments :)
#abu, your question brings up a clear gap in the documentation - what to do if dummy variables violate the proportional test. In this case, I suggest not dummying the variable, and add the original column as a stratified variable, ex: fit(..., strata=['dummy'])
I have a pandas dataframe with the following 2 columns:
Database Name Name
db1_user Login
db1_client Login
db_care Login
db_control LoginEdit
db_technology View
db_advanced LoginEdit
I have to cluster the Database Name based on the field “Name”. When I convert it to a numpy, using
dataset = df2.values
When I print the print(dataset.dtype), the type is object. I have just started with Clustering, from what I read, I understand that object is not a type suitable for Kmeans clustering.
Any help would be appreicated!!
What is the mean of
Login
LoginEdit
View
supposed to be?
There is a reason why k-means only works on continuous numerical data. Because the mean requires such data to be well defined.
I don't think clustering is applicable on your problem at all (rather look into data cleaning). But clearly you need a method that works with arbitrary distances - k-mean does not.
I don't understand whether you want to develop clusters for each GROUP of "Name" Attributes, or alternatively create n clusters regardless of the value of "Name"; and I don't understand what clustering could achieve here.
In any case, just a few days ago there was a similar question on the datascience SE site (from an R user, though), asking for similarity of local-names of email addresses (the part before the "#"), not of database-names. The problem is similar to yours.
Check this out:
https://datascience.stackexchange.com/questions/14146/text-similarities/14148#14148
The answer was comprehensive with respect to the different distance measures for strings.
Maybe this is what you should investigate. Then decide on a proper distance measure that is available in python (or one that you can program yourself), and that fits your needs.