Surprise API - how to load features into Surprise dataset? - python

For a recommender system, when we are using Surprise, we normally only pass on UserID, ItemID and Rating using load_from_df.
But if I also have other features which I want to load from a df, how can I do it? I couldn't find any useful information or examples on the Surprise API https://surprise.readthedocs.io/en/stable/dataset.html.
Can someone guide me to the right direction?

Surprise is a Python scikit for building and analyzing recommender systems that deal with explicit rating data. Surprise was designed with the following purposes in mind: Give users perfect control over their experiments. ... Provide tools to evaluate, analyse and compare the algorithms' performance

Related

Dose meta data of data frame help build features for ML algorithms

Recently I was given a task by a potential employer to do the following :
- transfer a data set to S3
- create metadata for the data set
- creat a feature for the data set in spark
Now this is a trainee position, and I am new to data engineering in terms of concepts and I am having trouble understanding how or even if metadata is used to create a feature.
I have gone through numerous sites in feature engineering and metadata but none of which really give me an indication on if metadata is directly used to build a feature.
what I have gathered so far from sites is that when you build a feature it extracts certain columns from a given data set and then you put this information into a feature vector for the ML algorithm to learn from. So to me, you could just build a feature directly from the data set directly, and not be concerned with the metadata.
However, I am wondering if is it common to use metadata to search for given information within multiple datasets to build the feature, i.e you look in the metadata file see certain criteria
that fit the feature your building and then load the data in from the metadata and build the feature from there to train the model.
So as an example say I have multiple files or certain car models for manufacture i.e (vw golf, vw fox, etc) and it contains the year and the price of the car for that year and I would like the ML algorithm to predict the depreciation of the car for the future or depreciation of the newest model of that car for years to come. Instead of going directly through all the dataset, you would check the metadata (tags, if that the correct wording) for certain attributes to train the model then by using the (tags) it loads the data in from the specific data sets.
I could very well be off base here, or my example I given above may be completely wrong, but if anyone could just explain how metadata can be used to build features if it can that would be appreactied or even if links to data engineering websites that explain. It just over the last day or two researching, I find that there more on data sic than data engineering itself and most data engineering info is coming from blogs so I feel like there a pre-existing knowledge I am supposed to have when reading them.
P.S though not a coding question, I have used the python tag as it seems most data engineers use python.
I'll give synopsis on this !!!
Here we need to understand two conditions
1)Do we have features which directly related in building ML models.
2)are we in data scarcity ?
Always make a question , what the problem statement suggest us in generating features ?
There are many ways we can generate features from given dataset like PCA,truncated SVD,TSNE used for dimensionality reduction techniques where new features are created from given features.feature engineering techniques like fourier features,trignometric features etc. and then we move to the metadata like type of feature,size of feature,time when it extracted if it etc..like this metadata also helps us in creating features for building ML Models but this depends how we have performed feature engineering on datacorpus of respective Problem.

How to visualize variable grouping or perform interactive grouping in PySpark world?

I was wondering whether there is a way how to perform interactive variables grouping (similar to one enabled by SAS Miner software) in PySpark/Python world. Variable grouping is intergral part of model development, so I suppose there has to be already some tool/library that might support this. Does anyone have experience with this?
Currently no such library exists for Python.
Interactive variable grouping is a multi-step process (offered as a node called IGN in SAS Enterprise Miner) that is part of SAS EM Credit Scoring solution and not base SAS. Although there are tools in Python world for some of the IGN steps such as binning, WoE, Gini, decision trees, etc. Scikit-learn is a good starting point for that.
There are a lot of Scikit-learn related projects including domain-specific ones. A project for credit scoring is a potential candidate in that list.

Datasets & Tutorials specifically targeting Business Data Analysis Issues

I am looking for data sets and tutorials which are specifically targeting business data analysis issues. I know about Kaggle but it's main focus is on Machine learning and associated problems/issues. Would be great to know a blog or dump regarding data analysis issues. Or may be a good read/book?
The correct answer to this all depends on how comfortable you are currently with machine learning. Business data analysis and predictions are so closely aligned with machine learning that most developers consider it a subset that more general ML skills will cover. So I will suggest two things to you. If you have no experience in ML launch into the Data Science(python) career track of Data camp - It is excellent! This will help you get to grips with the overall ideas of cleaning your data and data processing, as well as supervised and unsupervised learning.
If you are already comfortable with all that I would suggest looking at pbpython.com - This site covers python for business analysis use entirely and suggests a plethora of books specialized for certain topics. As well as covering individual topics itself very well.

Saving models from Python

Is it possible to save a predictive model in Python?
Background: I'm building regression models in SPSS and I need to share those when done. Unfortunately, no one else has SPSS.
Idea: My idea is to make the model in Python, do something XYZ, use another library to convert XYZ into and exe that will pick up a csv file with data and spit out the model fit results on that data. In this way, I can share the model with anyone I want without the need of SPSS or other expensive software
Challenge: I need to find out XYZ, how do I save the instance when the model is built. For example, in case of linear/logistic, it would be the set of coefficients.
PS: I'm using linear/logistic as examples, in reality, I need to share more complex models like SVM etc.
Using FOSS (Free & Open Source Software) is great to facilitate collaboration. Consider using R or Sage (which has a Python backbone and includes R) so that you can freely share programs and data. Or even use Sagemath Cloud so that you can work collaboratively in real-time.
Yes, this is possible. What you're looking for is scitkit-learn in combination with joblib. A working example of your problem can be found in this question.

Python package recommendation for data analysis and learning

I want to build an analytics engine on top of an article publishing platform. More specifically, I want to track the users' reading behaviour (e.g. number of views of an article, time spent with the article open, rating, etc), as well as statistics on the articles themselves (e.g. number of paragraphs, author, etc).
This will have two purposes:
Present insights about users and articles
Provide recommendations to users
For the data analysis part I've been looking at cubes, pandas and pytables. There is a lot of data, and it is stored in MySQL tables; I'm not sure which of these packages would better handle such a backend.
For the recommendation part, I'm simply thinking about feeding data from the data analysis engine to a clustering model.
Any recommendations about how to put all this together, as well as cool python projects out there that can help me out?
Please let me know if I should give more information.
Thank you
Scikit-learn should make you happy for the data processing (clustering) part.
For the analysis and visualization side, you have Cubes as you mentioned, and for viz I use CubesViewer which I wrote.

Categories

Resources