I am using Python Dedupe package for record linkage tasks.
It means matching Company names in one data set to other.
The Dedupe package allows user to label pairs for training Logistic Regression model. However, it's a manual process and one need to input y/n for each pair shown on screen.
I want to load a training file which has 3 columns say, Company 1, Company 2, Match
Where Match can take value yes or no if Company 1 and Company 2 are same or different respectively.
I am following this source code but couldn't find a way to load a file for training.
Also, the doc shows one can change default Classifier but not sure how to do this
Can anyone please help me on this
Look up the trainingDataLink function in the dedupe documentation. It’s designed to handle pre-labeled data for record linkage.
Related
Just a simple question, I'm stuck at a scenario where I want to pass multiple information other than the pipeline itself inside a PMML file.
Other information like:
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features.
Correlation of all features with target.
There can be more of those, but the situation is like this as you can see. I know they can be easily sent with other file specifically made for it, but since it is regarding the ML model, I want them in the single file: PMML.
The Question is:
Can we add any additional information in PMML file that is extra in nature and might not be related with the model so that it can be used on the another side?
If that becomes possible, somehow, it would be much more helpful.
The PMML standard is extensible with custom elements. However, in the current case, all your needs appear to be served by existing PMML elements.
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features
You can store descriptive statistics about features using the ModelStats element.
Correlation of all features with target
You can store function information using the ModelExplanation element.
I am building a recommender system in Python using the MovieLens dataset (https://grouplens.org/datasets/movielens/latest/). In order for my system to work correctly, I need all the users and all the items to appear in the training set. However, I have not found a way to do that yet. I tried using sklearn.model_selection.train_test_split on the partition of the dataset relevant to each user and then concatenated the results, thus succeeding in creating training and test datasets that contain at least one rating given by each user. What I need now is to find a way to create training and test datasets that also contain at least one rating for each movie.
This requirement is quite reasonable, but is not supported by the data ingestion routines for any framework I know. Most training paradigms presume that your data set is populated sufficiently that there is a negligible chance of missing any one input or output.
Since you need to guarantee this, you need to switch to an algorithmic solution, rather than a probabilistic one. I suggest that you tag each observation with the input and output, and then apply the "set coverage problem" to the data set.
You can continue with as many distinct covering sets as needed to populate your training set (which I recommend). Alternately, you can set a lower threshold of requirement -- say get three sets of total coverage -- and then revert to random methods for the remainder.
I'm currently using AzureML with pretty complex workflows involving large datasets etc. and I'm wondering what is the best way to manage the splitting resulting of preprocessing steps. All my projects are built as pipelines fed by registered Datasets. I want to be able to track the splitting in order to easily retrieve, for example, test and validation sets for integration testing purposes.
What would be the best pattern to apply there ? Registering every intermediate set as different Dataset ? Directly retrieving the intermediate sets using the Run IDs ? ...
Thaanks
I wish I had a more coherent answer, the upside is that you're at the bleeding edge so, should you find a pattern that works for you, you can evangelize it and make it best practice! Hopefully you find my rantings below valuable.
First off -- if you aren't already, you should definitely use PipelineData to as the intermediate artifact for passing data b/w PipelineSteps. In this way, you can treat the PipelineData as semi-ephemeral in that they are materialized should you need them, but that it isn't a requirement to have a hold on every single version of every PipelineData. You can always grab them using Azure Storage Explorer, or like you said, using the SDK and walking down from a PipelineRun object.
Another recommendation is to split your workflow into the following pipelines:
featurization pipeline (all joining, munging, and featurizing)
training pipeline
scoring pipeline (if you have a batch score scenario).
The intra-pipeline artifacts are PipelineData, and the inter-pipeline artifacts would be registered Datasets.
To actually get actual your question of associating data splits together with a models, our team struggled with this -- especially because for each train,test,split we also have an "extra cols" which contains either identifiers or leaking variables that that the model shouldn't see.
In our current hack implementation, we register our "gold" dataset as an Azure ML Dataset at the end of the featurization pipeline. The first step of our training pipline is a PythonScriptStep, "Split Data", which contains our train,test,split steps and outputs a pickled dictionary as data.pkl. Then we can unpickle anytime we need one of the splits and can join back using the index using for any reporting needs. Here's a gist.
Registration is to make sharing and reuse easier so that you can retrieve the dataset by its name. If you do expect to reuse the test/validation sets in other experiments, then registering them make sense. However, if you are just trying to keep records of what you used for this particular experiment, you can always find those info via Run as you suggested.
Tensorflow beginner here.
My data is split into two csv files, a.csv and b.csv, relating to two different events a and b. Both files contain information on the users concerned and, in particular, they both have a user_id field that I can use to merge the data sets.
I want to train a model to predict the probability of b happening based on the features of a. For doing this, I need to append a label column 'has_b_happened' to the data A retrieved from a.csv. In scala spark, I would do something like:
val joined = A
.join(B.groupBy("user_id").count, A("user_id") === B("user_id"), "left_outer")
.withColumn("has_b_happened", col("count").isNotNull.cast("double"))
In tensorflow, however, I haven't found anything comparable to spark's join. Is there a way of achieving the same result or am I trying to use the wrong tool for it?
Normally when I have done Machine Learning in the past I have essentially had one row for each observation. In those cases, I just fed in my data line by line into the algorithm. In my current data, I have what is essentially an index where one name has many counts to it. My problem is in one year I could have a name associated to both Male and Female and I need to weight it by the count (I am building a gender classifier based on name). I have included an image below as an example of how my data looks:
Maybe it is simple and I am missing it, but without expanding out the model into individual rows is there an easy way to read this into a machine learning algorithm and use the Count column to signify the weight? I am primarily planning on using the SciKit Learn suite of tools.
I think you can simply use pandas groupby function and have frequency of name+gender as a column. The below code you can refer to get started:
yourDataFrame = pd.DataFrame(colums=["Name","Gender","Age","SourceFile"])
yourDataFrame["Count"] = 1
dummyDf = yourDataFame.groupby(["Name","Gender"]).count("Count")
Now you can make a simple lookup function which combines yourDataFrame and dummydf for counts/weights.