Saving models from Python - python

Is it possible to save a predictive model in Python?
Background: I'm building regression models in SPSS and I need to share those when done. Unfortunately, no one else has SPSS.
Idea: My idea is to make the model in Python, do something XYZ, use another library to convert XYZ into and exe that will pick up a csv file with data and spit out the model fit results on that data. In this way, I can share the model with anyone I want without the need of SPSS or other expensive software
Challenge: I need to find out XYZ, how do I save the instance when the model is built. For example, in case of linear/logistic, it would be the set of coefficients.
PS: I'm using linear/logistic as examples, in reality, I need to share more complex models like SVM etc.

Using FOSS (Free & Open Source Software) is great to facilitate collaboration. Consider using R or Sage (which has a Python backbone and includes R) so that you can freely share programs and data. Or even use Sagemath Cloud so that you can work collaboratively in real-time.

Yes, this is possible. What you're looking for is scitkit-learn in combination with joblib. A working example of your problem can be found in this question.

Related

Store and reuse tsfresh feature engineering performed

I am currently using the tsfresh package for a project (predictive maintenance).
It is working really well and now I want to implement it live.
However, the issue is that I don't know how to store the feature engineering that has been applied to my original dataset in order to do the same feature engineering to the data that I am streaming (receiving live).
Do you have any idea if there is a parameter or a function that allows to store the feature engineering performed by tsfresh?
(I am using the extract_relevant_features function).
After searching through various post it turns out that the answer is that you can save your parameter into a dictionnary (see here).
This dictionnary can be can later be called with the function extract_features to extract only those parameters.

Breaking Down 3D models up to lines and curves

I'm working on a project to breakdown 3D models but I'm quite lost. I hope you can help me.
I'm getting a 3D model from Autodesk BIM and the format could be native or generic CAD formats (.stp, .igs, .x_t, .stl). Then, I need to "measure" somehow the maximum dimensions to model a raw material body, it will always have the shape of a huge panel. Once I get both bodies, I will get the difference to extract the solids I need to analyze; and, on each of these bodies, I need to extract the faces, and then the lines or curves of each face.
This sounds something really easy to do on a CAD software, but the idea is to automate this process. I was looking into openSCAD, but seems that works only to model geometry and it doesn't handle well imported solids. I'm leaving a picture with the idea of what I need to do in the link below.
So, Any idea how can I do this? which langue and library can help in this project?
I can see this automation possible with a few in between steps:
OpenSCAD can handle differences well, so your "Extract Bodies" seems plausible
1.5 Before going further, you'll have to explain how you "filtered out" the cylinder. Will you do this manually? If you don't, you will have it considered for analysis and have a lot of faces as a result.
I don't think openSCAD provides you a vertex array. However, it can save to .STL, which is kinda easy to parse with the programming language of your choice, you'll have to study .stl file structure a bit (this sounds much more frightening than it is - if you open an stl with an editor you will probably immediately realize what's happening).
Since you've parsed the file, you can now calculate lines with high school math.
This is not an easy, GUI way to do what you ask, but if you have a few skills you'll have your automation, and depending on the amount of your projects it may be worth it.
I have been working in this project, and foundt the library "trimesh" is better to solve this concern. Give it a shot, and save some time.

Which dataset used in jakeret/tf_unet U-Net implementation?

I am trying to implement U-net and I use https://github.com/jakeret/tf_unet/tree/master/scripts this link as reference. I don't understand which dataset they used. please give me some idea or link which dataset i use.
On their github README.md they show three different datasets, that they applied their implementation to. Their implementation is dataset agnostic, therefore it shouldn't matter too much what data they use if you're trying to solve your own problem with your own data. But if you're looking for a toy data set to play around, check out their demos. There you'll see two readily available examples and how they can be used:
demo_radio_data.ipynb which uses an astronomic radio data example set from here: http://people.phys.ethz.ch/~ast/cosmo/bgs_example_data/
demo_toy_problem.ipynb which uses their built-in data generator of a noisy image with circles that are to be detected.
The latter is probably the easiest one if it comes to just having something to play with. To see how data is generated, check out the class:
image_gen.py -> GrayScaleDataProvider
(with an IDE like PyCharm you can just jump into the according classes in the demo source code)

How to store multi-dimensional data

I am building a couple of web applications to store data using Django. This data would be generated from lab tests and might have up to 100 parameters being logged against time. This would leave me with an NxN matrix of data.
I'm struggling to see how this would fit into a Django model as the number of parameters logged may change each time, and it seems inefficient to create a new model for each dataset.
What would be a good way of storing data like this? Would it be best to store it as a separate file and then just use a model to link a test to a datafile? If so what would be the best format for fast access and being able to quickly render and search through data, generate graphs etc in the application?
In answer to the question below:
It would be useful to search through datasets generated from the same test for trend analysis etc.
As I'm still beginning with this site I'm using SQLite, but planning to move to full SQL as it grows

Analysing twitter for research : moving from small data to big

We have a research work that we are doing as part of our college project in which we need to analyse twitter data.
We have already built the prototype for classification and analysis using pandas and nltk, reading the comments from a csv file and then processing it. The problem now is that we want to scale it so as to read and analyse some big comments file also. But the problem is that we dont have anybody who could guide us(majority of them being from biology background) with what technologies to use for this massive amount.
Our issues are :-
1.] How to store a massive comments file(5 gb, offline data). Till now we had only 5000-10000 line of comments which we processed using pandas. But how do we store and process such a huge file. Which database to use for it.
2.] Also since we plan to use nltk, machine learning on this data, what should be our approach on parallels of :: csv->pandas,nltk,machine learning->model->prediction. That is, where in this path we need changes and with what technologies should we replace them to handle the huge data.
Generally speaking, there's two types of scaling:
Scale up
Scale out
Scale up, most of the time, means taking what you already have, and run it on a bigger machine (more CPU, RAM, disk throughput).
Scale out generally means partitioning your problem, and handling parts on separate threads/processes/machines.
Scaling up is much easier: keep the code you already have and run it on a big machine (possibly on Amazon EC2 or Rackspace, if you don't have one available).
If scaling up is not enough, you will need to scale out. Start by identifying what parts of your problem can be partitioned. Since you're processing twitter comments, there's a good chance you can simply partition your file into multiple ones, and train N independent models.
Since you're just processing text data, there isn't a big advantage to using a database over plain text files (for storing the input data, at least). Simply split your file into multiple files and distribute each one to a different processing unit.
Depending on the specific machine learning techniques you're using, it may be easy to merge the independent models into a single one, but it will likely require expert knowledge.
If you're using K-nearest-neighbors, for example, it's trivial to join the independent models

Categories

Resources