I am trying to implement U-net and I use https://github.com/jakeret/tf_unet/tree/master/scripts this link as reference. I don't understand which dataset they used. please give me some idea or link which dataset i use.
On their github README.md they show three different datasets, that they applied their implementation to. Their implementation is dataset agnostic, therefore it shouldn't matter too much what data they use if you're trying to solve your own problem with your own data. But if you're looking for a toy data set to play around, check out their demos. There you'll see two readily available examples and how they can be used:
demo_radio_data.ipynb which uses an astronomic radio data example set from here: http://people.phys.ethz.ch/~ast/cosmo/bgs_example_data/
demo_toy_problem.ipynb which uses their built-in data generator of a noisy image with circles that are to be detected.
The latter is probably the easiest one if it comes to just having something to play with. To see how data is generated, check out the class:
image_gen.py -> GrayScaleDataProvider
(with an IDE like PyCharm you can just jump into the according classes in the demo source code)
Related
I'm trying to write a python program that will decide if a given post is about the topic of volunteering. My data-sets are small (only the posts, which are examined 1 by 1) so approaches like LDA do not yield results.
My end goal is a simple True/False, a post is about the topic or not.
I'm trying this approach:
Using Google's word2vec model, I'm creating a "cluster" of words that are similar to the word: "volunteer".
CLUSTER = [x[0] for x in MODEL.most_similar_cosmul("volunteer", topn=120)]
Getting the posts and translating them to English, using Google translate.
Cleaning the translated posts using NLTK (removing stopwords, punctuation, and lemmatize the post)
Making a BOW out of the translated, clean post.
This stage is difficult for me. I want to calculate a "distance" / "similarity" / something that will help me get the True/False answer that I'm looking for, but I can't think of a good way to do that.
Thank you for your suggestions and help in advance.
You are attempting to intuitively improvise a set of steps that, in the end, will classify these posts into the two categories, "volunteering" and "not-volunteering".
You should looks for online examples that do "text classification" that are similar to your task, work through them (with their original demo data) for understanding, then adapt them incrementally to work with your data instead.
At some point, word2vec might be a helpful contributor to your task - but I wouldn't start with it. Similarly, eliminating stop-words, performing lemmatization, etc might eventually be helpful, but need not be important up front.
You'll typically want to start by acquiring (by hand-labeling if necessary) a training set of text for which you know the "volunteering" or "not-volunteering" value (known labels).
Then, create some feature-vectors for the texts – A simple starting approach that offers a quick baseline for later improvements is a "bag of words" representation.
Then, feed those representations, with the known-labels, to some existing classification algorithm. The popular scikit-learn package in Python offers many. That is: you don't yet need to be worrying about choosing ways to calculate a "distance" / "similarity" / something that will guide your own ad hoc classifier. Just feed the labeled data into one (or many) existing classifiers, and check how well they're doing. Many will be using various kinds of similarity/distance calculations internally - but that's automatic and explicit from choosing & configuring the algorithm.
Finally, when you have something working start-to-finish, no matter how modest in results, then try alternate ways of preprocessing text (stop-word-removal, lemmatization, etc), featurizing text, and alternate classifiers/algorithm paramterizations - to compare results, and thus discover what works well given your specific data, goals, and practical constraints.
The scikit-learn "Working With Text Data" guide is worth reviewing & working-through, and their "Choosing the right estimator" map is useful for understanding the broad terrain of alternate techniques and major algorithms, and when different ones apply to your task.
Also, scikit-learn contributors/educators like Jake Vanderplas (github.com/jakevdp) and Olivier Grisel (github.com/ogrisel) have many online notebooks/tutorials/archived-video-presentations which step through all the basics, often including text-classification problems much like yours.
I know all of the datasets that can be loaded with sns.load_dataset() are all example datasets, used for Seaborn's documentation, but do these example datasets use actual data?
I'm asking because I want to know if it's useful to pay attention to the results I get as I play around with these datasets, or if I should just see them as solely a means to learning the module.
The data does appear to be real. This is not formally documented by Seaborn, but:
Several of the dataset are "real" well-known datasets that can be verified elsewhere, such as the Iris dataset hosted on UCI's Machine Learning repository.
All of the data are sourced from https://github.com/mwaskom/seaborn-data, and in turn from actual CSVs on Michael Waskom's (core Seaborn developer) local drive, it appears. If the data were random/fake, it is more likely it would be generated by Python libraries such as NumPy.
I actually have Photodiode connect to my PC an do capturing with Audacity.
I want to improve this by using an old RPI1 as dedicated test station. As result the shutter speed should appear on the console. I would prefere a python solution for getting signal an analyse it.
Can anyone give me some suggestions? I played around with oct2py, but i dont really under stand how to calculate the time between the two peak of the signal.
I have no expertise on sound analysis with Python and this is what I found doing some internet research as far as I am interested by this topic
pyAudioAnalysis for an eponym purpose
You an use pyAudioAnalysis developed by Theodoros Giannakopoulos
Towards your end, function mtFileClassification() from audioSegmentation.py can be a good start. This function
splits an audio signal to successive mid-term segments and extracts mid-term feature statistics from each of these sgments, using mtFeatureExtraction() from audioFeatureExtraction.py
classifies each segment using a pre-trained supervised model
merges successive fix-sized segments that share the same class label to larger segments
visualize statistics regarding the results of the segmentation - classification process.
For instance
from pyAudioAnalysis import audioSegmentation as aS
[flagsInd, classesAll, acc, CM] = aS.mtFileClassification("data/scottish.wav","data/svmSM", "svm", True, 'data/scottish.segments')
Note that the last argument of this function is a .segment file. This is used as ground-truth (if available) in order to estimate the overall performance of the classification-segmentation method. If this file does not exist, the performance measure is not calculated. These files are simple comma-separated files of the format: ,,. For example:
0.01,9.90,speech
9.90,10.70,silence
10.70,23.50,speech
23.50,184.30,music
184.30,185.10,silence
185.10,200.75,speech
...
If I have well understood your question this is at least what you want to generate isn't it ? I rather think you have to provide it there.
Most of these information are directly quoted from his wiki which I suggest you to read it. Yet don't hesitate to reach out as far as I am really interested by this topic
Other available libraries for audio analysis :
I am trying to create an image classifier based on RandomTrees from OpenCV (version 2.4). Right now, I initialize everything likes this:
self.model = cv.RTrees()
max_num_trees=10
max_error=1
max_d=5
criteria=cv.TERM_CRITERIA_MAX_ITER+cv.TERM_CRITERIA_EPS
parameters = dict(max_depth=max_d, min_sample_count=5, use_surrogates=False, nactive_vars=0,
term_crit=(criteria, max_num_trees, max_error))
self.model.train(dataset, cv.CV_ROW_SAMPLE, responses, params=parameters)
I did it by looking at this question. The only problem is, whatever I change in the parameters, classification always remains the same (and wrong). Since the python documentation on this is very very scarce, I have no choice but to ask here what to do and how to check what I am doing. How to get the number of trees it generates and all the other things that are explained for C++ but not for Python - like train error? For example, I tried:
self.model.tree_count
self.model.get_tree_count()
but got an error every time. Also, am I doing the termination criteria initialization correctly?
Is it possible to save a predictive model in Python?
Background: I'm building regression models in SPSS and I need to share those when done. Unfortunately, no one else has SPSS.
Idea: My idea is to make the model in Python, do something XYZ, use another library to convert XYZ into and exe that will pick up a csv file with data and spit out the model fit results on that data. In this way, I can share the model with anyone I want without the need of SPSS or other expensive software
Challenge: I need to find out XYZ, how do I save the instance when the model is built. For example, in case of linear/logistic, it would be the set of coefficients.
PS: I'm using linear/logistic as examples, in reality, I need to share more complex models like SVM etc.
Using FOSS (Free & Open Source Software) is great to facilitate collaboration. Consider using R or Sage (which has a Python backbone and includes R) so that you can freely share programs and data. Or even use Sagemath Cloud so that you can work collaboratively in real-time.
Yes, this is possible. What you're looking for is scitkit-learn in combination with joblib. A working example of your problem can be found in this question.