How to load our own data set for training

How to load our own data set for training - python

I want to train a model that will tell us the PM2.5(this value describe AQI) of any image. For this I use CNN. I am using tensorflow for this purpose. I am new in this field.Please tell me how we upload our own dataset and separate its name and tags. The format of image name is "imageName_tag"(e.g ima01_23.4)

I think we need more information about your case regarding the "how upload our own dataset".
However, if your dataset is on your computer and you want to access it from python, i invite you to take a look at the libraries "glob" and "os".
To split the name (which in your case is "imageName_tag") you can use:
string = "imageName_tag"
name, tag = string.split('_')
As you'll have to do it for all your data, you'll have to use it in a loop and store the extracted informations in lists.

Related

How to add additional information in PMML other than related to model?

Just a simple question, I'm stuck at a scenario where I want to pass multiple information other than the pipeline itself inside a PMML file.
Other information like:
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features.
Correlation of all features with target.
There can be more of those, but the situation is like this as you can see. I know they can be easily sent with other file specifically made for it, but since it is regarding the ML model, I want them in the single file: PMML.
The Question is:
Can we add any additional information in PMML file that is extra in nature and might not be related with the model so that it can be used on the another side?
If that becomes possible, somehow, it would be much more helpful.

The PMML standard is extensible with custom elements. However, in the current case, all your needs appear to be served by existing PMML elements.
Average of all columns in dataset avg(col1), ... abg(coln)
P values of all features
You can store descriptive statistics about features using the ModelStats element.
Correlation of all features with target
You can store function information using the ModelExplanation element.

How to extract out unique words and there pos tags in separate columns while working with Dataset

I am working through Indonesian Data to use data for NER and as I get to know, there is no pretrained NLTK model to help for this language. So, to do this manually I tried to extract all the unique words used in the entire data frame, I still don't know how to apply tags to the words but this is what I did so far.
the first step,
the second step,
the third step,
the fourth step
please let me know if there is any other convenient way to do this, what I did in the following codes. also, let me know how to add tags to each row(if possible) and how to do NER for this.
(I am new to coding that's why I don't know how to ask, but I am trying my best to provide as much information as possible.)

Depending on what you want to do if results is all that matters you could use a pretrained transfomer model from huggingface instead of NLTK. This will be more computionally heavy but also give you a better performance.
There is one fitting model I could find (I don't speak Indonesian obviously, so excuse eventual errors in the sample sentence):
https://huggingface.co/cahya/xlm-roberta-large-indonesian-NER?text=Nama+saya+Peter+dan+saya+tinggal+di+Berlin.
The easiest way to use this would probably be either the API or using an inference-only pipeline, check out this guide, all you would have to do to get this running for the Indonesian model is to replace the previous model path (dslim/bert-base-NER) with cahya/xlm-roberta-large-indonesian-NER.
Note that this Indonesian model is quite large, so you need to have some decent hardware. If you don't you could alternatively use some (free) cloud computing service such as Google Colab.

I cant extract name using duckling in rasa 2.0

I want to extract name using duckling but keep failing saing that "faild to extract requested slot 'name's "..can anyone explain me

Duckling does not extract names. It specializes in regularly patterned entities like numbers and dates. See their list of supported dimensions. SpaCy offers pretrained models that usually have a PERSON dimension which might be what you want. See e.g. the label scheme for the English models

How to use pre labeled training data for Python Dedupe

I am using Python Dedupe package for record linkage tasks.
It means matching Company names in one data set to other.
The Dedupe package allows user to label pairs for training Logistic Regression model. However, it's a manual process and one need to input y/n for each pair shown on screen.
I want to load a training file which has 3 columns say, Company 1, Company 2, Match
Where Match can take value yes or no if Company 1 and Company 2 are same or different respectively.
I am following this source code but couldn't find a way to load a file for training.
Also, the doc shows one can change default Classifier but not sure how to do this
Can anyone please help me on this

Look up the trainingDataLink function in the dedupe documentation. It’s designed to handle pre-labeled data for record linkage.

Datasets like "The LJ Speech Dataset"

I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?

There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset

Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.