Input for Machine Learning

Input for Machine Learning - python

I'm using Matlab toolbox and Scikit-learn for implementation of environmental problem. I have different data of different time step but I do not want to concatenate all files together since each of which is for the specific time step. What is your solution to define a folder containing 1000 files and how to define for the machine to read those files consecutively?
I've tried by employing toolbox and scikit-learn to read each file as an input but I could not be able to read all files as a consecutive input.

Related

Training Keras model with multiple CSV files in multi folder in python

I have eight folders with 1300 CSV files (3*50) in each folder, each folder represents a label, but I have no idea how to input my data in to a training model.
Still, a beginner in CNN.
A part of my csv file can be accessed using this link.

When using Keras, you can use the tf.data.Dataset package, which helps you doing what you want to achieve.
Example
Here is an example code, I took from one of my projects:
# matching a glob pattern!
dataset_pro_raw = tf.data.Dataset.list_files([f"./aclImdb/{name}/pos/*.txt"], shuffle=True)
dataset_pro_i = dataset_pro_raw.interleave(
lambda file: tf.data.TextLineDataset(file),
# how many files should be processed concurently
cycle_length = 20,
# number of threads to increase the performance
num_parallel_calls = 10
)
First, we create a filelist by tf.data.Dataset.list_files(), also note, that already there the order of the files is shuffled. Then via dataset_pro_raw.interleave() we iterate through the file set and read the content of the files with with tf.data.TextLineDataset().
That way you can load data from multiple .txt files or any data source very well. It is a big clumsy at the beginning ot use, but it has very well advantages. Currently I only use tf.data.Dataset for train-data generation.
For more information on tf.data.Dataset you might want to check out this link

Kedro: How to pass multiple same data from a directory as a node input?

I have a directory with multiple files for the same data format (1 file per day). It's like one data split into multiple files.
Is it possible to pass all the files to A Kedro node without specifying each file? So they all get processed sequentially or in parallel based on the runner?

If the number of files is small and fixed, you may consider creating those preprocessing pipeline for each of them manually.
If the number of files is large/dynamic, you may create your pipeline definition programmatically for each of them, adding them all together afterwards. Same would probably apply to programmatic creation of the required datasets.
Alternative option would be to read all the files once in the first node, concatenate them all together into one dataset, and make all consecutive preproc nodes use that dataset (or its derivatives) as inputs

Datasets like "The LJ Speech Dataset"

I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?

There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset

Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.

Cntk personal dataset

I have read the cntk tutorials but I can not find a way to structure and create my image dataset.
In tutorials they use, like the mnist dataset, files like mean.xml, map.txt and other files for labels and features. I can not find other guides on how to generate these files based on the folder, for example (positive and negative) from which to obtain file mapping and serialization of images in the format | labels | features|.

You can write a small Python program to create these files for yourself e.g. the map file can be done with an os.walk or some calls to glob. Similarly the mean.xml can be constructed using PIL and minidom as in this example.

how to view sparse matrices outside python environment

I am working with sparse matrices which are 11685 by 85730 . I am able to store it only as a .pickle file . I want to view the file outside the python environment also . I tried saving as a .txt and .csv files but they are of no help . Can anybody suggest a suitable format and library so that I can view those matrices outside the python environment .

Python allows you to write to many formats that are readable outside of python. .csv is one format, but there are also HDF5 and netcdf4 among others (those are meant to store array data though).
http://code.google.com/p/netcdf4-python/
http://code.google.com/p/h5py/
Or you could save them in a matlab readable format:
http://docs.scipy.org/doc/scipy/reference/tutorial/io.html
What you use should depend on how you plan on accessing the data outside of python.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.