I want to try out a few algorithms in by loading my own dataset. I'm specifically interested in loading text files (very similar to the 20 NewsGroups dataset http://scikit-learn.org/stable/datasets/index.html#general-dataset-api). Is there any documentation that explains the format (and the procedure) for loading in data other than the sample datasets?
Thanks.
TfidfVectorizer and others text vectorizers classes in scikit-learn just take a list of Python unicode strings as input. You can thus load the text the way you want depending on the source: database query using SQLAlchemy, json stream from an HTTP API, a CSV file or random text files in folders.
For the last option, if the class information is stored in the folder names holding the text files you can use the load_files utility function.
Related
I am working with large textual data (articles containing words, symbols,escape characters,line breaks etc). Each article also consists of attributes like date , author etc.
It will be used in python for NLP purposes . What is the best way to store this data such that it can be read efficiently from disk.
EDIT :
I have loaded the data as a pandas dataframe in python
Storing as a csv results in corruption due to line breaks(\n) in the text making the data unusable.
storing as JSON is not working due to encoding problems.
I have my twitter archive downloaded and wanted to run word2vec to experiment most similar words, analogies etc on it.
But I am stuck at first step - how to convert a given dataset / csv / document so that it can be input to word2vec? i.e. what is the process to convert data to glove/word2vec format?
Typically implementations of the word2vec & GLoVe algorithms do one or both of:
accept a plain text file, where tokens are delimited by (one or more) spaces, and text is considered each newline-delimited line at a time (with lines that aren't "too long" - usually, short-article or paragraph or sentence per line)
have some language/library-specific interface for feeding texts (lists-of-tokens) to the algorithm as a stream/iterable
The Python Gensim library offers both options for its Word2Vec class.
You should generally try working through one or more tutorials to get a working overview of the steps involved, from raw data to interesting results, before applying such libraries to your own data. And, by examining the formats used by those tutorials – and the extra steps they perform to massage the data into the formats needed by exactly the libraries you're using – you'll also see ideas for how your data needs to be prepared.
I have eight folders with 1300 CSV files (3*50) in each folder, each folder represents a label, but I have no idea how to input my data in to a training model.
Still, a beginner in CNN.
A part of my csv file can be accessed using this link.
When using Keras, you can use the tf.data.Dataset package, which helps you doing what you want to achieve.
Example
Here is an example code, I took from one of my projects:
# matching a glob pattern!
dataset_pro_raw = tf.data.Dataset.list_files([f"./aclImdb/{name}/pos/*.txt"], shuffle=True)
dataset_pro_i = dataset_pro_raw.interleave(
lambda file: tf.data.TextLineDataset(file),
# how many files should be processed concurently
cycle_length = 20,
# number of threads to increase the performance
num_parallel_calls = 10
)
First, we create a filelist by tf.data.Dataset.list_files(), also note, that already there the order of the files is shuffled. Then via dataset_pro_raw.interleave() we iterate through the file set and read the content of the files with with tf.data.TextLineDataset().
That way you can load data from multiple .txt files or any data source very well. It is a big clumsy at the beginning ot use, but it has very well advantages. Currently I only use tf.data.Dataset for train-data generation.
For more information on tf.data.Dataset you might want to check out this link
I am trying to find databases like the LJ Speech Dataset made by Keith Ito. I need to use these datasets in TacoTron 2 (Link), so I think datasets need to be structured in a certain way. the LJ database is linked directly into the tacotron 2 github page, so I think it's safe to assume it's made to work with it. So I think Databases should have the same structure as the LJ. I downloaded the Dataset and I found out that it's structured like this:
main folder:
-wavs
-001.wav
-002.wav
-etc
-metadata.csv: This file is a csv file which contains all the things said in every .wav, in a form like this **001.wav | hello etc.**
So, my question is: Are There other datasets like this one for further training?
But I think there might be problems, for example, the voice from one dataset would be different from the one in one another, would this cause too much problems?
And also different slangs or things like that can cause problems?
There a few resources:
The main ones I would look at are Festvox (aka CMU artic) http://www.festvox.org/dbs/index.html and LibriVoc https://librivox.org/
these guys seem to be maintaining a list
https://github.com/candlewill/Speech-Corpus-Collection
And I am part of a project that is collecting more (shameless self plug): https://github.com/Idlak/Living-Audio-Dataset
Mozilla includes a database of several datasets you can download and use, if you don't need your own custom language or voice: https://voice.mozilla.org/data
Alternatively, you could create your own dataset following the structure you outlined in your OP. The metadata.csv file needs to contain at least two columns -- the first is the path/name of the WAV file (without the .wav extension), and the second column is the text that has been spoken.
Unless you are training Tacotron with speaker embedding/a multi-speaker model, you'd want all the recordings to be from the same speaker. Ideally, the audio quality should be very consistent with a minimum amount of background noise. Some background noise can be removed using RNNoise. There's a script in the Mozilla Discourse group that you can use as a reference. All the recordings files need to be short, 22050 Hz, 16-bit audio clips.
As for slag or local colloquialisms -- not sure; I suspect that as long as the word sounds match what's written (i.e. the phonemes match up), I would expect the system to be able to handle it. Tacotron is able to handle/train on multiple languages.
If you don't have the resources to produce your own recordings, you could use audio from a permissively licensed audiobook in the target language. There's a tutorial on this very topic here: https://medium.com/#klintcho/creating-an-open-speech-recognition-dataset-for-almost-any-language-c532fb2bc0cf
The tutorial has you:
Download the audio from the audiobook.
Remove any parts that aren't useful (e.g. the introduction, foreward, etc) with Audacity.
Use Aeneas to fine-tune and then export a forced alignment between the audio and the text of the e-book, so that the audio can be exported sentence by sentence.
Create the metadata.csv file containing the map from audio to segments. (The format that the post describes seems to include extra columns that aren't really needed for training and are mainly for use by Mozilla's online database).
You can then use this dataset with systems that support LJSpeech, like Mozilla TTS.
When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows.
word2vec (File)
word2vec.syn1nef.npy (NPY file)
word2vec.wv.syn0.npy (NPY file)
I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files).
Please help me.
Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store subsidiary arrays in separate files, using the more-efficient raw format of numpy arrays (.npy format).
You still load() the model by just specifying the root model filename; when the subsidiary arrays are needed, the loading code will find the side files – as long as they're kept beside the root file. So when moving a model elsewhere, be sure to keep all files with the same root filename together.