I have eight folders with 1300 CSV files (3*50) in each folder, each folder represents a label, but I have no idea how to input my data in to a training model.
Still, a beginner in CNN.
A part of my csv file can be accessed using this link.
When using Keras, you can use the tf.data.Dataset package, which helps you doing what you want to achieve.
Example
Here is an example code, I took from one of my projects:
# matching a glob pattern!
dataset_pro_raw = tf.data.Dataset.list_files([f"./aclImdb/{name}/pos/*.txt"], shuffle=True)
dataset_pro_i = dataset_pro_raw.interleave(
lambda file: tf.data.TextLineDataset(file),
# how many files should be processed concurently
cycle_length = 20,
# number of threads to increase the performance
num_parallel_calls = 10
)
First, we create a filelist by tf.data.Dataset.list_files(), also note, that already there the order of the files is shuffled. Then via dataset_pro_raw.interleave() we iterate through the file set and read the content of the files with with tf.data.TextLineDataset().
That way you can load data from multiple .txt files or any data source very well. It is a big clumsy at the beginning ot use, but it has very well advantages. Currently I only use tf.data.Dataset for train-data generation.
For more information on tf.data.Dataset you might want to check out this link
Related
I have a dataset of ~3500 images, with the labels of each image in a csv file. The csv file has two columns: the first one contains the exact name of the image file (i.e. 00001.jpg) and the second column contains the label of the image. There are a total of 7 different labels.
How can I sort the images from one huge folder to 7 different folders (each image in its respective category) in an efficient manner? Does anyone have a script that can do this?
Also, is there any way I can do this with Google Drive? I've already uploaded the dataset to Drive in order to use with Colab soon, so I don't want to have to do it again (takes ~2.5 hours).
I'm not sure about performance, probably there are better ways...
But this would be my take on the problem:
(not tested, so might need small adjustments)
I'm assuming the images are in a subfolder /images/, but the csv and the script are in root. Furthermore I'm assuming the csv is named images.csv and the columns in the csv are titled file and label.
import pandas as pd
import os
df = pf.DataFrame.from_csv('images.csv')
for _, row in df.iterrows():
f = row['file']
l = row['label']
os.replace(f'images/{f}', f'images/{l}/{f}')
I don't know what google drive would make out of it, but as long as you can run it on a google-drive-synced folder, I wouldn't know why this should be an issue.
Note: if you test it, you may want to do so on a copy of the files, in case I screwed up...
I have a directory with multiple files for the same data format (1 file per day). It's like one data split into multiple files.
Is it possible to pass all the files to A Kedro node without specifying each file? So they all get processed sequentially or in parallel based on the runner?
If the number of files is small and fixed, you may consider creating those preprocessing pipeline for each of them manually.
If the number of files is large/dynamic, you may create your pipeline definition programmatically for each of them, adding them all together afterwards. Same would probably apply to programmatic creation of the required datasets.
Alternative option would be to read all the files once in the first node, concatenate them all together into one dataset, and make all consecutive preproc nodes use that dataset (or its derivatives) as inputs
I'm using Matlab toolbox and Scikit-learn for implementation of environmental problem. I have different data of different time step but I do not want to concatenate all files together since each of which is for the specific time step. What is your solution to define a folder containing 1000 files and how to define for the machine to read those files consecutively?
I've tried by employing toolbox and scikit-learn to read each file as an input but I could not be able to read all files as a consecutive input.
When I try to create a word2vec model (skipgram with negative sampling) I received 3 files as output as follows.
word2vec (File)
word2vec.syn1nef.npy (NPY file)
word2vec.wv.syn0.npy (NPY file)
I am just worried why this happens as for my previous test examples in word2vec I only received one model(no npy files).
Please help me.
Models with larger internal vector-arrays can't be saved via Python 'pickle' to a single file, so beyond a certain threshold, the gensim save() method will store subsidiary arrays in separate files, using the more-efficient raw format of numpy arrays (.npy format).
You still load() the model by just specifying the root model filename; when the subsidiary arrays are needed, the loading code will find the side files – as long as they're kept beside the root file. So when moving a model elsewhere, be sure to keep all files with the same root filename together.
I want to try out a few algorithms in by loading my own dataset. I'm specifically interested in loading text files (very similar to the 20 NewsGroups dataset http://scikit-learn.org/stable/datasets/index.html#general-dataset-api). Is there any documentation that explains the format (and the procedure) for loading in data other than the sample datasets?
Thanks.
TfidfVectorizer and others text vectorizers classes in scikit-learn just take a list of Python unicode strings as input. You can thus load the text the way you want depending on the source: database query using SQLAlchemy, json stream from an HTTP API, a CSV file or random text files in folders.
For the last option, if the class information is stored in the folder names holding the text files you can use the load_files utility function.