How do I remove unknown, extra, data values from large file?

How do I remove unknown, extra, data values from large file? - python

I am working on an Python, TensorFlow, image classification model, and in my training images, I have 12,611 images, but in my training labels, I have 12,613. (each image has a number as the title, and this number corresponds to the same number in a CSV file with the accompanying information for that image).
From here, what I need to do is simply remove those 2 extra data points for which I don't have pictures for. How can I write a code to help with this?
(If the code tells me which data points are the extras, I can manually remove them from the CSV file)
Thanks for the help.

Well its very straightforward, you can try something like this (As I dont kno exactly how and where you have saved your images, you might have to update the code to meet your use case) :
dir_path = r'/path/to/folder/of/images'
csv_path = r'/path/to/csv/file'
images = []
# Get all images labels
for filename in os.listdir(dir_path):
images.append(int(filename.split('.')[0]))
# Read CSV
df = pd.read_csv(csv_path)
# Print which labels are extra
for i in df['<COLUMN_NAME>'].tolist():
if i not in images:
print(i)

Related

How to create a list of DICOM files and convert it to a single numpy array .npy?

I have a problem and don't know how to solve:
I'm learning how to analyze DICOM files with Python and, so,
I got a patient exam, on single patient and one single exam, which is 200 DICOM files all of the size 512x512 each archive representing a different layer of him and I want to turn them into a single archive .npy so I can use in another tutorial that I found online.
Many tutorials try to convert them to jpg or png using opencv first, but I don't want this since I'm not interested in a friendly image to see right now, I need the array. Also, this step screw all the quality of images.
I already know that using:
medical_image = pydicom.read_file(file_path)
image = medical_image.pixel_array
I can grab the path, turn 1 slice in a pixel array and them use it, but the thing is, it doesn't work in a for loop.
The for loop I tried was basically this:
image = [] # to create an empty list
for f in glob.iglob('file_path'):
img = pydicom.dcmread(f)
image.append(img)
It results in a list with all the files. Until here it goes well, but it seems it's not the right way, because I can use the list and can't find the supposed next steps anywhere, not even answers to the errors that I get in this part, (so I concluded it was wrong)

The following code snippet allows to read DICOM files from a folder dir_path and to store them into a list. Actually, the list does not consist of the raw DICOM files, but is filled with NumPy arrays of Hounsfield units (by using the apply_modality_lut function).
import os
from pathlib import Path
import pydicom
from pydicom.pixel_data_handlers import apply_modality_lut
dir_path = r"path\to\dicom\files"
dicom_set = []
for root, _, filenames in os.walk(dir_path):
for filename in filenames:
dcm_path = Path(root, filename)
if dcm_path.suffix == ".dcm":
try:
dicom = pydicom.dcmread(dcm_path, force=True)
except IOError as e:
print(f"Can't import {dcm_path.stem}")
else:
hu = apply_modality_lut(dicom.pixel_array, dicom)
dicom_set.append(hu)

You were well on your way. You just have to build up a volume from the individual slices that you read in. This code snippet will create a pixelVolume of dimension 512x512x200 if your data is as advertised.
import dicom
import numpy
images = [] # to create an empty list
# Read all of the DICOM images from file_path into list "images"
for f in glob.iglob('file_path'):
image = pydicom.dcmread(f)
images.append(image)
# Use the first image to determine the number of rows and columns
repImage = images[0]
rows=int(repImage.Rows)
cols=int(repImage.Columns)
slices=len(images)
# This tuple represents the dimensions of the pixel volume
volumeDims = (rows, cols, slices)
# allocate storage for the pixel volume
pixelVolume = numpy.zeros(volumeDims, dtype=repImage.pixel_array.dtype)
# fill in the pixel volume one slice at a time
for image in images:
pixelVolume[:,:,i] = image.pixel_array
#Use pixelVolume to do something interesting
I don't know if you are a DICOM expert or a DICOM novice, but I am just accepting your claim that your 200 images make sense when interpreted as a volume. There are many ways that this may fail. The slices may not be in expected order. There may be multiple series in your study. But I am guessing you have a "nice" DICOM dataset, maybe used for tutorials, and that this code will help you take a step forward.

Increasing dataset size using imgaug

I am merging two different datasets containing images into one dataset. One of the datasets contains 600 images in the training set. The other dataset contains only 90-100 images. I want to increase the size of the latter dataset by using the imgaug library. The images are stored in folders under the name of their class. So the path for a "cake" image in the training set would be ..//images//Cake//cake_0001. I'm trying to use this code to augment the images in this dataset:
path = 'C:\\Users\\User\\Documents\\Dataset\\freiburg_groceries_dataset\\images'
ia.seed(6)
seq = iaa.Sequential([
iaa.Fliplr(0.5),
iaa.Crop(percent=(0, 0.1)),
iaa.Affine(rotate=(-25,25))
], random_order=True)
for folder in os.listdir(path):
try:
for i in os.listdir(folder):
img = imageio.imread(i)
img_aug = seq(images=img)
iaa.imshow(img_aug)
print(img_aug)
except:
pass
Right now there's not output, even if I put print(img) or imshow(img) or anything. How do I ensure that I got more images for this dataset? Also, what is the best spot to augment images? Where do the augmented images get stored, and how do I see how many new images were generated?

The Question was not clear. So, for the issue2: error in saving file and not able to visualize using imshow().
First: In the second loop code block
img = imageio.imread(i)
img_aug = seq(images=img)
iaa.imshow(img_aug)
print(img_aug)
1st error is: i is not the file path. To solve this replace imageio.imread(i) with imageio.imread(path+'/'+folder+'/'+i).
2nd error is: iaa doesn't have the property imshow().
To fix this replace iaa.imshow(img_aug) with iaa.imgaug.imshow(img_aug). This fixes the error of visualizing and finishing the loop execution.
Second: If you have any issue in saving images, then use PIL.
i.e.,
from PIL import Image
im = Image.fromarray(img_aug)
im.save('img_aug.png')`

It's because folder is not the path to the directory you are looking for.
You should change for i in os.listdir(folder): to for i in os.listdir(path+'\\'+folder):. Then it looks inside the path\folder directory for files.

How to save extracted pitch values in a csv file?

Well I think I should mention it that it's the very first time I'm trying Audio signal processing in Python. I have an audio data set and I am extracting pitch features using Aubio library, and MFCC feature using the python_speech_features library in Python. The thing is, for a single audio file, I am getting around 84 valued vector for the pitch and 12 valued feature vector for MFCC.
Image of extracted pitch feature vector
So how do I save all these so many values in a single csv file? I have around 700 audio files separated in different directories wrt to emotions. Should I take the mean of all of these values and save them wrt the audio file in a csv? Like this:
Also, how would I used these values for classification then?
Any help would be much appreciated, Thanks.

There is not a simple answer to your question.
I have understand that for each data sample you extract a set of features, the same for each sample, don't you?
I suppose you work within a for loop, something like this:
import numpy as np
all_features = []
for path in path_list:
x = open_file(path) #an hypothetical function to open your files
features = extract_features(x) #an hypothetical function to extract features
all_features.append(features)
if your code looks like my simple example, you have created a list all_features whose elements all_features[i] contains the extracted features from the sample i. In addition i suppose that your extracted features is a numpy vector. If it is not, you should convert it into a numpy vector (something like features = np.array(features)).
Ok, now you are ready to create a dataset:
data = np.vstack(all_features)
the vertical stack np.vstack generates a matrix of shape (n_samples, n_features). Warning: all features vector must have the same shape!
Now you want to save the dataset, there is on ocean of possibilities, this my favorite three options:
1) using pandas to create a csv file:
import pandas as pd
df = pd.DataFrame(data)
df.to_csv(filename+'.csv', index=False, header=header) #header is a list of string to name columns of csv
#see https://pandas.pydata.org/pandasdocs/stable/generated/pandas.DataFrame.to_csv.html
2) dump memory into a pickle file:
import six.moves.cPickle as pickle
with open(filename+'.pkl', 'wb') as f:
pickle.dump(data, f)
3)save as numpy file:
np.save(filename+'.npy', data)
Concerning the classification problem, if you want to use a supervised method (MLP, RF, SVM, KNN, ...) you need a class labels (the ground truth), i.e. a vector with shape equals to the number of sample that relates each sample to a integer (for example 0,1 in a binary classification, or 0,1,2,3 for a 4-class classification). This strongly depend from what you want, what is the goal of your training.
Once you have the the data matrix and the label vector, each machine-learning method will be able to classify, if you have enough samples. With this aim, i suggest you to use same augmenting criteria, to have an idea have a look to this paper, it could give you same ideas.
Hoping i have help you, good work!

Python has a built-in csv module.
This section's example gives a simple example on how to use a writer to write rows to your csv.

Read Own Multiple Images from folder and Save as a Dataset for Training

I am working on training my own images read from my folders. I would be thankful if you could help me for this.
I successfully read my all images from the folder and create my own onehot_encoded labels. However, in each time I run my code, it takes a lot of time to do read all images from the folders. Therefore, I want to create dataset from these images and save it like MNIST to use faster. Thus, I will not read my whole images again. Could you please help me for this?
The code is:
path = "D:/cleandata/train_data/"
loadedImages = []
labels = []
sess = tf.InteractiveSession()
tf.global_variables_initializer().run()
for i in range(len(os.listdir(path))):
imagesList = listdir(path+os.listdir(path)[i])
for image in imagesList:
image_raw_data_jpg tf.gfile.FastGFile(path+os.listdir(path)
[i]+'/'+image, 'rb').read()
raw_image =tf.image.decode_png(image_raw_data_jpg,3)
gray_resize=tf.image.resize_images(raw_image, [28, 28])
image_data =
sess.run(tf.image.rgb_to_grayscale(gray_resize))
loadedImages.append(image_data)

Here is a tutorial on how to use a TFRecords file. It shows how to create the file (containing images and labels) and read from it.
http://www.machinelearninguru.com/deep_learning/tensorflow/basics/tfrecord/tfrecord.html
Or you could just use zipfile, and include the label in the image file name, thus keeping them together (that is what I did)

kaggle dataset or python split CLI

I downloaded the dataset from kaggle:
https://www.kaggle.com/c/dogs-vs-cats/data
Then tried to get image label from the downloaded data using cv2.split('.')[-3] command. (code in the end)
However, i got an "index out of range error". I checked the filename and see the filename after unzip from kaggle datasets is only 1.jpg, 2.jpg, 3.jpg.
From what I read, the dataset should have label in the filename. i.e.
https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788475655/23/ch23lvl1sec118/deep-learning-for-cats-versus-dogs
So my question is
Q1: I assume my python syntax is right. As it looks like I would only have two argument [0] and [1] with filename of "num.jpg" not "label.num.jpg", right?
Q2: if so, anyone can help me to point out why I cannot get the right datasets with label in the filename?
ps: I am really new in python, kaggle, (or programming area).
Thank you
Mira
ps: my partial code:
for img in tqdm(os.listdir(TRAIN_DIR))
path = os.path.join(TRAIN_DIR, img)
img_data = cv2.imread(path)
cv2.imshow('train_data_image:', img_data)
print ('test:', img.split('.')[-3])

just FYI - I found the answer for my question...
It turns out I was using the test data which indeed should not contain the label in the dataset. I download the train data and it does have the label (dog/cat) in the filename.
thanks!
Mira

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I remove unknown, extra, data values from large file? - python

Related

How to create a list of DICOM files and convert it to a single numpy array .npy?

Increasing dataset size using imgaug

How to save extracted pitch values in a csv file?

Read Own Multiple Images from folder and Save as a Dataset for Training

kaggle dataset or python split CLI

Categories

Resources