I am working on an Python, TensorFlow, image classification model, and in my training images, I have 12,611 images, but in my training labels, I have 12,613. (each image has a number as the title, and this number corresponds to the same number in a CSV file with the accompanying information for that image).
From here, what I need to do is simply remove those 2 extra data points for which I don't have pictures for. How can I write a code to help with this?
(If the code tells me which data points are the extras, I can manually remove them from the CSV file)
Thanks for the help.
Well its very straightforward, you can try something like this (As I dont kno exactly how and where you have saved your images, you might have to update the code to meet your use case) :
dir_path = r'/path/to/folder/of/images'
csv_path = r'/path/to/csv/file'
images = []
# Get all images labels
for filename in os.listdir(dir_path):
# Read CSV
df = pd.read_csv(csv_path)
# Print which labels are extra
for i in df['<COLUMN_NAME>'].tolist():
if i not in images:
I have a problem and don't know how to solve:
I'm learning how to analyze DICOM files with Python and, so,
I got a patient exam, on single patient and one single exam, which is 200 DICOM files all of the size 512x512 each archive representing a different layer of him and I want to turn them into a single archive .npy so I can use in another tutorial that I found online.
Many tutorials try to convert them to jpg or png using opencv first, but I don't want this since I'm not interested in a friendly image to see right now, I need the array. Also, this step screw all the quality of images.
I already know that using:
medical_image = pydicom.read_file(file_path)
image = medical_image.pixel_array
I can grab the path, turn 1 slice in a pixel array and them use it, but the thing is, it doesn't work in a for loop.
The for loop I tried was basically this:
image = [] # to create an empty list
for f in glob.iglob('file_path'):
img = pydicom.dcmread(f)
It results in a list with all the files. Until here it goes well, but it seems it's not the right way, because I can use the list and can't find the supposed next steps anywhere, not even answers to the errors that I get in this part, (so I concluded it was wrong)
The following code snippet allows to read DICOM files from a folder dir_path and to store them into a list. Actually, the list does not consist of the raw DICOM files, but is filled with NumPy arrays of Hounsfield units (by using the apply_modality_lut function).
import os
from pathlib import Path
import pydicom
from pydicom.pixel_data_handlers import apply_modality_lut
dir_path = r"path\to\dicom\files"
dicom_set = []
for root, _, filenames in os.walk(dir_path):
for filename in filenames:
dcm_path = Path(root, filename)
if dcm_path.suffix == ".dcm":
dicom = pydicom.dcmread(dcm_path, force=True)
except IOError as e:
print(f"Can't import {dcm_path.stem}")
hu = apply_modality_lut(dicom.pixel_array, dicom)
You were well on your way. You just have to build up a volume from the individual slices that you read in. This code snippet will create a pixelVolume of dimension 512x512x200 if your data is as advertised.
import dicom
import numpy
images = [] # to create an empty list
# Read all of the DICOM images from file_path into list "images"
for f in glob.iglob('file_path'):
image = pydicom.dcmread(f)
# Use the first image to determine the number of rows and columns
repImage = images[0]
# This tuple represents the dimensions of the pixel volume
volumeDims = (rows, cols, slices)
# allocate storage for the pixel volume
pixelVolume = numpy.zeros(volumeDims, dtype=repImage.pixel_array.dtype)
# fill in the pixel volume one slice at a time
for image in images:
pixelVolume[:,:,i] = image.pixel_array
#Use pixelVolume to do something interesting
I don't know if you are a DICOM expert or a DICOM novice, but I am just accepting your claim that your 200 images make sense when interpreted as a volume. There are many ways that this may fail. The slices may not be in expected order. There may be multiple series in your study. But I am guessing you have a "nice" DICOM dataset, maybe used for tutorials, and that this code will help you take a step forward.
I am merging two different datasets containing images into one dataset. One of the datasets contains 600 images in the training set. The other dataset contains only 90-100 images. I want to increase the size of the latter dataset by using the imgaug library. The images are stored in folders under the name of their class. So the path for a "cake" image in the training set would be ..//images//Cake//cake_0001. I'm trying to use this code to augment the images in this dataset:
path = 'C:\\Users\\User\\Documents\\Dataset\\freiburg_groceries_dataset\\images'
seq = iaa.Sequential([
iaa.Crop(percent=(0, 0.1)),
], random_order=True)
for folder in os.listdir(path):
for i in os.listdir(folder):
img = imageio.imread(i)
img_aug = seq(images=img)
Right now there's not output, even if I put print(img) or imshow(img) or anything. How do I ensure that I got more images for this dataset? Also, what is the best spot to augment images? Where do the augmented images get stored, and how do I see how many new images were generated?
The Question was not clear. So, for the issue2: error in saving file and not able to visualize using imshow().
First: In the second loop code block
img = imageio.imread(i)
img_aug = seq(images=img)
1st error is: i is not the file path. To solve this replace imageio.imread(i) with imageio.imread(path+'/'+folder+'/'+i).
2nd error is: iaa doesn't have the property imshow().
To fix this replace iaa.imshow(img_aug) with iaa.imgaug.imshow(img_aug). This fixes the error of visualizing and finishing the loop execution.
Second: If you have any issue in saving images, then use PIL.
from PIL import Image
im = Image.fromarray(img_aug)
It's because folder is not the path to the directory you are looking for.
You should change for i in os.listdir(folder): to for i in os.listdir(path+'\\'+folder):. Then it looks inside the path\folder directory for files.
Well I think I should mention it that it's the very first time I'm trying Audio signal processing in Python. I have an audio data set and I am extracting pitch features using Aubio library, and MFCC feature using the python_speech_features library in Python. The thing is, for a single audio file, I am getting around 84 valued vector for the pitch and 12 valued feature vector for MFCC.
Image of extracted pitch feature vector
So how do I save all these so many values in a single csv file? I have around 700 audio files separated in different directories wrt to emotions. Should I take the mean of all of these values and save them wrt the audio file in a csv? Like this:
Also, how would I used these values for classification then?
Any help would be much appreciated, Thanks.
There is not a simple answer to your question.
I have understand that for each data sample you extract a set of features, the same for each sample, don't you?
I suppose you work within a for loop, something like this:
import numpy as np
all_features = []
for path in path_list:
x = open_file(path) #an hypothetical function to open your files
features = extract_features(x) #an hypothetical function to extract features
if your code looks like my simple example, you have created a list all_features whose elements all_features[i] contains the extracted features from the sample i. In addition i suppose that your extracted features is a numpy vector. If it is not, you should convert it into a numpy vector (something like features = np.array(features)).
Ok, now you are ready to create a dataset:
data = np.vstack(all_features)
the vertical stack np.vstack generates a matrix of shape (n_samples, n_features). Warning: all features vector must have the same shape!
Now you want to save the dataset, there is on ocean of possibilities, this my favorite three options:
1) using pandas to create a csv file:
import pandas as pd
df = pd.DataFrame(data)
df.to_csv(filename+'.csv', index=False, header=header) #header is a list of string to name columns of csv
#see https://pandas.pydata.org/pandasdocs/stable/generated/pandas.DataFrame.to_csv.html
2) dump memory into a pickle file:
import six.moves.cPickle as pickle
with open(filename+'.pkl', 'wb') as f:
pickle.dump(data, f)
3)save as numpy file:
np.save(filename+'.npy', data)
Concerning the classification problem, if you want to use a supervised method (MLP, RF, SVM, KNN, ...) you need a class labels (the ground truth), i.e. a vector with shape equals to the number of sample that relates each sample to a integer (for example 0,1 in a binary classification, or 0,1,2,3 for a 4-class classification). This strongly depend from what you want, what is the goal of your training.
Once you have the the data matrix and the label vector, each machine-learning method will be able to classify, if you have enough samples. With this aim, i suggest you to use same augmenting criteria, to have an idea have a look to this paper, it could give you same ideas.
Hoping i have help you, good work!
Python has a built-in csv module.
This section's example gives a simple example on how to use a writer to write rows to your csv.
I am working on training my own images read from my folders. I would be thankful if you could help me for this.
I successfully read my all images from the folder and create my own onehot_encoded labels. However, in each time I run my code, it takes a lot of time to do read all images from the folders. Therefore, I want to create dataset from these images and save it like MNIST to use faster. Thus, I will not read my whole images again. Could you please help me for this?
The code is:
path = "D:/cleandata/train_data/"
loadedImages = []
labels = []
sess = tf.InteractiveSession()
for i in range(len(os.listdir(path))):
imagesList = listdir(path+os.listdir(path)[i])
for image in imagesList:
image_raw_data_jpg tf.gfile.FastGFile(path+os.listdir(path)
[i]+'/'+image, 'rb').read()
raw_image =tf.image.decode_png(image_raw_data_jpg,3)
gray_resize=tf.image.resize_images(raw_image, [28, 28])
image_data =
Here is a tutorial on how to use a TFRecords file. It shows how to create the file (containing images and labels) and read from it.
Or you could just use zipfile, and include the label in the image file name, thus keeping them together (that is what I did)
I downloaded the dataset from kaggle:
Then tried to get image label from the downloaded data using cv2.split('.')[-3] command. (code in the end)
However, i got an "index out of range error". I checked the filename and see the filename after unzip from kaggle datasets is only 1.jpg, 2.jpg, 3.jpg.
From what I read, the dataset should have label in the filename. i.e.
So my question is
Q1: I assume my python syntax is right. As it looks like I would only have two argument [0] and [1] with filename of "num.jpg" not "label.num.jpg", right?
Q2: if so, anyone can help me to point out why I cannot get the right datasets with label in the filename?
ps: I am really new in python, kaggle, (or programming area).
Thank you
ps: my partial code:
for img in tqdm(os.listdir(TRAIN_DIR))
path = os.path.join(TRAIN_DIR, img)
img_data = cv2.imread(path)
cv2.imshow('train_data_image:', img_data)
print ('test:', img.split('.')[-3])
just FYI - I found the answer for my question...
It turns out I was using the test data which indeed should not contain the label in the dataset. I download the train data and it does have the label (dog/cat) in the filename.