kaggle dataset or python split CLI

kaggle dataset or python split CLI - python

I downloaded the dataset from kaggle:
https://www.kaggle.com/c/dogs-vs-cats/data
Then tried to get image label from the downloaded data using cv2.split('.')[-3] command. (code in the end)
However, i got an "index out of range error". I checked the filename and see the filename after unzip from kaggle datasets is only 1.jpg, 2.jpg, 3.jpg.
From what I read, the dataset should have label in the filename. i.e.
https://www.packtpub.com/mapt/book/big_data_and_business_intelligence/9781788475655/23/ch23lvl1sec118/deep-learning-for-cats-versus-dogs
So my question is
Q1: I assume my python syntax is right. As it looks like I would only have two argument [0] and [1] with filename of "num.jpg" not "label.num.jpg", right?
Q2: if so, anyone can help me to point out why I cannot get the right datasets with label in the filename?
ps: I am really new in python, kaggle, (or programming area).
Thank you
Mira
ps: my partial code:
for img in tqdm(os.listdir(TRAIN_DIR))
path = os.path.join(TRAIN_DIR, img)
img_data = cv2.imread(path)
cv2.imshow('train_data_image:', img_data)
print ('test:', img.split('.')[-3])

just FYI - I found the answer for my question...
It turns out I was using the test data which indeed should not contain the label in the dataset. I download the train data and it does have the label (dog/cat) in the filename.
thanks!
Mira

Related

How do I remove unknown, extra, data values from large file?

I am working on an Python, TensorFlow, image classification model, and in my training images, I have 12,611 images, but in my training labels, I have 12,613. (each image has a number as the title, and this number corresponds to the same number in a CSV file with the accompanying information for that image).
From here, what I need to do is simply remove those 2 extra data points for which I don't have pictures for. How can I write a code to help with this?
(If the code tells me which data points are the extras, I can manually remove them from the CSV file)
Thanks for the help.

Well its very straightforward, you can try something like this (As I dont kno exactly how and where you have saved your images, you might have to update the code to meet your use case) :
dir_path = r'/path/to/folder/of/images'
csv_path = r'/path/to/csv/file'
images = []
# Get all images labels
for filename in os.listdir(dir_path):
images.append(int(filename.split('.')[0]))
# Read CSV
df = pd.read_csv(csv_path)
# Print which labels are extra
for i in df['<COLUMN_NAME>'].tolist():
if i not in images:
print(i)

Image Data is being stored in different/random order in array after reading from Google Drive to Google Colab

I am trying to use Google Colab for Image Segmentation process using U-net. I can read the image datasets from Drive to Colab and save in an array. FYI: I have a folder in Google Drive in my with all Training Data containing 2 sub-folders (Image and Mask respectively).
Now after reading and resizing the images and mask when I am checking the images using 'plt.show', I noticed that there is a discrepancy in the order of image numbers. For example when I am randomly picking the 10th image , that image does not match with the 10th image in the google drive. And to make it worse, I get a completely different image for my Mask which makes my image and mask different (main issue!!).
Has anyone faced any similar situation? Any idea how can I get around with this problem?

I was having this issue. Importing images from google drive to google colab from a directory seemed to import images randomly.
So I 1st checked with this code to confirm my theory.
inside = os.listdir('/content/gdrive/MyDrive/files/')
for i in range(20):
print(inside[i])
Which gave the output:
15588_KateOMara_32_f.jpg
15658_KatharineRoss_68_f.jpg
15741_MaryTamm_40_f.jpg
15661_KatharineRoss_72_f.jpg
15621_KateOMara_70_f.jpg
15646_KatharineRoss_46_f.jpg
15851_StВphaneAudran_22_f.jpg
15810_SarahDouglas_61_f.jpg
15486_JeanetteMacDonald_46_f.jpg
15831_StefaniePowers_56_f.jpg
15670_KathrynGrayson_26_f.jpg
15539_JulieBishop_36_f.jpg
15696_KathrynGrayson_75_f.jpg
15738_MaryTamm_33_f.jpg
15853_StВphaneAudran_24_f.jpg
15665_KathrynGrayson_21_f.jpg
15815_StefaniePowers_24_f.jpg
15748_MaryTamm_51_f.jpg
15759_PamelaSueMartin_26_f.jpg
15799_SarahDouglas_43_f.jpg
I was using
os.listdir(self.directory)
which returns the list of all files and directories in the specified path. So I just used
sorted()
function to sort the list and this solved the issue.
sorted_dir = sorted(os.listdir('/content/gdrive/MyDrive/files/'))
for i in range(20):
print(sorted_dir[i])
Output:
0_MariaCallas_35_f.jpg
10000_GlennClose_62_f.jpg
10001_GoldieHawn_23_f.jpg
10002_GoldieHawn_24_f.jpg
10003_GoldieHawn_24_f.jpg
10004_GoldieHawn_27_f.jpg
10005_GoldieHawn_28_f.jpg
10006_GoldieHawn_29_f.jpg
10007_GoldieHawn_30_f.jpg
10008_GoldieHawn_31_f.jpg
10009_GoldieHawn_35_f.jpg
1000_StephenHawking_1_m.jpg
10010_GoldieHawn_35_f.jpg
10011_GoldieHawn_37_f.jpg
10012_GoldieHawn_39_f.jpg
10013_GoldieHawn_44_f.jpg
10014_GoldieHawn_45_f.jpg
10015_GoldieHawn_45_f.jpg
10016_GoldieHawn_50_f.jpg
10017_GoldieHawn_51_f.jpg
Before:
for i, file in enumerate(os.listdir(self.directory)):
file_labels = parse('{}_{person}_{age}_{gender}.jpg', file)
After:
for i, file in enumerate(sorted(os.listdir(self.directory))):
file_labels = parse('{}_{person}_{age}_{gender}.jpg', file)

Pytesseract - output is extremely inaccurate (MAC)

I installed pytesseract via pip and its result is terrible.
As I searched for it, I think I need to give it more data
but I can't find where to put tessedata(traineddata)
since there is no directory like ProgramFile\Tesseract-OCR using Mac.
There is no problem with images' resolution, font or size.
Image whose result is 'ecient Sh Abu'
Because large and clear test images work fine, I think it is a problem about lack of data.
But any other possible solution is welcomed as long as it can read text with Python.
Please help me..

I installed pytesseract via pip and its result is terrible.
Sometimes you need to apply preprocessing to the input image to get accurate results.
Because large and clear test images work fine, I think it is a problem about lack of data. But any other possible solution is welcomed as long as it can read text with Python.
You could say lack of data is a problem. I think you'll find morphological-transformations useful.
For instance if we apply close operation, the result will be:
The image looks similar to the original posted image. However there are slight changes in the output images (i.e. Grammar word is slightly different from the original image)
Now if we read the output image:
English
Grammar Practice
ter K-SAT (1-10)
Code:
import cv2
from pytesseract import image_to_string
img = cv2.imread("6Celp.jpg")
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
opn = cv2.morphologyEx(gry, cv2.MORPH_OPEN, None)
txt = image_to_string(opn)
txt = txt.split("\n")
for i in txt:
i = i.strip()
if i != '' and len(i) > 3:
print(i)

Increasing dataset size using imgaug

I am merging two different datasets containing images into one dataset. One of the datasets contains 600 images in the training set. The other dataset contains only 90-100 images. I want to increase the size of the latter dataset by using the imgaug library. The images are stored in folders under the name of their class. So the path for a "cake" image in the training set would be ..//images//Cake//cake_0001. I'm trying to use this code to augment the images in this dataset:
path = 'C:\\Users\\User\\Documents\\Dataset\\freiburg_groceries_dataset\\images'
ia.seed(6)
seq = iaa.Sequential([
iaa.Fliplr(0.5),
iaa.Crop(percent=(0, 0.1)),
iaa.Affine(rotate=(-25,25))
], random_order=True)
for folder in os.listdir(path):
try:
for i in os.listdir(folder):
img = imageio.imread(i)
img_aug = seq(images=img)
iaa.imshow(img_aug)
print(img_aug)
except:
pass
Right now there's not output, even if I put print(img) or imshow(img) or anything. How do I ensure that I got more images for this dataset? Also, what is the best spot to augment images? Where do the augmented images get stored, and how do I see how many new images were generated?

The Question was not clear. So, for the issue2: error in saving file and not able to visualize using imshow().
First: In the second loop code block
img = imageio.imread(i)
img_aug = seq(images=img)
iaa.imshow(img_aug)
print(img_aug)
1st error is: i is not the file path. To solve this replace imageio.imread(i) with imageio.imread(path+'/'+folder+'/'+i).
2nd error is: iaa doesn't have the property imshow().
To fix this replace iaa.imshow(img_aug) with iaa.imgaug.imshow(img_aug). This fixes the error of visualizing and finishing the loop execution.
Second: If you have any issue in saving images, then use PIL.
i.e.,
from PIL import Image
im = Image.fromarray(img_aug)
im.save('img_aug.png')`

It's because folder is not the path to the directory you are looking for.
You should change for i in os.listdir(folder): to for i in os.listdir(path+'\\'+folder):. Then it looks inside the path\folder directory for files.

How to increase likeliness of image recognition with pytesseract

I'm trying to convert this list of images I have to text. The images are fairly small but VERY readable (15x160, with only grey text and a white background) I can't seem to get pytesseract to read the image properly. I tried to increase the size with .resize() but it didn't seem to do much at all. Here's some of my code. Anything new I can add to increase my chances? Like I said, I'm VERY surprised that pytesseract is failing me here, it's small but super readable compared to some of the things I've seem it catch.
for dImg in range(0, len(imgList)):
url = imgList[dImg]
local = "img" + str(dImg) + ".jpg"
urllib.request.urlretrieve(url, local)
imgOpen = Image.open(local)
imgOpen.resize((500,500))
imgToString = pytesseract.image_to_string(imgOpen)
newEmail.append(imgToString)

Setting the Page Segmentation Mode (psm) can probably help.
To get all the available psm enter tesseract --help-psm in your terminal.
Then identify the psm corresponding to your need. Lets say you want to treat the image as a single text line, in that case your ImgToString becomes:
imgToString = pytesseract.image_to_string(imgOpen, config = '--psm 7')
Hope this will help you.

You can perform several pre-processing steps in your code.
1) Use the from PIL import Image and use your_img.convert('L'). There are several other settings you can check.
2) A bit advanced method: Use a CNN. There are several pre-trained CNNs you can use. Here you can find a little bit more detailed information: https://www.cs.princeton.edu/courses/archive/fall00/cs426/lectures/sampling/sampling.pdf
tifi

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

kaggle dataset or python split CLI - python

just FYI - I found the answer for my question... It turns out I was using the test data which indeed should not contain the label in the dataset. I download the train data and it does have the label (dog/cat) in the filename. thanks! Mira

Related

How do I remove unknown, extra, data values from large file?

Image Data is being stored in different/random order in array after reading from Google Drive to Google Colab

Pytesseract - output is extremely inaccurate (MAC)

Increasing dataset size using imgaug

How to increase likeliness of image recognition with pytesseract

Categories

Resources