Sort labeled dataset with labels in separate csv file? - python

I have a dataset of ~3500 images, with the labels of each image in a csv file. The csv file has two columns: the first one contains the exact name of the image file (i.e. 00001.jpg) and the second column contains the label of the image. There are a total of 7 different labels.
How can I sort the images from one huge folder to 7 different folders (each image in its respective category) in an efficient manner? Does anyone have a script that can do this?
Also, is there any way I can do this with Google Drive? I've already uploaded the dataset to Drive in order to use with Colab soon, so I don't want to have to do it again (takes ~2.5 hours).

I'm not sure about performance, probably there are better ways...
But this would be my take on the problem:
(not tested, so might need small adjustments)
I'm assuming the images are in a subfolder /images/, but the csv and the script are in root. Furthermore I'm assuming the csv is named images.csv and the columns in the csv are titled file and label.
import pandas as pd
import os
df = pf.DataFrame.from_csv('images.csv')
for _, row in df.iterrows():
f = row['file']
l = row['label']
os.replace(f'images/{f}', f'images/{l}/{f}')
I don't know what google drive would make out of it, but as long as you can run it on a google-drive-synced folder, I wouldn't know why this should be an issue.
Note: if you test it, you may want to do so on a copy of the files, in case I screwed up...

Related

How do I reduce size of pdf merge

I'm trying to merge a group of PDFs (Up to 1,000) per unique groups. Meaning, of 100,00 pdfs created I need them to be grouped at a practice/market level and to output a merged pdf file containing varying counts of merged pdfs.
My pdf file creation and loop works fine but when it comes to merging, i'm running into file size issues.
Tried doing this utilizing PYPDF but file sizes are way too large:
''''
def merge_pdfs(paths, output):
'''
Is there an alternative to PYPDF that also allows me to create read only pdfs of a smaller size?
I've used PDFtk , ghostscript , and pymupdf to no avail.
It sounds like your files perhaps come from the same source or are generated in the same way, and therefore will have common internal parts, for example the same font data in each.
Try:
cpdf -squeeze in.pdf -o out.pdf
On the output. You could do the initial merge with cpdf too, but it’s not required.
If it must be done directly in python, pycpdflib can do it with squeezeInMemory.

Is Panda appropriate for joining 120 large txt files?

I have 120 txt files, all are around 150mb in size and have thousands of columns. Overall theres definitely more than 1million columns.
When I try to concatenate using pandas I get this error: " Unable to allocate 36.4 MiB for an array with shape (57, 83626) and data type object"... I've tried Jupyter notebook and Spyder, neither work
How can I join the data? Or is this data not suitable for Pandas.
Thanks!
You are running out of memory. Even if you manage to load all of them (with pandas or other package), your system will still run out of memory for every task you want to perform with this data.
Assuming that you want to perform different operations in different columns of all the tables, the best way to do so is to perform each task separately, preferrably batching your columns since there are more than 1k for each file, as you say.
Let's say you want to sum the values in the first column of each file (assuming they are numbers...) and store these results in a list:
import glob
import pandas as pd
import numpy as np
filelist = glob.glob('*.txt') # Make sure you're working in the directory containing the files
sum_first_columns = []
for file in filelist:
df = pd.read_csv(file,sep=' ') # Adjust the separator for your case
sum_temp = np.sum(df.iloc[:,0])
sum_first_columns.append(sum_temp)
You now have a list of dimension (1,120).
For each operation, this is what I would do if it was mandatory for me to work with my own computer/system.
Please note that this process will be very time consuming as well, given the size of your files. You can either try to reduce your data or to use a cloud server to compute everything.
Saying you want to concat in pandas implies that you just want to merge all 150 files together into one file? If so you can iterate through all the files in a directory and read them in as lists of tuples or something like that and just combine them all into one list. Lists and tuples are magnitudes less memory than dataframes, but you won't be able to perform calculations and stuff unless you throw them in as a numpy array or dataframe.
At a certain point, when there is too much data it is appropriate to shift from pandas to spark since spark can use the power and memory from a cluster instead of being restricted to your local machine or servers resources.

Training Keras model with multiple CSV files in multi folder in python

I have eight folders with 1300 CSV files (3*50) in each folder, each folder represents a label, but I have no idea how to input my data in to a training model.
Still, a beginner in CNN.
A part of my csv file can be accessed using this link.
When using Keras, you can use the tf.data.Dataset package, which helps you doing what you want to achieve.
Example
Here is an example code, I took from one of my projects:
# matching a glob pattern!
dataset_pro_raw = tf.data.Dataset.list_files([f"./aclImdb/{name}/pos/*.txt"], shuffle=True)
dataset_pro_i = dataset_pro_raw.interleave(
lambda file: tf.data.TextLineDataset(file),
# how many files should be processed concurently
cycle_length = 20,
# number of threads to increase the performance
num_parallel_calls = 10
)
First, we create a filelist by tf.data.Dataset.list_files(), also note, that already there the order of the files is shuffled. Then via dataset_pro_raw.interleave() we iterate through the file set and read the content of the files with with tf.data.TextLineDataset().
That way you can load data from multiple .txt files or any data source very well. It is a big clumsy at the beginning ot use, but it has very well advantages. Currently I only use tf.data.Dataset for train-data generation.
For more information on tf.data.Dataset you might want to check out this link

How to get common rows/columns between multiple files in all possible combinations - python/pandas

enter image description hereI am beginner to programming, so forgive me if I'm not using right terms. I have data collected in multiple csv files. Now I want to collect data that contains common rows between two files (say File_1 and File_2). Likewise, I want data in all combinations (two files at a time)- File_1 & File_3, File_1 & File_N, File_2 & 3, File_2 & File_N and so on. I can use pd.merge if I want to do this between two files but to obtain data in different combinations I need a function. Can any help with a function in python/pandas to perform this task.
Example of what my data looks like:
[Here if I want to calculate common alphabets between multiple files, in all possible combinations][2]
Thank you so much in advance
1- Create a list of file name pairs using itertools. (Not the only way you can do this, but let's try to make it more pythonic)
Here is a quick link to an example:
Combination of values in pandas data frame
2- It is hard to conceive how you should ingest each file without knowing what kind of data you have. But assuming that you have csv files, you can use pandas to read each file pair and join them on the field of your choice to each other and export the results.
I hope this gives you a rough idea how to do this. I suggest posting some example data to
get more specific solutions.

how to have indexed files in storage in python

I have a huge dataset of images and I am processing them one by one.
All the images are stored in a folder.
My Approach:
What I have tried is that I have tried reading all the filenames in memory and whenever a call for a certain index is sent, I load the corresponding image.
The problem is that it is even not possible to keep the paths and the names of the files in memory due to the huge dataset.
Is it possible to have an indexed file on storage and one is able to read a file name at a certain index.
Thanks a lot.

Categories

Resources