Sorting a large amount of data into separate sets

Sorting a large amount of data into separate sets - python

I'm extracting up to 2500 frame files per experiment (not always the same amount), and my process at the moment is to manually divide the total number of frames by three to separate into three subset folders since the file size is too large to convert all into a .mat file. I simply want to automate this.
Once the files are separated into three subsets ('Subset1, Subset2, Subset3'), i run each folder through my code to convert and rename.
from scipy.io import savemat
import numpy as np
import os
arrays = []
directory = r"F:\...\Experiment 24\Imaging\Subset3" # **something here that will look at the while directory and create a different file for each subset folder**
sorted(os.listdir(directory))
for filename in sorted(os.listdir(directory)):
f = os.path.join(directory, filename)
arrays.append(np.load(f))
data = np.array(arrays)
data = data.astype('uint16')
data = np.moveaxis(data, [0, 1, 2], [2, 1, 0])
savemat('24_subset3.mat', {'data': data})
How can I automatically sort my frame files into three separate subset folders and convert?

Create subsets from the filenames and copy them to new subset directories:
num_subsets = 3
in_dir = "/some/path/to/input"
out_dir = "/some/path/to/output/subsets"
filenames = sorted(os.listdir(in_dir))
chunk_size = len(filenames) // num_subsets
for i in range(num_subsets):
subset = filenames[i * chunk_size : (i + 1) * chunk_size]
# Create subset output directory.
subset_dir = f"{out_dir}/subset_{i}"
os.makedirs(subset_dir, exist_ok=True)
for filename in subset:
shutil.copyfile(filename, f"{subset_dir}/{filename}")
NOTE: Any extra files that cannot be distributed into equal subsets will be skipped.

If your goal is simply to create your three .mat files, you don't necessarily need to create subfolders and move your files around at all; you can iterate through subsets of them in-place. You could manually calculate the indexes at which to divide into subsets, but more_itertools.divide is convenient and readable.
Additionally, pathlib is usually a more convenient way of manipulating paths and filenames. No more worrying about os.path.join! The paths yielded by Path.iterdir or Path.glob know where they're located, and don't need to be recombined with their parent.
import pathlib
from more_itertools import divided
import numpy as np
from scipy.io import savemat
directory = Path("F:/.../Experiment 24/Imaging/")
subsets = divide(3, sorted(directory.iterdir()))
for index, subset in enumerate(subsets, start=1):
arrays = [np.load(file) for file in subset]
data = np.array(arrays).astype('uint16')
data = np.moveaxis(data, [0, 1, 2], [2, 1, 0])
savemat(f'24_subset{index}.mat', {'data': data})

Related

Merging multiple .npz files into single .npz file

I have multiple .npz files in folder with same nature, I want to append all of my .npz files into a single .npz file present in a given folder
I have tried below code to achieve so, but it seems its not appending multiple .npz files to single npz file.
Here is the code
import numpy as np
file_list = ['image-embeddings\img-emb-1.npz', 'image-embeddings\img-emb-2.npz']
data_all = [np.load(fname) for fname in file_list]
merged_data = {}
for data in data_all:
[merged_data.update({k: v}) for k, v in data.items()]
np.savez('new_file.npz', **merged_data)
Where img-emb-1.npz has different value and img-emb-2.npz has different value

Maybe try the following to construct merged_data:
arrays_read = dict(
chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)
Full example:
from itertools import chain
import numpy as np
file = lambda name: f"arrays/{name}.npz"
# Create data
arrays = {f"arr{i:02d}": np.random.randn(10, 20) for i in range(10)}
# Save data in separate files
for arr_name, arr in arrays.items():
np.savez(file(arr_name), **{arr_name: arr})
# Read all files into a dict
arrays_read = dict(
chain.from_iterable(np.load(file(arr_name)).items() for arr_name in arrays.keys())
)
# Save into a single file
np.savez(file("arrays"), **arrays_read)
# Load to compare
arrays_read_single = dict(np.load(file("arrays")).items())
assert arrays_read.keys() == arrays_read_single.keys()
for k in arrays_read.keys():
assert np.array_equal(arrays_read[k], arrays_read_single[k])

Reading multiple text files and separating them into different arrays in Python

I have this code in MATLAB
txtFiles = dir('*.txt') ; %loads txt files
N = length(txtFiles) ;
for i = 1:N
data = importdata(txtFiles(i).name);
x = data(:,1);
y(:,i) = data(:,2) ;
end
Which takes all 100 of my txt files and creates an array for x then stores the y data in an a separate array where each column corresponds to a different txt file's values.
Is there a similar trick in Python?
this is how the data files are constructed:
896.5000000000 0.8694710776
896.7500000000 0.7608314184
897.0000000000 0.6349069122
897.2500000000 0.5092121001
897.5000000000 0.3955858698
There are 50 of them and each one has about 1000 rows like this,
My solution so far jams it all into a massive list which is impossible to handle. In MATLAB it adds the second column of each text file to an array and I can easily cycle through them.
This is my solution
#%%
import matplotlib.pyplot as plt
import os
import numpy as np
import glob
# This can be further shortened, but will result in a less clear code.
# One quality of a pythonic code
# Gets file names and reads them.
data = [open(file).read() for file in glob.glob('*.txt')]
# each block of data has it's line split and is put
# into seperate arrays
# [['data1 - line1', 'data1 - line2'], ['data2 - line1',...]]
data = [block.splitlines() for block in data]
x, y = [], []
# takes each line within each file data
# splits it, and appends to x and y
for file in glob.glob('*.txt'):
# open the file
with open(file) as _:
# read and splitlines
for line in _.read().splitlines():
# split the columns
line = line.split()
# this splits by the spaces
# example line: -2000 data1
# output = ['-2000', 'data']
# append to lists
x.append(line[0])
y.append(line[1])

Your files are pretty much csv files, and could be read using np.loadtxt or pd.read_csv.
But as you did you can extract values from the text yourself, the following will work for any number of columns:
def extract_values(text, sep=" ", dtype=float):
return (
np.array(x, dtype=dtype)
for x in zip(*(l.split(sep) for l in text.splitlines()))
)
Then just concatenate the results in the shape you want:
import pathlib
dir_in = pathlib.Path("files/")
indexes, datas = zip(
*(
extract_values(f.read_text())
for f in sorted(dir_in.glob("*.txt"))
)
)
index = np.stack(indexes, axis=-1)
data = np.stack(datas, axis=-1)

Combine big data stored in subdirectories as 100,000+ CSV files of size 200 GB with Python

I want to create an algorithm to extract data from csv files in different folders / subfolders. each folder will have 9000 csvs. and we will have 12 of them. 12*9000. over 100,000 files

If the files have consistent structure (column names and column order), then dask can create a large lazy representation of the data:
from dask.dataframe import read_csv
ddf = read_csv('my_path/*/file_*.csv')
# do something

This is working solution for over 100,000 files
Credits : Abhishek Thakur - https://twitter.com/abhi1thakur/status/1358794466283388934
import pandas as pd
import glob
import time
start = time.time()
path = 'csv_test/data/'
all_files = glob.glob(path + "/*.csv")
l = []
for filename in all_files:
df = pd.read_csv(filename, index_col=None, header = 0)
l.append(df)
frame = pd.concat(l, axis = 0, ignore_index = True)
frame.to_csv('output.csv', index = False)
end = time.time()
print(end - start)
not sure if it can handle data of size 200 gb. - need feedback regarding this

You can read CSV-files using pandas and store them space efficiently on disk:
import pandas as pd
file = "your_file.csv"
data = pd.read_csv(file)
data = data.astype({"column1": int})
data.to_hdf("new_filename.hdf", "key")
Depending on the contents of your file, you can make adjustments to read_csv as described here:
https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html
Make sure that after you've read your data in as a dataframe, the column types match the types they are holding. This way you can save a lot of storage in memory and later when saving these dataframes to disk. You can use astype to make these adjustments.
After you've done that, store your dataframe to disk with to_hdf.
If your data is compatible across csv-files, you can append the dataframes onto each other into a larger dataframe.

Reading in multiple large .dta files in loops/enumerate

Problem: Efficiently reading in multiple .dta files at once without crashing.
Progress: Can currently read in one large .dta file, pickle it, and merge without it exceeding memory capacity. When I try to loop it by setting a dynamic variable n, and calling it from a list or dictionary, I do not output 3 separate DataFrame objects, but instead get a list with one of the values being a DataFrame object.
Current Code:
import pandas as pd
import pickle
import glob
import os
n = 0 # Change variable value when reading in new DF
# Name of variables, lists, and path
path = r"directory\directory\\"
fname = ["survey_2000_ucode.dta", "2010_survey_ucode.dta", "2020_survey.dta"]
chunks = ["chunks_2000", "chunks_2010", "chunks_2020"]
input_path = [path + fname[0], path + fname[1], path + fname[2]]
output_path = [path + chunks[0], path + chunks[1], path + chunks[2]]
# Create folders and directory if it does not exist
for chunk in chunks:
if not os.path.exists(os.path.join(path, chunk)):
os.mkdir(os.path.join(path, chunk))
CHUNK_SIZE = 100000 # Input size of chunks ~ around 14MB
# Read in .dta files in chunks and output pickle files
reader = pd.read_stata(input_path[n], chunksize=CHUNK_SIZE, convert_categoricals=False)
for i, chunk in enumerate(reader):
output_file = output_path[n] + "/chunk_{}.pkl".format(i+1)
with open(output_file[n], "wb") as f:
pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
# Read in pickle files and append one by one into DataFrame
pickle_files = []
for name in glob.glob(output_path[n] + "chunk_*.pkl"):
pickle_files.append(name)
# Create a list/dictionary of dataframes and append data
dfs = ["2000", "2010", "2020"]
dfs[n] = pd.DataFrame([])
for i in range(len(pickle_files)):
dfs[n] = dfs[n].append(pd.read_pickles(pickle_files[i]), ignore_index=True)
Current Output: No df2000, df2010, df2020 DataFrames outputted. Instead, my DataFrame with data is the first object in the dfs list. Basically, the dfs list:
index 0 is a DataFrame with 2,442,717 rows and 34 columns;
index 1 is a string value of 2010 and;
index 2 is a string value of 2020.
Desired Output:
Read in multiple large data files efficiently and create separate multiple DataFrames at once.
Advice/suggestions on interacting (i.e. data cleaning, wrangling, manipulation, etc) with the read-in multiple large DataFrames without crashing or taking long periods when running a line of code.
All help and input is greatly appreciated. Thank you for your time and consideration. I apologize for not being able to share pictures of my results and datasets as I am accessing it through a secured connection and have no access to internet.

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

I have a directory with hundreds of csv files that represent the pixels of a thermal camera (288x383), and I want to get the center value of each file (e.g. 144 x 191), and with each one of the those values collected, add them in a dataframe that presents the list with the names of each file.
Follow my code, where I created the dataframe with the lists of several csv files:
import os
import glob
import numpy as np
import pandas as pd
os.chdir("/Programming/Proj1/Code/Image_Data")
!ls
Out:
2021-09-13_13-42-16.csv
2021-09-13_13-42-22.csv
2021-09-13_13-42-29.csv
2021-09-13_13-42-35.csv
2021-09-13_13-42-47.csv
2021-09-13_13-42-53.csv
...
file_extension = '.csv'
all_filenames = [i for i in glob.glob(f"*{file_extension}")]
files = glob.glob('*.csv')
all_df = pd.DataFrame(all_filenames, columns = ['Full_name '])
all_df.head()
**Full_name**
0 2021-09-13_13-42-16.csv
1 2021-09-13_13-42-22.csv
2 2021-09-13_13-42-29.csv
3 2021-09-13_13-42-35.csv
4 2021-09-13_13-42-47.csv
5 2021-09-13_13-42-53.csv
6 2021-09-13_13-43-00.csv

You can loop through your files one by one, reading them in as a dataframe and taking the center value that you want. Then save this value along with the file name. This list of results can then be read in to a new dataframe ready for you to use.
result = []
for file in files:
# read in the file, you may need to specify some extra parameters
# check the pandas docs for read_csv
df = pd.read_csv(file)
# now select the value you want
# this will vary depending on what your indexes look like (if any)
# and also your column names
value = df.loc[row, col]
# append to the list
result.append((file, value))
# you should now have a list in the format:
# [('2021-09-13_13-42-16.csv', 100), ('2021-09-13_13-42-22.csv', 255), ...
# load the list of tuples as a dataframe for further processing or analysis...
result_df = pd.DataFrame(result)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Sorting a large amount of data into separate sets - python

Related

Merging multiple .npz files into single .npz file

Reading multiple text files and separating them into different arrays in Python

Combine big data stored in subdirectories as 100,000+ CSV files of size 200 GB with Python

Reading in multiple large .dta files in loops/enumerate

How to extract a specific value from multiple csv of a directory, and append them in a dataframe?

Categories

Resources