Averaging a number of files from a folder - python

I'm trying to write a function that pulls from a folder path, reads the files (each is a 2 by inf array) in sets of n, averages the second row of each file by column and writes those results out to an excel file. I expect this to loop until I have reached the end of the files in the folder.
For example the function is given a file path and an n value. ie.(path,2) each of the following arrays would be a different file in the path to the folder. The code would average the second row of each array and output the average row-by-row.
Example:
[1,2;3,4] [1,2;5,6]
[1,2;7,8] [1,2;9,10]
[1,2;3,4] [1,2;9,10]
would output in an excel file:
4 5
8 9
6 7
This is my current code:
def fileavg(path,n):
import numpy as np
import xlsxwriter
from glob import glob
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet()
row=0
glob.iglob(path) #when inputting path name begin with r' and end with a '
for i in range(0,len(1),n):
f=yield 1[i:i +n]
A=np.mean(f(1),axis=1)
for col, data in enumerate(A):
worksheet.write_column(row, col, data)
row +=1
I receive a generator object error when I attempt to run the function. Please let me know what this means and where any mistakes might be as I'm quite new to python.

Related

Averaging .spc files

I'm working with a large set of .spc files. My goal is to loop through the files in the directory grab a chunck of them (n=5) average those files together and then write the averaged data to excel. I've gotten pretty far in terms of general code but I've never worked with .spc files in python before. So I'm looking for some specific help in terms of opening, reading, averaging, and exporting .spc files.
def fileavg(path,n):
import numpy as np
import xlsxwriter
import glob
from pyspectra.readers.read_spc import read_spc_dir
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet()
row=0
more_files=True
b=glob.iglob(path) #when inputting path name begin with r' and end with a '
while more_files:
for i in range(n):
try:
next_file=next(b)
new_file=read_spc_dir(row,next_file)
A=np.array([new_file(1)])
navg=A.mean(axis=0)
except StopIteration:
more_files=False
break
for col, data in enumerate(navg):
worksheet.write_column(row, col, data)
row +=1

Reading in multiple large .dta files in loops/enumerate

Problem: Efficiently reading in multiple .dta files at once without crashing.
Progress: Can currently read in one large .dta file, pickle it, and merge without it exceeding memory capacity. When I try to loop it by setting a dynamic variable n, and calling it from a list or dictionary, I do not output 3 separate DataFrame objects, but instead get a list with one of the values being a DataFrame object.
Current Code:
import pandas as pd
import pickle
import glob
import os
n = 0 # Change variable value when reading in new DF
# Name of variables, lists, and path
path = r"directory\directory\\"
fname = ["survey_2000_ucode.dta", "2010_survey_ucode.dta", "2020_survey.dta"]
chunks = ["chunks_2000", "chunks_2010", "chunks_2020"]
input_path = [path + fname[0], path + fname[1], path + fname[2]]
output_path = [path + chunks[0], path + chunks[1], path + chunks[2]]
# Create folders and directory if it does not exist
for chunk in chunks:
if not os.path.exists(os.path.join(path, chunk)):
os.mkdir(os.path.join(path, chunk))
CHUNK_SIZE = 100000 # Input size of chunks ~ around 14MB
# Read in .dta files in chunks and output pickle files
reader = pd.read_stata(input_path[n], chunksize=CHUNK_SIZE, convert_categoricals=False)
for i, chunk in enumerate(reader):
output_file = output_path[n] + "/chunk_{}.pkl".format(i+1)
with open(output_file[n], "wb") as f:
pickle.dump(chunk, f, pickle.HIGHEST_PROTOCOL)
# Read in pickle files and append one by one into DataFrame
pickle_files = []
for name in glob.glob(output_path[n] + "chunk_*.pkl"):
pickle_files.append(name)
# Create a list/dictionary of dataframes and append data
dfs = ["2000", "2010", "2020"]
dfs[n] = pd.DataFrame([])
for i in range(len(pickle_files)):
dfs[n] = dfs[n].append(pd.read_pickles(pickle_files[i]), ignore_index=True)
Current Output: No df2000, df2010, df2020 DataFrames outputted. Instead, my DataFrame with data is the first object in the dfs list. Basically, the dfs list:
index 0 is a DataFrame with 2,442,717 rows and 34 columns;
index 1 is a string value of 2010 and;
index 2 is a string value of 2020.
Desired Output:
Read in multiple large data files efficiently and create separate multiple DataFrames at once.
Advice/suggestions on interacting (i.e. data cleaning, wrangling, manipulation, etc) with the read-in multiple large DataFrames without crashing or taking long periods when running a line of code.
All help and input is greatly appreciated. Thank you for your time and consideration. I apologize for not being able to share pictures of my results and datasets as I am accessing it through a secured connection and have no access to internet.

Find folders within a directory using a list, and copy them to a different directory

I need some help with creating a python script for this problem.
Basically I have an excel sheet with a list of patient medical record numbers:
10000
10001
10002
10003
etc...
And I have a drive with this basic format:
-AllImages
--A
---A1
---A2
----10004
---A3
----10005
----10006
----10007
--B
---B1
----10008
----10009
-----10009_MRI
-----10009_CT
---B2...
And the desired output would be:
-OutputImages
--10000
--10001
--10002
---10002_MRI
---10002_CT
--10003
etc...
They are not always in exact order though. So, these terminal patient folders are what I need to copy to a different directory, but they also contain other folders that also contain the medical record number in the file name as illustrated in patient 10009. I do NOT want to pull those subfolders out separately from the main patient folder, so when I search I want to stop at the highest folder with the patient medical record in the name.
I wrote a script that FINDS the folders and outputs a csv next to each medical record number saying where the image can be found or if it couldn't be found at all. However, I cannot figure out how to get it to copy those folders to a new location. This seems like a super simple operation but I can't figure it out!
Here is the current script I am running, I tried to just modify the other script I wrote with some code I found on this site, but its not working and I don't understand it well enough to know why.
import os
import shutil
import xlrd
import easygui
import numpy as np
import csv
#get the excel sheet
print ('Choose patient data sheet')
master_ws = 'TestDemo/TestPatientList.xlsx'
#easygui.fileopenbox()
workbook = xlrd.open_workbook(master_ws)
ws = workbook.sheet_by_name('Sheet1')
num_rows = ws.nrows - 1
#get correct MRN column
col = int(input ('Enter the column with patient MRNs (A=0, B=1, etc): '))
#file browser for choosing which directory to index
print ('Choose directory for indexing')
RootDir1 = r'TestDemo/TestDirectory'
#easygui.diropenbox()
#choose output folder
print ('Create output folder')
TargetFolder = r'Scripts/TestDemo/TestOutputDirectory'
#easygui.diropenbox()
#sorts directory titles into array of strings
folders = [f for f in sorted(os.listdir(RootDir1))]
folders = np.asarray(folders, dtype=str)
#gets worksheet row values and puts into an array of strings
arr = [ws.row(0)]
for i in range(1,num_rows+1):
row = np.asarray(ws.row_values(i))
arr = np.append(arr, [row], axis = 0)
#matching between folders and arr, ie. between directory index and master sheet
for y in range(1, len(arr)):
for root, dirs, files in os.walk((os.path.normpath(RootDir1)), topdown=False):
for name in dirs:
if name.find(str(int(float(str(arr[y, col]))))):
print ("Found" + name)
SourceFolder = os.path.join(root,name)
shutil.copy(SourceFolder, TargetFolder) #copies to new folder

Developing a Counter for Outputting Data Info into a specific range of excel rows & columns?

I am a beginner in Python with just basic fundamentals under my belt, ie. loops and modules. I have some data that I want to automate by exporting it as text into specific Microsoft excel cells.
Specifically, I have 3 folders each with image files in them, a total of 10 image files. My goal is to make a code that opens up excel and outputs the name of the folder path in each descending row of Column A and the respective image file name in descending rows of column B.
So far, I have defined the folder and path, and have allowed my code to open up Excel and name the sheet and put a title in. I encountered my problem when trying to individually iterate each folder/file into a new cell. I tried using the range function, but it doesn't work with strings and I feel like a simple counter variable would work, but again, excel column and row names are strings.
Here is my code so far:
import win32com.client
import sys, os, string, arcpy
data_folder = "F:\\School\\GEOG_390\\Week11\\data"
xlApp = win32com.client.Dispatch("Excel.Application")
xlApp.Visible = 1
xlApp.Workbooks.Add()
print(xlApp.Worksheets("Sheet1").Name)
xlApp.Worksheets("Sheet1").Range("A1").Value= "Data Files:"
for root, folders, files, in os.walk(data_folder):
for folder in folders:
workspace = os.path.join(root, folder)
print( "Processing" + " " + workspace)
arcpy.env.workspace = workspace
rasters = arcpy.ListRasters("*", "IMG")
for raster in rasters:
arcpy.BuildPyramids_management(raster)
arcpy.CalculateStatistics_management(raster)
print(raster)
sheet = xlApp.Worksheets("Sheet1")
sheet.Range("A2").Value = "Folder:" + folder
sheet.Range("B2").Value = "Raster:" + raster
print(sheet.Range("A2").Value)
print(sheet.Range("B2").Value)
I need the code to put folder name 1 in cell A2 and the image file name in B2, and from there on, folder 2 in cell A3 and file 2 in cell B3, all the way until cells A11 and B11.
If I understand the problem correctly, you are having trouble figuring out how to increment the names of the cells (for example, "A2" and "B2") each time you process a file. My apologies if that is not the issue.
(1) Before the first for loop, declare a variable cell_row that will track the row number where the next pair of cells will be created. It starts at 2.
cell_row = 2
(2) Modify the end of your loop as follows:
folder_cell = "A" + str(cell_row)
raster_cell = "B" + str(cell_row)
sheet = xlApp.Worksheets("Sheet1")
sheet.Range(folder_cell).Value = "Folder:" + folder
sheet.Range(raster_cell).Value = "Raster:" + raster
print(sheet.Range(folder_cell).Value)
print(sheet.Range(raster_cell).Value)
cell_row += 1
you need to think of it programatically start with a method name
def get_xls_range(start,end):
letter1,num1 = start[0],int(start[1:]) # get the alpha part and the numeric part
letter2,num2 = end[0],int(end[1:]) # same for end
#convert the letters to a range
range_rows = [chr(x) for x in range(ord(letter1),ord(letter2)+1)]
#convert the numerics to a range
range_cols = range(num1,num2+1)
for row in range_rows:
for col in range_cols:
yield "%s%s"%(row,col)
then all you need to do is interact with your new function
for cell in get_xls_range("A1","B22"):
print cell

How to read excel files in a for loop with openpyxl?

This seems tricky for me. Let's say I have, nested in a directory tree, an excel file with a few non-empty columns. I want to get the sum of all values located in column F with openpyxl:
file1.xlsx
A B C D E F
5
7
11
17
20
29
34
My take on it would be as follows, but it is wrong:
import os
from openpyxl import load_workbook
directoryPath=r'C:\Users\MyName\Desktop\MyFolder' #The main folder
os.chdir(directoryPath)
folder_list=os.listdir(directoryPath)
for folders, sub_folders, file in os.walk(directoryPath): #Traversing the sub folders
for name in file:
if name.endswith(".xlsx"):
filename = os.path.join(folders, name)
wb=load_workbook(filename, data_only=True)
ws=wb.active
cell_range = ws['F1':'F7'] #Selecting the slice of interest
sumup=0
for row in cell_range:
sumup=sumup+cell.value
While running this I get NameError: name 'cell' is not defined. How to work around this?
The main thing currently wrong is that you are only iterating through the rows, not the columns(cells) within that row.
At the end of your code, you can do this (Replace the two end lines of your code):
for row in cell_range: # This is iterating through rows 1-7
for cell in row: # This iterates through the columns(cells) in that row
value = cell.value
sumup += value
You identified that you didn't think this was running through each of your excel files. This would have been very easy to debug. Remove all code after
ws=wb.active
And add
print(name + ' : ' + ws)
This would have printed out all of the excel file names, and their active sheet. If it prints out more than 1, then it's obviously crawling through and grabbing the excel files...

Categories

Resources