I'm working with a large set of .spc files. My goal is to loop through the files in the directory grab a chunck of them (n=5) average those files together and then write the averaged data to excel. I've gotten pretty far in terms of general code but I've never worked with .spc files in python before. So I'm looking for some specific help in terms of opening, reading, averaging, and exporting .spc files.
def fileavg(path,n):
import numpy as np
import xlsxwriter
import glob
from pyspectra.readers.read_spc import read_spc_dir
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet()
row=0
more_files=True
b=glob.iglob(path) #when inputting path name begin with r' and end with a '
while more_files:
for i in range(n):
try:
next_file=next(b)
new_file=read_spc_dir(row,next_file)
A=np.array([new_file(1)])
navg=A.mean(axis=0)
except StopIteration:
more_files=False
break
for col, data in enumerate(navg):
worksheet.write_column(row, col, data)
row +=1
Related
I'm trying to write a function that pulls from a folder path, reads the files (each is a 2 by inf array) in sets of n, averages the second row of each file by column and writes those results out to an excel file. I expect this to loop until I have reached the end of the files in the folder.
For example the function is given a file path and an n value. ie.(path,2) each of the following arrays would be a different file in the path to the folder. The code would average the second row of each array and output the average row-by-row.
Example:
[1,2;3,4] [1,2;5,6]
[1,2;7,8] [1,2;9,10]
[1,2;3,4] [1,2;9,10]
would output in an excel file:
4 5
8 9
6 7
This is my current code:
def fileavg(path,n):
import numpy as np
import xlsxwriter
from glob import glob
workbook = xlsxwriter.Workbook('Test.xlsx')
worksheet = workbook.add_worksheet()
row=0
glob.iglob(path) #when inputting path name begin with r' and end with a '
for i in range(0,len(1),n):
f=yield 1[i:i +n]
A=np.mean(f(1),axis=1)
for col, data in enumerate(A):
worksheet.write_column(row, col, data)
row +=1
I receive a generator object error when I attempt to run the function. Please let me know what this means and where any mistakes might be as I'm quite new to python.
I am able to use my code to compare PDFs of smaller sizes, but when it is used for large size PDFs it fails and shows all sorts of error messages. Below is my code:
`
import pdfminer
import pandas as pd
from time import sleep
from tqdm import tqdm
from itertools import chain
import slate
# List of pdf files to process
pdf_files = ['file1.pdf', 'file2.pdf']
# Create a list to store the text from each PDF
pdf1_text = []
pdf2_text = []
# Iterate through each pdf file
for pdf_file in tqdm(pdf_files):
# Open the pdf file
with open(pdf_file, 'rb') as pdf_now:
# Extract text using slate
text = slate.PDF(pdf_now)
text = text[0].split('\n')
if pdf_file == pdf_files[0]:
pdf1_text.append(text)
else:
pdf2_text.append(text)
sleep(20)
pdf1_text = list(chain.from_iterable(pdf1_text))
pdf2_text = list(chain.from_iterable(pdf2_text))
differences = set(pdf1_text).symmetric_difference(pdf2_text)
## Create a new dataframe to hold the differences
differences_df = pd.DataFrame(columns=['pdf1_text', 'pdf2_text'])
# Iterate through the differences and add them to the dataframe
for difference in differences:
# Create a new row in the dataframe with the difference from pdf1 and pdf2
differences_df = differences_df.append({'pdf1_text': difference if difference in pdf1_text else '',
'pdf2_text': difference if difference in pdf2_text else ''}, ignore_index=True)
# Write the dataframe to an excel sheet
differences_df = differences_df.applymap(lambda x: x.encode('unicode_escape').decode('utf-8') if isinstance(x, str) else x)
differences_df.to_excel('differences.xlsx', index=False, engine='openpyxl')
import openpyxl
import re
# Load the Excel file into a dataframe
df = pd.read_excel("differences.xlsx")
# Create a condition to check the number of words in each cell
for column in ["pdf1_text", "pdf2_text"]:
df[f"{column}_word_count"] = df[column].str.split().str.len()
condition = df[f"{column}_word_count"] < 10
# Drop the rows that meet the condition
df = df[~condition]
for column in ["pdf1_text", "pdf2_text"]:
df = df.drop(f"{column}_word_count", axis=1)
# Save the modified dataframe to a new Excel file
df.to_excel("differences.xlsx", index=False)
The last error I got was this. Can anyone please go through the code, and help me find what the actual problem would be.
TypeError: %d format: a real number is required, not bytes
If you really want to boost the speed of your script by at least an order of magnitude, I recommend using PyMuPDF instead of PyPDF2 or pdfminer. I am usually measuring durations that are 10 to 35 times (!) smaller. And of course, no time.sleep() - why would you ever want to artificially slow down processing?
Here is how reading the text lines of the two PDFs would work with PyMuPDF:
import fitz # PyMuPDF
doc1 = fitz.open("file1.pdf")
doc2 = fitz.open("file2.pdf")
text1 = "\n".join([page.get_text() for page in doc1])
text2 = "\n".join([page.get_text() for page in doc2])
lines1 = text1.splitlines()
lines2 = text2.splitlines()
# then do your comparison ...
suppose I have a file/directory in which many .csv files are present and I have a python code that can read only one csv file and do some algorithm and store the output in an another csv file.Now I need to update that python code so that we can check the file/directory and store the output of the all csv files(which are present inside the directory) in different csv files.
import pandas as pd
import statistics as st
import csv
data = pd.read_csv('1mb.csv')
x_or = list(range(len(data['Main Avg Power (mW)'])))
y_or = list(data['Main Avg Power (mW)'])
time=list(data['Time (s)'])
rt=5000
i=time[rt]
k=i
tlist=[]
for i in time:
tlist.append(y_or[rt])
rt+=1
if i-k>4:
break
idp=st.mean(tlist)
sidp=st.stdev(tlist)
newlist=[]
imax=max(tlist)
imin=min(min(tlist),idp-sidp)
while imax>=y_or[rt]>=imin-1:
newlist.append(y_or[rt])
rt+= 1
print(rt,"Mean idle power:",st.mean(newlist),"mW")
midp=st.mean(newlist)
with open('new_1pp.csv','w',newline='') as f:
thewriter=csv.writer(f)
thewriter.writerow(['Idle Power(mW)'])
thewriter.writerow([midp])
this is the code done by me.please update it as required in the problem.
In your code you can use glob to list all the CSV files in a directory, then read them in one at a time passing each through whatever algorithm you have, and then output them again, e.g.
import glob
import os
# set the name of the directory you want to list the files in
csvdir = 'my_directory'
# get a list of all CSV files, assuming they have a '.csv' suffix
csvfiles = glob.glob(os.path.join(csvdir, '*.csv'))
# loop over all the files and run your algorithm
for csvfile in csvfiles:
# read the csvfile using your current code
# apply your algorithm
# output a new file (e.g. with the same name as before, but with '_new' added
newfile = csvfile.rstrip('.csv') + '_new.csv'
# save to 'newfile' using yor current code
Does that help?
Update:
From the comments and the updated questions, does the following code help:
import pandas as pd
import statistics as st
import csv
import glob
# get list of CSV files from current directory
csvfiles = glob.glob('*.csv')
for csvfile in csvfiles:
data = pd.read_csv(csvfile)
x_or = list(range(len(data['Main Avg Power (mW)'])))
y_or = list(data['Main Avg Power (mW)'])
time=list(data['Time (s)'])
rt=5000
i=time[rt]
k=i
tlist=[]
for i in time:
tlist.append(y_or[rt])
rt+=1
if i-k>4:
break
idp=st.mean(tlist)
sidp=st.stdev(tlist)
newlist=[]
imax=max(tlist)
imin=min(min(tlist),idp-sidp)
while imax>=y_or[rt]>=imin-1:
newlist.append(y_or[rt])
rt+= 1
print(rt,"Mean idle power:",st.mean(newlist),"mW")
midp=st.mean(newlist)
# create new file name using the old one and adding '_new'
newfile = csvfile.rstrip('.csv') + '_new.csv'
with open(newfile,'w',newline='') as f:
thewriter=csv.writer(f)
thewriter.writerow(['Idle Power(mW)'])
thewriter.writerow([midp])
I need to extract some data from 37,000 xls files, which are stored in 2100 folders (activity/year/month/day). I already wrote the script, however when given a small sample of a thousand files, it takes 5 minutes to run. Each individual file can include up to ten thousand entries I need to extract. Haven't tried running it on the entire folder, I'm looking for suggestions how to make it more efficient.
I would also like some help on how to export the dictionary to a new excel file, two columns, or how to skip the entire dictionary and just save directly to xls, and how to point the entire script at a shared drive folder, instead of Python's root.
import fnmatch
import os
import pandas as pd
docid = []
CoCo = []
for root, dirs, files in os.walk('Z_Option'):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
data = dict(zip(docid, CoCo))
print(data)
This walkthrough was very helpful for me when I was beginning with pandas. What is likely taking so long is the for i in df['BLDAT'] line.
Using something like an apply function can offer a speed boost:
def check_if_int(row): #row is effectively a pd.Series of the index
if type(row['BLDAT']) == 'int':
docid.append(i)
CoCo.append(row.name) #name should be the index
df.apply(check_if_int, axis = 1) #axis = 1 will work rowwise
It's unclear what exactly this script is trying to do, but if it's as simple as filtering the dataframe to only include rows where the 'BLDAT' column is an integer, using a mask would be much faster
df_filtered = df.loc[type(df['BLDAT']) == 'int'] #could also use .isinstance()
Another advantage of filtering the dataframe as opposed to creating lists is the ability to use the pandas function df_filtered.to_csv() to output the file as an .xlsx compatible file.
Eventually I gave up due to time constraints (yay last minute "I need this tomorrow" reports), and came up with this. Dropping empty rows helped by some margin, and for the next quarter I'll try to do this entirely with pandas.
#Shared drive
import fnmatch
import os
import pandas as pd
import time
start_time = time.time()
docid = []
CoCo = []
os.chdir("X:\Shared activities")
for root, dirs, files in os.walk("folder"):
for filename in files:
if fnmatch.fnmatch(filename, 'Z_*.xls'):
try:
df = pd.read_excel(os.path.join(root, filename), sheet_name='Sheet0')
df.dropna(subset = ['BLDAT'], inplace = True)
for i in df['BLDAT']:
if isinstance(i, int):
docid.append(i)
CoCo.append(df['BUKRS'].iloc[1])
except:
errors.append((os.path.join(root, filename)))
data = dict(zip(docid, CoCo))
os.chdir("C:\project\reports")
pd.DataFrame.from_dict(data, orient="index").to_csv('test.csv')
with open('errors.csv', 'w') as f:
for item in errors:
f.write("%s\n" % item)
print("--- %s seconds ---" % (time.time() - start_time))
So far for my code to read from text files and export to Excel I have:
import glob
data = {}
for infile in glob.glob("*.txt"):
with open(infile) as inf:
data[infile] = [l[:-1] for l in inf]
with open("summary.xls", "w") as outf:
outf.write("\t".join(data.keys()) + "\n")
for sublst in zip(*data.values()):
outf.write("\t".join(sublst) + "\n")
The goal with this was to reach all of the text files in a specific folder.
However, when I run it, Excel gives me an error saying,
"File cannot be opened because: Invalid at the top level of the document. Line 1, Position 1. outputgooderr.txt outputbaderr.txt. fixed_inv.txt
Note: outputgooderr.txt, outputbaderr.txt.,fixed_inv.txt are the names of the text files I wish to export to Excel, one file per sheet.
When I only have one file for the program to read, it is able to extract the data. Unfortunately, this is not what I would like since I have multiple files.
Please let me know of any ways I can combat this. I am very much so a beginner in programming in general and would appreciate any advice! Thank you.
If you're not opposed to having the outputted excel file as a .xlsx rather than .xls, I'd recommend making use of some of the features of Pandas. In particular pandas.read_csv() and DataFrame.to_excel()
I've provided a fully reproducible example of how you might go about doing this. Please note that I create 2 .txt files in the first 3 lines for the test.
import pandas as pd
import numpy as np
import glob
# Creating a dataframe and saving as test_1.txt/test_2.txt in current directory
# feel free to remove the next 3 lines if yo want to test in your directory
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))
df.to_csv('test_1.txt', index=False)
df.to_csv('test_2.txt', index=False)
txt_list = [] # empty list
sheet_list = [] # empty list
# a for loop through filenames matching a specified pattern (.txt) in the current directory
for infile in glob.glob("*.txt"):
outfile = infile.replace('.txt', '') #removing '.txt' for excel sheet names
sheet_list.append(outfile) #appending for excel sheet name to sheet_list
txt_list.append(infile) #appending for '...txt' to txtt_list
writer = pd.ExcelWriter('summary.xlsx', engine='xlsxwriter')
# a for loop through all elements in txt_list
for i in range(0, len(txt_list)):
df = pd.read_csv('%s' % (txt_list[i])) #reading element from txt_list at index = i
df.to_excel(writer, sheet_name='%s' % (sheet_list[i]), index=False) #reading element from sheet_list at index = i
writer.save()
Output example: