number of columns in xlwt worksheet - python

I can't seem to find a way to return the value of the number of columns in a worksheet in xlwt.Workbook(). The idea is to take a wad of .xls files in a directory and combine them into one. One problem I am having is changing the column position when writing the next file. this is what i'm working with thus far:
import xlwt, xlrd, os
def cbc(rd_sheet, wt_sheet, rlo=0, rhi=None,
rshift=0, clo=0, chi=None, cshift = 0):
if rhi is None: rhi = rd_sheet.nrows
if chi is None: chi = 2#only first two cols needed
for row_index in xrange(rlo, rhi):
for col_index in xrange(clo, chi):
cell = rd_sheet.cell(row_index, col_index)
wt_sheet.write(row_index + rshift, col_index + cshift, cell.value)
Dir = '/home/gerg/Desktop/ex_files'
ext = '.xls'
list_xls = [file for file in os.listdir(Dir) if file.endswith(ext)]
files = [Dir + '/%s' % n for n in list_xls]
output = '/home/gerg/Desktop/ex_files/copy_test.xls'
wbook = xlwt.Workbook()
wsheet = wbook.add_sheet('Summary', cell_overwrite_ok=True)#overwrite just for the repeated testing
for XLS in files:
rbook = xlrd.open_workbook(XLS)
rsheet = rbook.sheet_by_index(0)
cbc(rsheet, wsheet, cshift = 0)
wbook.save(output)
list_xls returns:
['file2.xls', 'file3.xls', 'file1.xls', 'copy_test.xls']
files returns:
['/home/gerg/Desktop/ex_files/file2.xls', '/home/gerg/Desktop/ex_files/file3.xls', '/home/gerg/Desktop/ex_files/file1.xls', '/home/gerg/Desktop/ex_files/copy_test.xls']
My question is how to scoot each file written into xlwt.workbook over by 2 each time. This code gives me the first file saved to .../copy_test.xls. Is there a problem with the file listing as well? I have a feeling there may be.
This is Python2.6 and I bounce between windows and linux.
Thank you for your help,
GM

You are using only the first two columns in each input spreadsheet. You don't need "the number of columns in a worksheet in xlwt.Workbook()". You already have the cshift mechanism in your code, but you are not using it. All you need to do is change the loop in your outer block, like this:
for file_index, file_name in enumerate(files):
rbook = xlrd.open_workbook(file_name)
rsheet = rbook.sheet_by_index(0)
cbc(rsheet, wsheet, chi = 2, cshift = file_index * 2)
For generality, change the line
if chi is None: chi = 2
in your function to
if chi is None: chi = rsheet.ncols
and pass chi=2 in as an arg as I have done in the above code.
I don't understand your rationale for overriding the overwrite check ... surely in your application, overwriting an existing cell value is incorrect?
You say "This code gives me the first file saved to .../copy_test.xls". First in input order is file2.xls. The code that you have shown is overwriting previous input and will give you the LAST file (in input order) , not the first ... perhaps you are mistaken. Note: The last input file 'copy_test.xls' is quite likely be a previous OUTPUT file; perhaps your output file should be put in a separate folder.

Related

How to iterate through csv rows, apply a function to those values and append to new column?

I have a python script which calculates tree heights based off distance and angle from the ground, however, despite the script running with no errors my heights column is left empty. Also, I dont want to be using pandas and I would like to keep to the 'with open' method if possible, before anyone suggests going about it a different way. Any help would be great thanks. It seems that the whole script runs fine and does everything i need it to until the "for row in csvread:" block.
This is my current script:
#!/usr/bin/env python3
# Import any modules needed
import sys
import csv
import math
import os
import itertools
# Extract command line arguments, remove file extension and attach to output_filename
input_filename1 = sys.argv[1]
input_filename2 = os.path.splitext(input_filename1)[0]
filenames = (input_filename2, "treeheights.csv")
output_filename = "".join(filenames)
def TreeHeight(degrees, distance):
"""
This function calculates the heights of trees given distance
of each tree from its base and angle to its top, using the
trigonometric formula.
"""
radians = math.radians(degrees)
height = distance * math.tan(radians)
print("Tree height is:", height)
return height
def main(argv):
with open(input_filename1, 'r') as f:
with open(output_filename, 'w') as g:
csvread = csv.reader(f)
print(csvread)
csvwrite = csv.writer(g)
header = csvread.__next__()
header.append("Height.m")
csvwrite.writerow(header)
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
return 0
if __name__ == "__main__":
status = main(sys.argv)
sys.exit(status)
Looking at your code, I think you're mostly there, but are a little confused on reading/writing rows:
# Populating the output csv with the input data
csvwrite.writerows(itertools.islice(csvread, 0, 121))
for row in csvread:
height = TreeHeight(csvread[:,2], csvread[:,1])
row.append(height)
csvwrite.writerow(row)
It looks like your reading rows 1 through 121 and writing them to your new file. Then, you're trying to iterate over your CSV reader in a second pass, compute the height, and then tack that computed value on to the end of the row, and also write to your CSV in a complete second pass.
If that's true, then you need to understand that CSV reader and writer are not designed to work "left-to-right" like that: read-write these columns, then read-write these columns... nope.
They both work "top-down", processing rows.
I propose, to get this working: iterate every row in one loop, and for every row:
read the values you need from row to compute the height
get the computed height
add the new computed to the original
write
...
header = next(csvread)
header.append("Height.m")
csvwrite.writerow(header)
for row in csvread:
degrees = float(row[1]) # second column for degrees?
distance = float(row[0]) # first column for distance?
height = TreeHeight(degrees, distance)
row.append(height)
csvwrite.writerow(row)
Some changes I made:
I replaced header = csvread.__next__() with header = next(csvread). Calling things that start with _ or __ is generally discouraged, at least in the standard library. next(<iterator>) is the built-in function that allows you to properly and safely advance through <iterator>.
Added float() conversion to textual values as read from CSV
Also, as far as I can tell, the ,2/,1 is incorrect syntax for subscripting/slice notation. You didn't get any errors because the reader was already done/exhausted from the islice() call, so your program never actually stepped into the for row in csvread: loop.

Write to .txt file the results from FOR looping in python

My program is search the upper and lower value from .txt file according to that input value.
def find_closer():
file = 'C:/.../CariCBABaru.txt'
data = np.loadtxt(file)
x, y = data[:,0], data[:,1]
print(y)
for k in range(len(spasi_baru)):
a = y #[0, 20.28000631, 49.43579604, 78.59158576, 107.7473755, 136.9031652, 166.0589549,
176.5645474, 195.2147447]
b = spasi_baru[k]
# diff_list = []
diff_dict = OrderedDict()
if b in a:
b = input("Number already exists, please enter another number ")
else:
for x in a:
diff = x - b
if diff < 0:
# diff_list.append(diff*(-1))
diff_dict[x] = diff*(-1)
else:
# diff_list.append(diff)
diff_dict[x] = diff
#print("diff_dict", diff_dict)
# print(diff_dict[9])
sort_dict_keys = sorted(diff_dict.keys())
#print(sort_dict_keys)
closer_less = 0
closer_more = 0
#cl = []
#cm = []
for closer in sort_dict_keys:
if closer < b:
closer_less = closer
else:
closer_more = closer
break
#cl.append(closer_less == len(spasi_baru) - 1)
#cm.append(closer_more == len(spasi_baru) - 1)
print(spasi_baru[k],": lower value=", closer_less, "and upper
value =", closer_more)
data = open('C:/.../Batas.txt','w')
text = "Spasi baru:{spasi_baru}, File: {closer_less}, line:{closer_more}".format(spasi_baru=spasi_baru[k], closer_less=closer_less, closer_more=closer_more)
data.write(text)
data.close()
print(spasi_baru[k],": lower value=", closer_less, "and upper value =", closer_more)
find_closer()
The results image is here 1
Then, i want to write these results to file (txt/csv no problem) into rows and columns sequence. But the problem that i have, the file contain just one row or written the last value output in terminal like below,
Spasi baru:400, File: 399.3052727, line: 415.037138
any suggestions to help fix my problem please? I stuck in a several hours to tried any different code algorithms. I'm using Python 3.7
The best solution is to use w+ or a+ mode when you're trying to append into the same test file.
Instead of doing this:
data = open('C:/.../Batas.txt','w')
Do this:
data = open('C:/.../Batas.txt','w+')
or
data = open('C:/.../Batas.txt','a+')
The reason is because you are overwriting the same file over and over inside the loop, so it will keep just the last interaction. Look for ways to save files without overwriting them.
‘r’ – Read mode which is used when the file is only being read
‘w’ – Write mode which is used to edit and write new information to the file (any existing files with the same name will be erased when this mode is activated)
‘a’ – Appending mode, which is used to add new data to the end of the file; that is new information is automatically amended to the end
‘r+’ – Special read and write mode, which is used to handle both actions when working with a file

Python: Reading Data from Multiple CSV Files to Lists

I'm using Python 3.5 to move through directories and subdirectories to access csv files and fill arrays with data from those files. The first csv file the code encounters looks like this:
The code I have is below:
import matplotlib.pyplot as plt
import numpy as np
import os, csv, datetime, time, glob
gpheight = []
RH = []
dewpt = []
temp = []
windspd = []
winddir = []
dirpath, dirnames, filenames = next(os.walk('/strm1/serino/DATA'))
count2 = 0
for dirname in dirnames:
if len(dirname) >= 8:
try:
dt = datetime.datetime.strptime(dirname[:8], '%m%d%Y')
csv_folder = os.path.join(dirpath, dirname)
for csv_file2 in glob.glob(os.path.join(csv_folder, 'figs', '0*.csv')):
if os.stat(csv_file2).st_size == 0:
continue
#create new arrays for each case
gpheight.append([])
RH.append([])
temp.append([])
dewpt.append([])
windspd.append([])
winddir.append([])
with open(csv_file2, newline='') as f2_input:
csv_input2 = csv.reader(f2_input,delimiter=' ')
for j,row2 in enumerate(csv_input2):
if j == 0:
continue #skip header row
#fill arrays created above
windspd[count2].append(float(row2[5]))
winddir[count2].append(float(row2[6]))
gpheight[count2].append(float(row2[1]))
RH[count2].append(float(row2[4]))
temp[count2].append(float(row2[2]))
dewpt[count2].append(float(row2[3]))
count2 = count2 + 1
except ValueError as e:
pass
I have it set up to create a new array for each new csv file. However, when I print the third (temperature) column,
for n in range(0,len(temp)):
print(temp[0][n])
it only partially prints that column of data:
-70.949997
-68.149994
-60.449997
-63.649994
-57.449997
-51.049988
-45.349991
-40.249985
-35.549988
-31.249985
-27.149994
-24.549988
-22.149994
-19.449997
-16.349976
-13.25
-11.049988
-8.949982
-6.75
-4.449982
-2.25
-0.049988
In addition, I believe a related problem is that when I simply do,
print(temp)
it prints
with the highlighted section the section that belongs to this one csv file, and should therefore be in one array. There are also additional empty arrays at the end that should not be there.
I have (not shown) a section of code before this that does the same thing but with different csv files, and that works as expected, separating each file's data into a new array, with no empty arrays. I appreciate any help!
The issue had been my use of try and pass. All the files that matched my criteria were met, but some of those files had issues with how their contents were read, which caused the errors I was receiving later in the code. For anyone looking to use try and pass, make sure that you are able to safely pass on any exceptions that block of code may encounter. Otherwise, it could cause problems later. You may still get an error if you don't pass on it, but that will force you to fix it appropriately instead of ignoring it.

Compare all the CSV files in a folder and print duplicate rows

I have multiple CSV files in a folder, which I want to compare and print the matching rows (where the number of columns could be different). I know how to get duplicates within a file but this case is a little different. Let's say there are two files in a folder and I want to compare them.
CSV1:
H1,H2,H4
C01,23,F
C2,45,M
CSV2:
H1,H2,H3,H4
C01,23,data,F
C01,23,some other data,M
C4,34,data,M
I need my output to check if all the available data (from the one with the least number of columns) matches exactly in another file in the same folder. My output could be like
CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data))
What about something like:
def duplines(csv_least_cols, csv_most_cols):
rowset = set()
with open(csv_least_cols) as csv1:
r = csv.reader(csv1)
csv1_cols = next(r)
for row in r:
rowset.add(tuple(row))
with open(csv_most_cols) as csv2:
dr = csv.DictReader(csv2)
for drow in dr:
refcols = tuple(drow[c] for c in csv1_cols)
if refcols in rowset: yield csv1_cols, refcols, drow
You can call this in a loop and perform whatever formatting you want -- this generator deals with the underlying logic, separating out the formatting task to its caller.
So for example to get your peculiar desired CSV1,CSV2 (H1:C01,H2:23,H4:F(H3:data)) style output you could have...:
def formatit(csv_least, csv_most):
out_start = '{},{} ('.format(csv_least, csv_most)
for c1cols, refvals, c2dict in duplines(csv_least, csv_most):
out_middle = []
for c, v in zip(c1cols, refvals):
out_middle.append('{}:{}'.format(c, v))
out_end = []
for c in c2dict:
if c in c1cols: continue
out_end.append('{}:{}'.format(c, c2dict[c]))
out = '{}{}({}))'.format(out_start, ','.join(out_middle), ','.join(out_end))
print(out)
You'll notice that the formatting work is substantially more complex than the actual logic (and hence more likely to hide bugs:-) which is why I call your desired format "peculiar".
But I hope this can at least get you started (and you can try out each function separately, making sure the logic is as you desire it before worrying about the formatting:-).

Optimize python file comparison script

I have written a script which works, but I'm guessing isn't the most efficient. What I need to do is the following:
Compare two csv files that contain user information. It's essentially a member list where one file is a more updated version of the other.
The files contain data such as ID, name, status, etc, etc
Write to a third csv file ONLY the records in the new file that either don't exist in the older file, or contain updated information. For each record, there is a unique ID that allows me to determine if a record is new or previously existed.
Here is the code I have written so far:
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = []
new = []
for row in fOld:
old.append(row)
for row in fNew:
new.append(row)
output = []
x = len(new)
i = 0
num = 0
while i < x:
if new[num] not in old:
fNewUpdate.writerow(new[num])
num += 1
i += 1
fileAin.close()
fileBin.close()
fileCout.close()
In terms of functionality, this script works. However I'm trying to run this on files that contain hundreds of thousands of records and it's taking hours to complete. I am guessing the problem lies with reading both files to lists and treating the entire row of data as a single string for comparison.
My question is, for what I am trying to do is this there a faster, more efficient, way to process the two files to create the third file containing only new and updated records? I don't really have a target time, just mostly wanting to understand if there are better ways in Python to process these files.
Thanks in advance for any help.
UPDATE to include sample row of data:
123456789,34,DOE,JOHN,1764756,1234 MAIN ST.,CITY,STATE,305,1,A
How about something like this? One of the biggest inefficiencies of your code is checking whether new[num] is in old every time because old is a list so you have to iterate through the entire list. Using a dictionary is much much faster.
import csv
fileAin = open('old.csv','rb')
fOld = csv.reader(fileAin)
fileBin = open('new.csv','rb')
fNew = csv.reader(fileBin)
fileCout = open('NewAndUpdated.csv','wb')
fNewUpdate = csv.writer(fileCout)
old = {row[0]:row[1:] for row in fOld}
new = {row[0]:row[1:] for row in fNew}
fileAin.close()
fileBin.close()
output = {}
for row_id in new:
if row_id not in old or not old[row_id] == new[row_id]:
output[row_id] = new[row_id]
for row_id in output:
fNewUpdate.writerow([row_id] + output[row_id])
fileCout.close()
difflib is quite efficient: http://docs.python.org/library/difflib.html
Sort the data by your unique field(s), and then use a comparison process analogous to the merge step of merge sort:
http://en.wikipedia.org/wiki/Merge_sort

Categories

Resources