Save row_slice information in workbook xlrd python - python

I am trying to extract the header row (the first row) from multiple files, each of which has multiple sheets. The output of each sheet should be saved and appened in a new master file that contains all the headers from each sheet and each file.
The easiest way I have found is to use the command row_slice. However, the output from the file is a list of Cell objects and I cannot seem to access their indices.
I am looking for a way to save the data extracted into a new workbook.
Here is the code I have so far:
from xlrd import open_workbook,cellname
book = open_workbook('E:\Files_combine\MOU worksheets 2012\Walmart-GE_MOU 2012-209_worksheet_v03.xls')
last_index = len(book.sheet_names())
for sheet_index in range(last_index):
sheet = book.sheet_by_index(sheet_index)
print sheet.name
print sheet.row_slice(0,1)
I cannot get the output and store it as an input to a new file. Also, any ideas on how to automate this process for 100+ files will be appreciated.

You can store the output in a csv file and you can use the os.listdir and a for loop to loop over all the file names
import csv
import os
from xlrd import open_workbook, cellname
EXCEL_DIR = 'E:\Files_combine\MOU worksheets 2012'
with open("headers.csv", 'w') as csv_file:
writer = csv.writer(csv_file)
for file_name in os.listdir(EXCEL_DIR):
if file_name.endswith("xls"):
book = open_workbook(os.path.join(EXCEL_DIR, file_name))
for index, name in enumerate(book.sheet_names()):
sheet = book.sheet_by_index(index)
#the write row method takes a sequence
#I assume that row_slice returns a list or a tuple
writer.writerow(sheet.row_slice(0,1))

Related

csv module not writing new line

I am working a script for reading specific cells from an Excel workbook into a list, and then from the list into a CSV. There's a loop to get workbooks open from a folder as well.
My code:
import csv
import openpyxl
import os
path = r'C:\Users.....' # Folder holding workbooks
workbooks = os.listdir(path)
cell_values = [] # List for storing cell values from worksheets
for workbook in workbooks: # Workbook iteration
wb = openpyxl.load_workbook(os.path.join(path, workbook), data_only=True) # Open workbook
sheet = wb.active # Get sheet
f = open('../record.csv', 'w', newline='') # Open the CSV file
cell_list = ["I9", "AK6", "N35"] # List of cells to check
with f: # CSV writer loop
record_writer = csv.writer(f) # Open CSV writer
for cells in cell_list: # Loop through cell list to get cell values and write them to the cell_values list
cell_values.append(sheet[cells].value) # Append cell values to the cell_values list
record_writer.writerow(cell_values) # Write cell_values list to CSV
quit() # Terminate program after all workbooks in the folder have been analyzed
The output just puts all values on the same line, albeit separated by commas, but it doesn't help me when I go to open my results in Excel if everything is on the same line. When I was using xlrd, the format was vertical but all I had to do was transpose the dataset to be good. But I had to change from xlrd (which was a smart move in general) because it would not read merged cells.
I get this:
4083940,140-21-541,NP,8847060,140-21-736,NP
When I want this
4083940,140-21-541,NP
8847060,140-21-736,NP
Edit - I forgot the "what have I tried" portion of my post. I have tried changing my loops around to avoid overwriting the previous write to the CSV. I have tried clearing the list on each loop to get the script to treat each new entry as a new line. I have tried adding \n in the writer line as I saw in a couple of posts. I have tried to use writerows instead of writerow. I tried A instead of W even though it is a fix and not a solution but that didn't quite work right either.
Your main problem is that cell_values is accumulating the cells from multiple sheets. You need to reset it, like, cell_values = [], for every sheet.
I went back to your original example and:
moved the opening of record.csv up, and placed all the work inside the scope of that file being open and written into
moved cell_values = [] inside your workbook loop
moved cell_list = ["I9", "AK6", "N35"] to the top, because that's really scoped for the entire script, if every workbook has the same cells
removed quit(), it's not necessary at the very end of the script, and in general should probably be avoided: Python exit commands - why so many and when should each be used?
import csv
import openpyxl
import os
path = r'C:\Users.....' # Folder holding workbooks
workbooks = os.listdir(path)
cell_list = ["I9", "AK6", "N35"] # List of cells to check
with open('record.csv', 'w', newline='') as f:
record_writer = csv.writer(f)
for workbook in workbooks:
wb = openpyxl.load_workbook(os.path.join(path, workbook), data_only=True)
sheet = wb.active
cell_values = [] # reset for every sheet
for cells in cell_list:
cell_values.append(sheet[cells].value)
# Write one row per sheet
record_writer.writerow(cell_values)
Also, I can see your new the CSV module, and struggling a little conceptually (since you tried writerow, then writerows, trying to debug your code). Python's official document for CSV doesn't really give practical examples of how to use it. Try reading up here, Writing to a CSV.

Why did my code (that was supposed to put in column headers) wipe the whole Excel sheet blank with no headers?

I wrote this simple program for writing column headers to empty cells above a data table in a pre-existing Excel .xlsx file. I don't get any errors when I run this, but when I open the file (with a single sheet), the whole table is gone and it's blank; there's not even any of the headers that it was supposed to write in. Can anyone please help me figure out why this happened? I can get the data again, I just need this to work.
import pandas as pd
from openpyxl import load_workbook
headers = []
# code not shown, but just prompts user for column headers and saves in list 'headers'
#open .xlsx file
book = load_workbook(r'path.xlsx')
writer = pd.ExcelWriter(r'path.xlsx', engine='xlsxwriter')
#write column headers from list into empty cells
writer.columns = headers[:]
#save and close
writer.save()
writer.close()
book.close()
You can try out this code
import pandas as pd
# Read excel file (.xlsx)
book_df = pd.read_excel(r'path.xlsx')
# headers is list of header names like ['header_1','header_2','header_3']
book_df.columns = headers
book_df.to_excel(r'modified_file_name.xlsx',index=False)
# In case you want the file in the same name , make sure the file is not open else you may get permission error
book_df.to_excel(r'path.xlsx',index=False)

How python XLRD library scans all sheets in excel?

I am planning to use XLRD libraries for reading the number of rows and columns in the excel file that I imported.
I use following codes which work perfectly fine.
import xlrd
path = 'sample123.xlsx'
inputWorkbook = xlrd.open_workbook(path)
inputWorksheet = inputWorkbook.sheet_by_index(0)
print("Your worksheet has: " + str(inputWorksheet.nrows) + " rows")
print("Your worksheet has: " + str(inputWorksheet.ncols) + " columns")
However, that codes only run for a sheet (the first one). If I would like to randomly import a number of excel files that I do not know the total index or sheet names of each file, is there any coding suggestion so that all sheets in that file could be scanned through, thus the number of rows and columns for all sheets can be detected?
Thanks very much for your assistance.
However, that codes only run for a sheet (the first one)
that is because you are passing the index=0 when calling get sheet method...
you call the method get_sheet
myDoc.get_sheet(index)
where index is the index of the sheet, if you dont know it, you can find it by name:
sheet_names().index(nameOfMySheet)
here the doc
here is an example about how to get the sheets in a workbook
import xlrd
book = xlrd.open_workbook("sample.xls")
for sheet in book.sheets():
print sheet.name
To read all sheets from one excel file by using xlrd,
import xlrd
path = 'sample123.xlsx'
inputWorkbook = xlrd.open_workbook(path)
dict_sheet_tabs= {} # Store sheets in a dictionary
for sheet_name in inputWorkbook.sheet_names():
print(sheet_name ) # name of each tab
all_sheet = wb1.sheet_by_name(sheet_name) # read sheet by name
dict_sheet_tabs.update({sheet_name:all_sheet })
print(dict_sheet_tabs)
>>> {'sheet_name1': <xlrd.sheet.Sheet object at 0x7fa903b6efd0>, 'sheet_name2': <xlrd.sheet.Sheet object at 0x7fa9038ece10>}
#The dictionary keys are sheet names and values are the sheet content

Copy .csv files into a .xlsx workbook using python

I have done extensive researches, however have not been able to come up with the right solution for my needs.
I have python script which generates multiple .csv files.
Each .csv file only has data in column A & B.
Each .csv file has a unique name, and im trying to work out how to copy the .csv file based on its name, into an existing excel workbook, into a specific tab/sheet of the same name.
The .csv files will always be in the same folder.
I would ideally like to use python for this task.
You can try something like this
import os
import glob
import csv
from xlsxwriter.workbook import Workbook
workbook = Workbook('Existing.xlsx')
for csvfile in glob.glob(os.path.join('.', '*.csv')):
worksheet = workbook.add_worksheet(os.path.splitext(csvfile)[0]) # worksheet with csv file name
with open(csvfile, 'rb') as f:
reader = csv.reader(f)
for r, row in enumerate(reader):
for c, col in enumerate(row):
worksheet.write(r, c, col) # write the csv file content into it
workbook.close()
using module csv to read data from csv file
and write .xlsx(https://pypi.python.org/pypi/XlsxWriter) with XlsxWriter is recommended

Converting a folder of Excel files into CSV files/Merge Excel Workbooks

I have a folder with a large number of Excel workbooks. Is there a way to convert every file in this folder into a CSV file using Python's xlrd, xlutiles, and xlsxWriter?
I would like the newly converted CSV files to have the extension '_convert.csv'.
OTHERWISE...
Is there a way to merge all the Excel workbooks in the folder to create one large file?
I've been searching for ways to do both, but nothing has worked...
Using pywin32, this will find all the .xlsx files in the indicated directory and open and resave them as .csv. It is relatively easy to figure out the right commands with pywin32...just record an Excel macro and perform the open/save manually, then look at the resulting macro.
import os
import glob
import win32com.client
xl = win32com.client.gencache.EnsureDispatch('Excel.Application')
for f in glob.glob('tmp/*.xlsx'):
fullname = os.path.abspath(f)
xl.Workbooks.Open(fullname)
xl.ActiveWorkbook.SaveAs(Filename=fullname.replace('.xlsx','.csv'),
FileFormat=win32com.client.constants.xlCSVMSDOS,
CreateBackup=False)
xl.ActiveWorkbook.Close(SaveChanges=False)
I will give a try with my library pyexcel:
from pyexcel import Book, BookWriter
import glob
import os
for f in glob.glob("your_directory/*.xlsx"):
fullname = os.path.abspath(f)
converted_filename = fullname.replace(".xlsx", "_converted.csv")
book = Book(f)
converted_csvs = BookWriter(converted_filename)
converted_csvs.write_book_reader(book)
converted_csvs.close()
If you have a xlsx that has more than 2 sheets, I imagine you will have more than 2 csv files generated. The naming convention is: "file_converted_%s.csv" % your_sheet_name. The script will save all converted csv files in the same directory where you had xlsx files.
In addition, if you want to merge all in one, it is super easy as well.
from pyexcel.cookbook import merge_all_to_a_book
import glob
merge_all_to_a_book(glob.glob("your_directory/*.xlsx"), "output.xlsx")
If you want to do more, please read the tutorial
Look at openoffice's python library. Although, I suspect openoffice would support MS document files.
Python has no native support for Excel file.
Sure. Iterate over your files using something like glob and feed them into one of the modules you mention. With xlrd, you'd use open_workbook to open each file by name. That will give you back a Book object. You'll then want to have nested loops that iterate over each Sheet object in the Book, each row in the Sheet, and each Cell in the Row. If your rows aren't too wide, you can append each Cell in a Row into a Python list and then feed that list to the writerow method of a csv.writer object.
Since it's a high-level question, this answer glosses over some specifics like how to call xlrd.open_workbook and how to create a csv.writer. Hopefully googling for examples on those specific points will get you where you need to go.
You can use this function to read the data from each file
import xlrd
def getXLData(Filename, min_row_len=1, get_datemode=False, sheetnum=0):
Data = []
book = xlrd.open_workbook(Filename)
sheet = book.sheets()[sheetnum]
rowcount = 0
while rowcount < sheet.nrows:
row = sheet.row_values(rowcount)
if len(row)>=min_row_len: Data.append(row)
rowcount+=1
if get_datemode: return Data, book.datemode
else: return Data
and this function to write the data after you combine the lists together
import csv
def writeCSVFile(filename, data, headers = []):
import csv
if headers:
temp = [headers]
temp.extend(data)
data = temp
f = open(filename,"wb")
writer = csv.writer(f)
writer.writerows(data)
f.close()
Keep in mind you may have to re-format the data, especially if there are dates or integers in the Excel files since they're stored as floating point numbers.
Edited to add code calling the above functions:
import glob
filelist = glob.glob("*.xls*")
alldata = []
headers = []
for filename in filelist:
data = getXLData(filename)
headers = data.pop(0) # omit this line if files do not have a header row
alldata.extend(data)
writeCSVFile("Output.csv", alldata, headers)

Categories

Resources