Relatively new to python so please excuse the newbie question, but google isn't helpful at this time.
I have 100 very large xlsx files from which I need to extract the first row (specifically cell A2). I found this gem of a tool called openpyxl which will iterate through my data files without loading everything in memory. It uses a generaotor to get the relevant row on each call
The thing that I can't get is how to initialize a generator outside of a loop. Right now my code is:
from openpyxl import load_workbook
wb = load_workbook(filename = "merged01.xlsx", use_iterators= True)
sheetName = wb.get_sheet_names()
ws = wb.get_sheet_by_name(name = sheetName[0])
row = ws.iter_rows() #row is a generator
for cell in row:
break
print (cell[1].internal_value) # A2
But there has to be a better way of doing this such as:
...
row = ws.iter_rows() #row is a generator
cell = row.first # line I'm trying to KISS
print (cell[1].internal_value) # A2
cell = next(row)
The next function retrieves the next value from any iterator.
You're looking for next().
cell = next(row)
Related
I'm a Python beginner and I made a script to extract data into an xlsx file with openpyxl but I'm stuck with a problem which seems pretty easy. I'd like to copy(not move) the yellow data to the green cells in the following Excel file:
Or said in another way, I want to copy B2:B15 to B16:B29 within my python script. I don't need help with the import of openpyxl or creation of my ws it´s just the specific code that allows to copy the B2:B15 to B16:B29 which I don't get.
I appreciate any help! Ty so much.
I tried the following which didn´t work at all:
for row in range(16,29):
for col in range(1,2):
char = get_column_letter(col)
ws[char + str(row)] = ws(['B2:B15'].value)
If ws is your worksheet, then the code to do that is...
for row in range(16,30):
ws.cell(row=row, column=2).value = ws.cell(row=row-14, column=2).value
Updated below for doing this multiple times
Repeat = 5 #Indicate how many times you want to paste the 15 rows
for cycle in range(1, Repeat+1):
for row in range(14):
ws.cell(row=row+2+(cycle)*14, column=2).value = ws.cell(row=row+2, column=2).value
I have 6 work sheets in my workbook. I want to copy data (all used cells except the header) from 5 worksheets and paste them into the 1st. Snippet of code that applies:
`
excel = win32.gencache.EnsureDispatch('Excel.Application')
wb = excel.Workbooks.Open(mergedXL)
wsSIR = wb.Sheets(1)
sheetList = wb.Sheets
for ws in sheetList:
used = ws.UsedRange
if ws.Name != "1st sheet":
print ("Copying cells from "+ws.Name)
used.Copy()
`
used.Copy() will copy ALL used cells, however I don't want the first row from any of the worksheets. I want to be able to copy from each sheet and paste it into the first blank row in the 1st sheet. So when cells from the first sheet (that is NOT the sheet I want to copy to) are pasted in the 1st sheet, they will be pasted starting in A3. Every subsequent paste needs to happen in the first available blank row. I probably haven't done a great job of explaining this, but would love some help. Haven't worked with win32com a ton.
I also have this code from one of my old scripts, but I don't understand exactly how it's copying stuff and how I can modify it to work for me this time around:
ws.Range(ws.Cells(1,1),ws.Cells(ws.UsedRange.Rows.Count,ws.UsedRange.Columns.Count)).Copy()
wsNew.Paste(wsNew.Cells(wsNew.UsedRange.Rows.Count,1))
If I understand well your problem, I think this code will do the job:
import win32com.client
# create an instance of Excel
excel = win32com.client.gencache.EnsureDispatch('Excel.Application')
# Open the workbook
file_name = 'path_to_your\file.xlsx'
wb = excel.Workbooks.Open(file_name)
# Select the first sheet on which you want to write your data from the other sheets
ws_paste = wb.Sheets('Sheet1')
# Loop over all the sheets
for ws in wb.Sheets:
if ws.Name != 'Sheet1': # Not the first sheet
used_range = ws.UsedRange.SpecialCells(11) # 11 = xlCellTypeLastCell from VBA Range.SpecialCells Method
# With used_range.Row and used_range.Col you get the number of row and col in your range
# Copy the Range from the cell A2 to the last row/col
ws.Range("A2", ws.Cells(used_range.Row, used_range.Column)).Copy()
# Get the last row used in your first sheet
# NOTE: +1 to go to the next line to not overlapse
row_copy = ws_paste.UsedRange.SpecialCells(11).Row + 1
# Paste on the first sheet starting the first empty row and column A(1)
ws_paste.Paste(ws_paste.Cells(row_copy, 1))
# Save and close the workbook
wb.Save()
wb.Close()
# Quit excel instance
excel.Quit()
I hope it helps you to understand your old code as well.
Have you considered using pandas?
import pandas as pd
# create list of panda dataframes for each sheet (data starts ar E6
dfs=[pd.read_excel("source.xlsx",sheet_name=n,skiprows=5,usecols="E:J") for n in range(0,4)]
# concatenate the dataframes
df=pd.concat(dfs)
# write the dataframe to another spreadsheet
writer = pd.ExcelWriter('merged.xlsx')
df.to_excel(writer,'Sheet1')
writer.save()
I try to follow this question to add some formula in my excel using python and openpyxl package.
That link is what i need for my task.
but in this code :
for i, cellObj in enumerate(Sheet.columns[2], 1):
cellObj.value = '=IF($A${0}=$B${0}, "Match", "Mismatch")'.format(i)
i take an error at Sheet.columns[2] any idea why ? i follow the complete code.
i have python 2.7.13 version if that helps for this error.
****UPDATE****
COMPLETE CODE :
import openpyxl
wb = openpyxl.load_workbook('test1.xlsx')
print wb.get_sheet_names()
Sheet = wb.worksheets[0]
for i, cellObj in enumerate(Sheet.columns[2], 1):
cellObj.value = '=IF($A${0}=$B${0}, "Match", "Mismatch")'.format(i)
error message :
for i, cellObj in enumerate(Sheet.columns[2], 1):
TypeError: 'generator' object has no attribute 'getitem'
ws.columns and ws.rows are properties that return generators. But openpyxl also supports slicing and indexing for rows and columns
So, ws['C'] will give a list of the cells in the third column.
For other Stack adventurers looking to copy/paste a formula:
# Writing from pandas back to an existing EXCEL workbook
wb = load_workbook(filename=myfilename, read_only=False, keep_vba=True)
ws = wb['Mysheetname']
# Paste a formula Vlookup! Look at column A, put result in column AC.
for i, cellObj in enumerate(ws['AC'], 1):
cellObj.value = "=VLOOKUP($A${0}, 'LibrarySheet'!C:D,2,FALSE)".format(i)
One issue, I have a header and the formula overwrites it. Anyone know how to start from row 2?
If you want to start from another row you can either use an if statement to skip the first row, or specify the range in the enumeration. A coded example is below:
wb = load_workbook(filename=myfilename, read_only=False, keep_vba=True)
ws = wb['Mysheetname']
# using an if statement
for i, cellObj in enumerate(ws['AC'], 1):
if i > 1:
cellObj.value = "=VLOOKUP($A${0}, 'LibrarySheet'!C:D,2,FALSE)".format(i)
# specifying range, up to max row on worksheet - or you can specify an exact range
for i, cellObj in enumerate(ws['AC2:AC'+str(ws.max_row)],2):
cellObj[0].value = "=VLOOKUP($A${0}, 'LibrarySheet'!C:D,2,FALSE)".format(i)
The second method requires you to begin the index at 2 and returns a tuple rather than a cell object, so you need to specify cellObj[0].value to return the value of the cell object.
fortunately now you can easy do formulas in certain records. Also there are simpler functions to use, such as:
wb.sheetnames instead of wb.read_sheet_names()
sheet = wb['SHEET_NAME'] instead of sheet = wb.get_sheet_by_name('SHEET_NAME')
And formulas can be easily inserted with:
sheet['A1'] = '=SUM(1+1)'
I need to search the Excel sheet for cells containing some pattern. It takes more time than I can handle. The most optimized code I could write is below. Since the data patterns are usually row after row so I use iter_rows(row_offset=x). Unfortunately the code below finds the given pattern an increasing number of times in each for loop (starting from milliseconds and getting up to almost a minute). What am I doing wrong?
import openpyxl
import datetime
from openpyxl import Workbook
wb = Workbook()
ws = wb.active
ws.title = "test_sheet"
print("Generating quite big excel file")
for i in range(1,10000):
for j in range(1,20):
ws.cell(row = i, column = j).value = "Cell[{},{}]".format(i,j)
print("Saving test excel file")
wb.save('test.xlsx')
def FindXlCell(search_str, last_r):
t = datetime.datetime.utcnow()
for row in ws.iter_rows(row_offset=last_r):
for cell in row:
if (search_str == cell.value):
print(search_str, last_r, cell.row, datetime.datetime.utcnow() - t)
last_r = cell.row
return last_r
print("record not found ",search_str, datetime.datetime.utcnow() - t)
return 1
wb = openpyxl.load_workbook("test.xlsx", data_only=True)
t = datetime.datetime.utcnow()
ws = wb["test_sheet"]
last_row = 1
print("Parsing excel file in a loop for 3 cells")
for i in range(1,100,1):
last_row = FindXlCell("Cell[0,0]", last_row)
last_row = FindXlCell("Cell[1000,6]", last_row)
last_row = FindXlCell("Cell[6000,6]", last_row)
Looping over a worksheet multiple times is inefficient. The reason for the search getting progressively slower looks to be increasingly more memory being used in each loop. This is because last_row = FindXlCell("Cell[0,0]", last_row) means that the next search will create new cells at the end of the rows: openpyxl creates cells on demand because rows can be technically empty but cells in them are still addressable. At the end of your script the worksheet has a total of 598000 rows but you always start searching from A1.
If you wish to search a large file for text multiple times then it would probably make sense to create a matrix keyed by the text with the coordinates being the value.
Something like:
matrix = {}
for row in ws:
for cell in row:
matrix[cell.value] = (cell.row, cell.col_idx)
In a real-world example you'd probably want to use a defaultdict to be able to handle multiple cells with the same text.
This could be combined with read-only mode for a minimal memory footprint. Except, of course, if you want to edit the file.
im trying to copy all cells on a sheet to a new workbook. i can store cell values manually like in the example code below and paste variable in respective cells but i want to automate the collection of cell data. I am very new to python but i can conceptually see something along the line of this but i could use some help to finish it, thanks!
attempt to automate cell collection
def cell(r,c):
set r+=1
cellname = c.isalpha() + r
if r <= sheet.nrow:
cellname = (r,c,sheet.cell_value)
...... i get lost around here but i assume there should be a sheet.ncols and nrows
current manual cell copying
cellA1 = sheet.cell_value(0,0)
cellA2 = sheet.cell_value(1,0)
cellA3 = sheet.cell_value(2,0)
cellA4 = sheet.cell_value(3,0)
cellA5 = sheet.cell_value(4,0)
cellB1 = sheet.cell_value(0,1)
cellB2 = sheet.cell_value(1,1)
workbook = xlwt.Workbook()
sheet = workbook.add_sheet('ITEM DETAILS')
manual cell pasting
sheet.write(0, 0, cellA1)
sheet.write(1, 0, cellA2)
You can just simply loop through the cells in the sheet, by using sheet.nrows and sheet.ncols as the limit to loop up to. Also, make sure you do not define the new worksheet you are creating as sheet itself, use a new name. Example:
newworkbook = xlwt.Workbook()
newsheet = newworkbook.add_sheet('ITEM DETAILS')
for r in range(sheet.nrows):
for c in range(sheet.ncols):
newsheet.write(r, c, sheet.cell_value(r, c))
Then use newsheet instead of sheet wherever you want to use the new sheet.
Anand S Kumar's answer is correct but you need to change i to r and j to c. For extra benefit I added a bit more code for a complete code example. This code opens an existing excel file, reads all of the data from the first sheet, and writes that same data to a new excel file.
import os,xlrd,xlwt
if os.path.isfile(outExcel):os.remove(outExcel)#delete file if it exists
inExcel= (r'C:\yourpath\inFile.xls')
outExcel= (r'C:\yourpath\outFile.xls')
workbook = xlrd.open_workbook(inExcel)
sheetIn = workbook.sheet_by_index(0)
workbook = xlwt.Workbook()
sheetOut = workbook.add_sheet('DATA')
for r in range(sheetIn.nrows):
for c in range(sheetIn.ncols):
sheetOut.write(r, c, sheetIn.cell_value(r, c))
workbook.save(outExcel)#save the result