Retrieve first empty column and row using xlwings - python

I am looking for a way to find the first empty column and the row. As a part of my use case, I am trying to find out H3 (to add current date) and then H4 and H5 (to add my daily metrics) [screenshot attached]. I have tried below with xlwings.
import xlwings as xw
from xlwings import Range, constants
wb = xw.Book(r"path to xlsx")
sht1 = wb.sheets['Sheet1']
sht1.range('G3').value = current_date
sht1.range('G4').value = 5678
sht1.range('G5').value = 1234
wb.save(r"path to xlsx")
The issue is I have hardcoded the column and row references in the script. I want H3, H4 and H5 to find out dynamically through xlwings and update the metrics programmatically. Can someone guide me on this?

You can do this by finding the last column of the data used. Here are two options to get this data:
Using SpecialCells(11), which is a VBA function accessed through the .api, information about this can be found here.
Using .end("right"), the equivalent of ctrl + right in Excel.
Option 1 would work well if there is no other data in the spreadsheet, so the last cell in the sheet would be the correct column. This is convenient and doesn't require knowledge of the starting cell (in this case B3).
Option 2 would be preferred for spreadsheets where other data may be on the sheet, so the last cell will not necessarily be in the last column of your desired data. This option does, however, require no missing columns as moving the last right-most cell in the group of cells would therefore not strictly be the last column of the data.
An alternative could be to import all the data to Python as a pd.DataFrame, then append an additional column and return. If you need to append many columns of data, this would probably be more efficient (especially if you already have a DataFrame of the data you are pasting to Excel).
The last_col is an integer, as this is most easily manipulated (such as increasing by 1). Therefore, the range has also be modified to make use of this, instead of using A1 style (e.g. range("A1")), a tuple is used of format (row_num, col_num) (e.g. range((row_num, col_num))).
import xlwings as xw
import datetime as dt
current_date = dt.date.today().strftime("%d-%b-%y")
wb = xw.Book(r"path to xlsx")
sht1 = wb.sheets['Sheet1']
# options 1: last column in the sheet through SpecialCells
last_col = sht1.range("A1").api.SpecialCells(11).Column
# option 2: starting at cell B3, the first in the date headers, move to the right (like ctrl+right in Excel)
last_col = sht1.range("B3").end("right").column
# paste new values
sht1.range((3, last_col+1)).value = current_date
sht1.range((4, last_col+1)).value = 5678
sht1.range((5, last_col+1)).value = 1234
wb.save(r"path to xlsx")

Related

How to search through pandas data frame row by row and extract variables

I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.

Is there a way to copy merged cells using Python(openpyxl)?

I'm trying to make code using Python to repeatedly copy a certain range of Excel files.
Like the image file below, the same content is copied by placing a few columns on a sheet with existing content in the same sheet.
enter image description here
Despite being an introductory level of ability, at first I believed that it would be really easy to implement. And I have implemented the most of my plans through the code below except merged cells
import openpyxl
from openpyxl import Workbook
wb = openpyxl.load_workbook('DSD.xlsx')
ws = wb.worksheets[3]
from openpyxl.utils import range_boundaries
min_col, min_row, max_col, max_row = range_boundaries('I1')
for row, row_cells in enumerate(wb['2'], min_row):
for column, cell in enumerate(row_cells, min_col):
# Copy Value from Copy.Cell to given Worksheet.Cell
ws.cell(row=row, column=column).value = cell.value
ws.cell(row=row, column=column)._style = cell._style
wb.save('DSD.xlsx')
enter image description here
However, I've searching for a few days and thinking about it, but I don't know how to merge the merged cells. Beyond that, I don't even know if this is technically possible.
The way I thought of it today was to pull out a list of merged cells in the sheet and then add only more columns from the coordinates of each merged cell to add commands to merge to the same extent.
However, since I pulled out the list of merged cells as shown below, I can't think of what to do.
Is there a way?
import openpyxl
from openpyxl import Workbook
wb=openpyxl.load_workbook('merged.xlsx')
ws=wb.active
Merged_Cells = ws.merged_cell_ranges
a = Merged_Cells[0]
print(ws.merged_cell_ranges)
print(a)
[<MergedCellRange B2:C3>, <MergedCellRange B9:C9>, <MergedCellRange B13:B14>]
B2:C3
The value B2:C3, corresponding to a, appears to be a special value in which a.replace(~~) does not work even if the value is seem to be "B2:C3".
Is there really no way?
You'll need to look at how CellRanges work because you will need to calculate the offset for each range and then create a new MergedCellRange for the new cell using the dimensions from the original.
The following should give you an idea of how to do ths.
from openpyxl.worksheet.cell_range import CellRange
area = CellRange("A1:F13") # area being copied
for mcr in ws.merged_cells:
if mcr.coord not in area:
continue
cr = CellRange(mcr.coord)
cr.shift(col_shift=10)
ws.merge_cells(cr.coord)

How to get first 300 rows of Google sheet via gspread

Set-up
I create a Pandas dataframe from all records in a google sheet like this,
df = pd.DataFrame(wsheet.get_all_records())
as explained in the Gspread docs.
Issue
It seems Python stays in limbo when I execute the command since today. I don't get any error; I interrupt Python with KeyboardInterrupt after a while.
I suspect Google finds the records too much; ±3500 rows with 18 columns.
Question
Now, I actually don't really need the entire sheet. The first 300 rows would do just fine.
The docs show values_list = worksheet.row_values(1), which would return the first row values in a list.
I guess I could create a loop, but I was wondering if there's a build-in / better solution?
I believe your goal as follows.
You want to retrieve the values from 1st row to 300 row from a sheet in Google Spreadsheet.
From I suspect Google finds the records too much; ±3500 rows with 18 columns., you want to retrieve the values from the columns "A" to "R"?
You want to convert the retrieved values to the dataFrame.
You want to achieve this using gspread.
In order to achieve this, I would like to propose the following sample script.
In this answer, I used the method of values_get.
Sample script:
spreadsheetId = "###" # Please set the Spreadsheet ID.
rangeA1notation = "Sheet1!A1:R300" # Please set the range using A1Notation.
client = gspread.authorize(credentials)
spreadsheet = client.open_by_key(spreadsheetId)
values = spreadsheet.values_get(rangeA1notation)
v = values['values']
df = pd.DataFrame(v)
print(df)
Note:
Please set the range as the A1Notation. In this case, when "A1:R300" instead of "Sheet1!A1:R300" is used, the values are retrieved from the 1st tab in the Spreadsheet.
When "A1:300" is used, the values are retrieved from the column "A" to the last column of the sheet.
When the 1st row is the header row and the data is after the 2nd row, please modify as follows.
From
df = pd.DataFrame(v)
To
df = pd.DataFrame(v[1:], columns=v[0])
Reference:
values_get
I used openpyxl package.
import openpyxl as xl
wb = xl.load_workbook('your_file_name')>
sheet = wb['name_of_your_sheet']
Specify your range.
for row in range(1, 300):
Now you can perform many opertions e.g this will point at row(1) & col(3) in first iteration
cell = sheet.cell(row, 3)
if you want to change the cell value
cell.value = 'something'
It's has pretty much all of it.
Here is a link to the docs: https://openpyxl.readthedocs.io/en/stable/

manipulating excel spreadsheets with python

I am new to Python especially when it comes to using it with Excel. I need to write code to search for the string “Mac”, “Asus”, “AlienWare”, “Sony”, or “Gigabit” within a longer string for each cell in column A. Depending on which of these strings it finds within the entire entry in column A’s cell, it should write one of these 5 strings to the corresponding row in column C’s cell. Else if it doesn’t find any of the five, it would write “Other” to the corresponding row in column C. For example, if Column A2’s cell contained the string “ProLiant Asus DL980 G7, the correct code would write “Asus” to column C2’s cell. It should do this for every single cell in column A, writing the appropriate string to the corresponding cell in column C. Every cell in column A will have one of the five strings Mac, Asus, AlienWare, Sony, or Gigabit within it. If it doesn’t contain one of those strings, I want the corresponding cell in column 3 to have the string “Other” written to it. So far, this is the code that I have (not much at all):
import openpyxl
wb = openpyxl.load_workbook(path)
sheet = wb.active
for i in range (sheet.max_row):
cell1 = sheet.cell (row = i, column = 1)
cell2 = sheet.cell (row = I, column = 3)
# missing code here
wb.save(path)
You haven't tried writing any code to solve the problem. You might want to first get openpyxl to write to the excel workbook and verify that is working - even if it's dummy data. This page looks helpful - here
Once that is working all you'd need is a simple function that takes in a string as an argument.
def get_column_c_value(string_from_column_a):
if "Lenovo" in string_from_column_a:
return "Lenovo"
else if "HP" in string_from_column_a:
return "HP"
# strings you need to check for here in the same format as above
else return "other"
Try out those and if you have any issues let me know where you're getting stuck.
I have not worked much with openpyxl, but it sounds like you are trying to do a simple string search.
You can access individual cells by using
cell1.internal_value
Then, your if/else statement would look something like
if "HP" in str(cell1.internal_value):
Data can be assigned directly to a cell so you could have
ws['C' + str(i)] = "HP"
You could do this for all of the data in your cells

I/O efficiency in Python

I am writing a program that:
Read the content from an excel sheets for each row (90,000 rows in total)
Compare the content with another excel sheet for each row (600,000 rows in total)
If a match occurs, write the matching entry into a new excel sheet
I have written the script and everything works fine. however, the computational time is HUGE. For an hour, it has done just 200 rows from the first sheet, resulting in writing 200 different files.
I was wondering if there is a way to save the matching in a different way as I am going to use them later on? Is there any way to save in a matrix or something?
import xlrd
import xlsxwriter
import os, itertools
from datetime import datetime
# choose the incident excel sheet
book_1 = xlrd.open_workbook('D:/Users/d774911/Desktop/Telstra Internship/Working files/Incidents.xlsx')
# choose the trap excel sheet
book_2 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Traps.xlsx")
# choose the features sheet
book_3 = xlrd.open_workbook("D:/Users/d774911/Desktop/Telstra Internship/Working files/Features.xlsx")
# select the working sheet, either by name or by index
Traps = book_2.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Incidents = book_1.sheet_by_name('Sheet1')
# select the working sheet, either by name or by index
Features_Numbers = book_3.sheet_by_name('Sheet1')
#return the total number of rows for the traps sheet
Total_Number_of_Rows_Traps = Traps.nrows
# return the total number of rows for the incident sheet
Total_Number_of_Rows_Incidents = Incidents.nrows
# open a file two write down the non matching incident's numbers
print(Total_Number_of_Rows_Traps, Total_Number_of_Rows_Incidents)
write_no_matching = open('C:/Users/d774911/PycharmProjects/GlobalData/No_Matching.txt', 'w')
# For loop to iterate for all the row for the incident sheet
for Rows_Incidents in range(Total_Number_of_Rows_Incidents):
# Store content for the comparable cell for incident sheet
Incidents_Content_Affected_resources = Incidents.cell_value(Rows_Incidents, 47)
# Store content for the comparable cell for incident sheet
Incidents_Content_Product_Type = Incidents.cell_value(Rows_Incidents, 29)
# Convert Excel date type into python type
Incidents_Content_Date = xlrd.xldate_as_tuple(Incidents.cell_value(Rows_Incidents, 2), book_1.datemode)
# extract the year, month and day
Incidents_Content_Date = str(Incidents_Content_Date[0]) + ' ' + str(Incidents_Content_Date[1]) + ' ' + str(Incidents_Content_Date[2])
# Store content for the comparable cell for incident sheet
Incidents_Content_Date = datetime.strptime(Incidents_Content_Date, '%Y %m %d')
# extract the incident number
Incident_Name = Incidents.cell_value(Rows_Incidents, 0)
# Create a workbook for the selected incident
Incident_Name_Book = xlsxwriter.Workbook(os.path.join('C:/Users/d774911/PycharmProjects/GlobalData/Test/', Incident_Name + '.xlsx'))
# Create sheet name for the created workbook
Incident_Name_Sheet = Incident_Name_Book.add_worksheet('Sheet1')
# insert the first row that contains the features
Incident_Name_Sheet.write_row(0, 0, Features_Numbers.row_values(0))
Insert_Row_to_Incident_Sheet = 0
# For loop to iterate for all the row for the traps sheet
for Rows_Traps in range(Total_Number_of_Rows_Traps):
# Store content for the comparable cell for traps sheet
Traps_Content_Node_Name = Traps.cell_value(Rows_Traps, 3)
# Store content for the comparable cell for traps sheet
Traps_Content_Event_Type = Traps.cell_value(Rows_Traps, 6)
# extract date temporally
Traps_Content_Date_temp = Traps.cell_value(Rows_Traps, 10)
# Store content for the comparable cell for traps sheet
Traps_Content_Date = datetime.strptime(Traps_Content_Date_temp[0:10], '%Y-%m-%d')
# If the content matches partially or full
if len(str(Traps_Content_Node_Name)) * len(str(Incidents_Content_Affected_resources)) != 0 and \
str(Incidents_Content_Affected_resources).lower().find(str(Traps_Content_Node_Name).lower()) != -1 and \
len(str(Traps_Content_Event_Type)) * len(str(Incidents_Content_Product_Type)) != 0 and \
str(Incidents_Content_Product_Type).lower().find(str(Traps_Content_Event_Type).lower()) != -1 and \
len(str(Traps_Content_Date)) * len(str(Incidents_Content_Date)) != 0 and \
Traps_Content_Date <= Incidents_Content_Date:
# counter for writing inside the new incident sheet
Insert_Row_to_Incident_Sheet = Insert_Row_to_Incident_Sheet + 1
# Write the Incident information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 0, Incidents.row_values(Rows_Incidents))
# Write the Traps information
Incident_Name_Sheet.write_row(Insert_Row_to_Incident_Sheet, 107, Traps.row_values(Rows_Traps))
Incident_Name_Book.close()
Thanks
What your doing is seeking/reading a litte bit of data for each cell. This is very inefficient.
Try reading all information in one go into an as basic as sensible python data structure (lists, dicts etc.) and make your comparisons/operations on this data set in memory and write all results in one go. If not all data fits into memory, try to partition it into sub-tasks.
Having to read the data set 10 times, to extract a tenth of data each time will likely still be hugely faster than reading each cell independently.
I don't see how your code can work; the second loop works on variables which change for every row in the first loop but the second loop isn't inside of the first one.
That said, comparing files in this way has a complexity of O(N*M) which means that the runtime explodes quickly. In your case you try to execute 54'000'000'000 (54 billion) loops.
If you run into these kind of problems, the solution is always a three step process:
Transform the data to make it easier to process
Put the data into efficient structures (sorted lists, dict)
Search the data with the efficient structures
You have to find a way to get rid of the find(). Try to get rid of all the junk in the cells that you want to compare so that you could use =. When you have this, you can put rows into a dict to find matches. Or you could load it into a SQL database and use SQL queries (don't forget to add indexes!)
One last trick is to use sorted lists. If you can sort the data in the same way, then you can use two lists:
Sort the data from the two sheets into two lists
Use two row counters (one per list)
If the current item from the first list is less than the current one from the second list, then there is no match and you have to advance the first row counter.
If the current item from the first list is bigger than the current one from the second list, then there is no match and you have to advance the second row counter.
If the items are the same, you have a match. Process the match and advance both counters.
This allows you to process all the items in one go.
I would suggest that you use pandas. This module provides a huge amount of functions to compare datasets. It also has a very fast import/export algorithms for excel files.
IMHO you should use the merge function and provide the arguments how = 'inner' and on = [list of your columns to compare]. That will create a new dataset with only such rows that occur in both tables (having the same values in the defined colums). This new dataset you can export to your excel file.

Categories

Resources