Python - Appending new columns to Excel File based on row/cell info - python

I am trying to append new columns to an excel file. But the cell value depends on the cells from the same row. I need to add about 8 columns.
Right now, based on one of the cells, let's say serialno, i do a lookup against a JSON URL and pull the relevant column info. But i need to write this info to that particular row.
So far all the help i have found shows adding 1 whole column at a time. Is that the best option or is there an easier process to add all 8 columns and keep appending row wise? I want to be careful with any blank information, as I want the cell to stay blank.
I'm a novice at this and pretty much learning and doing by available scripts.
Thanks for any direction you can provide.
Here is some code i'm currently using
except IndexError:
cols = [col for col in df.columns if 'no' in col]
col_name = cols[0]
for x in df.index:
n = 9 - len(str(df[col_name][x]))
num = str(df[col_name][x]).rjust(n + len(str(df[col_name][x])), '0')
with suppress(KeyError, UnicodeEncodeError):
main(num)
def main(num):
for i in jsonData["people"]:
room_no = jsonData["people"][i]["roomno"]
title = jsonData["people"][i]["title"]
fname = jsonData["people"][i]["full_name_ac"]
tel = jsonData["people"][i]["telephone"]

Related

How to read excel files when some cells have multiple rows

I have to read multiple large excel files to attempt to clean the data.
I am down to the last problem of some cells have multiple rows within them or I guess some cells span multiple rows.
It is something like this:
Index Col1 Col2 Col3
1 row1 row1 row1
2 row1.1
3 row1.2
4 row2 row2 row3
When I use Pandas.read_excel(filename) or Pandas.ExcelFile then sheet.parse(sheetname) it of course reads in index 2 and 3 with mostly blank lines.
How would I go about merging index 2 and 3 into 1 based off what Col1 spans?
To be clear my question is: How could I read in an excel file and merge rows based off what rows the first column spans? Is this even possible?
Thanks
I don't know that this functionality is built into Pandas, since frankly Excel is not intended to be used like this but people still tend to abuse the heck out of it. Man I hate Excel.....but that's a topic for another thread.
I think your best bet here is to define a custom function based on the logic that you know applies to these files. As I am currently in the middle of a project dealing with diverse and poorly-formatted Excel files, I'm all too familiar with this kind of garbage.
This is my suggestion, based on my understanding of the data and what you're asking. It may need to be changed depending on the specifics of your files.
last_valid = None
check_cols = [] # if only need to check a subset of cols for validity, do it here
for i, s in df.iterrows(): # This is slow, but probably necessary in this case
""" If all the rows are valid, we want to keep it as a reference in case
the following rows are not """
if all(s[check_cols].notna()):
lvi, last_valid = i, s
# need to store index and series so we can go back and replace it
continue
else: # here is the critical part
extra_vals = s[s.notna()] # find cells in row that have actual values
for col in extra_vals.index:
""" I'm creating a list and appending here since I don't know
your values or how they need to be handled exactly"""
last_valid[col] = list(last_valid[col]).append(extra_vals[col])
# replace that row in the dataframe
df.iloc[lvi, :] = last_valid
# drop extra rows:
df = df.dropna(axis=0, subset=check_cols)
Hope this works for ya!
#LiamFiddler answer is correct but needed some adjustment to work in my situation as I am combining numbers on the same line and will be going out to a csv as strings. I am posting mine in case it helps someone that comes here
last_valid = None
check_cols = ['Col1'] # if only need to check a subset of cols for validity, do it here
df = df.astype(str) #convert all columns to strings as I have to combine numbers in the same cell
df = df.replace('nan','') #get rid of the nan created back to a blank string
for i, s in df.iterrows(): # This is slow, but probably necessary in this case
""" If all the rows are valid, we want to keep it as a reference in case
the following rows are not """
if all(s[check_cols] != ''):
lvi, last_valid = i, s
# need to store index and series so we can go back and replace it
continue
else: # here is the critical part
extra_vals = s[s != ''] # find cells in row that have actual values
for col in extra_vals.index:
""" I'm creating a list and appending here since I don't know
your values or how they need to be handled exactly"""
last_valid[col] = last_valid[col] + "," + extra_vals[col] #separate by whatever you wish, list was causing issues
# replace that row in the dataframe
df.iloc[lvi, :] = last_valid
# drop extra rows:
df = df[df['Col1'] != ''].reset_index(drop=True)

How to skip rows in pandas dataframe iteration

So I've created a dataframe called celtics and the last column is called 'Change in W/L%' and is right now filled with all 0s.
I want to calculate the change in Win-Loss Percentage (see 'W/L%' column) if the coach's name in one row of Coaches is different from the name of the coach right underneath that row. I have written this loop to try and execute this program:
i = 0
while i < len(celtics) - 1:
if (celtics["Coaches"].loc[i].split("("))[0] != (celtics["Coaches"].loc[i + 1].split("("))[0]:
celtics["Change in W/L%"].loc[i] = celtics["W/L%"].loc[i] - celtics["W/L%"].loc[i + 1]
i = i + 1
i = i + 1
So basically, if the name of the coach in Row i is different than the name of the coach in Row i+1, the change in W/L% between the two rows and is added to Row i of the Change in W/L% column. However, when I execute the code, the dataframe ends up looking like this.
For example, Row 1 should just have 0 in the Change in W/L% column; instead, it has been replaced by the difference in W/L% between Row 1 and Row 2, even though the coach's name is the same in both Rows. Could anyone help me resolve this issue? Thanks!
Check out this solution from this question here on StackOverflow.
Skip rows while looping over data frame Pandas

How to search through pandas data frame row by row and extract variables

I am trying to search through a pandas dataframe row by row and see if 3 variables are in the name of the file. If they are in the name of the file, more variables are extracted from that same row. For instance I am checking to see if the concentration, substrate and the number of droplets match the file name. If this condition is true which will only happen one as there are no duplicates, I want to extract the frame rate and the time from that same row. Below is my code:
excel_var = 'Experiental Camera.xlsx'
workbook = pd.read_excel(excel_var, "PythonTable")
workbook.Concentration.astype(int, errors='raise')
for index, row in workbook.iterrows():
if str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets']) in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
Attached is a example of what my spreadsheet looks like and what my Path_ext is
At the moment nothing is being saved for the Actual_Frame_Rate and I don't know why. I have attached the pictures to show that it should match. Is there anything wrong with my code /. is there a better way to go about this. Any help is much appreciated.
So am unsure why this helped but fixed is by just combining it all into one string and matching is like that. I used the following code:
for index, row in workbook.iterrows():
match = 'water(' + str(row['Concentration']) + '%)-' + str(row['substrate']) + str(-+row['droplets'])
# str(row['Concentration']) and str(row['substrate']) and str(-+row['droplets'])
if match in path_ext:
Actual_Frame_Rate = row['Actual Frame Rate']
Acquired_Time = row['Acquisition time']
This code now produces the correct answer but am unsure why I can't use the other method as of yet.

Adding Multiple Columns at Specific Locations in CSV file using Pandas

I am trying to place multiple columns (Score1, Score2, Score3 etc) before columns whose name begins with a certain text e.g.: Certainty.
I can insert columns at fixed locations using:
df.insert(17, "Score1", " ")
Adding a column then changes the column sequence, so then I would have to look and see where the next column is located. I can add a list of blank columns to the end of a CSV.
So essentially, my understanding is that I have to get pandas to read the column header. If the header text starts with "Certainty", then place a column called Score1 before it.
I tried using:
df.insert(df.filter(regex='Certainty').columns, "Score", " ")
However, as can be guessed it doesn't work.
From what I understand is that pandas is not efficient at iterative methods? Am I misinformed here?
Writing this also leads me to think that it needs a counter for Score1, 2, 3.
Any suggestions would be appreciated!
Thanks in advance.
Updates------Based on feedback provided
Using the method by #SergeBallesta works.
cur=0
for i, col in enumerate(df.columns):
if col.startswith('Certainty'):
df.insert(i+cur, f'Score{cur + 1}', '')
cur += 1
Using the method by #JacoSolari
I needed to make a modification to allow it to find all columns starting with "Certainty". And also needed to add Score1, Score2, Score3 automatically.
Version 1: This only adds Score1 in the correct place and then nothing else
counter=0
certcol = df.columns[df.columns.str.contains('Certainty')]
col_idx = df.columns.get_loc(certcol[0])
col_names = [f'Score{counter + 1}']
[df.insert(col_idx, col_name, ' ')
for col_name in col_names[::-1]]
Version 2: This adds Score1 in the correct place and then adds the rest after the first "Certainty" column. So it does not proceed to find the next one. Perhaps it needs a for loop somewhere?
cur=0
certcol = df.columns[df.columns.str.contains('Certainty')]
for col in enumerate(certcol):
col_idx = df.columns.get_loc(certcol[0])
df.insert(cur+col_idx, f'Score{cur + 1}', '')
cur += 1
I have posted this, in case anyone stumbles across the same need.
You will have to iterate over the columns. It is not as performant as numpy vectorized accesses but sometimes you have no other choice:
Here I would just do:
cur = 0
for i, col in enumerate(df.columns):
if col.startswith('Certainty')
df.insert(i+cur, f'Score{cur + 1}', '')
cur += 1
You can find the location of your Certainty column like this
col_idx = df.columns.get_loc('Certainty')
Then you can add in a for loop each of your new columns and data (here just empty string as in your example) like this
col_names = ['1', '2', '3']
[df.insert(col_idx, col_name, '') for col_name in col_names[::-1]]
So you don't need to update the column index as long as you add the reversed ([::-1]) list of new columns.
Also have a look at this question if you didn't already.

How to return the string of a header based on the max value of a cell in Openpyxl

Good morning guys! quick question for Openpyxl:
I am working with Python editing a xlsx document and generating various stats. Part of my script is to generate max values of a cell range :
temp_list=[]
temp_max=[]
for row in sheet.iter_rows(min_row=3, min_col=10, max_row=508, max_col=13):
print(row)
for cell in row:
temp_list.append(cell.value)
print(temp_list)
temp_max.append(max(temp_list))
temp_list=[]
I would also like to be able to print the string of the header of the column that contains the max value for the cell range desired. My data structure looks like this :
Any idea on how to do so?
Thanks!
This seems like a typical INDEX/MATCH Excel problem.
Have you tried retrieving the index for the max value in each temp_list?
You can use a function like numpy.argmax() to get the index of your max value within your "temp_list" array, then use this index to locate the header and append the string to a new list called, say, "max_headers" which contains all the header strings in order of appearance.
It would look something like this
for cell in row:
temp_list.append(cell.value)
i_max = np.argmax(temp_list)
max_headers.append(cell(row = 1, column = i_max).value)
And so on and so forth. Of course, for that to work, your temp_list should be a numpy array instead of a simple python list, and the max_headers list would have to be defined.
First, Thanks Bernardo for the hint. I found a decently working solution but still have a little issue. Perhaps someone can be of assistance.
Let me amend my initial statement : here is the code I am working with now :
temp_list=[]
headers_list=[]
for row in sheet.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32): #Index starts at 1 // Here we set the rows/columns containing the data to be analyzed
for cell in row:
temp_list.append(cell.value)
for cell in row:
if cell.value == max(temp_list):
print(str(cell.column))
print(cell.value)
print(sheet.cell(row=1, column=cell.column).value)
headers_list.append(sheet.cell(row=1,column=cell.column).value)
else:
print('keep going.')
temp_list = []
This formula works, but has a little issue : If, for instance, a row has the same value twice (ie : 25,9,25,8,9), this loop will print 2 headers instead of one. My question is :
how can I get this loop to take in account only the first match of a max value in a row?
You probably want something like this:
headers = [c for c in next(ws.iter_rows(min_col=27, max_col=32, min_row=1, max_row=1, values_only=True))]
for row in ws.iter_rows(min_row=3, min_col=27, max_row=508, max_col=32, values_only=True):
mx = max(row)
idx = row.index(mx)
col = headers[idx]

Categories

Resources