I have a file with about 25 sheets and each sheet contains 5-30 columns with system names as headers. I want to iterate through a list of about 170 systems (the list is on one of the sheets in the main file) and with each system search each tab for columns with the matching system as the header. I have the code below and it works great for the first iteration, but for some reason after it loops through all the sheets and moves on to the second system it pulls the sheet name rather than the second system name. Anyone see what i'm doing wrong?
import pandas as pd
matrix = pd.ExcelFile('file')
names_tab = pd.read_excel(matrix, sheet_name='Name_Test')
sheets_list = {}
for (y, sysRows) in names_tab.iterrows():
print(sysRows['header'])
for sheets in matrix.sheet_names[1:]:
sheets_list['{}'.format(sheets)] = pd.read_excel(matrix, sheet_name='{}'.format(sheets), skiprows=2)
print(sheets)
for column in sheets_list[sheets]:
if column == sysRows['header']:
for idx, row in sheets_list[sheets][column].iteritems():
if sheets_list[sheets].iloc[idx][column] == 'x':
print('{} has X in row {} column {} on sheet {}'
.format(sysRows['header'], idx, column, sheets))
elif sheets_list[sheets].iloc[idx][column] == 'X':
print('{} has X in row {} column {} on sheet {}'
.format(sysRows['header'], idx, column, sheets))
print(column + ' works')
else:
print(column + ' doesnt work')
I'm not totally sure this is the same result you are trying to achieve, but hopefully it is a starting point (I doubt you need 4 for loops):
import pandas as pd
import numpy as np
names_tab = pd.DataFrame({'header':['System1','System2','System3'], 'some_other_column':['foo','bar','foobar']})
sheet1 = pd.DataFrame({'System1':['x','X'], 'System2':['x','X'], 'System4':['X','x']})
sheet2 = pd.DataFrame({'System2':['X','x'], 'System8':['x','x'], 'System3':['x','X']})
sheets = [sheet1, sheet2]
for i, sheet in enumerate(sheets):
print("Sheet", i + 1)
common_columns = list(set(sheet.columns.tolist()).intersection(names_tab['header'].tolist()))
df = sheet[common_columns]
print("Here are all the 'x' values in Sheet", i + 1)
print(df.where(df == 'x'))
# To get your behavior
positions = np.where(df.values == 'x')
for idx, col in positions:
print('{} has x in row {} column {} on sheet {}'.format(df.columns[col], idx, col, str(i+1)))
Perhaps you could provide a Minimal, Complete, and Verifiable example.
Related
We have an excel file with the same column names of another excel file but in a shuffled way:
excel_1_columns = ["name", "address", "phone"]
and
excel_2_columns = ["address", "name", "phone"]
We want to arrange excel 2 the same way excel 1 is.
Actually these names are just for the purpose of making you understand but it can contain hundreds of columns.
excel_1_columns and excel_2_columns always have the same length but not the same order.
We want to use openpyxl to arrange both to look the same.
We had the idea of converting to data frames and then save into an excel, but we don't want to lose the formatting and the theme used with the original excels.
I started with getting columns names from excel 1:
excel_1_columns = []
for row in intro_sheet.iter_rows(min_row=1, max_row=1):
for cell in row:
excel_1_columns.append(cell.value)
My logic is to loop over excel_1_columns and append a new column into excel 2:
for col in col_order:
excel_2.insert_cols(idx = 1)
# rename
excel_2.cell(row = 1, column = 1).value = str(col)
and then we work within excel 2 and move cell values from old columns into the new appended ones and delete the old ones.
How can we move cell values from column named old_phone for example to phone column that was appended in the for loop?
My thought on your requirement. Assumes the cells are all data only with no formulas. If there are formulas there may be issues with the references. Read this section in the Openpyxl docs if you haven't already
https://openpyxl.readthedocs.io/en/stable/editing_worksheets.html?highlight=delete_cols%20translate#moving-ranges-of-cells
I haven't bothered to read an 'original' workbook header list since adding that feature should be straight forward. For my example code the required sequence is ascending numeric order. I've included a list in that order which you would obtain from the 'excel 1' sheet;
col_order = ['header1', 'header2', 'header3', 'header4', 'header5', 'header6', 'header7']
For testing I created a sheet with 7 columns with these headers, jumbled so they do not match the lists sequencence.
The example code will iterate row 1 (header row) and compare the current header in each column to its required position per col_order. Depending on this check one of two options is taken;
Correct column position; nothing is changed and the code moves to the next column.
Incorrect column position one of two actions is taken. The required column position of the current column is checked;
a) if the column is empty the current column data is moved directly to the correct column
b) if the column has another columns data at that moment, the current column data is moved to a holding column. Holding columns start at the end of the existing columns (ws.max_column)
The copy process also copies the cell style information so that detail is preserved.
After the column data is copied its originating column is cleared of all data so that column is free for data to be entered from the correct column later. The deletion of the column data is achieved by deleting the column then inserting a column at the same location.
At the end of the first iteration through the header row all columns are either in their correct position or in a holding column.
The code then processes the holding columns moving them to their correct location all of which should have been cleared of data in the previous iteration.
from openpyxl import load_workbook
from openpyxl.utils import get_column_letter
from copy import copy
def move_cells(sheet, cur_pos, cur_col):
for existing_cell in sheet[get_column_letter(cur_pos + 1)]:
new_cell = existing_cell.offset(column=cur_col)
new_cell.value = existing_cell.value
if existing_cell.has_style:
new_cell.font = copy(existing_cell.font)
new_cell.border = copy(existing_cell.border)
new_cell.fill = copy(existing_cell.fill)
new_cell.number_format = copy(existing_cell.number_format)
new_cell.protection = copy(existing_cell.protection)
new_cell.alignment = copy(existing_cell.alignment)
del_row = cur_pos + 1
sheet.delete_cols(del_row, amount=1)
sheet.insert_cols(del_row, amount=1)
col_order = ['header1', 'header2', 'header3', 'header4', 'header5', 'header6', 'header7']
excel_file = 'foo1.xlsx'
wb = load_workbook(file)
ws = wb['Sheet1']
total_columns = ws.max_column
current_column = 0
holding_column = total_columns
print("")
list_with_values = []
for enum, cell in enumerate(ws[1]):
list_with_values.append(cell.value)
required_position = col_order.index(cell.value)
print("Column with Header '" + cell.value + "': current position: "
+ str(enum) + " Required position: " + str(required_position))
if required_position != enum:
print("This column is in the wrong position")
header_check = cell.offset(column=required_position - enum).value
print("Header of required position is : " + str(header_check))
if header_check is None:
# The column is clear we can move the correct column to it
print("The required column is not in use, moving this column direct to " + str(required_position))
if required_position > enum:
current_column = enum - required_position
else:
current_column = -abs(enum - required_position)
else:
# The column is in use, move to holding column
print("The required column is in use, move the column to holding column " + str(holding_column))
current_column = holding_column - enum
holding_column += 1
print("Moving cells; enum: " + str(enum) + " current_column: " + str(current_column))
move_cells(ws, enum, current_column)
else:
print("This column is in the right position. It will not be moved.")
print("--------------------------------------------------\n")
print("Moving columns placed in holding to proper positions")
for column in ws.iter_cols(min_row=1, max_row=1, min_col=total_columns+1, max_col=ws.max_column):
for cell in column:
required_position = col_order.index(cell.value)
holding_column = cell.col_idx - 1
print("Moving " + str(cell.value) + " from column " + str(holding_column)
+ " to column " + str(required_position))
if required_position > holding_column:
current_column = holding_column - required_position
else:
current_column = -abs(holding_column - required_position)
move_cells(ws, holding_column, current_column)
holding_column += 1
wb.save(excel_file)
Images would not format properly for some reason so I have just left as links.
Image Before running code
https://i.stack.imgur.com/hlXET.jpg
Image After running code
https://i.stack.imgur.com/YMsM7.png
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
The code I am using:
import pandas as pd
from collections import Counter
import xlsxwriter
def list_generator(file, savefile):
#set writer for output filepath
writer = pd.ExcelWriter(savefile+'.xlsx', engine='xlsxwriter')
#set dataframe to file(path)
df = pd.read_csv(file)
#set split action
split = lambda x: pd.Series(str(x).split(','))
# Special character list
specials = ['\\','?','/','*',':','[',']']
#set columns
col_list = list(df.columns)
for j in col_list:
temp = df[j].apply(split)
temp_clean = []
for i, r in temp.iterrows():
for x in range(len(r)):
if x in temp_clean:
break
elif (r[x] is None) == True or str(r[x])=='':
break
else:
cleaned = str(r[x])
cleaned = cleaned.lstrip()
temp_clean.append(cleaned)
#temp_clean.append(r[x])
counted = Counter(temp_clean)
temp_list = pd.DataFrame(counted.items(), columns = [j, 'count'])
temp_list = temp_list.dropna()
for spec in specials:
if spec in j:
j = j.replace(spec, '')
if len(j)>30:
j = j[:30]
temp_list.to_excel(writer, sheet_name=j, index=False)
writer.save()
list_generator('/content/drive/MyDrive/Maryland/Data/md_res.csv', 'md_res_count')
the files are csv's downloaded from airtable. I want to split multi-select columns to get accurate counts of all occurences, which I get, but I can't understand how I keep getting blank spaces (which I think I figured out?) but also nan values. the output is an xlsx file with sheets that look like:
Also some of the multi-selects seem to split on the comma seperation as well as strings contained within string.
Sample sheet cut
Any help would be greatly appreciated! And can elaborate on anything needed.
We have this if else iteration with the goal to split a dataframe into several dataframes. The result of this iteration will vary, so we will not know how much dataframes we will get out of a dataframe. We want to save that several dataframe as text (.txt):
txtDf = open('D:/My_directory/df0.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df1.txt', 'w')
txtDf.write(df0)
txtDf.close()
txtDf = open('D:/My_directory/df2.txt', 'w')
txtDf.write(df0)
txtDf.close()
And so on ....
But, we want to save that several dataframes automatically, so that we don't need to write the code above for 100 times because of 100 splitted-dataframes.
This is the example our dataframe df:
column_df
237814
1249823
89176812
89634
976234
98634
and we would like to split the dataframe df to several df0, df1, df2 (notes: each column will be in their own dataframe, not in one dataframe):
column_df0 column_df1 column_df2
237814 89176812 976234
1249823 89634 98634
We tried this code:
import copy
import numpy as np
df= pd.DataFrame(df)
len(df)
if len(df) > 10:
print('EXCEEEEEEEEEEEEEEEEEEEEDDD!!!')
sys.exit()
elif len(df) > 2:
df_dict = {}
x=0
y=2
for df_letter in ['A','B','C','D','E','F']:
df_name = f'df_{df_letter}'
df_dict[df_name] = copy.deepcopy(df_letter)
df_dict[df_name] = pd.DataFrame(df[x:y]).to_string(header=False, index=False, index_names=False).split('\n ')
df_dict[df_name] = [','.join(ele.split()) for ele in df_dict[df_name]]
x += 2
y += 2
df_name
else:
df
for df_ in df_dict:
print(df_)
print(f'length: {len(df_dict[df_])}')
txtDf = open('D:/My_directory/{df_dict[df_]}.txt', 'w')
txtDf.write(df)
txtDf.close()
The problem with this code is that we cannot write several .txt files automatically, everything else works just fine. Can anybody figure it out?
If it is a list then you can iterate through it and save each element as string
import os
for key, value in df_dict.items():
with open(f'D:/My_directory/{key}.txt', "w") as file:
file.write('\n'.join(str(v) for v in value))
I think I have a fairly simple question for a python expert. With a lot of struggling I put together underneath code. I am opening an excel file, transforming it to a list of lists and adding a column to this list of lists. Now I want to rename and recalculate the rows of this added column. How do I script that I always take the last column of a list of lists, even though the number of columns could differ.
import xlrd
file_location = "path"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_index(0)
data = [[sheet.cell_value(r, c) for c in range(sheet.ncols)] for r in range(sheet.nrows)]
data = [x + [0] for x in data]
If you have a function called calculate_value that takes a row and returns the value for that row, you could do it like this:
def calculate_value(row):
# calculate it...
return value
def add_calculated_column(rows, func):
result_rows = []
for row in rows:
# create a new row to avoid changing the old data
new_row = row + [func(row)]
result_rows.append(new_row)
return result_rows
data_with_column = add_calculated_column(data, calculate_value)
I found a more easy and more flexible way of adjusting the values in the last column.
counter = 0
for list in data:
counter = counter + 1
if counter == 1:
value = 'Vrspng'
else:
value = counter
list.append(value)