I have a python code written that loads an excel workbook, iterates through all of the rows in a specified column, saves the rows in a dictionary and writes that dictionary to a .txt file.
The vb script that is referenced opens the workbook before openpyxl does and filters it to only show some data.
The only problem is that when openpyxl iterates through the workbook, it records every value instead of the filtered data.
for example if the original spreadsheet is:
A B C
1 x x x
2 x y x
3 x x x
and I filter column B to only show rows that contain "x", then save the workbook. I want openpyxl to only iterate through rows 1 and 3.
here is my code:
from openpyxl import load_workbook
from openpyxl import workbook
import os
#sort using vba script
os.system(r"C:\script.vbs")
#load workbook
path = 'C:/public/temp/workbook.xlsm'
wb = load_workbook(filename = path)
ws=wb.get_sheet_by_name('Sheet3')
#make empty lists
proj_name = []
proj_num = []
proj_status = []
#iterate through rows and append values to lists
for row in ws.iter_rows('D{}:D{}'.format(ws.min_row,ws.max_row)):
for cell in row:
proj_name.append(cell.value)
for row in ws.iter_rows('R{}:R{}'.format(ws.min_row,ws.max_row)):
for cell in row:
proj_num.append(cell.value)
for row in ws.iter_rows('G{}:G{}'.format(ws.min_row,ws.max_row)):
for cell in row:
proj_status.append(cell.value)
#create dictionary from lists using defaultdict
from collections import defaultdict
dict1 = dict((z[0],list(z[1:])) for z in zip(proj_num,proj_name,proj_status))
with open(r"C:\public\list2.txt", "w") as text_file:
text_file.write(str(dict1))
text_file.close()
Unfortunately openpyxl does not currently include filtering in its functionality. As the documentation notes: "Filters and sorts can only be configured by openpyxl but will need to be applied in applications like Excel."
It looks as though you may have to find another solution ...
f is the data I want to filter: (e.g. 'CISCO' only with(and)'PAI' or 'BD' only with(and) 'PAP' or 'H' is 42 )
f = {
'C': ["CISCO", "BD"],
'E': ["PAI", "PAP"],
'H': [60]
}
from openpyxl import load_workbook
from openpyxl.utils.cell import column_index_from_string
def filter_data(rows, f_config, skip_header=False):
# convert column alphabet string to index number (e.g. A=1, B=2)
new_config = {}
for col, fil in f_config.items():
if type(col) == str:
col = column_index_from_string(col)
new_config[col] = fil
output = []
t_filter = len(new_config.items())
for n, row in enumerate(rows):
if n == 0:
if skip_header == True:
# first row header
continue
for i, (col,fil) in enumerate(new_config.items()):
if type(fil) != list:
fil = [fil]
val = row[col-1].value
# break the loop if any of the conditions not meet
if not val in fil:
break
if i+1 == t_filter:
# all conditions were met, add into output
output.append(row)
return output
#flexible to edit/filter which column of data you want
data1 = filter_data(sheet.rows, { "C": "CISCO", "E": "PAI" }, skip_header=True)
#filter 2 possibility, either str or value
data2 = filter_data(data1, { "H": [ "60", 60 ] } )
Related
So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.
The code I am using:
import pandas as pd
from collections import Counter
import xlsxwriter
def list_generator(file, savefile):
#set writer for output filepath
writer = pd.ExcelWriter(savefile+'.xlsx', engine='xlsxwriter')
#set dataframe to file(path)
df = pd.read_csv(file)
#set split action
split = lambda x: pd.Series(str(x).split(','))
# Special character list
specials = ['\\','?','/','*',':','[',']']
#set columns
col_list = list(df.columns)
for j in col_list:
temp = df[j].apply(split)
temp_clean = []
for i, r in temp.iterrows():
for x in range(len(r)):
if x in temp_clean:
break
elif (r[x] is None) == True or str(r[x])=='':
break
else:
cleaned = str(r[x])
cleaned = cleaned.lstrip()
temp_clean.append(cleaned)
#temp_clean.append(r[x])
counted = Counter(temp_clean)
temp_list = pd.DataFrame(counted.items(), columns = [j, 'count'])
temp_list = temp_list.dropna()
for spec in specials:
if spec in j:
j = j.replace(spec, '')
if len(j)>30:
j = j[:30]
temp_list.to_excel(writer, sheet_name=j, index=False)
writer.save()
list_generator('/content/drive/MyDrive/Maryland/Data/md_res.csv', 'md_res_count')
the files are csv's downloaded from airtable. I want to split multi-select columns to get accurate counts of all occurences, which I get, but I can't understand how I keep getting blank spaces (which I think I figured out?) but also nan values. the output is an xlsx file with sheets that look like:
Also some of the multi-selects seem to split on the comma seperation as well as strings contained within string.
Sample sheet cut
Any help would be greatly appreciated! And can elaborate on anything needed.
I would like to make the first column of each row in an excel spreadsheet as a key and rest of the values in that row as its value so that I can store them in a dictionary.
The problem is, when I loop through the rows and columns all the row values are getting stored in every key.
import openpyxl
from openpyxl import load_workbook
file = "test.xlsx"
#load the work book
wb_obj = load_workbook(filename = file)
wsheet = wb_obj['test']
#dictionary to store data
dataDict = {}
value = []
row_count = wsheet.max_row
col_count = wsheet.max_column
# loop to get row and column values
for i in range(2, row_count+1):
for j in range(i, col_count+1):
key = wsheet.cell(row=i, column=1).value
print (key)
value.append(wsheet.cell(row=i, column=j).value)
print (value)
dataDict[key] = value
#prompt user for input
userInput = input("Please enter an id to find a person's details: ")
print (dataDict.get(int(userInput)))
data in spreadsheet:
Result I'm expecting:
{1: ['John', 'Doe', 4567891234, 'johndoe#jd.ca'], 2: ['Wayne 'Kane', 1234567891, 'wd#wd.ca']}
Result I got:
{1: ['John', 'Doe', 4567891234, 'johndoe#jd.ca', 'Kane', 1234567891, 'wd#wd.ca'], 2: ['John', 'Doe', 4567891234, 'johndoe#jd.ca', 'Kane', 1234567891, 'wd#wd.ca']}
Openpyxl already has a proper way to iterate through rows using worksheet.iter_rows(). You can use it to unpack the first cell's value as the key and the values from the other cells as the list in the dictionary, repeating for every row.
from openpyxl import load_workbook
file = "test.xlsx" #load the work book
wb_obj = load_workbook(filename = file)
wsheet = wb_obj['test']
dataDict = {}
for key, *values in wsheet.iter_rows():
dataDict[key.value] = [v.value for v in values]
I have a file with about 25 sheets and each sheet contains 5-30 columns with system names as headers. I want to iterate through a list of about 170 systems (the list is on one of the sheets in the main file) and with each system search each tab for columns with the matching system as the header. I have the code below and it works great for the first iteration, but for some reason after it loops through all the sheets and moves on to the second system it pulls the sheet name rather than the second system name. Anyone see what i'm doing wrong?
import pandas as pd
matrix = pd.ExcelFile('file')
names_tab = pd.read_excel(matrix, sheet_name='Name_Test')
sheets_list = {}
for (y, sysRows) in names_tab.iterrows():
print(sysRows['header'])
for sheets in matrix.sheet_names[1:]:
sheets_list['{}'.format(sheets)] = pd.read_excel(matrix, sheet_name='{}'.format(sheets), skiprows=2)
print(sheets)
for column in sheets_list[sheets]:
if column == sysRows['header']:
for idx, row in sheets_list[sheets][column].iteritems():
if sheets_list[sheets].iloc[idx][column] == 'x':
print('{} has X in row {} column {} on sheet {}'
.format(sysRows['header'], idx, column, sheets))
elif sheets_list[sheets].iloc[idx][column] == 'X':
print('{} has X in row {} column {} on sheet {}'
.format(sysRows['header'], idx, column, sheets))
print(column + ' works')
else:
print(column + ' doesnt work')
I'm not totally sure this is the same result you are trying to achieve, but hopefully it is a starting point (I doubt you need 4 for loops):
import pandas as pd
import numpy as np
names_tab = pd.DataFrame({'header':['System1','System2','System3'], 'some_other_column':['foo','bar','foobar']})
sheet1 = pd.DataFrame({'System1':['x','X'], 'System2':['x','X'], 'System4':['X','x']})
sheet2 = pd.DataFrame({'System2':['X','x'], 'System8':['x','x'], 'System3':['x','X']})
sheets = [sheet1, sheet2]
for i, sheet in enumerate(sheets):
print("Sheet", i + 1)
common_columns = list(set(sheet.columns.tolist()).intersection(names_tab['header'].tolist()))
df = sheet[common_columns]
print("Here are all the 'x' values in Sheet", i + 1)
print(df.where(df == 'x'))
# To get your behavior
positions = np.where(df.values == 'x')
for idx, col in positions:
print('{} has x in row {} column {} on sheet {}'.format(df.columns[col], idx, col, str(i+1)))
Perhaps you could provide a Minimal, Complete, and Verifiable example.
I've written a script that takes a large excel spreadsheet of data and strips away unwanted columns, rows that contain zero values in particular columns and then saves out to a csv. The piece that I'm stuck on is I'm also trying to remove rows that are missing cells. The way I was trying this was by way of:
for each_row in row_list :
if not all(map(len, each_row)) :
continue
else :
UICData.append(row_list)
But this isn't working correctly as I'm getting the error:
File
"/Users/kenmarold/PycharmProjects/sweetCrude/Work/sweetCrude.py",
line
56, in PrepareRawData
if not all(map(len, each_row)) :
TypeError: 'float' object is not iterable
I'm not exactly sure how to resolve this, what's the way forward on this? I've also attached the full script below.
#!/usr/bin/env python3
import os
import sqlite3
import csv
import unicodecsv
from datetime import date
from xlrd import open_workbook, xldate_as_tuple
from xlwt import Workbook
orig_xls = 'data/all_uic_wells_jun_2016.xls'
temp_xls = 'data/temp.xls'
new_csv = 'data/gh_ready_uic_well_data.csv'
temp_csv = 'data/temp.csv'
input_worksheet_index = 0 # XLS Sheet Number
output_workbook = Workbook()
output_worksheet = output_workbook.add_sheet('Sweet Crude')
lat_col_index = 13
long_col_index = 14
#### SELECT AND FORMAT DATA
def PrepareRawData(inputFile, tempXLSFile, tempCSVFile, outputFile):
# 0 = API# # 7 = Approval Date
# 1 = Operator # 13 = Latitude
# 2 = Operator ID # 14 = Longitude
# 3 = Well Type # 15 = Zone
keep_columns = [0, 1, 2, 3, 7, 13, 14, 15]
with open_workbook(inputFile) as rawUICData:
UICSheet = rawUICData.sheet_by_index(input_worksheet_index)
UICData = []
for each_row_index in range(1, UICSheet.nrows - 1, 1):
row_list = []
lat_num = UICSheet.cell_value(each_row_index, lat_col_index) # Get Lat Values
long_num = UICSheet.cell_value(each_row_index, long_col_index) # Get Long Values
if lat_num != 0.0 and long_num != 0.0: # Find Zero Lat/Long Values
for each_column_index in keep_columns:
cell_value = UICSheet.cell_value(each_row_index, each_column_index)
cell_type = UICSheet.cell_type(each_row_index, each_column_index)
if cell_type == 3:
date_cell = xldate_as_tuple(cell_value, rawUICData.datemode)
date_cell = date(*date_cell[0:3]).strftime('%m/%d/%Y')
row_list.append(date_cell)
else:
row_list.append(cell_value)
for each_row in row_list :
if not all(map(len, each_row)) :
continue
else :
UICData.append(row_list)
# CreateDB(row_list) # Send row data to Database
for each_list_index, output_list in enumerate(UICData):
for each_element_index, element in enumerate(output_list):
output_worksheet.write(each_list_index, each_element_index, element)
output_workbook.save(tempXLSFile)
#### RUN XLS-CSV CONVERSION
workbook = open_workbook(tempXLSFile)
sheet = workbook.sheet_by_index(input_worksheet_index)
fh = open(outputFile, 'wb')
csv_out = unicodecsv.writer(fh, encoding = 'utf-8')
for each_row_number in range(sheet.nrows) :
csv_out.writerow(sheet.row_values(each_row_number))
fh.close()
#### KILL TEMP FILES
filesToRemove = [tempXLSFile]
for each_file in filesToRemove:
os.remove(each_file)
print("Raw Data Conversion Ready for Grasshopper")
# ---------------------------------------------------
PrepareRawData(orig_xls, temp_xls, temp_csv, new_csv)
# ---------------------------------------------------
This is a dirty patch.
for each_row in row_list :
if not isinstance(each_row, list):
each_row = [each_row]
if not any(map(len, each_row)) :
continue
UICData.append(row_list)
EDIT: If the any/map/len raises it still, then I would try a different route to check if it's empty.
Also I'm not sure why you are appending the entire row_list and not the current row. I changed it to appending each_row.
Option1
for each_row in row_list:
if not each_row:
continue
UICData.append(each_row)
Option2
keep_data = [arow in row_list if arow] # Or w/e logic. This will be faster.
UICData.append(keep_data)
Your row_list contains a set of values, for example:
[1.01, 75, 3.56, ...]
When you call for each_row in row_list:, you assign a float value to each_row for every iteration of the loop.
You then try to do this:
if not all(map(len, each_row)):
Python's map function expects a list as the second argument, and tries to iterate over it to apply the function len to each item in the list. You can't iterate a float.
I'm not entirely sure what you are trying to do here, but if you are wanting to check that none of the items in your row_list are None or an empty string, then you could do:
if None not in row_list and '' not in row_list:
UICData.append(row_list)
Your overall object appears to be to copy selected columns from all rows of one sheet of an Excel XLS file to a CSV file. Each output row must contain only valid cells, for some definition of "valid".
As you have seen, using map() is not a good idea; it's only applicable if all the fields are text. You should apply tests depending generally on the datatype and specifically on the individual column.
Once you have validated the items in the row, you are in a position to output the data. You have chosen a path which (1) builds a list of all output rows (2) uses xlwt to write to a temp XLS file (3) uses xlrd to read the temp file and unicodecsv to write a CSV file. Please consider avoiding all that; instead just use unicodecsv.writer.writerow(row_list)