How to Parse Pandas ExcelFile sheets with generic headers - python

I have many excel files with multiple sheets where the first row of data are not headers but data. How can I parse each sheet without specifying a header row, and the default not being 0. Having the first row as my headers is a pain.
Failing that, what is the best way to insert a column index into the first row of data?
My code is simple:
import pandas as pd
path_list #list of paths to .xls files
data_sheets = [] #container for parsed sheets
for file_ in path_list:
excel_file_obj = pd.ExcelFile(file_)
for sheet in excel_file_obj:
data_sheet = excel_file_obj.parse(sheet)
data_sheets.append(data_sheet)
I cant for the life of me figure out how to get the column index into the first row index. I basically want a df.reset_index(False) type solution but for columns. Does such a thing exist?
One extremely hackish way round would seem to be to do this for each data sheet:
first_row = data_sheet.columns
generic_cols = ['col' + str(x) for x in xrange(len(data_sheet.columns))]
data_sheet.index = [x for x in xrange(1, len(data_sheet) + 1)]
data_sheet.columns = generic_cols
for_concat = pd.DataFrame({col : val for col, val in zip(generic_cols, first_row)}, index = [0,])
new_sheet =pd.concat([for_concat, data_sheet])
There must be a better way. All help appreciated...

Related

Python code to copy and update excel formulas dynamically

Target: I am trying to split an excel file into multiple files based on some filter given within the sheet.
Problem: An issue is arising while copying the formula columns as it is not updating the row numbers inside the formula while splitting them into multiple sheets.
For Ex: In the master file, the formula is "=LEFT(B11, FIND(" ", B11,1))" for row 11, however, this becomes the first row in the new split file but the formula is still referring to row 11 which gives "#VALUE" error in the new file.
Any ideas on how to resolve this one?
I have tried achieving this using pandas and openpyxl and failed, PFB the code.
To Load the file
wb = load_workbook(filepath)
sheets = wb.get_sheet_names()
sheet_name = wb[sheets[0]]
master_df = pd.DataFrame(sheet_name.values, index=False)
master_df.columns = master_df.iloc[0]
master_df = master_df[1:]
print(master_df)
To split amd export the file
temp_df = master_df[master_df['Filter Column'] == filter_criteria]
sp.export_file(temp_df, output_path + "/" + <"output file name">)
def update_formula(df: pd.DataFrame, formula_col):
'''
Function to update formulas for each Manager
:param df: DataFrame for one specific manager.
'''
for _col in formula_col:
col_alpha = formula_col[_col][0]
formula = formula_col[_col][1]
index = 2
for ind, row in df.iterrows():
df.at[ind, _col] = Translator(formula, origin=col_alpha + '2').translate_formula(col_alpha + str(index))
index = index + 1
Here I am giving DataFrame and a list of columns which have formula in them as input. Later I am iterating over DataFrame and updating formula for each cell in that column using OpenpyXl Translator method.
This is the best solution I have figured yet.
Please let me know if there is a better way.

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

XLRD cannot read multiindex column name

I have a problem with multiindex column name. I'm using XLRD to convert excel data to json using json.dumps but instead it gives me only one row of column name only. I have read about multilevel json but i have no idea how to do it using XLRD.
Here is my sample of table column name
Sample of code:
for i in path:
with xlrd.open_workbook(i) as wb:
print([i])
kwd = 'sage'
print(wb.sheet_names())
for j in range(wb.nsheets):
worksheet = wb.sheet_by_index(j)
data = []
n = 0
nn = 0
keyword = 'sage'
keyword2 = 'adm'
try:
skip = skip_row(worksheet, n, keyword)
keys = [v.value for v in worksheet.row(skip)]
except:
try:
skip = skip_row2(worksheet, nn, keyword2)
keys = [v.value for v in worksheet.row(skip)]
except:
continue
print(keys)
for row_number in range(check_skip(skip), worksheet.nrows):
if row_number == 0:
continue
row_data = {}
for col_number, cell in enumerate(worksheet.row(row_number)):
row_data[keys[col_number]] = cell.value
data.append(row_data)
print(json.dumps({'Data': data}))
ouh by the way, each worksheet have different number to skip before column name so that's why my code got function of skip row. After I skip the row and found the exact location of my column name. Then i start to read the values. But it yah there is where the problem raise from my view because i got two rows of column name. And still confuse how to do multi level json with XLRD or at least join the column name with XLRD (which i guess can't).
Desired outcome multilevel json:
{ "Data":[{ "ID" : "997", "Tax" : [{"Date" : "9/7/2019", "Total" : 2300, "Grand Total" : 340000"}], "Tax ID" : "ST-000", .... }]}
pss:// I've tried to use pandas but it gives me a lot of trouble since i work with big data.
You can use multi indexing in panda, first you need to get header row index for each sheet.
header_indexes = get_header_indexes(excel_filepath, sheet_index) #returns list of header indexes
You need to write get_header_indexes function which scans sheet and return header indexes.
you can use panda to get JSON from dataframe.
import pandas as pd
df = pd.read_excel(excel_filepath, header=header_indexes, sheet_name=sheet_index)
data = df.to_dict(orient="records")
for multiple headers data containts list of dict and each dict has tuple as key, you can reformat it to final JSON as per your requirement.
Note: Use chunksize for reading large files.

Import CSV and create one list for each column in Python

I am processing a CSV file in python thats delimited by a comma (,).
Each column is a sampled parameter, for instance column 0 is time, sampled at once a second, column 1 is altitude sampled at 4 times a second, etc.
So columns will look like as below:
Column 0 -> ["Time", 0, " "," "," ",1]
Column 1 -> ["Altitude", 100, 200, 300, 400]
I am trying to create a list for each column that captures its name and all its data. That way I can do calculations and organize my data into a new file automatically (the sampled data I am working with has substantial number of rows)
I want to do this for any file not just one, so the number of columns can vary.
Normally if every file was consistent I would do something like:
import csv
time =[]
alt = []
dct = {}
with open('test.csv',"r") as csvfile:
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0])
alt.append(row[1]) #etc for all columns
I am pretty new in python. Is this a good way to tackle this, if not what is better methodology?
Thanks for your time
Pandas will probably work best for you. If you use csv_read from pandas, it will create a DataFrame based on the column. It's roughly a dictionary of lists.
You can also use the .tolist() functionality of pandas to convert it to a list if you want a list specifically.
import pandas as pd
data = pd.read_csv("soqn.csv")
dict_of_lists = {}
for column_name in data.columns:
temp_list = data[column_name].tolist()
dict_of_lists[column_name] = temp_list
print dict_of_lists
EDIT:
dict_of_lists={column_name: data[column_name].tolist() for column_name in data.columns}
#This list comprehension might work faster.
I think I made my problem more simpler and just focused on one column.
What I ultimately wanted to do was to interpolate to the highest sampling rate. So here is what I came up with... Please let me know if I can do anything more efficient. I used A LOT of searching on this site to help build this. Again I am new at Python (about 2-3 weeks but some former programming experience)
import csv
header = []
#initialize variables
loc_int = 0
loc_fin = 0
temp_i = 0
temp_f = 0
with open('test2.csv',"r") as csvfile: # open csv file
csv_f = csv.reader(csvfile)
for row in csv_f:
header.append(row[0]) #make a list that consists of all content in column A
for x in range(0,len(header)-1): #go through entire column
if header[x].isdigit() and header[x+1]=="": # find lower bound of sample to be interpolated
loc_int = x
temp_i = int(header[x])
elif header[x+1].isdigit() and header[x]=="": # find upper bound of sample to be interpolated
loc_fin = x
temp_f = int(header[x+1])
if temp_f>temp_i: #calculate interpolated values
f_min_i = temp_f - temp_i
interp = f_min_i/float((loc_fin+1)-loc_int)
for y in range(loc_int, loc_fin+1):
header[y] = temp_i + interp*(y-loc_int)
print header
with open("output.csv", 'wb') as g: #write to new file
writer = csv.writer(g)
for item in header:
writer.writerow([item])
I couldn't figure out how to write my new list "header" with its interpolated values and replace it with column A of my old file , test2.csv.
Anywho thank you very much for looking...

In python removing rows from a excel file using xlrd, xlwt, and xlutils

Hello everyone and thank you in advance.
I have a python script where I am opening a template excel file, adding data (while preserving the style) and saving again. I would like to be able to remove rows that I did not edit before saving out the new xls file. My template xls file has a footer so I want to delete the extra rows before the footer.
Here is how I am loading the xls template:
self.inBook = xlrd.open_workbook(file_path, formatting_info=True)
self.outBook = xlutils.copy.copy(self.inBook)
self.outBookCopy = xlutils.copy.copy(self.inBook)
I then write the info to outBook while grabbing the style from outBookCopy and applying it to each row that I modify in outbook.
so how do I delete rows from outBook before writing it? Thanks everyone!
I achieved using Pandas package....
import pandas as pd
#Read from Excel
xl= pd.ExcelFile("test.xls")
#Parsing Excel Sheet to DataFrame
dfs = xl.parse(xl.sheet_names[0])
#Update DataFrame as per requirement
#(Here Removing the row from DataFrame having blank value in "Name" column)
dfs = dfs[dfs['Name'] != '']
#Updating the excel sheet with the updated DataFrame
dfs.to_excel("test.xls",sheet_name='Sheet1',index=False)
xlwt does not provide a simple interface for doing this, but I've had success with a somewhat similar problem (inserting multiple copies of a row into a copied workbook) by directly changing the worksheet's rows attribute and the row numbers on the row and cell objects.
The rows attribute is a dict, indexed on row number, so iterating a row range takes a little care and you can't slice it.
Given the number of rows you want to delete and the initial row number of the first row you want to keep, something like this might work:
rows_indices_to_move = range(first_kept_row, worksheet.last_used_row + 1)
max_used_row = 0
for row_index in rows_indices_to_move:
new_row_number = row_index - number_to_delete
if row_index in worksheet.rows():
row = worksheet.rows[row_index]
row._Row__idx = new_row_number
for cell in row._Row__cells.values():
if cell:
cell.rowx = new_row_number
worksheet.rows[new_row_number] = row
max_used_row = new_row_number
else:
# There's no row in the block we're trying to slide up at this index, but there might be a row already present to clear out.
if new_row_number in worksheet.rows():
del worksheet.rows[new_row_number]
# now delete any remaining rows
del worksheet.rows[new_row_number + 1:]
# and update the internal marker for the last remaining row
if max_used_row:
worksheet.last_used_row = max_used_row
I would believe that there are bugs in that code, it's untested and relies on direct manipulation of the underlying data structures, but it should show the general idea. Modify the row and cell objects and adjust the rows dictionary so that the indices are correct.
Do you have merged ranges in the rows you want to delete, or below them? If so you'll also need to run through the worksheet's merged_ranges attribute and update the rows for them. Also, if you have multiple groups of rows to delete you'll need to adjust this answer - this is specific to the case of having a block of rows to delete and shifting everything below up.
As a side note - I was able to write text to my worksheet and preserve the predefined style thus:
def write_with_style(ws, row, col, value):
if ws.rows[row]._Row__cells[col]:
old_xf_idx = ws.rows[row]._Row__cells[col].xf_idx
ws.write(row, col, value)
ws.rows[row]._Row__cells[col].xf_idx = old_xf_idx
else:
ws.write(row, col, value)
That might let you skip having two copies of your spreadsheet open at once.
For those of us still stuck with xlrd/xlwt/xlutils, here's a filter you could use:
from xlutils.filter import BaseFilter
class RowFilter(BaseFilter):
rows_to_exclude: "Iterable[int]"
_next_output_row: int
def __init__(
self,
rows_to_exclude: "Iterable[int]",
):
self.rows_to_exclude = rows_to_exclude
self._next_output_row = -1
def _should_include_row(self, rdrowx):
return rdrowx not in self.rows_to_exclude
def row(self, rdrowx, wtrowx):
if self._should_include_row(rdrowx):
# Proceed with writing out the row to the output file
self._next_output_row += 1
self.next.row(
rdrowx, self._next_output_row,
)
# After `row()` has been called, `cell()` is called for each cell of the row
def cell(self, rdrowx, rdcolx, wtrowx, wtcolx):
if self._should_include_row(rdrowx):
self.next.cell(
rdrowx, rdcolx, self._next_output_row, wtcolx,
)
Then put it to use with e.g.:
from xlrd import open_workbook
from xlutils.filter import DirectoryWriter, XLRDReader
xlutils.filter.process(
XLRDReader(open_workbook("input_filename.xls", "output_filename.xls")),
RowFilter([3, 4, 5]),
DirectoryWriter("output_dir"),
)

Categories

Resources