XLRD cannot read multiindex column name - python

I have a problem with multiindex column name. I'm using XLRD to convert excel data to json using json.dumps but instead it gives me only one row of column name only. I have read about multilevel json but i have no idea how to do it using XLRD.
Here is my sample of table column name
Sample of code:
for i in path:
with xlrd.open_workbook(i) as wb:
print([i])
kwd = 'sage'
print(wb.sheet_names())
for j in range(wb.nsheets):
worksheet = wb.sheet_by_index(j)
data = []
n = 0
nn = 0
keyword = 'sage'
keyword2 = 'adm'
try:
skip = skip_row(worksheet, n, keyword)
keys = [v.value for v in worksheet.row(skip)]
except:
try:
skip = skip_row2(worksheet, nn, keyword2)
keys = [v.value for v in worksheet.row(skip)]
except:
continue
print(keys)
for row_number in range(check_skip(skip), worksheet.nrows):
if row_number == 0:
continue
row_data = {}
for col_number, cell in enumerate(worksheet.row(row_number)):
row_data[keys[col_number]] = cell.value
data.append(row_data)
print(json.dumps({'Data': data}))
ouh by the way, each worksheet have different number to skip before column name so that's why my code got function of skip row. After I skip the row and found the exact location of my column name. Then i start to read the values. But it yah there is where the problem raise from my view because i got two rows of column name. And still confuse how to do multi level json with XLRD or at least join the column name with XLRD (which i guess can't).
Desired outcome multilevel json:
{ "Data":[{ "ID" : "997", "Tax" : [{"Date" : "9/7/2019", "Total" : 2300, "Grand Total" : 340000"}], "Tax ID" : "ST-000", .... }]}
pss:// I've tried to use pandas but it gives me a lot of trouble since i work with big data.

You can use multi indexing in panda, first you need to get header row index for each sheet.
header_indexes = get_header_indexes(excel_filepath, sheet_index) #returns list of header indexes
You need to write get_header_indexes function which scans sheet and return header indexes.
you can use panda to get JSON from dataframe.
import pandas as pd
df = pd.read_excel(excel_filepath, header=header_indexes, sheet_name=sheet_index)
data = df.to_dict(orient="records")
for multiple headers data containts list of dict and each dict has tuple as key, you can reformat it to final JSON as per your requirement.
Note: Use chunksize for reading large files.

Related

Is there a way to export a list of 100+ dataframes to excel?

So this is kind of weird but I'm new to Python and I'm committed to seeing my first project with Python through to the end.
So I am reading about 100 .xlsx files in from a file path. I then trim each file and send only the important information to a list, as an individual and unique dataframe. So now I have a list of 100 unique dataframes, but iterating through the list and writing to excel just overwrites the data in the file. I want to append the end of the .xlsx file. The biggest catch to all of this is, I can only use Excel 2010, I do not have any other version of the application. So the openpyxl library seems to have some interesting stuff, I've tried something like this:
from openpyxl.utils.dataframe import dataframe_to_rows
wb = load_workbook(outfile_path)
ws = wb.active
for frame in main_df_list:
for r in dataframe_to_rows(frame, index = True, header = True):
ws.append(r)
Note: In another post I was told it's not best practice to read dataframes line by line using loops, but when I started I didn't know that. I am however committed to this monstrosity.
Edit after reading Comments
So my code scrapes .xlsx files and stores specific data based on a keyword comparison into dataframes. These dataframes are stored in a list, I will list the entirety of the program below so hopefully I can explain what's in my head. Also, feel free to roast my code because I have no idea what is actual good python practices vs. not.
import os
import pandas as pd
from openpyxl import load_workbook
#the file path I want to pull from
in_path = r'W:\R1_Manufacturing\Parts List Project\Tool_scraping\Excel'
#the file path where row search items are stored
search_parameters = r'W:\R1_Manufacturing\Parts List Project\search_params.xlsx'
#the file I will write the dataframes to
outfile_path = r'W:\R1_Manufacturing\Parts List Project\xlsx_reader.xlsx'
#establishing my list that I will store looped data into
file_list = []
main_df = []
master_list = []
#open the file path to store the directory in files
files = os.listdir(in_path)
#database with terms that I want to track
search = pd.read_excel(search_parameters)
search_size = search.index
#searching only for files that end with .xlsx
for file in files:
if file.endswith('.xlsx'):
file_list.append(in_path + '/' + file)
#read in the files to a dataframe, main loop the files will be maninpulated in
for current_file in file_list:
df = pd.read_excel(current_file)
#get columns headers and a range for total rows
columns = df.columns
total_rows = df.index
#adding to store where headers are stored in DF
row_list = []
column_list = []
header_list = []
for name in columns:
for number in total_rows:
cell = df.at[number, name]
if isinstance(cell, str) == False:
continue
elif cell == '':
continue
for place in search_size:
search_loop = search.at[place, 'Parameters']
#main compare, if str and matches search params, then do...
if insensitive_compare(search_loop, cell) == True:
if cell not in header_list:
header_list.append(df.at[number, name]) #store data headers
row_list.append(number) #store row number where it is in that data frame
column_list.append(name) #store column number where it is in that data frame
else:
continue
else:
continue
for thing in column_list:
df = pd.concat([df, pd.DataFrame(0, columns=[thing], index = range(2))], ignore_index = True)
#turns the dataframe into a set of booleans where its true if
#theres something there
na_finder = df.notna()
#create a new dataframe to write the output to
outdf = pd.DataFrame(columns = header_list)
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
#I turn the dataframe into booleans and read until False
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
#Store actual dataframe into my output dataframe, outdf
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
So main_df is a list that has 100+ dataframes in it. For this example I will only use 2 of them. I would like them to print out into excel like:
So the comment from Ashish really helped me, all of the dataframes had different column titles so my 100+ dataframes eventually concat'd to a dataframe that is 569X52. Here is the code that I used, I completely abandoned openpyxl because once I was able to concat all of the dataframes together, I just had to export it using pandas:
# what I want to do here is grab all the data in the same column as each
# header, then move to the next column
for i in range(len(row_list)):
k = 0
while na_finder.at[row_list[i] + k, column_list[i]] == True:
if(df.at[row_list[i] + k, column_list[i]] not in header_list):
outdf.at[k, header_list[i]] = df.at[row_list[i] + k, column_list[i]]
k += 1
main_df.append(outdf)
to_xlsx_df = pd.DataFrame()
for frame in main_df:
to_xlsx_df = pd.concat([to_xlsx_df, frame])
to_xlsx_df.to_excel(outfile_path)
The output to excel ended up looking something like this:
Hopefully this can help someone else out too.

Using CSV module (*no pandas*): How to print the rows in a header column?

Define the function getColumn() below that takes a column name and returns the data in that column as a list of strings to print the items in a header.
My getColumn(name) function fails and gives the error message
unhashable type:list
import csv
from csv import DictReader
from collections import defaultdict
with open('training_set_features.csv') as new_csv_file:
new_csv_file = csv.DictReader(new_csv_file,fieldnames=headers)
data = list(new_csv_file)
column_names = list(data[0].keys())
print("List of column names : ", column_names)
# This part of my code is where I am stuck so it is incomplete
def getColumn(name):
empty_list=[]
column=dict[column_names]
for row in rows:
return
print(getColumn('doctor_recc_h1n1'))
The will do what's needed using something called a "list comprehension" which extracts the data for the named column from each row and returns it.
Note that because getColumn(name) is only passed one argument, the column name, it is forced it to use the global variable data. If would be better to pass it data as as a second argument to avoid needing to do that because it's generally best to avoid them as much as possible.
import csv
def getColumn(name):
return [row[name] for row in data]
with open('training_set_features.csv', newline='') as new_csv_file:
new_csv_file = csv.DictReader(new_csv_file)
data = list(new_csv_file)
column_names = list(data[0].keys())
print("List of column names : ", column_names)
print(getColumn('doctor_recc_h1n1'))

Creating a list from an excel file that's a slice of a column. How can I print it without a 'text:' prefix in any value?

Trying to fetch a list of values from an excel file consisting of a column but has to end at a certain row. Can fetch it using a slice of the col and rows, however a prefix of 'text:' appears. This makes the list incompatible for what I need to use it for.
import xlrd
import csv
loc = ("/Users/uni/Desktop/TESTEXCEL.xls")
wb =xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
sheet.cell_value(0,0)
CANDIDATE = sheet.col_slice(colx=5,
start_rowx=1,
end_rowx=29)
print (CANDIDATE)
RESULT:
[text:u'lt102', text:u'lt103', text:u'lt104', text:u'lt105', text:u'lt108', text:u'lt124', text:u'lt149', text:u'lt151', text:u'lt152', text:u'lt153', text:u'lt195', text:u'lt223', text:u'lt229', text:u'lt254', text:u'lt255', text:u'lt268', text:u'lt269', text:u'lt270', text:u'lt277', text:u'lt278', text:u'lt280', text:u'lt284', text:u'lt285', text:u'lt287', text:u'lt299', text:u'lt95', text:u'lt96', text:u'lt97']
You can use pandas library, it has convenient read_excel method. Here is example:
import pandas as pd
column_number = 5
df = pd.read_excel('/Users/uni/Desktop/TESTEXCEL.xls', usecols=[column_number], nrows=29, header=None)
values = df[column_number].to_list() # a list of values in your 5th column
You can read more about read_excel here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Read data from excel after a string matches

I want to read the entire row data and store it in variables, later use them in selenium to write it to webelements. Programming language is Python.
Example: I have an excel sheet of Incidents and their details regarding priority, date, assignee etc
If I give the string as INC00000 it should match the excel data, fetch all the above details and store it in separate variables like
INC #= INC0000 Priority= Moderate Date = 11/2/2020
Is this feasible? I tried and failed writing a code. Please suggest other possible ways to do this.
I would,
load the sheet into a pandas DataFrame
filter the corresponding column in the DataFrame by the INC # of interest
convert the row to dictionary (assuming the INC filter produces only 1 row)
get the corresponding value in the dictionary to assign to the corresponding webelement
Example:
import pandas as pd
df = pd.read_excel("full_file_path", sheet_name="name_of_sheet")
dict_data = df[df['INC #']].to_dict("record") # <-- assuming the "INC #" are in column named "INC #" in the spreadsheet
webelement1.send_keys(dict_data[columnname1])
webelement2.send_keys(dict_data[columnname2])
webelement3.send_keys(dict_data[columnname3])
.
.
.
Please find the below code and do the changes as per your variables after saving your excel file as csv:
Please find the dummy data image
import csv
# Set up input and output variables for the script
gTrack = open("file1.csv", "r")
# Set up CSV reader and process the header
csvReader = csv.reader(gTrack)
header = csvReader.next()
print header
id_index = header.index("id")
date_index = header.index("date ")
var1_index = header.index("var1")
var2_index = header.index("var2")
# # Make an empty list
cList = []
# Loop through the lines in the file and get required id
for row in csvReader:
id = row[id_index]
if(id == 'INC001') :
date = row[date_index]
var1 = row[var1_index]
var2 = row[var2_index]
cList.append([id,date,var1,var2])
# # Print the coordinate list
print cList

How to Parse Pandas ExcelFile sheets with generic headers

I have many excel files with multiple sheets where the first row of data are not headers but data. How can I parse each sheet without specifying a header row, and the default not being 0. Having the first row as my headers is a pain.
Failing that, what is the best way to insert a column index into the first row of data?
My code is simple:
import pandas as pd
path_list #list of paths to .xls files
data_sheets = [] #container for parsed sheets
for file_ in path_list:
excel_file_obj = pd.ExcelFile(file_)
for sheet in excel_file_obj:
data_sheet = excel_file_obj.parse(sheet)
data_sheets.append(data_sheet)
I cant for the life of me figure out how to get the column index into the first row index. I basically want a df.reset_index(False) type solution but for columns. Does such a thing exist?
One extremely hackish way round would seem to be to do this for each data sheet:
first_row = data_sheet.columns
generic_cols = ['col' + str(x) for x in xrange(len(data_sheet.columns))]
data_sheet.index = [x for x in xrange(1, len(data_sheet) + 1)]
data_sheet.columns = generic_cols
for_concat = pd.DataFrame({col : val for col, val in zip(generic_cols, first_row)}, index = [0,])
new_sheet =pd.concat([for_concat, data_sheet])
There must be a better way. All help appreciated...

Categories

Resources