How to sort Excel sheet using Python - python

I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?

There are a couple ways to do this. The first option is to utilize xlrd, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel call, the index is set to False, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.

I just wanted to refresh the answer as the Pandas implementation has changed a bit over time. Here's the code that should work now (pandas 1.1.2).
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort_values(by="Header Row")
...
The sort function is now called sort_by and columns is replaced by by.

Related

How to add Pandas dataframe to an existing xlsx file using to_excel

I have write some content to a xlsx file by using xlsxwriter
workbook = xlsxwriter.Workbook(file_name)
worksheet = workbook.add_worksheet()
worksheet.write(row, col, value)
worksheet.close()
I'd like to add a dataframe after the existing rows to this file by to_excel
df.to_excel(file_name,
startrow=len(existing_content),
engine='xlsxwriter')
However, this seems not work.The dataframe not inserted to the file. Anyone knows why?
Unfortunately, as the content above is not specifically written, let's take a look at to_excel and XlsxWriter as examples.
using xlsxwriter
import xlsxwriter
# Create a new Excel file and add a worksheet
workbook = xlsxwriter.Workbook('example.xlsx')
worksheet = workbook.add_worksheet()
# Add some data to the worksheet
worksheet.write('A1', 'Language')
worksheet.write('B1', 'Score')
worksheet.write('A2', 'Python')
worksheet.write('B2', 100)
worksheet.write('A3', 'Java')
worksheet.write('B3', 98)
worksheet.write('A4', 'Ruby')
worksheet.write('B4', 88)
# Save the file
workbook.close()
Using the above code, we have saved the table similar to the one below to an Excel file.
Language
Score
Python
100
Java
98
Ruby
88
Next, if we want to add rows using a dataframe.to_excel :
using to_excel
import pandas as pd
# Load an existing Excel file
existing_file = pd.read_excel('example.xlsx')
# Create a new DataFrame to append
df = pd.DataFrame({
'Language': ['C++', 'Javascript', 'C#'],
'Score': [78, 97, 67]
})
# Append the new DataFrame to the existing file
result = pd.concat([existing_file, df])
# Write the combined DataFrame to the existing file
result.to_excel('example.xlsx', index=False)
The reason for using pandas concat:
To append, it is necessary to use pandas.DataFrame.ExcelWriter(), but XlsxWriter does not support append mode in ExcelWriter
Although the task can be accomplished using pandas.DataFrame.append(), the append method is slated to be deleted in the future, so we use concat instead.
The OP is using xlsxwriter in the engine parameter. Per XlsxWriter documentation "XlsxWriter is designed only as a file writer. It cannot read or modify an existing Excel file." (link to XlsxWriter Docs).
Below I've provided a fully reproducible example of how you can go about modifying an existing .xlsx workbook using the openpyxl module (link to Openpyxl Docs).
For demonstration purposes, I'll first create create a workbook called test.xlsx using pandas:
import pandas as pd
df = pd.DataFrame({'Col_A': [1,2,3,4],
'Col_B': [5,6,7,8],
'Col_C': [0,0,0,0],
'Col_D': [13,14,15,16]})
df.to_excel('test.xlsx', index=False)
This is the Expected output at this point:
Using openpyxl you can use another dataset to load the existing workbook ('test.xlsx') and modify the third column with different data from the new dataframe while preserving the other existing data. In this example, for simplicity, I update it with a one column dataframe but you could extend it to update or add more data.
from openpyxl import load_workbook
import pandas as pd
df_new = pd.DataFrame({'Col_C': [9, 10, 11, 12]})
wb = load_workbook('test.xlsx')
ws = wb['Sheet1']
for index, row in df_new.iterrows():
cell = 'C%d' % (index + 2)
ws[cell] = row[0]
wb.save('test.xlsx')
With the Expected output at the end:

how to read many columns from excel using python

I want to read the data in like 162 rows from excel, I tried this code but I couldn't figure out a way to make it in a loop
import xlrd
file_location = "dec_DB.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('Sheet1')
x = []
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 3))
You can read the columns you want from an excel sheet easily in one line with pandas:
import pandas as pd
x = pd.read_excel(io = "/Users/atheeralzaydi/Desktop/KACST/DB/dec_DB.xlsx", sheet_name = "Sheet1", usecols = [1,2,3,12,34,100])
x will be a pandas DataFrame.
EDIT:
To read the first 162 rows (+1 including the header -- the row that includes the column names), instead of the parameter usecols, use the parameter nrows:
import pandas as pd
x = pd.read_excel(io = "/Users/atheeralzaydi/Desktop/KACST/DB/dec_DB.xlsx", sheet_name = "Sheet1", nrows = 162)
To import specific rows and not the first m rows, use the parameter skiprows.
To import the entire sheet, simply omit these parameters and all rows will be imported.
More info on read_excel here.

Having Trouble Writing Table to Excel with Python

Hi I am trying to create a table in excel using a dataframe from another excel spreadsheet and writing the table to a new one. I believe my code is correct but the table isn't writing to the new excel spreadsheet. Can someone take a look at my code and tell me what's wrong?
import xlsxwriter
import pandas as pd
import openpyxl as pxl
import xlsxwriter
import numpy as np
from openpyxl import load_workbook
path = '/Users/benlong/Downloads/unemployment.xlsx'
df = pd.read_excel(path)
rows = df.shape[0]
columns = df.shape[1]
wb = xlsxwriter.Workbook('UE2.xlsx')
ws = wb.add_worksheet('Sheet1')
ws.add_table(0,0,rows,columns, {'df': df})
wb.close()
You should convert your dataframe to list . By using df.values.tolist() and use the key data.
In your case , you also should set the header of df and avoid getting a nan value error.
eg:
import xlsxwriter as xlw
# while got NaN/Inf values from ur dataframe , u'll get a value of '#NUM!' instead in saved excel
wb = xlw.Workbook('UE2.xlsx',{'nan_inf_to_errors': True})
ws = wb.add_worksheet('Sheet1')
cell_range = xlw.utility.xl_range(0, 0, rows, columns-1)
header = [{'header': str(di)} for di in df.columns.tolist()]
ws.add_table(cell_range, {'header_row': True,'first_column': False,'columns':header,'data':df.values.tolist()})
wb.close()
Possible duplicate: How to use xlsxwriter .add_table() method with a dataframe?
You can try converting the dataframe to a list of lists and use the data keyword.
ws.add_table(0,0,rows,columns, {'data': df.values.T.tolist()})

Performance evaluating excel cells with python

I am trying to compare 2 columns of an excel file with a 2D-matrix row by row with python. My excel file contains 20'100 rows and the computing time via Pycharm is more than 1 hour. Is there any way how to do these value comparisons more time efficient?
import openpyxl as xl
from IDM import idm_matrix
# load and create excel file
wb = xl.load_workbook('Auswertung_C33.xlsx')
result_wb = xl.Workbook() #new workbook for results
result_sheet = result_wb.create_sheet('Ergebnisse') #create new sheet in result file
result_wb.remove(result_wb['Sheet'])
sheet = wb['TriCad_Format']
# copy 1st row
first_row = sheet[1:1]
list_first_row =[]
for item in first_row:
list_first_row.append(item.value)
result_sheet.append(list_first_row)
# Value check
for row in range(2, sheet.max_row + 1):
row_list = []
for col in range(1, sheet.max_column + 1):
cell = sheet.cell(row, col)
row_list.append(cell.value)
for matrix in idm_matrix:
if row_list[7] is None:
continue
elif matrix[0] in row_list[7]:
if row_list[14] is None or matrix[1] != row_list[14]:
result_sheet.append(row_list)
print("saving file...")
result_wb.save('Auswertung.xlsx') #saves the file in a new wb
print("Done!")
Thanks for your help!
Alex
----- Sample of Data ------
Input:
BEZ | _IDM
Schirmsprinkler-SU5 | EAL
--> if column BEZ contains the string 'Schirmsprinkler' and column _IDM has any value, the row should be copied. If the column _IDM is empty, the row is fine and should not be copied. There are many strings in BEZ where _IDM should be empty, so thats why I am trying to put them all in the df_idm lists. However, it doesn't work with an empty string "".
Update 20th of May 2020:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns=['LongName', 'ShortName'])
# REMOVE ROWS WHICH HAVE NO VALUE IN COLUMN 6
df_excel.dropna(subset=['BEZ'], inplace=True)
# MATCH ON CORRESPONDING COLUMNS
search_list = df_idm['LongName'].unique().tolist()
matches1 = df_excel[(df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) &
(~df_excel["_IDM"].isin(df_idm["ShortName"].tolist()))]
matches2 = df_excel[(~df_excel["BEZ"].str.contains("|".join(search_list), regex=True)) & (~pd.isnull(df_excel["_IDM"]))]
# CREATE LIST OF MATCHING DATAFRAMES
sum_of_idm = [matches1, matches2]
# CREATE NEW WORKBOOK
writer = pd.ExcelWriter('Ergebnisse.xlsx')
pd.concat(sum_of_idm).to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()
Since you are handling data requiring comparison checks, consider Pandas, the third-party data analytics library of Python for several reasons:
Import and export Excel features that can interface with openpyxl
Ability to interact with many Python data objects (list, dict, tuple, array, etc.)
Vectorized, comparison logic that is more efficient than nested for loops
Use whole, named objects (DataFrame and Series) for bulk, single call, set-based operations
Avoid working with unnamed, numbered rows and cells that impacts readability
Specifically, you can migrate your idm_matrix to a data frame and import Excel data to a data frame for column comparison or by a single call merge (for exact match) or Series.str.contains (for partial match) followed by logic filter.
Note: Without reproducible example the below code uses information from posted code but needs to be tested on actual data. Adjust any Column# from original Excel worksheet as needed:
import openpyxl as xl
from IDM import idm_matrix
import pandas as pd
# EXCEL DATA FRAME
xl_file = 'Auswertung_C33.xlsx'
df_excel = pd.read_excel(xl_file, sheet_name="TriCad_Format")
# IDM LIST DATA FRAME
df_idm = pd.DataFrame(idm_matrix, columns = ['LongName', 'ShortName'])
# EXACT MATCH
# matches = df_excel.merge(df_idm, left_on=['Column6'], right_on=['LongName'])
# PARTIAL MATCH
search_list = df_idm['LongName'].unique().tolist()
matches = df_excel[(~df_excel["Column6"].str.contains("|".join(search_list), regex=True)) &
(pd.isnull(df_excel["_IDM"])) &
(df_excel["Column6"].isin(df_idm["ShortName"].tolist())]
# ADJUST EXISTING WORKBOOK
with pd.ExcelWriter(xl_file, engine='openpyxl') as writer:
writer.book = xl.load_workbook(xl_file)
try:
# REMOVE SHEET IF EXISTS
writer.book.remove(writer.book['Ergebnisse'])
writer.save()
except Exception as e:
print(e)
finally:
# ADD NEW SHEET OF RESULTS
matches.to_excel(writer, sheet_name="Ergebnisse", index=False)
writer.save()

Creating a list from an excel file that's a slice of a column. How can I print it without a 'text:' prefix in any value?

Trying to fetch a list of values from an excel file consisting of a column but has to end at a certain row. Can fetch it using a slice of the col and rows, however a prefix of 'text:' appears. This makes the list incompatible for what I need to use it for.
import xlrd
import csv
loc = ("/Users/uni/Desktop/TESTEXCEL.xls")
wb =xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
sheet.cell_value(0,0)
CANDIDATE = sheet.col_slice(colx=5,
start_rowx=1,
end_rowx=29)
print (CANDIDATE)
RESULT:
[text:u'lt102', text:u'lt103', text:u'lt104', text:u'lt105', text:u'lt108', text:u'lt124', text:u'lt149', text:u'lt151', text:u'lt152', text:u'lt153', text:u'lt195', text:u'lt223', text:u'lt229', text:u'lt254', text:u'lt255', text:u'lt268', text:u'lt269', text:u'lt270', text:u'lt277', text:u'lt278', text:u'lt280', text:u'lt284', text:u'lt285', text:u'lt287', text:u'lt299', text:u'lt95', text:u'lt96', text:u'lt97']
You can use pandas library, it has convenient read_excel method. Here is example:
import pandas as pd
column_number = 5
df = pd.read_excel('/Users/uni/Desktop/TESTEXCEL.xls', usecols=[column_number], nrows=29, header=None)
values = df[column_number].to_list() # a list of values in your 5th column
You can read more about read_excel here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

Categories

Resources