How to load win32com Excel worksheet to Pandas df? - python

I have the following code:
import pandas as pd
import win32com.client
excel_app = win32com.client.Dispatch("Excel.Application")
file_path = r"path to the file"
file_password = "file password"
workbook = excel_app.Workbooks.Open(file_path, Password=file_password)
sheet = workbook.Sheets("sheet name")
Now I'd like to take the sheet variable and load it into a Pandas df. I was trying to accomplish it via saving the sheet to a separate file and then reading it from Pandas, but it seems to be over-complicating the issue, as the file is both password protected and in .xlsm format, so re-opening it directly from Pandas isn't straightforward.
How do I do it?

The UsedRange property of the sheet will return an array that encompasses all the cells in the worksheet that have data.
df = pd.DataFrame(sheet.UsedRange())
With the column headers as the column number, and the index as the row number. Both zero-based.

Related

Can't see csv file (converted from df) in files

After saving my dataframe to a csv in a specific location, the csv file doesn't appear in the location I saved it to. Is there any reason why it possibly is not showing?
Here is the code to save my dataframe to csv:
df.to_csv(r'C:\Users\gibso\OneDrive\Documents\JOSEPH\export_dataframe.csv', index = False)
Even changing an empty df does not seem to work.
import pandas as pd
olympics={}
df = pd.DataFrame(olympics)
df.to_csv(r'C:\Users\gibso\OneDrive\Documents\JOSEPH\export_dataframe.csv', index = False)
Thanks for the help!
I would rather use the module openpyxl. Example of saving:
import openpyxl
workbook = openpyxl.Workbook()
sheet = workbook.active
# Work on your workbook. Once finished:
workbook.save(file_name) # file_name is a variable you must define
Don't forget installing openpyxl with pip first!

Loading only one sheet to dataframe

I am trying to read an excel sheet into df using pandas read_excel method. The excel file contains 6-7 different sheet. Out of it, 2-3 sheets are very huge. I only want to read one excel sheet out of the file.
If I copy the sheet out and read the time reduces by 90%.
I have read that xlrd that is used by pandas always loads the whole sheet to memory. I cannot change the format of the input.
Can you please suggest a way to improve the performance?
It's quite simple. Just do this.
import pandas as pd
xls = pd.ExcelFile('C:/users/path_to_your_excel_file/Analysis.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
# etc.
df2 = pd.read_excel(xls, 'Sheet2')
print(df2)
import pandas as pd
df = pd.read_excel('YourFile.xlsx', sheet_name = 'YourSheet_Name')
Whatever sheet you want to read just put the sheet name and your path to excel file.
Use openpyxl in read-only mode. See http://openpyxl.readthedocs.io/en/default/pandas.html

Create a visual filter in excel using Python - openpyxl

I am trying to create a filter in excel programatically, so when the sheet is created with openpyxl, the first row of each sheet will already be set to a be a filter. I've looked at the docs but all I can find is how to filter data not to create a filter.
Is it even possible with the current implementation of openpyxl?
openpyxl does support filters. See the worksheet.filters module and the associated tests.
Sample of what you can do:
ws.auto_filter.ref = 'C1:G9'
Copy and paste this into a .py file and run.
import pandas as pd
import numpy as np
# Here is an example dataframe
df_example = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Create xlsx file
filepath = 'mytempfile.xlsx'
with pd.ExcelWriter(filepath, engine='xlsxwriter') as writer:
df_example.to_excel(writer, sheet_name='Sheet1',index=False)
# Add filter feature to first row
import openpyxl
xfile = openpyxl.load_workbook(filepath)
sheet = xfile.get_sheet_by_name('Sheet1')
maxcolumnletter = openpyxl.utils.get_column_letter(sheet.max_column)
sheet.auto_filter.ref = 'A1:'+maxcolumnletter+str(len(sheet['A']))
# Save the file
xfile.save(filepath)
print 'your file:',filepath

Dynamically Parsing a worksheet in Pandas using Python 3

My question is regarding parsing worksheets in Panda (Python 3).
Right now my code looks like this:
var = input("Enter the path for the Excel file you want to use: ")
import pandas as pd
xl = pd.ExcelFile(var)
df = xl.parse("HelloWorld")
df.head()
with my code parsing the worksheet "HelloWorld" within an excel file the user inputs. However, sometimes the worksheet within the file will not be called "HelloWorld" in which case the parsing code will fail.
Does anyone know how to set the variable "df" to dynamically read the name of the worksheet within the excel file. There will always be only ONE worksheet in these excel files so whatever worksheet is in the file, I want my code to read.
Thank you for the help!
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.excel.ExcelFile.parse.html
You can pass in the sheet number instead of the name.
var = input("Enter the path for the Excel file you want to use: ")
import pandas as pd
xl = pd.ExcelFile(var)
df = xl.parse(sheetname=0)
df.head()

Is it possible to read data from an Excel sheet in Python using Xlsxwriter? If so how?

I'm doing the following calculation.
worksheet.write_formula('E5', '=({} - A2)'.format(number))
I want to print the value in E5 on the console. Can you help me to do it? Is it possible to do it with Xlsxwriter or should I use a different library to the same?
It is not possible to read data from an Excel file using XlsxWriter.
There are some alternatives listed in the documentation.
If you want to use xlsxwriter for manipulating formats and formula that you can't do with pandas, you can at least import your excel file into an xlsxwriter object using pandas. Here's how.
import pandas as pd
import xlsxwriter
def xlsx_to_workbook(xlsx_in_file_url, xlsx_out_file_url, sheetname):
"""
Read EXCEL file into xlsxwriter workbook worksheet
"""
workbook = xlsxwriter.Workbook(xlsx_out_file_url)
worksheet = workbook.add_worksheet(sheetname)
#read my_excel into a pandas DataFrame
df = pd.read_excel(xlsx_in_file_url)
# A list of column headers
list_of_columns = df.columns.values
for col in range(len(list_of_columns)):
#write column headers.
#if you don't have column headers remove the folling line and use "row" rather than "row+1" in the if/else statments below
worksheet.write(0, col, list_of_columns[col] )
for row in range (len(df)):
#Test for Nan, otherwise worksheet.write throws it.
if df[list_of_columns[col]][row] != df[list_of_columns[col]][row]:
worksheet.write(row+1, col, "")
else:
worksheet.write(row+1, col, df[list_of_columns[col]][row])
return workbook, worksheet
# Create a workbook
#read you Excel file into a workbook/worksheet object to be manipulated with xlsxwriter
#this assumes that the EXCEL file has column headers
workbook, worksheet = xlsx_to_workbook("my_excel.xlsx", "my_future_excel.xlsx", "My Sheet Name")
###########################################################
#Do all your fancy formatting and formula manipulation here
###########################################################
#write/close the file my_new_excel.xlsx
workbook.close()
Not answering this specific question, just a suggestion - simply try pandas and read data from excel. Thereafter you can simply manipulate the data using pandas DataFrame built-in methods:
df = pd.read_excel(file_,index_col=None, header=0)
df is the pandas.DataFrame, just go through DataFrame from this it's cookbook site. If you are unaware about this package, you might get surprised by this awesome python module.

Categories

Resources