How can I use xlwings to read a "table" in excel, into a pandas DataFrame, where the table "headers" become the DataFrame column names?
Every way I have tried to read the table, the header row is always excluded from the read!
Here is what I've tried, where "b" is my xlwings workbook object:
b.sheets['Sheet1'].range('Table1').options(pd.DataFrame)
b.sheets['Sheet1'].range('Table1').options(pd.DataFrame, headers=False)
b.sheets['Sheet1'].range('Table1').options(pd.DataFrame, headers=True)
Hoping this is not the best answer, but I did find I could reference the named range, then .offset(-1).expand('vertical')
Another option is to use the api and Excel's ListObject
import xlwings as xw
wb = xw.books.active
ws = wb.sheets('MySheet')
tbl = ws.api.ListObjects('MyTable') # or .ListObjects(1)
rng = ws.range(tbl.range.address) # get range from table address
df = rng.options(pd.DataFrame, header=True).value # load range to dataframe
Let me add the comment by Felix Zumstein as explicit answer, because in my opinion it is the best solution.
b.sheets['Sheet1'].range('Table1[[#All]]').options(pd.DataFrame)
Using the square bracket notation expands the selection from the table body to the header row as well. Subsequently the conversion to a pandas DataFrame, can use it as header.
To read more, check this answer.
Related
So my goal was to add data in an already existing table using openpyxl and python. I did it by using .cell(row, column).value method.
After doing this I had a problem because the table I was writing the data in was not expanding correctly. So i found this method and it worked fine :
from openpyxl import load_workbook
#getting the max number of row
ok = bs_sheet.max_row
#expanding the table
bs_sheet.tables['Table1'].ref = "A1:H"+str(ok)
What I initially thought is that the format of the table would expand accordingly. What I mean by that is if I had a formula in a column, when expanding using openpyxl it would also expand the formula (or the position etc.). Just like it works when you do it manually. But it doesn't. And this is where I have a problem because I haven't found anything.
What I am having trouble with is when extending the table, the shaping that was already done on the existing rows doesn't extend down on the rows of the table. Is there a way I could fix this ?
Using xlwings helps to keep the same format (including justifications, formulas) of a table.
When inserting data, the table will automatically expand. See example below :
import xlwings as xw
wb = xw.Book('test_book.xlsx')
tableau = wb.sheets[0].tables[0]
sheet = wb.sheets[0]
tableau.name = 'new'
sheet.range(3,1).value = 'VAMONOS'
wb.save('test_book.xlsx')
wb.close()
In ths example, it will add value in an alreading existing table (and also change the name of the table). You will see the table already expanded when opening the file again.
There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)
I have an excel file(.xlsm), from which I need to extract data, including data stored as comments in some cells. Is it possible to read such comments with Pandas? How to do it?
No. As far as I am aware it is not currently possible. If you know you will be making comments when designing your spreadsheet however, you can just specify a column that will contain these comments. Alternatively, you can use something like
pd.read_excel('tmp.xlsx', index_col=0, comment='#')
to specify that any cell that starts with # will be regarded as a comment. From the documentation regarding the comment argument of pandas:
Comments out remainder of line. Pass a character or characters to this argument to indicate comments in the input file. Any data between the comment string and the end of the current line is ignored.
update
I would like to say that I know openpyxl can read comments. An example script would look like:
from openpyxl import Workbook
from openpyxl import load_workbook
wb = load_workbook("test.xlsx")
ws = wb["Sheet1"] # or whatever sheet name
for row in ws.rows:
for cell in row:
print(cell.comment)
Perhaps you could get this to interface with your data somehow!
Total newbie and this is my first ever question so apologies in advance for any inadvertent faux pas.
I have a large(ish) dataset in Excel xlsx format that I would like to import into a pandas dataframe. The data has column headers except for the first column which does not have a header label. Here is what the excel sheet looks like:
Raw data
I am using read_excel() in Pandas to read in the data. The code I am using is:
df = pd.read_excel('Raw_Data.xlsx', sheetname=0, labels=None, header=0, index_col=None)
(I have tried index_col = false or 0 but, for obvious reasons, it doesn't change anything)
The headers for the columns are picked up fine but the first column, circled in red in the image below, is assigned as the index.
wrong index
What I am trying to get from the read_excel command is as follows with the index circled in red:
correct index
I have other excel sheets that I have used read_excel() to import into pandas and pandas automatically adds in a numerical incremental index rather than inferring one of the columns as an index.
None of those excel sheets had missing label in the column header though which might be the issue here though I am not sure.
I understand that I can use the reset_index() command after the import to get the correct index.
Wondering if it can be done without having to do the reset_index() and within the read_excel() command. i.e. is there anyway to prevent an index being inferred or to force pandas to add in the index column like it normally does.
Thank you in advance!
I don't think you can do it with only the read_excel function because of the missing value in cell A1. If you want to insert something into that cell prior to reading the file with pandas, you could consider using openpyxl as below.
from openpyxl import load_workbook as load
path = 'Raw_Data.xlsx'
col_name = 'not_index'
cell = 'A1'
def write_to_cell(path, col_name, cell):
wb = load(path)
for sheet in wb.sheetnames:
ws = wb[sheet]
if ws[cell].value is None:
ws[cell] = col_name
wb.save(path)
I am trying to create a database and fill it with values gotten from an excel sheet.
My code:
new_db = pd.DataFrame()
workbook = pd.ExcelFile(filename)
df = workbook.parse('Sheet1')
print(df)
new_db.append(df)
print(new_db.head())
But whenever I seem to do this, I get an empty dataframe back.
My excel sheet however is packed with values. When it is printed(print(df)) it prints it out with ID values and all the correct columns and rows.
My knowledge with Pandas-Dataframes is limited so excuse me if I do not know something I should. All help is appreciated.
I think pandas.read_excel is what you're looking for. here is an example:
import pandas as pd
df = pd.read_excel(filename)
print(df.head())
df will have the type pandas.DataFrame
The default parameters of read_excel are set in a way that the first sheet in the excel file will be read, check the documentation for more options(if you provide a list of sheets to read by setting the sheetname parameter df will be a dictionary with sheetnames as keys and their correspoding Dataframes as values). Depending on the version of Python you're using and its distribution you may need to install the xlrd module, which you can do using pip.
You need to reassign the df after appending to it, as #ayhan pointed out in the comments:
new_db = new_db.append(df)
From the Panda's Documentation for append, it returns an appended dataframe, which means you need to assign it to a variable.