I'm working with some xlsx files and need to import them into program. I've written a script that handles everything I need it to do.
However, I need to apply filters to the table in excel before importing them in.
When I apply filters and prep the table to import into python, python reads the entire table ignoring all the filters.
My work around has been filtering what I need then copying that to a new sheet. Then when reading into excel I specify the new sheet of filtered data that I'm looking for.
Is there a way to read the filtered table into excel directly?
Or Should I just import the entire table and apply those same filters using pandas in my script instead?
IIUC, you can't read only visible rows and/or columns of an Excel spreadsheet with pandas.
To do that, you need some help from openpyxl (!pip install openpyxl) :
from openpyxl import openpyxl
import pandas as pd
wb = load_workbook("file.xlsx")
ws = wb.active # or wb["SheetName"] # <- change the name here
rows = [[c.value for c in r
if c.value and not ws.row_dimensions[r[0].row].hidden]
for r in ws.iter_rows()]
df = pd.DataFrame(data= rows[1:], columns=rows[0]).dropna()
Output :
print(df)
col col2
0 foo 1.0
2 baz 3.0
Input used (spreadsheet) :
Related
For the past few days I've been trying to do a relatively simple task but I'd always encounter some errors so I'd really appreciate some help on this. Here goes:
I have an Excel file which contains a specific column (Column F) that has a list of IDs.
What I want to do is for the program to read this excel file and allow the user to input any of the IDs they would like.
When the user types in one of the IDs, I would want the program to return a bunch IDs that contain the text that the user has inputted, and after that I'd like to export those 'bunch of IDs' to a new & separate Excel file where all the IDs would be displayed in one column but in separate rows.
Here's my code so far, I've tried using arrays and stuff but nothing seems to be working for me :/
import pandas as pd
import numpy as np
import re
import xlrd
import os.path
import xlsxwriter
import openpyxl as xl;
from pandas import ExcelWriter
from openpyxl import load_workbook
# LOAD EXCEL TO DATAFRAME
xls = pd.ExcelFile('N:/TEST/TEST UTILIZATION/IA 2020/Dev/SCS-FT-IE-Report.xlsm')
df = pd.read_excel(xls, 'FT')
# GET USER INPUT (USE AD1852 AS EXAMPLE)
value = input("Enter a Part ID:\n")
print(f'You entered {value}\n\n')
i = 0
x = df.loc[i, "MFG Device"]
df2 = np.array(['', 'MFG Device', 'Loadboard Group','Socket Group', 'ChangeKit Group'])
for i in range(17367):
# x = df.loc[i, "MFG Device"]
if value in x:
df = np.array[x]
df2.append(df)
i += 1
print(df2)
# create excel writer object
writer = pd.ExcelWriter('N:/TEST/TEST UTILIZATION/IA 2020/Dev/output.xlsx')
# write dataframe to excel
df2.to_excel(writer)
# save the excel
writer.save()
print('DataFrame is written successfully to Excel File.')
Any help would be appreciated, thanks in advance! :)
It looks like you're doing much more than you need to do. Rather than monkeying around with xlsxwriter, pandas.DataFrame.to_excel is your friend.
Just do
df2.to_excel("output.xlsx")
You don't need xlsxwriter. Simply df.to_excel() would work. In your code df2 is a numpy array/ First convert it into a pandas DataFrame format a/c to the requirement (index and columns) before writing it to excel.
Currently I'm running a live test that uses 3 variables data1, data2 and data 3. The Problem is that whenever I run my python code that it only writes to the first row within the respective columns and overwrites any previous data I had.
import pandas as pd
import xlsxwriter
from openpyxl import load_workbook
def dataholder(data1,data2,data3):
df = pd.DataFrame({'Col1':[data1],'Col2':[data2],'Col3':[data3]})
with pd.ExcelWriter('data_hold.xlsx', engine='openpyxl') as writer:
df.to_excel(writer,sheet_name='Sheet1')
writer.save()
Is what I'm trying to accomplish feasible?
Use startrow=... of to_excel to shift every subsequent update down.
There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)
I would like to read in an excel spreadsheet to python / pandas, but have the formulae instead of the cell results.
For example, if cell A1 is 25, and cell B1 is =A1, I would like my dataframe to show:
25 =A1
Right now it shows:
25 25
How can I do so?
OpenPyXL provides this capacity out-of-the-box. See here and here. An example:
from openpyxl import load_workbook
import pandas as pd
wb = load_workbook(filename = 'empty_book.xlsx')
sheet_names = wb.get_sheet_names()
name = sheet_names[0]
sheet_ranges = wb[name]
df = pd.DataFrame(sheet_ranges.values)
Yes, it is possible. I recently found a package that solves this issue in a quite sophisticated way. It is called portable-spreadsheet (available via pip install portable-spreadsheet). It basically encapsulates xlsxwriter. Here is a simple example:
import portable_spreadsheet as ps
sheet = ps.Spreadsheet.create_new_sheet(5, 5)
# Set values
sheet.iloc[0, 0] = 25 # Set A1
sheet.iloc[1, 0] = sheet.iloc[0, 0] # reference to A1
# Export to Excel
sheet.to_excel('output/sample.xlsx')
It works in a similar way as Pandas Dataframe.
There is one option to do this using xlwings and pandas modules.
xlwings provides a way to automate the excel via python scripts.
Create one "sample.xlsx" file and add random formula in range("A1").
Below is the sample code which will read the value as well as the formula from the given file:
import pandas as pd
import xlwings as xw
wbk = xw.Book('sample.xlsx')
ws = wbk.sheets[0]
print(ws.cells(1,1).value)
print(ws.cells(1,1).formula)
Same thing applies on range as well. You can assign the range.value into dataframe and vice versa.
In case you want to get formulas from big range, you can get that too, but it will return tuple.
Hope this is helpful at some extent.
I am trying to create a filter in excel programatically, so when the sheet is created with openpyxl, the first row of each sheet will already be set to a be a filter. I've looked at the docs but all I can find is how to filter data not to create a filter.
Is it even possible with the current implementation of openpyxl?
openpyxl does support filters. See the worksheet.filters module and the associated tests.
Sample of what you can do:
ws.auto_filter.ref = 'C1:G9'
Copy and paste this into a .py file and run.
import pandas as pd
import numpy as np
# Here is an example dataframe
df_example = pd.DataFrame(np.random.randint(0,100,size=(100, 4)), columns=list('ABCD'))
# Create xlsx file
filepath = 'mytempfile.xlsx'
with pd.ExcelWriter(filepath, engine='xlsxwriter') as writer:
df_example.to_excel(writer, sheet_name='Sheet1',index=False)
# Add filter feature to first row
import openpyxl
xfile = openpyxl.load_workbook(filepath)
sheet = xfile.get_sheet_by_name('Sheet1')
maxcolumnletter = openpyxl.utils.get_column_letter(sheet.max_column)
sheet.auto_filter.ref = 'A1:'+maxcolumnletter+str(len(sheet['A']))
# Save the file
xfile.save(filepath)
print 'your file:',filepath