I know how to get the list of sheet names. The excel file I am using has multiple sheets. How do I select the first one sequentially ? I don't know the name of the sheet but I need to select the first one. How would I go about this ?
The first sheet is automatically selected when the Excel table is read into a dataframe.
To be explicit however, the command is :
import pandas as pd
fd = 'file path'
data = pd.read_excel( fd, sheet_name=0 )
Use of 'sheetname' is deprecated. Please use sheet_name
Also this bug at the time of writing:
https://github.com/pandas-dev/pandas/issues/17107
Use 'sheetname', not 'sheet_name'.
Following the official documentation and as already suggested by EdChum, it's enough to use read_excell passing sheetname=N as argument.
N=0 for the first sheet, N=1 for the second, N=2 for the third and so on..
Related
I have a big size excel files that I'm organizing the column names into a unique list.
The code below works, but it takes ~9 minutes!
Does anyone have suggestions for speeding it up?
import pandas as pd
import os
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl').columns)
print(get_col)
Using pandas to extract just the column names of a large excel file is very inefficient.
You can use openpyxl for this:
from openpyxl import load_workbook
wb = load_workbook("E:\DATA\dbo.xlsx", read_only=True)
columns = {}
for sheet in worksheets:
for value in sheet.iter_rows(min_row=1, max_row=1, values_only=True):
columns = value
Assuming you only have one sheet, you will get a tuple of column names here.
If you want faster reading, then I suggest you use other type files. Excel, while convenient and fast are binary files, therefore for pandas to be able to read it and correctly parse it must use the full file. Using nrows or skipfooter to work with less data with only happen after the full data is loaded and therefore shouldn't really affect the waiting time. On the opposite, when working with a .csv() file, given its type and that there is no significant metadata, you can just extract the first rows of it as an interable using the chunksize parameter in pd.read_csv().
Other than that, using list() with a dataframe as value, returns a list of the columns already. So my only suggestion for the code you use is:
get_col = list(pd.read_excel("E:\DATA\dbo.xlsx",nrows=1, engine='openpyxl'))
The stronger suggestion is to change datatype if you specifically want to address this issue.
I have an workbook in Excel and I need to find the first column that is empty / has no data in it. I need to keep Excel open at all times, so something like openpyxl won't do.
Here's my code so far:
import xlwings as xw
from pathlib import Path
wbPath = Path('test.xlsx')
wb = xw.Book(wbPath)
sourceSheet = wb.sheets['source']
This can be done using
destinationSheet["A1"].expand("right").last_cell.column
Depending on what you need exactly, this code might be most robust. With using used_range, the code gives you the first empty column at the very end of the data as integer, regardless of empty/blank columns before the last column with data.
a_rng = sourceSheet.used_range[-1].offset(column_offset=1).column
print(a_rng)
I have an excel file(.xlsm), from which I need to extract data, including data stored as comments in some cells. Is it possible to read such comments with Pandas? How to do it?
No. As far as I am aware it is not currently possible. If you know you will be making comments when designing your spreadsheet however, you can just specify a column that will contain these comments. Alternatively, you can use something like
pd.read_excel('tmp.xlsx', index_col=0, comment='#')
to specify that any cell that starts with # will be regarded as a comment. From the documentation regarding the comment argument of pandas:
Comments out remainder of line. Pass a character or characters to this argument to indicate comments in the input file. Any data between the comment string and the end of the current line is ignored.
update
I would like to say that I know openpyxl can read comments. An example script would look like:
from openpyxl import Workbook
from openpyxl import load_workbook
wb = load_workbook("test.xlsx")
ws = wb["Sheet1"] # or whatever sheet name
for row in ws.rows:
for cell in row:
print(cell.comment)
Perhaps you could get this to interface with your data somehow!
I am trying to create a database and fill it with values gotten from an excel sheet.
My code:
new_db = pd.DataFrame()
workbook = pd.ExcelFile(filename)
df = workbook.parse('Sheet1')
print(df)
new_db.append(df)
print(new_db.head())
But whenever I seem to do this, I get an empty dataframe back.
My excel sheet however is packed with values. When it is printed(print(df)) it prints it out with ID values and all the correct columns and rows.
My knowledge with Pandas-Dataframes is limited so excuse me if I do not know something I should. All help is appreciated.
I think pandas.read_excel is what you're looking for. here is an example:
import pandas as pd
df = pd.read_excel(filename)
print(df.head())
df will have the type pandas.DataFrame
The default parameters of read_excel are set in a way that the first sheet in the excel file will be read, check the documentation for more options(if you provide a list of sheets to read by setting the sheetname parameter df will be a dictionary with sheetnames as keys and their correspoding Dataframes as values). Depending on the version of Python you're using and its distribution you may need to install the xlrd module, which you can do using pip.
You need to reassign the df after appending to it, as #ayhan pointed out in the comments:
new_db = new_db.append(df)
From the Panda's Documentation for append, it returns an appended dataframe, which means you need to assign it to a variable.
Imagine I am given two columns: a,b,c,d,e,f and e,f,g,h,i,j (commas indicating a new row in the column)
How could I read in such information from excel and put it in an two separate arrays? I would like to manipulate this info and read it off later. as part of an output.
You have a few choices here. If your data is rectangular, starts from A1, etc. you should just use pandas.read_excel:
import pandas as pd
df = pd.read_excel("/path/to/excel/file", sheetname = "My Sheet Name")
print(df["column1"].values)
print(df["column2"].values)
If your data is a little messier, and the read_excel options aren't enough to get your data, then I think your only choice will be to use something a little lower level like the fantastic xlrd module (read the quickstart on the README)