I have an excel file(.xlsm), from which I need to extract data, including data stored as comments in some cells. Is it possible to read such comments with Pandas? How to do it?
No. As far as I am aware it is not currently possible. If you know you will be making comments when designing your spreadsheet however, you can just specify a column that will contain these comments. Alternatively, you can use something like
pd.read_excel('tmp.xlsx', index_col=0, comment='#')
to specify that any cell that starts with # will be regarded as a comment. From the documentation regarding the comment argument of pandas:
Comments out remainder of line. Pass a character or characters to this argument to indicate comments in the input file. Any data between the comment string and the end of the current line is ignored.
update
I would like to say that I know openpyxl can read comments. An example script would look like:
from openpyxl import Workbook
from openpyxl import load_workbook
wb = load_workbook("test.xlsx")
ws = wb["Sheet1"] # or whatever sheet name
for row in ws.rows:
for cell in row:
print(cell.comment)
Perhaps you could get this to interface with your data somehow!
Related
I was looking at the code related to xlsxwriter, when using Pandas' Dataframe.to_excel Command.
I ended up adding some formatting to the files, but the columns dont see to work. Ideally i was hoping to dynamically set column widths to fit the content.
I saw there was a command called: set_column which i thought might do the trick. https://xlsxwriter.readthedocs.io/worksheet.html#set_column Showed me though that it needs to be a number.
that number to me, needs to be the largest string in that column (including the column name itself). While I can process that, I thought it a bit extreme to do. I figured there might be a wrap command i could use which auto formats or something.
Some Simple Code I was using:
import pandas as pd
from pandas import DataFrame
df = DataFrame({"aadsfasdfasdfasdfasdf":[1,2,3]})
writer = pd.ExcelWriter(filename, engine='xlsxwriter')
_base_sheet = "Sheet1"
df.to_excel(writer, sheet_name=_base_sheet, header=HEADERS)
workbook = writer.book
worksheet = writer.sheets[_base_sheet]
...
# Here I would want do set all columns to have some sort of auto-width
Apologies for no coding provided, this is really a generic question.
I'm using Python xlwings library, and trying to copy a sheet from one workbook to another new workbook, then hard-code the sheet in the newly created workbook. Effectively same as "Copy / Paste Values and source formatting".
I wasn't able to find any documentation on this, and thank you in advance for your help!
edit: someone mentioned that I should include an example. Here it is but it's kind hard to show the format in an Excel file. the following code will copy/paste "sht" into a new workbook but the "new_sht" will contain formulas. I'm trying to hard-code all the values while preserving the number format (eg. with thousands separator, percentage sign, etc)
import xlwings as xw
wb = xw.Book('example1.xlsx')
sht = wb.sheets['sheet1']
new_wb = xw.Book()
new_sht = new_wb.sheets[0]
sht.api.Copy(Before = new_sht.api)
Answering my own question as I just figured out what I wanted to accomplish.
The following code will hardcode the values while preserve the formatting, since it's essentially pasting value-only to an already formatted area.
new_sht.range('A1:C10').value = new_sht.range('A1:C10').value
import pandas as pd
check = pd.read_csv('1.csv')
nocheck = check['CUSIP'].str[:-1]
nocheck = nocheck.to_frame()
nocheck['CUSIP'] = nocheck['CUSIP'].astype(str)
nocheck.to_csv('NoCheck.csv')
This works but while writing the csv, a value for an identifier like 0003418 (type = str) converts to 3418 (type = general) when the csv file is opened in Excel. How do I avoid this?
I couldn't find a dupe for this question, so I'll post my comment as a solution.
This is an Excel issue, not a python error. Excel autoformats numeric columns to remove leading 0's. You can "fix" this by forcing pandas to quote when writing:
import csv
# insert pandas code from question here
# use csv.QUOTE_ALL when writing CSV.
nocheck.to_csv('NoCheck.csv', quoting=csv.QUOTE_ALL)
Note that this will actually put quotes around each value in your CSV. It will render the way you want in Excel, but you may run into issues if you try to read the file some other way.
Another solution is to write the CSV without quoting, and change the cell format in Excel to "General" instead of "Numeric".
How can I use xlwings to read a "table" in excel, into a pandas DataFrame, where the table "headers" become the DataFrame column names?
Every way I have tried to read the table, the header row is always excluded from the read!
Here is what I've tried, where "b" is my xlwings workbook object:
b.sheets['Sheet1'].range('Table1').options(pd.DataFrame)
b.sheets['Sheet1'].range('Table1').options(pd.DataFrame, headers=False)
b.sheets['Sheet1'].range('Table1').options(pd.DataFrame, headers=True)
Hoping this is not the best answer, but I did find I could reference the named range, then .offset(-1).expand('vertical')
Another option is to use the api and Excel's ListObject
import xlwings as xw
wb = xw.books.active
ws = wb.sheets('MySheet')
tbl = ws.api.ListObjects('MyTable') # or .ListObjects(1)
rng = ws.range(tbl.range.address) # get range from table address
df = rng.options(pd.DataFrame, header=True).value # load range to dataframe
Let me add the comment by Felix Zumstein as explicit answer, because in my opinion it is the best solution.
b.sheets['Sheet1'].range('Table1[[#All]]').options(pd.DataFrame)
Using the square bracket notation expands the selection from the table body to the header row as well. Subsequently the conversion to a pandas DataFrame, can use it as header.
To read more, check this answer.
I know how to get the list of sheet names. The excel file I am using has multiple sheets. How do I select the first one sequentially ? I don't know the name of the sheet but I need to select the first one. How would I go about this ?
The first sheet is automatically selected when the Excel table is read into a dataframe.
To be explicit however, the command is :
import pandas as pd
fd = 'file path'
data = pd.read_excel( fd, sheet_name=0 )
Use of 'sheetname' is deprecated. Please use sheet_name
Also this bug at the time of writing:
https://github.com/pandas-dev/pandas/issues/17107
Use 'sheetname', not 'sheet_name'.
Following the official documentation and as already suggested by EdChum, it's enough to use read_excell passing sheetname=N as argument.
N=0 for the first sheet, N=1 for the second, N=2 for the third and so on..