I have a python code wherein I take certain data from an excel and work with that data. Now I want that at the end of my code in the already existing Excel table a new column named XY is created. What would be your approach to this?
If you're using pandas to perform operations on the data, and you have it loaded as a df, just add:
import pandas as pd
import numpy as np
# Generating df with random numbers to show example
df = pd.DataFrame(np.random.randint(
0, 100, size=(15, 4)), columns=list('ABCD'))
print(df.head())
# Adding the empty column
df['xy'] = ''
print(df.head())
#exporting to excel
df.to_excel( FileName.xlsx, sheetname= 'sheet1')
This will add an empty column to the df, with the top cell labelled xy. If you want any values in the column, you can replace the empty '' with a list of whatever.
Hope this helps!
The easiest way get the right code is to record a macro in Excel. Go to your table in Excel, command 'Record macro' and manually perform required actions. Then command 'Stop recording' and go to VBA to discover the code. Then use the equivalent code in your Python app.
Related
There are multiple ways to read excel data into python.
Pandas provides aslo an API for writing and reading
import pandas as pd
from pandas import ExcelWriter
from pandas import ExcelFile
df = pd.read_excel('File.xlsx', sheetname='Sheet1')
That works fine.
BUT: What is the way to access the tables of every sheet directly into a pandas dataframe??
The above picture shows a sheet including a table SEPARATED THAN CELL (1,1).
Moreover the sheet might include several tables (listobjects in VBA).
I can not find anywhere the way to read them into pandas.
Note1: It is not possible to modify the workbook to bring all the tables towards cell(1,1).
Note2: I would like to use just pandas (if it is possible) and minimize the need to import other libraries. But it there is no other way I am ready to use other lybray. In any case I could not manage with xlwings for instance.
here it looks like its possible to parse the excel file, but no soilution is provided for tables, just for complete sheets.
The documentation of pandas does not seem to offer that possibility.
Thanks.
You can use xlwings, great package for working with excel files in python.
This is for a single table, but it is pretty trivial to use xlwings collections (App>books>sheets>tables) to iterate over all tables. Tables are ofcourse listobjects.
import xlwings
import pandas
with xlwings.App() as App:
_ = App.books.open('my.xlsx')
rng = App.books['my.xlsx'].sheets['mysheet'].tables['mytablename'].range
df: pandas.DataFrame = rng.expand().options(pandas.DataFrame).value
I understand that this question has been marked solved already, but I found an article that provides a much more robust solution:
Full Post
I suppose a newer version of this library supports better visibility of the workbook structure. Here is a summary:
Load the workbook using the load_workbook function from openpyxl
Then, you are able to access the sheets within, which contains collection of List-Objects (Tables) in excel.
Once you gain access to the tables, you are able to get to the range addresses of those tables.
Finally they loop through the ranges and create a pandas data-frame from it.
This is a nicer solution as it gives us the ability to loop through all the sheets and tables in a workbook.
Here is a way to parse one table, howver it's need you to know some informations on the seet parsed.
df = pd.read_excel("file.xlsx", usecols="B:I", index_col=3)
print(df)
Not elegant and work only if one table is present inside the sheet, but that a first step:
import pandas as pd
import string
letter = list(string.ascii_uppercase)
df1 = pd.read_excel("file.xlsx")
def get_start_column(df):
for i, column in enumerate(df.columns):
if df[column].first_valid_index():
return letter[i]
def get_last_column(df):
columns = df.columns
len_column = len(columns)
for i, column in enumerate(columns):
if df[column].first_valid_index():
return letter[len_column - i]
def get_first_row(df):
for index, row in df.iterrows():
if not row.isnull().values.all():
return index + 1
def usecols(df):
start = get_start_column(df)
end = get_last_column(df)
return f"{start}:{end}"
df = pd.read_excel("file.xlsx", usecols=usecols(df1), header=get_first_row(df1))
print(df)
I need to extract the domain for example: (http: //www.example.com/example-page, http ://test.com/test-page) from a list of websites in an excel sheet and modify that domain to give its url (example.com, test.com). I have got the code part figured put but i still need to get these commands to work on excel sheet cells in a column automatically.
here's_the_code
I think you should read in the data as a pandas DataFrame (pd.read_excel), make a function from your code then apply to the dframe (df.apply). Then it is easy to save to excel with pd.to_excel().
ofc you will need pandas to be installed.
Something like:
import pandas as pd
dframe = pd.read_excel(io='' , sheet_name='')
dframe['domains'] = dframe['urls col name'].apply(your function)
dframe.to_excel('your path')
Best
Total newbie and this is my first ever question so apologies in advance for any inadvertent faux pas.
I have a large(ish) dataset in Excel xlsx format that I would like to import into a pandas dataframe. The data has column headers except for the first column which does not have a header label. Here is what the excel sheet looks like:
Raw data
I am using read_excel() in Pandas to read in the data. The code I am using is:
df = pd.read_excel('Raw_Data.xlsx', sheetname=0, labels=None, header=0, index_col=None)
(I have tried index_col = false or 0 but, for obvious reasons, it doesn't change anything)
The headers for the columns are picked up fine but the first column, circled in red in the image below, is assigned as the index.
wrong index
What I am trying to get from the read_excel command is as follows with the index circled in red:
correct index
I have other excel sheets that I have used read_excel() to import into pandas and pandas automatically adds in a numerical incremental index rather than inferring one of the columns as an index.
None of those excel sheets had missing label in the column header though which might be the issue here though I am not sure.
I understand that I can use the reset_index() command after the import to get the correct index.
Wondering if it can be done without having to do the reset_index() and within the read_excel() command. i.e. is there anyway to prevent an index being inferred or to force pandas to add in the index column like it normally does.
Thank you in advance!
I don't think you can do it with only the read_excel function because of the missing value in cell A1. If you want to insert something into that cell prior to reading the file with pandas, you could consider using openpyxl as below.
from openpyxl import load_workbook as load
path = 'Raw_Data.xlsx'
col_name = 'not_index'
cell = 'A1'
def write_to_cell(path, col_name, cell):
wb = load(path)
for sheet in wb.sheetnames:
ws = wb[sheet]
if ws[cell].value is None:
ws[cell] = col_name
wb.save(path)
How to I stop Pandas to_excel() function creating an extra column with the indexes in? If I run the following:
import pandas as pd
df = pd.read_excel('in.xlsx')
#do some stuff to the dataframe
writer = pd.ExcelWriter('out.xlsx')
df.to_excel(writer)
writer.save()
.. the newly created file (out.xlsx) has an additional column which I don't want. I just want the columns identified in df.columns outputting without the additional indexes column.
This is a small step in a larger process so i can't just manually delete the column. Also, i don't want to use any other Excel writing packages such as XlsxWriter
Many thanks!
You need to set index property to false, like this:
df.to_excel(writer, index=False)
As decribed in pandas documentation: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_excel.html
I am trying to create a database and fill it with values gotten from an excel sheet.
My code:
new_db = pd.DataFrame()
workbook = pd.ExcelFile(filename)
df = workbook.parse('Sheet1')
print(df)
new_db.append(df)
print(new_db.head())
But whenever I seem to do this, I get an empty dataframe back.
My excel sheet however is packed with values. When it is printed(print(df)) it prints it out with ID values and all the correct columns and rows.
My knowledge with Pandas-Dataframes is limited so excuse me if I do not know something I should. All help is appreciated.
I think pandas.read_excel is what you're looking for. here is an example:
import pandas as pd
df = pd.read_excel(filename)
print(df.head())
df will have the type pandas.DataFrame
The default parameters of read_excel are set in a way that the first sheet in the excel file will be read, check the documentation for more options(if you provide a list of sheets to read by setting the sheetname parameter df will be a dictionary with sheetnames as keys and their correspoding Dataframes as values). Depending on the version of Python you're using and its distribution you may need to install the xlrd module, which you can do using pip.
You need to reassign the df after appending to it, as #ayhan pointed out in the comments:
new_db = new_db.append(df)
From the Panda's Documentation for append, it returns an appended dataframe, which means you need to assign it to a variable.