Loading only one sheet to dataframe - python

I am trying to read an excel sheet into df using pandas read_excel method. The excel file contains 6-7 different sheet. Out of it, 2-3 sheets are very huge. I only want to read one excel sheet out of the file.
If I copy the sheet out and read the time reduces by 90%.
I have read that xlrd that is used by pandas always loads the whole sheet to memory. I cannot change the format of the input.
Can you please suggest a way to improve the performance?

It's quite simple. Just do this.
import pandas as pd
xls = pd.ExcelFile('C:/users/path_to_your_excel_file/Analysis.xlsx')
df1 = pd.read_excel(xls, 'Sheet1')
print(df1)
# etc.
df2 = pd.read_excel(xls, 'Sheet2')
print(df2)

import pandas as pd
df = pd.read_excel('YourFile.xlsx', sheet_name = 'YourSheet_Name')
Whatever sheet you want to read just put the sheet name and your path to excel file.

Use openpyxl in read-only mode. See http://openpyxl.readthedocs.io/en/default/pandas.html

Related

Pandas read_excel - How to try two different sheet names to import

I am attempting to import a large group of excels and the code that selects what to import is included below.
df = pd.read_excel (file, sheet_name = ['Sheet1', 'Sheet2'])
I know that the excels either use sheet1 or sheet2, however they do not use both. This makes my code error out. Is there anyway to tell pandas to try importing sheet1, and if that errors, trying sheet2?
Thanks for any help.
try:
df = pd.read_excel (file, sheet_name = ['Sheet1'])
except:
df = pd.read_excel (file, sheet_name = ['Sheet2'])
Assuming your Excel files aren't too large to import everything, you could do this:
df = pd.read_excel(file, sheet_name=None)
That would return all the sheets in the file as a dict, where the key is sheet name and the value is the dataframe. You can then test for the key you want and use that sheet, and drop the rest.
(Edit: I'll note that this may be a heavy-handed approach, but I tried to generalize the answer to how to select one or more sheets when you aren't sure of their names)

How to save into one Excelsheet many (n) dataframes using python without deleting the previous dataframe

After parsing eml files and extracting them to create many dataframes, I want to save them to one Excel file.
After saving all my dataframes into the Excel file I still have only the last dataframe among them not all of them in the Sheet.
Someone have an idea how I can solve that ?
you should use sheet name parameter:
import pandas as pd
df_1 = pd.DataFrame()
df_2 = pd.DataFrame()
df_1.to_excel(filename, sheet_name='df1')
df_2.to_excel(filename, sheet_name='df2')

Is there a way to export individual sheets in a excel workbook to separate csv files using pandas?

I have 5 sheets in an excel workbook. I would like to export each sheet to csv using python libraries.
This is a sheet showing sales in 2019. I have named the seets according to the year they represent as shown here.
I have read the excel spreadsheet using pandas. I have used the for loop since I am interested in saving the csv file like the_sheet_name.csv. This is my code in a jupyter notebook:
import pandas as pd
df = pd.DataFrame()
myfile = 'sampledata.xlsx’
xl = pd.ExcelFile(myfile)
for sheet in xl.sheet_names:
df_tmp = xl.parse(sheet)
print(df_tmp)
df = df.append(df_tmp, ignore_index=True,sort=False)
csvfile = f'{sheet_name}.csv'
df.to_csv(csvfile, index=False)
Executing the code is producing just one csv file that has the data for all the other sheets. I would like to know if there is a way to customize my code so that I can produce individual sheets e.g sales2011.csv, sales2012.csv and so on.
Use sheet_name=None returns a dictionary of dataframes:
dfs = pd.read_excel('file.xlsx', sheet_name=None)
for sheet_name, data in dfs.items():
data.to_csv(f"{sheet_name}.csv")

Copying several columns from a csv file to an existing xls file using Python

I'm pretty new to Python but I was having some difficulty on getting started on this. I am using Python 3.
I've googled and found quite a few python modules that help with this but was hoping for a more defined answer here. So basically, I need to read from a csv file certain columns i.e G, H, I, K, and M. The ones I need aren't consecutive.
I need to read those columns from the csv file and transfer them to empty columns in an existing xls with data already in it.
I looked in to openpyxl but it doesn't seem to work with csv/xls files, only xlsx.
Can I use xlwt module to do this?
Any guidance on which module may work best for my usecase would be greatly appreciated. Meanwhile, i'm going to tinker around with xlwt/xlrd.
I recommend using pandas. It has convenient functions to read and write csv and xls files.
import pandas as pd
from openpyxl import load_workbook
#read the csv file
df_1 = pd.read_csv('c:/test/test.csv')
#lets say df_1 has columns colA and colB
print(df_1)
#read the xls(x) file
df_2=pd.read_excel('c:/test/test.xlsx')
#lets say df_2 has columns aa and bb
#now add a column from df_1 to df_2
df_2['colA']=df_1['colA']
#save the combined output
writer = pd.ExcelWriter('c:/test/combined.xlsx')
df_2.to_excel(writer)
writer.save()
#alternatively, if you want to add just one column to an existing xlsx file:
#i.e. get colA from df_1 into a new dataframe
df_3=pd.DataFrame(df_1['colA'])
#create writer using openpyxl engine
writer = pd.ExcelWriter('c:/test/combined.xlsx', engine='openpyxl')
#need this workaround to provide a list of work sheets in the file
book = load_workbook('c:/test/combined.xlsx')
writer.book = book
writer.sheets = dict((ws.title, ws) for ws in book.worksheets)
column_to_write=16 #this would go to column Q (zero based index)
writeRowIndex=0 #don't plot row index
sheetName='Sheet1' #which sheet to write on
#now write the single column df_3 to the file
df_3.to_excel(writer, sheet_name=sheetName, columns =['colA'],startcol=column_to_write,index=writeRowIndex)
writer.save()
You could try XlsxWriter , which is fully featured python module for writing Excel 2007+ XLSX file format.
https://pypi.python.org/pypi/XlsxWriter

How to open an excel file with multiple sheets in pandas?

I have an excel file composed of several sheets. I need to load them as separate dataframes individually. What would be a similar function as pd.read_csv("") for this kind of task?
P.S. due to the size I cannot copy and paste individual sheets in excel
Use pandas read_excel() method that accepts a sheet_name parameter:
import pandas as pd
df = pd.read_excel(excel_file_path, sheet_name="sheet_name")
Multiple data frames can be loaded by passing in a list. For a more in-depth explanation of how read_excel() works see: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
If you can't type out each sheet name and want to read whole worksheet try this:
dfname=pd.ExcelFile('C://full_path.xlsx')
print(dfname.sheet_names)
df=pd.read_excel('C://fullpath.xlsx')
for items in dfname.sheet_names[1:]:
dfnew=pd.read_excel(full_path,sheet_name=items)
df=pd.concat([df,dfnew])
The thing is that pd.read_excel() can read the very first sheet and rest are unread.So you can use this
import pandas
# setting sheet_name = None, reads all sheets into a dict
sheets = pandas.read_excel(filepath, sheet_name=None)
# i will be the keys in a dictionary object
# the values are the dataframes of each sheet
for i in sheets:
print(f"sheet[{i}]")
print(f"sheet[{i}].columns={sheets[i].columns}")
for index, row in sheets[i].iterrows():
print(f"index={index} row={row}")
exFile = ExcelFile(f) #load file f
data = ExcelFile.parse(exFile) #this creates a dataframe out of the first sheet in file

Categories

Resources