how to read many columns from excel using python - python

I want to read the data in like 162 rows from excel, I tried this code but I couldn't figure out a way to make it in a loop
import xlrd
file_location = "dec_DB.xlsx"
workbook = xlrd.open_workbook(file_location)
sheet = workbook.sheet_by_name('Sheet1')
x = []
for rownum in range(sheet.nrows):
x.append(sheet.cell(rownum, 3))

You can read the columns you want from an excel sheet easily in one line with pandas:
import pandas as pd
x = pd.read_excel(io = "/Users/atheeralzaydi/Desktop/KACST/DB/dec_DB.xlsx", sheet_name = "Sheet1", usecols = [1,2,3,12,34,100])
x will be a pandas DataFrame.
EDIT:
To read the first 162 rows (+1 including the header -- the row that includes the column names), instead of the parameter usecols, use the parameter nrows:
import pandas as pd
x = pd.read_excel(io = "/Users/atheeralzaydi/Desktop/KACST/DB/dec_DB.xlsx", sheet_name = "Sheet1", nrows = 162)
To import specific rows and not the first m rows, use the parameter skiprows.
To import the entire sheet, simply omit these parameters and all rows will be imported.
More info on read_excel here.

Related

How to obtain the mean of selected columns from multiple sheets within same Excel File

I am working with a large excel file having 22 sheets, where each sheet has the same coulmn headings but do not have equal number of rows. I would like to obtain the mean values (excluding zeros) for columns AA to AX for all the 22 sheets. The columns have titles which I use in my code.
Rather than reading each sheet, I want to loop through the sheets and get as output the mean values.
With help from answers to other posts, I have this:
import pandas as pd
xls = pd.ExcelFile('myexcelfile.xlsx')
xls.sheet_names
#print(xls.sheet_names)
out_df = pd.DataFrame()
for sheets in xls.sheet_names:
df = pd.read_excel('myexcelfile.xlsx', sheet_names= None)
df1= df[df[:]!=0]
df2=df1.loc[:,'aa':'ax'].mean()
out_df.append(df2) ## This will append rows of one dataframe to another(just like your expected output)
print(out_df2)
## out_df will have data from all the sheets
The code works so far, but only one of the sheets. How do I get it to work for all 22 sheets?
You can use numpy to perform basic math on pandas.DataFrame or pandas.Series
take a look at my code below
import pandas as pd, numpy as np
XL_PATH = r'C:\Users\YourName\PythonProject\Book1.xlsx'
xlFile = pd.ExcelFile(XL_PATH)
xlSheetNames = xlFile.sheet_names
dfList = [] # variable to store all DataFrame
for shName in xlSheetNames:
df = pd.read_excel(XL_PATH, sheet_name=shName) # read sheet X as DataFrame
dfList.append(df) # put DataFrame into a list
for df in dfList:
print(df)
dfAverage = np.average(df) # use numpy to get DataFrame average
print(dfAverage)
#Try code below
import pandas as pd, numpy as np, os
XL_PATH = "YOUR EXCEL FULL PATH"
SH_NAMES = "WILL CONTAINS LIST OF EXCEL SHEET NAME"
DF_DICT = {} """WILL CONTAINS DICTIONARY OF DATAFRAME"""
def readExcel():
if not os.path.isfile(XL_PATH): return FileNotFoundError
SH_NAMES = pd.ExcelFile(XL_PATH).sheet_names
# pandas.read_excel() have argument 'sheet_name'
# when you put a list to 'sheet_name' argument
# pandas will return dictionary of dataframe with sheet_name as keys
DF_DICT = pd.read_excel(XL_PATH, sheet_name=SH_NAMES)
return SH_NAMES, DF_DICT
#Now you have DF_DICT that contains all DataFrame for each sheet in excel
#Next step is to append all rows data from Sheet1 to SheetX
#This will only works if you have same column for all DataFrame
def appendAllSheets():
dfAp = pd.DataFrame()
for dict in DF_DICT:
df = DF_DICT[dict]
dfAp = pd.DataFrame.append(self=dfAp, other=df)
return dfAp
#you can now call the function as below:
dfWithAllData = appendAllSheets()
#now you have one DataFrame with all rows combine from Sheet1 to SheetX
#you can fixed the data, for example to drop all rows which contain '0'
dfZero_Removed = dfWithAllData[[dfWithAllData['Column_Name'] != 0]]
dfNA_removed = dfWithAllData[not[pd.isna(dfWithAllData['Column_Name'])]]
#last step, to find average or other math operation
#just let numpy do the job
average_of_all_1 = np.average(dfZero_Removed)
average_of_all_2 = np.average(dfNA_Removed)
#show result
#this will show the average of all
#rows of data from Sheet1 to SheetX from your target Excel File
print(average_of_all_1, average_of_all_2)

Having Trouble Writing Table to Excel with Python

Hi I am trying to create a table in excel using a dataframe from another excel spreadsheet and writing the table to a new one. I believe my code is correct but the table isn't writing to the new excel spreadsheet. Can someone take a look at my code and tell me what's wrong?
import xlsxwriter
import pandas as pd
import openpyxl as pxl
import xlsxwriter
import numpy as np
from openpyxl import load_workbook
path = '/Users/benlong/Downloads/unemployment.xlsx'
df = pd.read_excel(path)
rows = df.shape[0]
columns = df.shape[1]
wb = xlsxwriter.Workbook('UE2.xlsx')
ws = wb.add_worksheet('Sheet1')
ws.add_table(0,0,rows,columns, {'df': df})
wb.close()
You should convert your dataframe to list . By using df.values.tolist() and use the key data.
In your case , you also should set the header of df and avoid getting a nan value error.
eg:
import xlsxwriter as xlw
# while got NaN/Inf values from ur dataframe , u'll get a value of '#NUM!' instead in saved excel
wb = xlw.Workbook('UE2.xlsx',{'nan_inf_to_errors': True})
ws = wb.add_worksheet('Sheet1')
cell_range = xlw.utility.xl_range(0, 0, rows, columns-1)
header = [{'header': str(di)} for di in df.columns.tolist()]
ws.add_table(cell_range, {'header_row': True,'first_column': False,'columns':header,'data':df.values.tolist()})
wb.close()
Possible duplicate: How to use xlsxwriter .add_table() method with a dataframe?
You can try converting the dataframe to a list of lists and use the data keyword.
ws.add_table(0,0,rows,columns, {'data': df.values.T.tolist()})

Creating a list from an excel file that's a slice of a column. How can I print it without a 'text:' prefix in any value?

Trying to fetch a list of values from an excel file consisting of a column but has to end at a certain row. Can fetch it using a slice of the col and rows, however a prefix of 'text:' appears. This makes the list incompatible for what I need to use it for.
import xlrd
import csv
loc = ("/Users/uni/Desktop/TESTEXCEL.xls")
wb =xlrd.open_workbook(loc)
sheet = wb.sheet_by_index(0)
sheet.cell_value(0,0)
CANDIDATE = sheet.col_slice(colx=5,
start_rowx=1,
end_rowx=29)
print (CANDIDATE)
RESULT:
[text:u'lt102', text:u'lt103', text:u'lt104', text:u'lt105', text:u'lt108', text:u'lt124', text:u'lt149', text:u'lt151', text:u'lt152', text:u'lt153', text:u'lt195', text:u'lt223', text:u'lt229', text:u'lt254', text:u'lt255', text:u'lt268', text:u'lt269', text:u'lt270', text:u'lt277', text:u'lt278', text:u'lt280', text:u'lt284', text:u'lt285', text:u'lt287', text:u'lt299', text:u'lt95', text:u'lt96', text:u'lt97']
You can use pandas library, it has convenient read_excel method. Here is example:
import pandas as pd
column_number = 5
df = pd.read_excel('/Users/uni/Desktop/TESTEXCEL.xls', usecols=[column_number], nrows=29, header=None)
values = df[column_number].to_list() # a list of values in your 5th column
You can read more about read_excel here: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html

How to read specific columns in an xlsb in Python

I'm trying to read spreadsheets in an xlsb file in python and I've used to code below to do so. I found the code in stack overflow and I'm sure that it reads every single column in a row of a spreadsheet and appends it to a dataframe. How can I modify this code so that it only reads/appends specific columns of the spreadsheet i.e. I only want to append data in columns B through D into my dataframe.
Any help would be appreciated.
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])
pyxlsb itself cannot do it, but it is doable with the help of xlwings.
import pandas as pd
import xlwings as xw
from pyxlsb import open_workbook as open_xlsb
with open_xlsb(r"W:\path\filename.xlsb") as wb:
Data=xw.Range('B:D').value
#Creates a dataframe using the first list of elements as columns
Data_df = pd.DataFrame(Data[1:], columns=Data[0])
Just do:
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row if item.c > 0 and item.c < 4])
df = pd.DataFrame(df[1:], columns=df[0])
item.c refers to the column number starting at 0

How to sort Excel sheet using Python

I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?
There are a couple ways to do this. The first option is to utilize xlrd, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel call, the index is set to False, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.
I just wanted to refresh the answer as the Pandas implementation has changed a bit over time. Here's the code that should work now (pandas 1.1.2).
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort_values(by="Header Row")
...
The sort function is now called sort_by and columns is replaced by by.

Categories

Resources