How to read specific columns in an xlsb in Python - python

I'm trying to read spreadsheets in an xlsb file in python and I've used to code below to do so. I found the code in stack overflow and I'm sure that it reads every single column in a row of a spreadsheet and appends it to a dataframe. How can I modify this code so that it only reads/appends specific columns of the spreadsheet i.e. I only want to append data in columns B through D into my dataframe.
Any help would be appreciated.
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])

pyxlsb itself cannot do it, but it is doable with the help of xlwings.
import pandas as pd
import xlwings as xw
from pyxlsb import open_workbook as open_xlsb
with open_xlsb(r"W:\path\filename.xlsb") as wb:
Data=xw.Range('B:D').value
#Creates a dataframe using the first list of elements as columns
Data_df = pd.DataFrame(Data[1:], columns=Data[0])

Just do:
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row if item.c > 0 and item.c < 4])
df = pd.DataFrame(df[1:], columns=df[0])
item.c refers to the column number starting at 0

Related

Python Select Specific Row and Column from CSV file

I want to print specific row and column from a csv file.
csv file look like,
R,IMSI,DATE FIRST EVENT,TIME FIRST EVENT,DATE LAST EVENT,TIME LAST EVENT,DC(HHMMSS),NC,VOLUME,SDR
R
C,634012007277489,20221122,150025,20221122,150025,711,1,0,294
C,634012031576061,20221122,150859,20221122,151738,905,3,0,1597
C,634012045006518,20221122,144022,20221122,144022,902,1,0,368
R
R
R,END OF REPORT
T,18
Output should be look like,
C,634012007277489,20221122,150025,20221122,150025,711,1,0,294 C,634012031576061,20221122,150859,20221122,151738,905,3,0,1597 C,634012045006518,20221122,144022,20221122,144022,902,1,0,368
Use pandas (you need to install it first by pip install pandas in the terminal).
import pandas as pd
df = pd.read_csv(fullpath.csv)
x = df[column_name].iloc[row_number]
Try reading it with pandas.read_csv()
import pandas as pd
df = pd.read_csv('filename.csv', skipfooter=1, header=1)
df.iloc[row_number,column_number]
You can use .iat too.
import pandas as pd
df = pd.read_csv("example.csv", delimiter =",")
for row in range(len(df)):
for column in range(len(df.columns)):
print(df.iat[row, column])

Having Trouble Writing Table to Excel with Python

Hi I am trying to create a table in excel using a dataframe from another excel spreadsheet and writing the table to a new one. I believe my code is correct but the table isn't writing to the new excel spreadsheet. Can someone take a look at my code and tell me what's wrong?
import xlsxwriter
import pandas as pd
import openpyxl as pxl
import xlsxwriter
import numpy as np
from openpyxl import load_workbook
path = '/Users/benlong/Downloads/unemployment.xlsx'
df = pd.read_excel(path)
rows = df.shape[0]
columns = df.shape[1]
wb = xlsxwriter.Workbook('UE2.xlsx')
ws = wb.add_worksheet('Sheet1')
ws.add_table(0,0,rows,columns, {'df': df})
wb.close()
You should convert your dataframe to list . By using df.values.tolist() and use the key data.
In your case , you also should set the header of df and avoid getting a nan value error.
eg:
import xlsxwriter as xlw
# while got NaN/Inf values from ur dataframe , u'll get a value of '#NUM!' instead in saved excel
wb = xlw.Workbook('UE2.xlsx',{'nan_inf_to_errors': True})
ws = wb.add_worksheet('Sheet1')
cell_range = xlw.utility.xl_range(0, 0, rows, columns-1)
header = [{'header': str(di)} for di in df.columns.tolist()]
ws.add_table(cell_range, {'header_row': True,'first_column': False,'columns':header,'data':df.values.tolist()})
wb.close()
Possible duplicate: How to use xlsxwriter .add_table() method with a dataframe?
You can try converting the dataframe to a list of lists and use the data keyword.
ws.add_table(0,0,rows,columns, {'data': df.values.T.tolist()})

Using Python 3 to import multiple excel workbooks and sheets into single data frame

I am still learning python. I am trying to import multiple workbooks and all the worksheets into one data frame.
Here is what I have so far:
import pandas as pd
import numpy as np
import os #checking the working directory
print(os.getcwd())
all_data = pd.DataFrame() #creating an empty data frame
for file in glob.glob("*.xls"): #import every file that ends in .xls
df = pd.read_excel(file)
all_data = all_data.append(df, ignore_index = True)
all_data.shape #12796 rows with 19 columns # we will have to find a way to check if this is accurate
I am having real trouble finding any documentation that will confirm/explain whether or not this code imports all the data sheets in every workbook. Some of these files have 15-20 sheets
Here is a link to where I found the glob explanation: http://pbpython.com/excel-file-combine.html
Any and all advice is greatly appreciated. I am still really new to R and Python so if you could explain this in as much detail as possible I would greatly appreciate it!
What you are missing is importing all the sheets in the workbook.
import pandas as pd
import numpy as np
import os #checking the working directory
print(os.getcwd())
all_data = pd.DataFrame() #creating an empty data frame
rows = 0
for file in glob.glob("*.xls"): #import every file that ends in .xls
# df = pd.read_excel(file).. This will import only first sheet
xls = pd.ExcelFile(file)
sheets = xls.sheet_names # To get names of all the sheets
for sheet_name in sheets:
df = pd.read_excel(file, sheetname=sheet_name)
rows += df.shape[0]
all_data = all_data.append(df, ignore_index = True)
print(all_data.shape[0]) # Now you will get all the rows which should be equal to rows
print(rows)

While Loop for forming lists when xslx cell is not None

The goal here is to take the xslx that is loaded and replace certain columns with the lists created from the csv. I could use some direction as to the easiest method in doing this.
For example, I would like to take col_test from the csv and place the values in column1 of the xslx sheet starting in row 5. Can I do this using the same xlsx or will they need to be saved to a new xlsx like I am currently doing?
What I need help with as it stands is figuring out the syntax for the write function. I would like it to write from the first column of row 6 until the end of the list used. How would I do this?
import os
import glob
import pandas as pd
for csvfile in glob.glob(os.path.join('.', '*.csv')):
df = pd.read_csv(csvfile)
col_test = df['Test #'].tolist()
col_retest = df['Retest #'].tolist()
from xlrd import open_workbook
from xlutils.copy import copy
rb = open_workbook("Excel FDT Master_01_update.xlsx")
wb = copy(rb)
s = wb.get_sheet(3)
s.write(6,0:17,0, col_test)
wb.save('didthiswork.xls')

How to sort Excel sheet using Python

I am using Python 3.4 and xlrd. I want to sort the Excel sheet based on the primary column before processing it. Is there any library to perform this ?
There are a couple ways to do this. The first option is to utilize xlrd, as you have this tagged. The biggest downside to this is that it doesn't natively write to XLSX format.
These examples use an excel document with this format:
Utilizing xlrd and a few modifications from this answer:
import xlwt
from xlrd import open_workbook
target_column = 0 # This example only has 1 column, and it is 0 indexed
book = open_workbook('test.xlsx')
sheet = book.sheets()[0]
data = [sheet.row_values(i) for i in xrange(sheet.nrows)]
labels = data[0] # Don't sort our headers
data = data[1:] # Data begins on the second row
data.sort(key=lambda x: x[target_column])
bk = xlwt.Workbook()
sheet = bk.add_sheet(sheet.name)
for idx, label in enumerate(labels):
sheet.write(0, idx, label)
for idx_r, row in enumerate(data):
for idx_c, value in enumerate(row):
sheet.write(idx_r+1, idx_c, value)
bk.save('result.xls') # Notice this is xls, not xlsx like the original file is
This outputs the following workbook:
Another option (and one that can utilize XLSX output) is to utilize pandas. The code is also shorter:
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort(columns="Header Row")
writer = pd.ExcelWriter('output.xlsx')
df.to_excel(writer,sheet_name='Sheet1',columns=["Header Row"],index=False)
writer.save()
This outputs:
In the to_excel call, the index is set to False, so that the Pandas dataframe index isn't included in the excel document. The rest of the keywords should be self explanatory.
I just wanted to refresh the answer as the Pandas implementation has changed a bit over time. Here's the code that should work now (pandas 1.1.2).
import pandas as pd
xl = pd.ExcelFile("test.xlsx")
df = xl.parse("Sheet1")
df = df.sort_values(by="Header Row")
...
The sort function is now called sort_by and columns is replaced by by.

Categories

Resources