Python - Trouble pulling data from multiple excel files in folder

Python - Trouble pulling data from multiple excel files in folder - python

I am trying to write a script that will open/look through every excel workbook inside a folder, pull specific values from each of those workbooks, and then paste those values into a new csv.
My script (see below), and all 49 workbooks are located in the following path: C:\Users\user.name\Desktop\Excel Test.
import pandas
import os
info_headers = ['Production Name', 'Data size (GBs)', 'Billable Data size (GBs)']
info = []
files = [file for file in os.listdir('C:\\Users\\user.name\\Desktop\\Excel Test')]
for file in files:
df = pandas.read_excel(file)
size = df['Unnamed: 2'].loc[df['QC Checklist'] == 'Data Size (GB):'].values[0]
name = df['Unnamed: 2'].loc[df['QC Checklist'] == 'Production Volume Name'].values[0]
bill_size = df['Unnamed: 2'].loc[df['QC Checklist'] == 'Billable Data Size (GB):'].values[0]
info.append([name, size, bill_size])
output = pandas.DataFrame(info, columns=info_headers)
output.to_csv('C:\\Users\\user.name\\Excel Test') # This will output a csv in your current directory
I receive the following error when trying to run this:
Traceback (most recent call last):
File "C:/Users/user.name/Desktop/Excel Test/exceltest.py", line 9, in <module>
df = pandas.read_excel(file)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel.py", line 350, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel.py", line 653, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\excel.py", line 424, in __init__
self.book = xlrd.open_workbook(filepath_or_buffer)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\book.py", line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\book.py", line 1278, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Users\user.name\AppData\Local\Programs\Python\Python37\lib\site-packages\xlrd\book.py", line 1272, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'import p'

There could be a couple of problems. It may be trying to open non-excel files in the directory. After your os.listdir() call, try to filter for excel files only.
Or your excel file may be formatted incorrectly.

Related

TypeError: expected <class 'openpyxl.styles.fills.Fill'>

I`m trying to download and then open excel file (report) generated by marketplace with openpyxl.
import requests
import config
import openpyxl
link = 'https://api.telegram.org/file/bot' + config.TOKEN + '/documents/file_66.xlsx'
def save_open(link):
filename = link.split('/')[-1]
r = requests.get(link)
with open(filename, 'wb') as new_file:
new_file.write(r.content)
wb = openpyxl.open ('file_66.xlsx')
ws = wb.active
cell = ws['B2'].value
print (cell)
save_open(link)
After running this code I got the above:
Traceback (most recent call last):
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\base.py", line 55, in _convert
value = expected_type(value)
TypeError: Fill() takes no arguments
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Home\Documents\myPython\bot_WB\main.py", line 20, in <module>
save_open(link)
File "C:\Users\Home\Documents\myPython\bot_WB\main.py", line 14, in save_open
wb = openpyxl.open ('file_66.xlsx')
File "C:\Python 3.9\lib\site-packages\openpyxl\reader\excel.py", line 317, in load_workbook
reader.read()
File "C:\Python 3.9\lib\site-packages\openpyxl\reader\excel.py", line 281, in read
apply_stylesheet(self.archive, self.wb)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 198, in apply_stylesheet
stylesheet = Stylesheet.from_tree(node)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 103, in from_tree
return super(Stylesheet, cls).from_tree(node)
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\serialisable.py", line 103, in from_tree
return cls(**attrib)
File "C:\Python 3.9\lib\site-packages\openpyxl\styles\stylesheet.py", line 74, in __init__
self.fills = fills
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\sequence.py", line 26, in __set__
seq = [_convert(self.expected_type, value) for value in seq]
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\sequence.py", line 26, in <listcomp>
seq = [_convert(self.expected_type, value) for value in seq]
File "C:\Python 3.9\lib\site-packages\openpyxl\descriptors\base.py", line 57, in _convert
raise TypeError('expected ' + str(expected_type))
TypeError: expected <class 'openpyxl.styles.fills.Fill'>
[Finished in 1.6s]
If you run file properties/details you can see that this file was generated by "Go Exelize" (author: xuri). To run this file you need to separate code in two parts. First: download file. Then you need to manually open it with MS Excel, save file and close it (after this "Go Excelize" switch to "Microsoft Excel"). And only after that you can run the second part of the code correctly with no errors. Can anyone help me to handle this problem?

I had the same problem, "TypeError('expected ' + str(expected_type))", using pandas.read_excel, which uses openpyxl. If I open the file, save and close it, it will work with both, pandas and openpyxl.
Upon further attempts I could open the file using the "read_only=True" in openpyxl, but while iterating over the rows I would still get the error, but only when all the rows ended, in the end of the file.
I belive it could be something in the EOF (end of file) and openpyxl don't have ways of treating it.
Here is the code that I used to test and worked for me:
import openpyxl
wb = openpyxl.load_workbook(my_file_name, read_only=True)
ws = wb.worksheets[0]
lis = []
try:
for row in ws.iter_rows():
lis.append([cell.value for cell in row])
except TypeError:
print('Skip error in EOF')
Used openpyxl==3.0.10

Problem with adding Excel files at Pandas | wrapper return func

Hi everybody I have a problem uploading a excel file with Pandas
I have taken the file in archive, if I uploaded it directly it gaves me an error. If I cope and paste the excel file there is no problem.
The code is very easy:
data = pd.read_excel(r"C:\Users\obett\Desktop\Corporate Governance\pandas.xlsx")
and this is the error:
Traceback (most recent call last):
File "C:/Users/obett/PycharmProjects/pythonProject6/main.py", line 24, in <module>
data = pd.read_excel(r"C:\Users\obett\Desktop\Corporate Governance\Aida_Export_67.xlsx")
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\io\excel\_base.py", line 344, in read_excel
data = io.parse(
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\io\excel\_base.py", line 1170, in parse
return self._reader.parse(
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\io\excel\_base.py", line 492, in parse
data = self.get_sheet_data(sheet, convert_float)
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\io\excel\_openpyxl.py", line 549, in get_sheet_data
converted_row = [self._convert_cell(cell, convert_float) for cell in row]
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\io\excel\_openpyxl.py", line 549, in <listcomp>
converted_row = [self._convert_cell(cell, convert_float) for cell in row]
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\pandas\io\excel\_openpyxl.py", line 514, in _convert_cell
elif cell.is_date:
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\openpyxl\cell\read_only.py", line 101, in is_date
return Cell.is_date.__get__(self)
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\openpyxl\cell\cell.py", line 256, in is_date
self.data_type == 'n' and is_date_format(self.number_format)
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\openpyxl\cell\read_only.py", line 66, in number_format
_id = self.style_array.numFmtId
File "C:\Users\obett\PycharmProjects\pythonProject6\venv\lib\site-packages\openpyxl\cell\read_only.py", line 56, in style_array
return self.parent.parent._cell_styles[self._style_id]
IndexError: list index out of range
Thank you very much

Unsupported format, or corrupt file: Expected BOF record

I am having an issue using read_excel on .xlsx files while using a for loop to go through the directory. When doing the following:
df1 = pd.read_excel('***.xlsx')
df2 = pd.read_excel('***.xlsx')
df = pd.concat([df1,df2], ignore_index=True, sort=False)
I have no issues. But when I try reading the same files in my loop:
directories_to_check = [
"C:\\Users\\***"
]
files = []
for directory in directories_to_check:
directories = [d for d in listdir(directory) if isdir(join(directory,d))]
print(directories)
for root in directories:
path = join(directory,root)
print(path)
for file in listdir(path):
print(file)
if file[0:5].lower() == 'name':
files.append(join(directory,root,file))
else:
files.append(join(directory,root,file))
for filetoopen in files:
print(filetoopen)
df = pd.concat([pd.read_excel(filetoopen, header=1)], ignore_index=True, sort=False)
I get the following error:
Traceback (most recent call last):
File "C:/Users/***/Sample.py", line 30, in <module>
df = pd.concat([pd.read_excel(filetoopen, header=1)], ignore_index=True, sort=False)
File "C:\Users\***\Python\Python37-32\lib\site-packages\pandas\util\_decorators.py", line 208, in wrapper
return func(*args, **kwargs)
File "C:\Users\***\Python\Python37-32\lib\site-packages\pandas\io\excel\_base.py", line 310, in read_excel
io = ExcelFile(io, engine=engine)
File "C:\Users\***\Python\Python37-32\lib\site-packages\pandas\io\excel\_base.py", line 819, in __init__
self._reader = self._engines[engine](self._io)
File "C:\Users\***\Python\Python37-32\lib\site-packages\pandas\io\excel\_xlrd.py", line 21, in __init__
super().__init__(filepath_or_buffer)
File "C:\Users\***\Python\Python37-32\lib\site-packages\pandas\io\excel\_base.py", line 359, in __init__
self.book = self.load_workbook(filepath_or_buffer)
File "C:\Users\***\Python\Python37-32\lib\site-packages\pandas\io\excel\_xlrd.py", line 36, in load_workbook
return open_workbook(filepath_or_buffer)
File "C:\Users\***\Python\Python37-32\lib\site-packages\xlrd\__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "C:\Users\***\Python\Python37-32\lib\site-packages\xlrd\book.py", line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "C:\Users\***\Python\Python37-32\lib\site-packages\xlrd\book.py", line 1278, in getbof
bof_error('Expected BOF record; found %r' % self.mem[savpos:savpos+8])
File "C:\Users\***\Python\Python37-32\lib\site-packages\xlrd\book.py", line 1272, in bof_error
raise XLRDError('Unsupported format, or corrupt file: ' + msg)
xlrd.biffh.XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'C:\\Users'
Any help would be appreciated, thank you!

How to read the uploaded excel file in AWS Lambda?

I am uploading an excel file from Postman and trying to read it in AWS lambda with pandas. How can I do it?
I have tried reading the bytes from API gateway event with 'cgi.parse_multipart'. I was able to read the csv files successfully but not xlsx files.
from cgi import parse_header, parse_multipart
import pandas as pd
from io import BytesIO, StringIO
import json
def get_data(event):
c_type, c_data = parse_header(event['headers']['content-type'])
c_data['boundary'] = bytes(c_data['boundary'], "utf-8")
assert c_type == 'multipart/form-data'
body = event['body']
body = bytes(body, 'utf-8')
form_data = parse_multipart(BytesIO(body), c_data)
data = form_data['file'][0]
s=str(data,'utf-8')
d = StringIO(s)
# df=pd.read_csv(d)
df=pd.read_excel(d)
print(df)
def run(event, context):
output = {}
output['statusCode'] = 200
output['body'] = json.dumps(get_data(event))
return output
While trying to read xlsx file, I get the following error:
Traceback (most recent call last):
File "/var/task/upload_test.py", line 108, in run
output['body'] = json.dumps(get_data(event))
File "/var/task/upload_test.py", line 52, in get_data
df=pd.read_excel(d)
File "/opt/python/lib/python3.6/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/opt/python/lib/python3.6/site-packages/pandas/util/_decorators.py", line 188, in wrapper
return func(*args, **kwargs)
File "/opt/python/lib/python3.6/site-packages/pandas/io/excel.py", line 350, in read_excel
io = ExcelFile(io, engine=engine)
File "/opt/python/lib/python3.6/site-packages/pandas/io/excel.py", line 653, in __init__
self._reader = self._engines[engine](self._io)
File "/opt/python/lib/python3.6/site-packages/pandas/io/excel.py", line 422, in __init__
self.book = xlrd.open_workbook(file_contents=data)
File "/var/task/xlrd/__init__.py", line 157, in open_workbook
ragged_rows=ragged_rows,
File "/var/task/xlrd/book.py", line 92, in open_workbook_xls
biff_version = bk.getbof(XL_WORKBOOK_GLOBALS)
File "/var/task/xlrd/book.py", line 1274, in getbof
opcode = self.get2bytes()
File "/var/task/xlrd/book.py", line 675, in get2bytes
return (BYTES_ORD(hi) << 8) | BYTES_ORD(lo)
TypeError: unsupported operand type(s) for <<: 'str' and 'int'

Issues reading with pandas.read_table

I am trying to read in and manipulate some text files I have as outputs with statistics from an MRI analysis I ran.
With a for loop I would like to index into each subjects folder and convert their txt file with summary statistics into a data_frame, drop some of the rows that are not necessary, and concatenate each subject's now cleaned data_frame with a master data_frame. I seem to be reading in the txt file and getting it into the data_frame. However I am running into two issues I can't troubleshoot when trying to drop rows.
The txt file is organized like so.....
# Title Pathway Statistics
#
# generating_program
/cell_root/software/freesurfer/6.0.0/sys/bin/dmri_pathstats
# cvs_version
Count 2000
Volume 98
Len_Min 67
Len_Max 92
Len_Avg 81.219
Len_Center 87
AD_Avg 0.00152315
AD_Avg_Weight 0.00151198
AD_Avg_Center 0.00141413
Although there is one row with many spaces that might be related to the issue in terms of reading the data in?
# cmdline
/cell_root/software/freesurfer/6.0.0/sys/bin/dmri_pathstats --intrc
/homes/dcallow/dti_freesurf/trac/Ex.AES115.long.base_AES115/dpath/fmajor_PP_avg33_mni_bbr --dtbase
/homes/dcallow/dti_freesurf/trac/Ex.AES115.long.base_AES115/dmri/dtifit --path fmajor --subj Ex.AES115.long.base_AES115 --out
/homes/dcallow/dti_freesurf/trac/Ex.AES115.long.base_AES115/dpath/fmajor_PP_avg33_mni_bbr/pathstats.overall.txt --outvox
/homes/dcallow/dti_freesurf/trac/Ex.AES115.long.base_AES115/dpath/fmajor_PP_avg33_mni_bbr/pathstats.byvoxel.txt
I tried removing the drop line however I then get a different error that occurs earlier in the code which is also odd?
Traceback (most recent call last):
File "./txt_2_excel.sh", line 18, in
df=pd.read_table('pathstats.overall.txt', delim_whitespace=True,names=['measure','value','excess1','excess2','excess3','excess4'])
File "/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 678, in parser_f
return _read(filepath_or_buffer, kwds)
File "/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 787, in init
self._make_engine(self.engine)
File "/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/io/parsers.py",
line 1708, in init
self._reader = parsers.TextReader(src, **kwds)
File "pandas/_libs/parsers.pyx", line 384, in pandas._libs.parsers.TextReader.cinit
File "pandas/_libs/parsers.pyx", line 695, in pandas._libs.parsers.TextReader._setup_parser_source
FileNotFoundError: File b'pathstats.overall.txt' does not exist
#!/Users/amos/anaconda3/bin/python
# Pythono3 code to rename multiple
# files in a directory or folder
# importing os module
import os
import pandas as pd
#set working directory to where files are stored
os.chdir("/Volumes/DANIEL/trac_stats")
df_master = pd.DataFrame()
for tract in os.listdir("/Volumes/DANIEL/tract_names/"):
for subj in os.listdir("/Volumes/DANIEL/trac/"):
os.chdir("/Volumes/DANIEL/trac/{0}/dpath/{1}/".format(subj,tract))
os.getcwd()
df=pd.read_table('pathstats.overall.txt', delim_whitespace=True,names=['measure','value','excess1','excess2','excess3','excess4'])
df=df.drop([0,1,2,3,4,5,6,7,8,9,10,11,14,15,16,17,18,20,22,23,25,26,28,29,31,32])
df['subj']=subj
df_master=pd.concat([df_master,df])
print(df_master)
os.chdir("/Volumes/DANIEL/trac_stats/")
df_master.to_excel('trac_stats')
This should produce an excel sheet with a ['measure','value','excess1','excess2','excess3','excess4'] column and rows 12,13,19,21,24,and 27 of data for each subject in a excel sheet named trac_stats.
I get the following error
File "./txt_2_excel.sh", line 20, in File
"/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py",
line 3697, in drop errors=errors) File
"/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py",
line 3111, in drop obj = obj._drop_axis(labels, axis, level=level,
errors=errors) File
"/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py",
line 3143, in _drop_axis new_axis = axis.drop(labels, errors=errors)
File
"/Users/amos/anaconda3/lib/python3.7/site-packages/pandas/core/indexes/base.py",
line 4404, in drop '{} not found in axis'.format(labels[mask]))
KeyError: '[32] not found in axis'

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python - Trouble pulling data from multiple excel files in folder - python

There could be a couple of problems. It may be trying to open non-excel files in the directory. After your os.listdir() call, try to filter for excel files only. Or your excel file may be formatted incorrectly.

Related

TypeError: expected <class 'openpyxl.styles.fills.Fill'>

Problem with adding Excel files at Pandas | wrapper return func

Unsupported format, or corrupt file: Expected BOF record

How to read the uploaded excel file in AWS Lambda?

Issues reading with pandas.read_table

Categories

Resources