importing an excel file to python

importing an excel file to python - python

I have a basic question about importing xlsx files to Python. I have checked many responses about the same topic, however I still cannot import my files to Python whatever I try. Here's my code and the error I receive:
import pandas as pd
import xlrd
file_location = 'C:\Users\cagdak\Desktop\python_self_learning\Coursera\sample_data.xlsx'
workbook = xlrd.open_workbook(file_location)
Error:
IOError: [Errno 2] No such file or directory: 'C:\\Users\\cagdak\\Desktop\\python_self_learning\\Coursera\\sample_data.xlsx'

With pandas it is possible to get directly a column of an Excel file. Here is the code.
import pandas
df = pandas.read_excel('sample.xls')
#print the column names
print df.columns
#get the values for a given column
values = df['column_name'].values
#get a data frame with selected columns
FORMAT = ['Col_1', 'Col_2', 'Col_3']
df_selected = df[FORMAT]

You should use raw strings or escape your backslash instead, for example:
file_location = r'C:\Users\cagdak\Desktop\python_self_learning\Coursera\sample_data.xlsx'
or
file_location = 'C:\\Users\\cagdak\\Desktop\python_self_learning\\Coursera\\sample_data.xlsx'

go ahead and try this:
file_location = 'C:/Users/cagdak/Desktop/python_self_learning/Coursera/sample_data.xlsx'

As pointed out above Pandas supports reading of Excel spreadsheets using its read_excel() method. However, it is dependent upon a number of external libraries depending on which version Excel/odf is being accessed. It defaults to selecting one automatically, though one can be specified using the engine parameter. Here's an excerpt from the docs:
"xlrd" supports old-style Excel files (.xls).
"openpyxl" supports newer Excel file formats.
"odf" supports OpenDocument file formats (.odf, .ods, .odt).
"pyxlsb" supports Binary Excel files.
If the required library is not already installed you'll see an error message suggesting library you need to install.

Related

python pandas read_excel engine=openpyxl not closing file

I am loading a dataframe into pandas using following:
import pandas as pd
df_factor_histories=pd.read_excel("./eco_factor/eco_factor_test_data_builder.xlsx",
engine='openpyxl', sheet_name=0)
engine=openpyxl is required to enable read_excel to support newer Excel file formats (specifically in my case .xlsx rather than jusy .xls).
The dataframe loads just fine but the file is left open:
import psutil
p = psutil.Process()
print(p.open_files())
OUTPUT
[popenfile(path='C:\\Users\\xx\\.ipython\\profile_default\\history.sqlite', fd=-1),
popenfile(path='C:\\Windows\\System32\\en-US\\KernelBase.dll.mui', fd=-1),
popenfile(path='C:\\Windows\\System32\\en-US\\kernel32.dll.mui', fd=-1),
popenfile(path='D:\\xxxxx\\data modelling\\eco_factor\\eco_factor_test_data_builder.xlsx', fd=-1)]
This Github Post suggests the bug is fixed - but not for me (running Anaconda/Jupyter).
Relevant versions I am running:
numpy 1.19.2
openpyxl 3.0.5
pandas 1.1.3
Python 3.7.4
I would appreciate some suggestions on how to close the files/best work around this, thanks

I suggest to remove engine='openpyxl' from your code. It isn't actually needed. I use the pd.read_excel without it and it works just fine even for .xlsx format.
Removing this will cause the default behavior for the engine parameter to take over. The engine will know which engine to use:
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine : str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files. When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.

I was facing the same issue,
When i read the excel file using pandas with engine=openpyxl. It is not being closed. when i try to archive/ move the excel file using python, it was giving error,
PermissionError: [WinError 32] The process cannot access the file because it is being used by another process
Also, once it is read by pandas, we are not able to edit or any other operation on excel file using Excel tool.
Following solution worked for me.
i am using:
python version 3.6.8.
pandas==0.25.1
openpyxl==3.0.7
import io
import pandas as pd
with open('path/to/input_excel_file.xlsx', "rb") as f:
file_io_obj = io.BytesIO(f.read())
df_input_file = pd.read_excel(file_io_obj, engine='openpyxl', sheet_name=None)

Converting xlsx files to xls to use with pandas [duplicate]

This question already has answers here:
Pandas cannot open an Excel (.xlsx) file
(5 answers)
Closed 2 years ago.
I have a repetetive task, where I download multiple excel files (I'm forced to download in xlsx format), I then take column G from each excel file and concatenate them into "final.xlsx" Then "final.xlsx" is compared to another excel workbook to see if all number instances are matched in each workbook.
I'm now working on making a cross platform python app to solve this. However, pandas won't allow xlsx files anymore, and manually opening and saving them as xls files just adds more repetitive manual labour.
Is there a cross-platform way for python to convert xlsx files to xls?
Or should I abandon pandas and go with openpyxl since I'm forced to handle xlsx format?
I tried using this without success ;
from pathlib import Path
import openpyxl
import os
# get files
os.chdir(os.path.abspath(os.path.dirname(__file__)))
pdir = Path('.')
filelist = [filename for filename in pdir.iterdir() if filename.suffix == '.xlsx']
for filename in filelist:
print(filename.name)
for infile in filelist:
workbook = openpyxl.load_workbook(infile)
outfile = f"{infile.name.split('.')[0]}.xls"
workbook.save(outfile)

You can still use pandas, but you would need openpyxl. As you have it in your code, I suppose it is ok for you.
Otherwise, you can install it via: pip install openpyxl.
The following illustrates how this can work. Kr.
import pandas as pd
fpath = r".\test.xlsx"
df = pd.read_excel (fpath, engine='openpyxl')
print(df)
A B
0 1 2
1 1 2

Previously, the default argument engine=None to read_excel() would result in using the xlrd engine in many cases, including new Excel 2007+ (.xlsx) files. If openpyxl is installed, many of these cases will now default to using the openpyxl engine. See the read_excel() documentation for more details.
Thus, it is strongly encouraged to install openpyxl to read Excel 2007+ (.xlsx) files. Please do not report issues when using xlrd to read .xlsx files. This is no longer supported, switch to using openpyxl instead.
https://pandas.pydata.org/docs/whatsnew/v1.2.0.html

Read XLSB File in Pandas Python

There are many questions on this, but there has been no simple answer on how to read an xlsb file into pandas. Is there an easy way to do this?

With the 1.0.0 release of pandas - January 29, 2020, support for binary Excel files was added.
import pandas as pd
df = pd.read_excel('path_to_file.xlsb', engine='pyxlsb')
Notes:
You will need to upgrade pandas - pip install pandas --upgrade
You will need to install pyxlsb - pip install pyxlsb

Hi actually there is a way. Just use pyxlsb library.
import pandas as pd
from pyxlsb import open_workbook as open_xlsb
df = []
with open_xlsb('some.xlsb') as wb:
with wb.get_sheet(1) as sheet:
for row in sheet.rows():
df.append([item.v for item in row])
df = pd.DataFrame(df[1:], columns=df[0])
UPDATE:
as of pandas version 1.0 read_excel() now can read binary Excel (.xlsb) files by passing engine='pyxlsb'
Source: https://pandas.pydata.org/pandas-docs/version/1.0.0/whatsnew/v1.0.0.html

Pyxlsb indeed is an option to read xlsb file, however, is rather limited.
I suggest using the xlwings package which makes it possible to read and write xlsb files without losing sheet formating, formulas, etc. in the xlsb file. There is extensive documentation available.
import pandas as pd
import xlwings as xw
app = xw.App()
book = xw.Book('file.xlsb')
sheet = book.sheets('sheet_name')
df = sheet.range('A1').options(pd.DataFrame, expand='table').value
book.close()
app.kill()
'A1' in this case is the starting position of the excel table.
To write to xlsb file, simply write:
sheet.range('A1').value = df

If you want to read a big binary file or any excel file with some ranges you can directly put at this code
range = (your_index_number)
first_dataframe = []
second_dataframe = []
with open_xlsb('Test.xlsb') as wb:
with wb.get_sheet('Sheet1') as sheet:
i=0
for row in sheet.rows():
if(i!=range):
first_dataframe.append([item.v for item in row])
i=i+1
else:
second_dataframe.append([item.v for item in row])
first_dataframe = pd.DataFrame(first_dataframe[1:], columns=first[0])
second_dataframe = pd.DataFrame(second_dataframe[:], columns=first.columns)

To be able to read xlsb files, it is necessary to have openpyxl installed.
As per https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_excel.html#pandas.read_excel
engine: str, default None
If io is not a buffer or path, this must be set to identify io. Supported engines: “xlrd”, “openpyxl”, “odf”, “pyxlsb”. Engine compatibility :
“xlrd” supports old-style Excel files (.xls).
“openpyxl” supports newer Excel file formats.
“odf” supports OpenDocument file formats (.odf, .ods, .odt).
“pyxlsb” supports Binary Excel files.
Changed in version 1.2.0: The engine xlrd now only supports old-style .xls files.
When engine=None, the following logic will be used to determine the engine:
If path_or_buffer is an OpenDocument format (.odf, .ods, .odt), then odf will be used.
Otherwise if path_or_buffer is an xls format, xlrd will be used.
Otherwise if openpyxl is installed, then openpyxl will be used.
Otherwise if xlrd >= 2.0 is installed, a ValueError will be raised.
Otherwise xlrd will be used and a FutureWarning will be raised. This case will raise a ValueError in a future version of pandas.
xlsb reading without index_col:
import pandas as pd
dfcluster = pd.read_excel('c:/xml/baseline/distribucion.xlsb', sheet_name='Cluster', index_col=0, engine='pyxlsb')

CParserError: Error tokenizing data

I'm having some trouble reading a csv file
import pandas as pd
df = pd.read_csv('Data_Matches_tekha.csv', skiprows=2)
I get
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 1 fields in line 526, saw 5
and when I add sep=None to df I get another error
Error: line contains NULL byte
I tried adding unicode='utf-8', I even tried CSV reader and nothing works with this file
the csv file is totally fine, I checked it and i see nothing wrong with it
Here are the errors I get:

In your actual code, the line is:
>>> pandas.read_csv("Data_Matches_tekha.xlsx", sep=None)
You are trying to read an Excel file, and not a plain text CSV which is why things are not working.
Excel files (xlsx) are in a special binary format which cannot be read as simple text files (like CSV files).
You need to either convert the Excel file to a CSV file (note - if you have multiple sheets, each sheet should be converted to its own csv file), and then read those.
You can use read_excel or you can use a library like xlrd which is designed to read the binary format of Excel files; see Reading/parsing Excel (xls) files with Python for for more information on that.

Use read_excel instead read_csv if Excel file:
import pandas as pd
df = pd.read_excel("Data_Matches_tekha.xlsx")

I have encountered the same error when I used to_csv to write some data and then read it in another script. I found an easy solution without passing by pandas' read function, it's a package named Pickle.
You can download it by typing in your terminal
pip install pickle
Then you can use for writing your data (first) the code below
import pickle
with open(path, 'wb') as output:
pickle.dump(variable_to_save, output)
And finally import your data in another script using
import pickle
with open(path, 'rb') as input:
data = pickle.load(input)
Note that if you want to use, when reading your saved data, a different python version than the one in which you saved your data, you can precise that in the writing step by using protocol=x with x corresponding to the version (2 or 3) aiming to use for reading.
I hope this can be of any use.

Reading/parsing Excel (xls) files with Python [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 1 year ago and left it closed:
Original close reason(s) were not resolved
Improve this question
What is the best way to read Excel (XLS) files with Python (not CSV files).
Is there a built-in package which is supported by default in Python to do this task?

I highly recommend xlrd for reading .xls files. But there are some limitations(refer to xlrd github page):
Warning
This library will no longer read anything other than .xls files. For
alternatives that read newer file formats, please see
http://www.python-excel.org/.
The following are also not supported but will safely and reliably be
ignored:
- Charts, Macros, Pictures, any other embedded object, including embedded worksheets.
- VBA modules
- Formulas, but results of formula calculations are extracted.
- Comments
- Hyperlinks
- Autofilters, advanced filters, pivot tables, conditional formatting, data validation
Password-protected files are not supported and cannot be read by this
library.
voyager mentioned the use of COM automation. Having done this myself a few years ago, be warned that doing this is a real PITA. The number of caveats is huge and the documentation is lacking and annoying. I ran into many weird bugs and gotchas, some of which took many hours to figure out.
UPDATE:
For newer .xlsx files, the recommended library for reading and writing appears to be openpyxl (thanks, Ikar Pohorský).

You can use pandas to do this, first install the required libraries:
$ pip install pandas openpyxl
See code below:
import pandas as pd
xls = pd.ExcelFile(r"yourfilename.xls") # use r before absolute file path
sheetX = xls.parse(2) #2 is the sheet number+1 thus if the file has only 1 sheet write 0 in paranthesis
var1 = sheetX['ColumnName']
print(var1[1]) #1 is the row number...

You can choose any one of them http://www.python-excel.org/
I would recommended python xlrd library.
install it using
pip install xlrd
import using
import xlrd
to open a workbook
workbook = xlrd.open_workbook('your_file_name.xlsx')
open sheet by name
worksheet = workbook.sheet_by_name('Name of the Sheet')
open sheet by index
worksheet = workbook.sheet_by_index(0)
read cell value
worksheet.cell(0, 0).value

I think Pandas is the best way to go. There is already one answer here with Pandas using ExcelFile function, but it did not work properly for me. From here I found the read_excel function which works just fine:
import pandas as pd
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name")
print(dfs.head(10))
P.S. You need to have the xlrd installed for read_excel function to work
Update 21-03-2020: As you may see here, there are issues with the xlrd engine and it is going to be deprecated. The openpyxl is the best replacement. So as described here, the canonical syntax should be:
dfs = pd.read_excel("your_file_name.xlsx", sheet_name="your_sheet_name", engine="openpyxl")

For xlsx I like the solution posted earlier as https://web.archive.org/web/20180216070531/https://stackoverflow.com/questions/4371163/reading-xlsx-files-using-python. I uses modules from the standard library only.
def xlsx(fname):
import zipfile
from xml.etree.ElementTree import iterparse
z = zipfile.ZipFile(fname)
strings = [el.text for e, el in iterparse(z.open('xl/sharedStrings.xml')) if el.tag.endswith('}t')]
rows = []
row = {}
value = ''
for e, el in iterparse(z.open('xl/worksheets/sheet1.xml')):
if el.tag.endswith('}v'): # Example: <v>84</v>
value = el.text
if el.tag.endswith('}c'): # Example: <c r="A3" t="s"><v>84</v></c>
if el.attrib.get('t') == 's':
value = strings[int(value)]
letter = el.attrib['r'] # Example: AZ22
while letter[-1].isdigit():
letter = letter[:-1]
row[letter] = value
value = ''
if el.tag.endswith('}row'):
rows.append(row)
row = {}
return rows
Improvements added are fetching content by sheet name, using re to get the column and checking if sharedstrings are used.
def xlsx(fname,sheet):
import zipfile
from xml.etree.ElementTree import iterparse
import re
z = zipfile.ZipFile(fname)
if 'xl/sharedStrings.xml' in z.namelist():
# Get shared strings
strings = [element.text for event, element
in iterparse(z.open('xl/sharedStrings.xml'))
if element.tag.endswith('}t')]
sheetdict = { element.attrib['name']:element.attrib['sheetId'] for event,element in iterparse(z.open('xl/workbook.xml'))
if element.tag.endswith('}sheet') }
rows = []
row = {}
value = ''
if sheet in sheets:
sheetfile = 'xl/worksheets/sheet'+sheets[sheet]+'.xml'
#print(sheet,sheetfile)
for event, element in iterparse(z.open(sheetfile)):
# get value or index to shared strings
if element.tag.endswith('}v') or element.tag.endswith('}t'):
value = element.text
# If value is a shared string, use value as an index
if element.tag.endswith('}c'):
if element.attrib.get('t') == 's':
value = strings[int(value)]
# split the row/col information so that the row leter(s) can be separate
letter = re.sub('\d','',element.attrib['r'])
row[letter] = value
value = ''
if element.tag.endswith('}row'):
rows.append(row)
row = {}
return rows

If you need old XLS format. Below code for ansii 'cp1251'.
import xlrd
file=u'C:/Landau/task/6200.xlsx'
try:
book = xlrd.open_workbook(file,encoding_override="cp1251")
except:
book = xlrd.open_workbook(file)
print("The number of worksheets is {0}".format(book.nsheets))
print("Worksheet name(s): {0}".format(book.sheet_names()))
sh = book.sheet_by_index(0)
print("{0} {1} {2}".format(sh.name, sh.nrows, sh.ncols))
print("Cell D30 is {0}".format(sh.cell_value(rowx=29, colx=3)))
for rx in range(sh.nrows):
print(sh.row(rx))

For older .xls files, you can use xlrd
either you can use xlrd directly by importing it. Like below
import xlrd
wb = xlrd.open_workbook(file_name)
Or you can also use pandas pd.read_excel() method, but do not forget to specify the engine, though the default is xlrd, it has to be specified.
pd.read_excel(file_name, engine = xlrd)
Both of them work for older .xls file formats.
Infact I came across this when I used OpenPyXL, i got the below error
InvalidFileException: openpyxl does not support the old .xls file format, please use xlrd to read this file, or convert it to the more recent .xlsx file format.

You can use any of the libraries listed here (like Pyxlreader that is based on JExcelApi, or xlwt), plus COM automation to use Excel itself for the reading of the files, but for that you are introducing Office as a dependency of your software, which might not be always an option.

You might also consider running the (non-python) program xls2csv. Feed it an xls file, and you should get back a csv.

Python Excelerator handles this task as well. http://ghantoos.org/2007/10/25/python-pyexcelerator-small-howto/
It's also available in Debian and Ubuntu:
sudo apt-get install python-excelerator

with open(csv_filename) as file:
data = file.read()
with open(xl_file_name, 'w') as file:
file.write(data)
You can turn CSV to excel like above with inbuilt packages. CSV can be handled with an inbuilt package of dictreader and dictwriter which will work the same way as python dictionary works. which makes it a ton easy
I am currently unaware of any inbuilt packages for excel but I had come across openpyxl. It was also pretty straight forward and simple You can see the code snippet below hope this helps
import openpyxl
book = openpyxl.load_workbook(filename)
sheet = book.active
result =sheet['AP2']
print(result.value)

For older Excel files there is the OleFileIO_PL module that can read the OLE structured storage format used.

If the file is really an old .xls, this works for me on python3 just using base open() and pandas:
df = pandas.read_csv(open(f, encoding = 'UTF-8'), sep='\t')
Note that the file I'm using is tab delimited. less or a text editor should be able to read .xls so that you can sniff out the delimiter.
I did not have a lot of luck with xlrd because of – I think – UTF-8 issues.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

importing an excel file to python - python

You should use raw strings or escape your backslash instead, for example: file_location = r'C:\Users\cagdak\Desktop\python_self_learning\Coursera\sample_data.xlsx' or file_location = 'C:\\Users\\cagdak\\Desktop\python_self_learning\\Coursera\\sample_data.xlsx'

go ahead and try this: file_location = 'C:/Users/cagdak/Desktop/python_self_learning/Coursera/sample_data.xlsx'

Related

python pandas read_excel engine=openpyxl not closing file

Converting xlsx files to xls to use with pandas [duplicate]

Read XLSB File in Pandas Python

CParserError: Error tokenizing data

Reading/parsing Excel (xls) files with Python [closed]

Categories

Resources