XLRDError: Excel xlsx file; not supported Databricks

XLRDError: Excel xlsx file; not supported Databricks - python

I'm using Azure Databricks and trying to read an excel file. I have an encrypted file with .xlsx.pgp. After decrypting the message I get it as a byte array. So, here's the function I use to read this file as a pandas dataframe:
df = pd.read_excel(BytesIO(orig))
However, this is giving me the following error:
XLRDError: Excel xlsx file; not supported
Now, based on this documentation:
I have added openpyxl to the cluster and then tried to run the following:
df = pd.read_excel(BytesIO(orig),engine=`openpyxl`)
I'm getting the error:
global name 'openpyxl' is not defined
With the following command, I get:
df = pd.read_excel(BytesIO(orig),engine='openpyxl')
The error I get is:
ValueError: Unknown engine: openpyxl
How can I resolve this issue?
Thanks for all the help!

Errors suggests that, openpyxl library is not properly installed. Also maybe notebook is not in scope of openpyxl library.
Please install openpyxl in Cluster which is attached to notebook as shown below:
Step1: Select Cluster and click on libraries.
Step2: Click on Install New.
Next click on PyPI.
Now enter name of library that is openpyxl
Then click on Install.
Step3: Check status of openpyxl library is installed.
Step4: Successfully installed openpyxl library.
Edit -
Note - pandas version should be 1.0.1 or above.
If pandas version is below 1.0.1, you can upgrade pandas library using pip install pandas
Check pandas version using pd.__version__ command.
For more information you can refer this answer from rama-a

Related

The ERROR: xlrd.biffh.XLRDError: Excel xlsx file; not supported

Please I'd like to ask you a question about opening an excel file.
Now I'm trying to open it using this program:
data = pd.read_excel(r'C:\Users\Acer\Desktop\OffshoringData.xlsx')
print(data)
The problem is that I found the following error:
**xlrd.biffh.XLRDError: Excel xlsx file; not supported**
What should I do in this case, please??

You need to use a different engine in your pandas.read_excel().
For security reasons xlrd no longer supports .xlsx files, but openpyxl still does.
So you would need to add engine='openpyxl' in your function.
Here's the documentation:
https://pandas.pydata.org/docs/reference/api/pandas.read_excel.html

This error is usually as a result of conflicting versions of the xlrd package and your excel version. Try installing a newer version of xlrd package (v2.0.1) which is able to handle .xlsx files. Seems the version of xlrd you are using is for older versions of excel files.
reference - xlrd python package

How to read excel file in python using data provider

Is there any way in python to read excel file like we have data provider in the testng
i am having a test method (using python unit test framework) and from this test i am calling another method which is actually reading the excel sheet , I just want something like data provider so that with every data it should be treated as a new test case

You could use pandas to read the excel files or csv files.
import pandas as pd
excel_data = pd.read_excel('test_file.xlsx')
csv_data = pd.read_csv('test_file.csv')
And the result is DataFrame structure.

Use Pandas to read excel files in Python. From your question I assume you don't know about pandas.
If you have added python to path during installation of IDE. Use pip for installation in the terminal
py -m pip install pandas
The python code is
import pandas as pd
df=pd.read_excel('Data.xlsx')
print(df.head()) # This will print the first 5 rows.
if you want to use jupyter notebook in the terminal
py -m pip install notebook
This will work best. But you need to have pandas installed through pip. For adavanced functions or atleast update question to what you want.What is it that dataprovider does, so as to repeat it in python Specify the fuction
Go through pandas documentation : https://pandas.pydata.org/docs/

Having trouble with opening Excel file with Pandas

I am working with Pandas for the 1st time and don't know much about it.
While trying to read an Excel file, Visual Studio code shows the "missing dependency xlrd". I don't know what to do.
Info:
Anaconda, VS code installed on the same drive. Excel file also on the same drive. I am using Windows 10 64bit.

Very short description. It would be nice if the description was a little more detailed. Try install the module:
pip install xlrd
If using python3 then:
pip3 install xlrd
If you are using conda:
conda install -c anaconda xlrd
May be there are multiple python versions in the system, where requirement might be satisfied for one and not for the other. I faced such problem and python3 rather than pip3 worked for me. Check out this too.
python3 -m pip install xlrd
Then it must work, otherwise, upgrade.
pip3 install --upgrade pandas
pip3 install --upgrade xlrd
I hope this will work.
import xlrd
import pandas as pd
sp = pd.ExcelFile("data.xlsx")
print(sp.parse(sp.sheet_names[0]))
If it doesn't work even after the upgrade, my guess is that there is another problem that is not known from your description. (Please include the full error message in the description as a code block, not in image format.)

First make sure you have all the required libraries installed.
pip install pandas
Pandas also requires the NumPy library
pip install numpy
In order to work with Pandas in your script, you will need to import it into your code. This is done with one line of code:
import pandas as pd
To work with Excel using Pandas, you need an additional object named ExcelFile. ExcelFile is built into the Pandas ecosystem, so you import directly from Pandas:
from pandas import ExcelFile
Recall your path where you have that excel file, example: /Users/Desktop/file.xlsx
Rather than referencing the path inside of the Read_Excel function, keep code clean by storing the path in a variable:
file_path = '/Users/Desktop/file.xlsx'
The Read_Excel function takes the file path of an Excel Workbook and returns a DataFrame object with the contents.
Put it all together and set the DataFrame object to a variable named “df”:
df = pd.read_excel(file_Path)
Lastly, you want to view the DataFrame so print the result. Add a print statement to the end of your script, using the DataFrame variable as the argument
print(df)

How to load the excel data into hive using python script?

I need a python scripts to load the multiple excel sheet data into hive table using python. Any one helping on this.

You can read excels using pandas and insert the dataframe using pyhive or any other Hive library.
Inserting a Python Dataframe into Hive from an external server

Yes, it is very easy!!
You should have pandas library installed or install it using pip if you don't have by typing this in the command prompt - py -m pip install pandas
Then, use the following code -
import pandas as pd
df = pd.read_excel('', '')
print(df)
You will see that the table is available in excel.

Reading excel file with pandas

I have tried to read an excel file using pandas however I haven't been able to. I am using python version 3.8 and still haven't been able to do it. I want to make the excel file a list in python and then use that list in an option box via tkinter. However without being able to read the file I cannot do this.
The code I'm using is:
import pandas as pd
df = pd.read_excel (r'downloads\Clients - Nybble HelpDesk.xlsx')
print (df)
The error I'm recieving is:
Traceback (most recent call last):
File "C:\Users\Natasha\OneDrive - Nybble.co.uk LTD\Desktop\excel export.py", line 1, in <module>
import pandas as pd
ModuleNotFoundError: No module named 'pandas'

In the command line try pip install pandas to install pandas first. Then re-run your code.
Other installation information is available here:
https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html

you can save time by downloading the anacondas distribution and use the spyder IDE. This will prevent the need to install pandas and will come with a whole host of useful packages that you will likely use in your day to day.
See the link: https://docs.anaconda.com/anaconda/navigator/tutorials/pandas/

First install pandas.
If you are using python shell then, open command prompt and type pip install pandas.
If you are using Anaconda then, open Conda Prompt and type conda install pandas.
You have your pandas library installed now.
Now for your code,
import pandas as pd
excel_file = 'filepath.xls'
df = pd.read_excel(excel_file)
print (df)
Now try it! and be sure to check the path of your file. Save the file in the same folder where you are saving the .py file.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

XLRDError: Excel xlsx file; not supported Databricks - python

Related

The ERROR: xlrd.biffh.XLRDError: Excel xlsx file; not supported

How to read excel file in python using data provider

Having trouble with opening Excel file with Pandas

How to load the excel data into hive using python script?

Reading excel file with pandas

Categories

Resources