Cannot read ".parquet" files in Azure Jupyter Notebook (Python 2 and 3)

Cannot read ".parquet" files in Azure Jupyter Notebook (Python 2 and 3) - python

I am currently trying to open parquet files using Azure Jupyter Notebooks. I have tried both Python kernels (2 and 3).
After the installation of pyarrow I can import the module only if the Python kernel is 2 (not working with Python 3)
Here is what I've done so far (for clarity, I am not mentioning all my various attempts, such as using conda instead of pip, as it also failed):
!pip install --upgrade pip
!pip install -I Cython==0.28.5
!pip install pyarrow
import pandas
import pyarrow
import pyarrow.parquet
#so far, so good
filePath_parquet = "foo.parquet"
table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')
This works well if I'm doing that off-line (using Spyder, Python v.3.7.0). But it fails using an Azure Notebook.
AttributeErrorTraceback (most recent call last)
<ipython-input-54-2739da3f2d20> in <module>()
6
7 #table_parquet_raw = pd.read_parquet(filePath_parquet, engine='pyarrow')
----> 8 table_parquet_raw = pandas.read_parquet(filePath_parquet, engine='pyarrow')
AttributeError: 'module' object has no attribute 'read_parquet'
Any idea please?
Thank you in advance !
EDIT:
Thank you very much for your reply Peter Pan !
I have typed these statements, here is what I got:
1.
print(pandas.__dict__)
=> read_parquet does not appear
2.
print(pandas.__file__)
=> I get:
/home/nbuser/anaconda3_23/lib/python3.4/site-packages/pandas/__init__.py
import sys; print(sys.path) => I get:
['', '/home/nbuser/anaconda3_23/lib/python34.zip',
'/home/nbuser/anaconda3_23/lib/python3.4',
'/home/nbuser/anaconda3_23/lib/python3.4/plat-linux',
'/home/nbuser/anaconda3_23/lib/python3.4/lib-dynload',
'/home/nbuser/.local/lib/python3.4/site-packages',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages/Sphinx-1.3.1-py3.4.egg',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages/setuptools-27.2.0-py3.4.egg',
'/home/nbuser/anaconda3_23/lib/python3.4/site-packages/IPython/extensions',
'/home/nbuser/.ipython']
Do you have any idea please ?
EDIT 2:
Dear #PeterPan, I have typed both !conda update conda and !conda update pandas : when checking the Pandas version (pandas.__version__), it is still 0.19.2.
I have also tried with !conda update pandas -y -f, it returns:
`Fetching package metadata ...........
Solving package specifications: .
Package plan for installation in environment /home/nbuser/anaconda3_23:
The following NEW packages will be INSTALLED:
pandas: 0.19.2-np111py34_1`
When typing:
!pip install --upgrade pandas
I get:
Requirement already up-to-date: pandas in /home/nbuser/anaconda3_23/lib/python3.4/site-packages
Requirement already up-to-date: pytz>=2011k in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas)
Requirement already up-to-date: numpy>=1.9.0 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas)
Requirement already up-to-date: python-dateutil>=2 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from pandas)
Requirement already up-to-date: six>=1.5 in /home/nbuser/anaconda3_23/lib/python3.4/site-packages (from python-dateutil>=2->pandas)
Finally, when typing:
!pip install --upgrade pandas==0.24.0
I get:
Collecting pandas==0.24.0
Could not find a version that satisfies the requirement pandas==0.24.0 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0)
No matching distribution found for pandas==0.24.0
Therefore, my guess is that the problem comes from the way the packages are managed in Azure. Updating a package (here Pandas), should lead to an update to the latest version available, shouldn't it?

I tried to reproduce your issue on my Azure Jupyter Notebook, but failed. There was no any issue for me without doing your two steps !pip install --upgrade pip & !pip install -I Cython==0.28.5 which I think not matter.
Please run some codes below to check your import package pandas whether be correct.
Run print(pandas.__dict__) to check whether has the description of read_parquet function in the output.
Run print(pandas.__file__) to check whether you imported a different pandas package.
Run import sys; print(sys.path) to check the order of paths whether there is a same named file or directory under these paths.
If there is a same file or directory named pandas, you just need to rename it and restart your ipynb to re-run. It's a common issue which you can refer to these SO threads AttributeError: 'module' object has no attribute 'reader' and Importing installed package from script raises "AttributeError: module has no attribute" or "ImportError: cannot import name".
In Other cases, please update your post for more details to let me know.
The latest pandas version should be 0.23.4, not 0.24.0.
I tried to find out the earliest version of pandas which support the read_parquet feature via search the function name read_parquet in the documents of different version from 0.19.2 to 0.23.3. Then, I found pandas supports read_parquet feature after the version 0.21.1, as below.
The new features shown in the What's New of version 0.21.1
According to your EDIT 2 description, it seems that you are using Python 3.4 in Azure Jupyter Notebook. Not all pandas versions support Python 3.4 version.
The versions 0.21.1 & 0.22.0 offically support Python 2.7,3.5, and 3.6, as below.
And the PyPI page for pandas also requires the Python version as below.
So you can try to install the pandas versions 0.21.1 & 0.22.0 in the current notebook of Python 3.4. if failed, please create a new notebook in Python 2.7 or >=3.5 to install pandas version >= 0.21.1 to use the function read_parquet.

Related

Tried installing fastai on Spyder, but dont know how to solve this error

Can some help me with exact solution, I tried installing pytorch of different versions though but it is still getting the same.
!pip install fastai
Collecting fastai
Using cached fastai-2.3.0-py3-none-any.whl (193 kB)
Collecting spacy<3
Using cached spacy-2.3.5-cp38-cp38-win_amd64.whl (9.7 MB)
ERROR: Could not find a version that satisfies the requirement torchvision<0.9,>=0.8 (from fastai) (from versions: 0.1.6, 0.1.7, 0.1.8, 0.1.9, 0.2.0, 0.2.1, 0.2.2, 0.2.2.post2, 0.2.2.post3, 0.5.0, 0.9.0, 0.9.1)
ERROR: No matching distribution found for torchvision<0.9,>=0.8 (from fastai)

(Spyder maintainer here) The right way to install packages in Spyder is not by using the !pip or !conda commands (which will be disabled in the future).
Instead, you need to install Miniconda, create a conda environment after that with the packages you want to use and spyder-kernels, and finally connect Spyder to that env.

Pip can't install flask-socketio

I want to install the module flask-socketio using pip2, however I get an error, that no matching bidict version was found. I looked bidict up, and it turns out, that the version doesnt even exist. I tried installing some other packets, but nothing worked. Here you can see the full error
'''
ERROR: Could not find a version that satisfies the requirement bidict>=0.21.0 (from python-socketio>=5.0.2->flask-socketio) (from versions: 0.1.5, 0.2.1, 0.3.0, 0.3.1, 0.9.0rc0, 0.9.0.post1, 0.10.0, 0.10.0.post1, 0.11.0, 0.12.0.post1, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.14.2, 0.15.0.dev0, 0.15.0.dev1, 0.15.0rc1, 0.15.0, 0.16.0, 0.17.0, 0.17.1, 0.17.2, 0.17.3, 0.17.4, 0.17.5, 0.18.0, 0.18.1, 0.18.2, 0.18.3, 0.18.4)
ERROR: No matching distribution found for bidict>=0.21.0 (from python-socketio>=5.0.2->flask-socketio)
'''
Any Ideas?

Flask-SocketIO dropped support for Python 2 at version 5.0.0. Install an older version:
pip install "Flask-SocketIO<5.0.0"

This solved the problem on my Raspberry Pi:
pip3 install flask-socketio

Adding pandas dependencies after kedro new

I began a new project with kedro new without adding the files from the iris example. The original requirements.txt looked like:
black==v19.10b0
flake8>=3.7.9, <4.0
ipython~=7.0
isort>=4.3.21, <5.0
jupyter~=1.0
jupyter_client>=5.1, < 7.0
jupyterlab==0.31.1
kedro==0.16.6
nbstripout==0.3.3
pytest-cov~=2.5
pytest-mock>=1.7.1, <2.0
pytest~=5.0
wheel==0.32.2
I then ran kedro install to install the packages, generating requirements.in and requirements.txt. I now want to install the necessary dependencies for working with pandas and csv files. I tried updating the requirements.in with the line: kedro[pandas]==0.16.6 and then executing kedro install --build-reqs. However, that line fails with the error:
Could not find a version that matches pyarrow<1.0.0,<2.0dev,>=0.12.0,>=1.0.0 (from kedro[pandas]==0.16.6->-r /lrlhps/data/busanalytics/Guilherme/Projects/kedro-environment/spaceflights/src/requirements.in (line 8))
Tried: 0.9.0, 0.10.0, 0.11.0, 0.11.1, 0.12.0, 0.12.1, 0.13.0, 0.14.0, 0.15.1, 0.16.0, 0.16.0, 0.16.0, 0.16.0, 0.17.0, 0.17.0, 0.17.0, 0.17.0, 0.17.1, 0.17.1, 0.17.1, 0.17.1, 1.0.0, 1.0.0, 1.0.0, 1.0.0, 1.0.1, 1.0.1, 1.0.1, 1.0.1, 2.0.0, 2.0.0, 2.0.0
There are incompatible versions in the resolved dependencies:
pyarrow<2.0dev,>=1.0.0 (from google-cloud-bigquery[bqstorage,pandas]==2.2.0->pandas-gbq==0.14.0->kedro[pandas]==0.16.6->-r /Projects/kedro/spaceflights/src/requirements.in (line 8))
pyarrow<1.0.0,>=0.12.0 (from kedro[pandas]==0.16.6->-r /Projects/kedro/spaceflights/src/requirements.in (line 8))
Question: Is it possible to update requirements.in and have the pandas dependencies installed with the --build-reqs option? Or must I install the dependency with pip?

You should be able to install pandas by adding which specific components you wish to use, as exemplified in the documentation:
The dependencies above may be sufficient for some projects, but for the
spaceflights project, you need to add a requirement for the pandas project
because you are working with CSV and Excel files. You can add the necessary
dependencies for these files types as follows:
kedro[pandas.CSVDataSet,pandas.ExcelDataSet]==0.17.0
https://kedro.readthedocs.io/en/stable/03_tutorial/02_tutorial_template.html#add-and-remove-project-specific-dependencies
For instance, after adding
kedro[pandas.CSVDataSet]==0.17.0
to your requirements.in and issuing a kedro build-reqs, you should see
kedro[pandas.csvdataset]==0.17.0 # via -r /.../src/requirements.in
(...)
pandas==1.2.0 # via kedro
in your requirements.txt file.

pandas version is not updated after installing a new version on databricks

I am trying to solve a problem of pandas when I run python3.7 code on databricks.
The error is:
ImportError: cannot import name 'roperator' from 'pandas.core.ops' (/databricks/python/lib/python3.7/site-packages/pandas/core/ops.py)
the pandas version:
pd.__version__
0.24.2
I run
from pandas.core.ops import roperator
well on my laptop with
pandas 0.25.1
So, I tried to upgrade pandas on databricks.
%sh pip uninstall -y pandas
Successfully uninstalled pandas-1.1.2
%sh pip install pandas==0.25.1
Collecting pandas==0.25.1
Downloading pandas-0.25.1-cp37-cp37m-manylinux1_x86_64.whl (10.4 MB)
Requirement already satisfied: python-dateutil>=2.6.1 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from pandas==0.25.1) (2.8.0)
Requirement already satisfied: numpy>=1.13.3 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from pandas==0.25.1) (1.16.2)
Requirement already satisfied: pytz>=2017.2 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from pandas==0.25.1) (2018.9)
Requirement already satisfied: six>=1.5 in /databricks/conda/envs/databricks-ml/lib/python3.7/site-packages (from python-dateutil>=2.6.1->pandas==0.25.1) (1.12.0)
Installing collected packages: pandas
ERROR: After October 2020 you may experience errors when installing or updating packages.
This is because pip will change the way that it resolves dependency conflicts.
We recommend you use --use-feature=2020-resolver to test your packages with the new resolver before it becomes the default.
mlflow 1.8.0 requires alembic, which is not installed.
mlflow 1.8.0 requires prometheus-flask-exporter, which is not installed.
mlflow 1.8.0 requires sqlalchemy<=1.3.13, which is not installed.
sklearn-pandas 2.0.1 requires numpy>=1.18.1, but you'll have numpy 1.16.2 which is incompatible.
sklearn-pandas 2.0.1 requires pandas>=1.0.5, but you'll have pandas 0.25.1 which is incompatible.
sklearn-pandas 2.0.1 requires scikit-learn>=0.23.0, but you'll have scikit-learn 0.20.3 which is incompatible.
sklearn-pandas 2.0.1 requires scipy>=1.4.1, but you'll have scipy 1.2.1 which is incompatible.
Successfully installed pandas-0.25.1
When I run:
import pandas as pd
pd.__version__
it is still:
0.24.2
Did I missed something ?
thanks

It's really recommended to install libraries via cluster initialization script. The %sh command is executed only on the driver node, but not on the executor nodes. And it also doesn't affect Python instance that is already running.
The correct solution will be to use dbutils.library commands, like this:
dbutils.library.installPyPI("pandas", "1.0.1")
dbutils.library.restartPython()
this will install library to all places, but it will require restarting of the Python to pickup new libraries.
Also, although it's possible to specify only package name, it's recommended to specify version explicitly, as some of the library version may not be compatible with runtime. Also, consider usage of the newer runtimes where library versions are already updated - check the release notes for runtimes to figure out the library versions installed out of the box.
For newer Databricks runtimes you can use new magic commands: %pip and %conda to install dependencies. See the documentation for more details.

install the pandas package in pycharm

When I try to install pandas, I get this error:
Could not find a version that satisfies the requirement numpy==1.9.3 (from versions: 1.10.4, 1.11.0, 1.11.1rc1, 1.11.1, 1.11.2rc1, 1.11.2, 1.11.3, 1.12.0b1, 1.12.0rc1, 1.12.0rc2, 1.12.0, 1.12.1rc1, 1.12.1, 1.13.0rc1, 1.13.0rc2, 1.13.0, 1.13.1, 1.13.3, 1.14.0rc1, 1.14.0, 1.14.1, 1.14.2, 1.14.3)
No matching distribution found for numpy==1.9.3
I use
pip install pandas
and
python -m pip install pandas
In both cases, I get the same error.
Thanks.

You can update pip and try it again. The last version of pip is 10.0.1.
python -m pip install --upgrade pip

You have some sort of versionitis (versions of different software not being compatible with each other). If you are running Python <3.4, that could be the cause because Pandas no longer supports Python <3.4. See this SO thread.
I saw this issue on Mac OS running python 3.7 and pandas 0.23. I downgraded to python 3.6 and pandas 0.22, and that fixed my issue.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Cannot read ".parquet" files in Azure Jupyter Notebook (Python 2 and 3) - python

Related

Tried installing fastai on Spyder, but dont know how to solve this error

Pip can't install flask-socketio

Adding pandas dependencies after kedro new

pandas version is not updated after installing a new version on databricks

install the pandas package in pycharm

Categories

Resources