Getting exception "No module named 'airflow.providers.sftp'" - python

I am trying to copy files from SFTP to google cloud storage.
Composer version = 1.16.12
Airflow version = 1.10.15.
While executing getting exception No module named 'airflow.providers.sftp'.
Much appreciated if some one can give pointers.
Code snippet is :
import os
import airflow
from airflow import DAG
from airflow import models
from airflow.operators import python_operator
from airflow.providers.google.cloud.transfers.sftp_to_gcs import SFTPToGCSOperator
from airflow.utils.dates import days_ago
with models.DAG("test_ssh_to_gcs", start_date=days_ago(1), schedule_interval=None) as dag:
copy_file_from_ssh_to_gcs = SFTPToGCSOperator(
task_id="file-copy-ssh-to-gcs",
source_path="/ ",
destination_bucket='test_sftp_to_gcs',
destination_path="test/test.csv",
gcp_conn_id="google_cloud_default",
sftp_conn_id="sftp_test",
)
copy_file_from_ssh_to_gcs

First, have you tried installing the package with pip install apache-airflow-providers-sftp ?
Be also careful about the documentation version you are refering to. With Airflow 2.0, some packages have been moved.

You get the error because SFTPToGCSOperator uses airflow.providers.sftp.operators.SFTPOperator under the hood, which present in airflow >= 2.0.0.
The bad news is need to upgrade your Airflow version to use airflow.providers.google.cloud.transfers.sftp_to_gcs.SFTPToGCSOperator.
If you don't want / cannot upgrade airflow, you can create a DAG chaining two operators:
Operator
Airflow 1.x import
SFTP Operator (download a file to a local using 'operation=get')
from airflow.contrib.operators.sftp_operator import SFTPOperator
Upload File to Google cloud storage
from airflow.contrib.operators.file_to_gcs import FileToGoogleCloudStorageOperator
This should do the trick:
LOCALFILE = '/tmp/kk'
with models.DAG("test_ssh_to_gcs", start_date=days_ago(1), schedule_interval=None) as dag:
download_sftp = SFTPOperator(
task_id = 'part1_sftp_download_to_local',
ssh_conn_id="sftp_test",
local_file=LOCALFILE,
remote_file='',
operation='get')
gcp_upload = FileToGoogleCloudStorageOperator(
task_id='part2_upload_to_gcs',
bucket='test_sftp_to_gcs',
src=LOCALFILE,
dst="test/test.csv",
google_cloud_storage_conn_id="google_cloud_default" # configured in Airflow
)
sftp_download >> gcp_upload

With airflow 1.10 you can install backported packages
For your case the following need to be added to your composer cluster:
1- apache-airflow-backport-providers-google
2- apache-airflow-backport-providers-sftp
3- apache-airflow-backport-providers-ssh

Install dependency via pip.
pip install apache-airflow-providers-sftp

Related

Can't get airflow AWS connection to work "ModuleNotFoundError: No module named 'airflow.providers.amazon"

I have been trying to run a simple Airflow DAG to show what's in an s3 bucket but I keep getting this error: ModuleNotFoundError: No module named 'airflow.providers.amazon'
I've tried several pip installs recommended in similar questions but still have no luck. Here's the python script and below is a screenshot of my Airflow webserver showing the error message. Note I'm using Airflow version 2.5.0
import datetime
import logging
from airflow import DAG
from airflow.models import Variable
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.S3_hook import S3Hook
def list_keys():
hook = S3Hook(aws_conn_id='aws_credentials_old')
bucket = Variable.get('s3_bucket')
prefix = Variable.get('s3_prefix')
logging.info(f"Listing Keys from {bucket}/{prefix}")
keys = hook.list_keys(bucket, prefix=prefix)
for key in keys:
logging.info(f"- s3://{bucket}/{key}")
dag = DAG(
'lesson1.exercise4',
start_date=datetime.datetime.now())
list_task = PythonOperator(
task_id="list_keys",
python_callable=list_keys,
dag=dag
)
You can try installing the backport-providers-amazon package because it's only available in the Airflow main branch.
pip install apache-airflow-backport-providers-amazon
Here you can find more info. https://pypi.org/project/apache-airflow-backport-providers-amazon/
You are importing from the wrong place. It should be
from airflow.providers.amazon.aws.hooks.s3 import S3Hook

specifying dependencies needed for .py file run through BashOperator() with apache airflow

I'm trying to automate the process of running a webscraper daily with apache airflow and docker. I have the airflow server up and running and I can manually initialize my dag through the airflow GUI on the local server, but it's failing.
I'm not sure where to even see what errors are being triggered. My dag.py file is below... and you can see where I'm trying to use the BashOperator function to run the script. I suspect the issue is with the dependencies the scraper uses but I'm not sure how to integrate the config file and the other packages necessary to run the script through apache / docker.
from airflow.models import DAG
from datetime import datetime, timedelta
from airflow.operators.bash_operator import BashOperator
dag = DAG("MI_Spider", start_date=datetime(2021,1,1), schedule_interval="#daily", catchup=False)
curl = BashOperator(
task_id='testingbash',
bash_command="python ~/spider/path/MichiganSpider.py",
dag=dag)
Should I move the spider file and config file into the airflow project directory or somehow install the dependencies directly to the docker container I'm using along with somehow setting env variables within the docker container instead of calling the db login credentials through a separate config file? I've been using a conda env for the scraper when I run it manually. Is there any way I can just use that environment?
I'm very new to docker and apache airflow so I apologize if this stuff should be obvious.
Thank you in advance!!
Assuming you are on pretty recent version of Airflow I recommend refactoring your DAG to use PythonVirtualenvOperator https://airflow.apache.org/docs/apache-airflow/stable/howto/operator/python.html#pythonvirtualenvoperator instead of BashOperator.
Here's example on how to use python operators in Airflow: https://airflow.apache.org/docs/apache-airflow/stable/_modules/airflow/example_dags/example_python_operator.html
Part relevant to you is:
import logging
import shutil
import time
from pprint import pprint
import pendulum
from airflow import DAG
from airflow.decorators import task
log = logging.getLogger(__name__)
with DAG(
dag_id='example_python_operator',
schedule_interval=None,
start_date=pendulum.datetime(2021, 1, 1, tz="UTC"),
catchup=False,
tags=['example'],
) as dag:
if not shutil.which("virtualenv"):
log.warning("The virtalenv_python example task requires virtualenv, please install it.")
else:
#task.virtualenv(
task_id="virtualenv_python", requirements=["colorama==0.4.0"], system_site_packages=False
)
def callable_virtualenv():
"""
Example function that will be performed in a virtual environment.
Importing at the module level ensures that it will not attempt to import the
library before it is installed.
"""
from time import sleep
from colorama import Back, Fore, Style
print(Fore.RED + 'some red text')
print(Back.GREEN + 'and with a green background')
print(Style.DIM + 'and in dim text')
print(Style.RESET_ALL)
for _ in range(10):
print(Style.DIM + 'Please wait...', flush=True)
sleep(10)
print('Finished')
virtualenv_task = callable_virtualenv()
Just remember to have virtualenv package available in your Airflow image.

Issue when running code from SQL server agent - azure.storage.blob - module azure not found

I'm having a problem with some python code that connects to an azure storage container.
Code:
from azure.storage.blob import BlockBlobService
import logging
import os, sys
block_blob_service = BlockBlobService(account_name = accountName, account_key = accountKey, connection_string=connectionString)
block_blob_service.create_blob_from_path(containerName,blob_name=blobName,file_path=fileNameToUpload)
Ok, so this code works when executed using a command prompt.
When its executed using a SQL agent job:
line 1, in from azure.storage.blob import BlockBlobService ModuleNotFoundError: No module named 'azure'. Process Exit Code 1. The step failed.
pip list:
azure-common 1.1.27
azure-core 1.19.0
azure-storage-blob 1.5.0
azure-storage-common 1.4.2
Using python 3.7.4
The credential that I use to run the SQL agent job is mapped to my userid which has admin privileges on the server.
I used the quickstart to get me started.
Can anyone help please?
You are referencing an old version of azure.storage.version v1.5.0, in the latest version, v12.x.x, you need to use BlobServiceClient instead.
######################################
pip install azure-storage-blob==12.9.0
######################################
blob_service_client = BlobServiceClient(account_url=url, credential=self._account_key)
The link you mentioned it's already pointing to the latest version

How to import Airflow's PostgresOperator

I'm trying to import the PostgresOperator from the airflow package:
from airflow.providers.postgres.operators.postgres import PostgresOperator
But I'm getting the following error: Cannot find reference 'postgres' in imported module airflow.providers.
The solution was to run the following in the terminal, using the project's virtualenv: pip install 'apache-airflow[postgres]'.
Please notice that it won't work if you don't wrap the name of the package between single-quotes.
Use:
from airflow.operators.postgres_operator import PostgresOperator

Connect BigQuery to Python

I have installed Google BigQuery and I'm trying to query from BigQuery inside of Python.
from google.cloud import bigquery
client = bigquery.Client(project='aaa18')
But I'm getting this error and I don't know what this means :
C:\Users\udgtlvr\AppData\Local\Continuum\anaconda3\python.exe C:/Users/udgtlvr/untitled4/bigquery
Traceback (most recent call last):
File "C:/Users/udgtlvr/untitled4/bigquery", line 2, in <module>
from google.cloud import bigquery
File "C:\Users\udgtlvr\AppData\Local\Continuum\anaconda3\lib\site-packages\google\cloud\bigquery\__init__.py", line 35, in <module>
from google.cloud.bigquery.client import Client
File "C:\Users\udgtlvr\AppData\Local\Continuum\anaconda3\lib\site-packages\google\cloud\bigquery\client.py", line 43, in <module>
from google import resumable_media
ImportError: cannot import name 'resumable_media' from 'google' (unknown location)
What are the steps I need to do in order to get this thing work?
Besides what #Hitobat mentions in his comment (reinstalling the google-cloud-bigquery module), you could also try only installing the module that is failing, the module google-resumable-media.
To verify which module is missing or might be outdated, you can run the following command:
pip freeze | grep google
This will list all the google's modules you have installed.
Just to make sure you did install the required official packages, you should have used this source first https://googleapis.dev/python/bigquery/latest/index.html
They specify the install considering you use a virtual environment
Windows
pip install virtualenv
virtualenv <your-env>
<your-env>\Scripts\activate
<your-env>\Scripts\pip.exe install google-cloud-bigquery
If you did right, installing the Google Cloud Python library usually installs the required dependencies (in your case the google-resumable-media)
pip install google-cloud
As I see your problem is more than an Anaconda/Local Python paths problem... So you may first verify where the Python Client for Google BigQuery is being installed.
If you definitely want to work with Anaconda environments, then try to install the packages from the Anaconda Navigator environments manager... That should solve the dependencies problem.
Create a service account with desired BigQuery roles and download JSON key file (example: data-lab.json). Use below code:
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = "data-lab.json"
from google.cloud import bigquery
client = bigquery.Client()
This is what I do and works fine . I use Colab/Jupyter Notebooks to access Bigquery.
First Option is to authenticate using browser
from google.colab import auth
auth.authenticate_user()
The second option is to use service account for authentication which is explained above
Once you are done with authentication then the below code should work
from google.cloud import bigquery
client = bigquery.Client(project='project_id')
%load_ext google.cloud.bigquery
Run Bigquery commands using magic link
%%bigquery --project project_id
select * from table limit 1

Categories

Resources