import apache_beam as beam
from apache_beam.io.jdbc import ReadFromJdbc
with beam.Pipeline() as p:
result = (p
| 'Read from jdbc' >> ReadFromJdbc(
fetch_size=None,
table_name='table_name',
driver_class_name='oracle.jdbc.driver.OracleDriver',
jdbc_url='jdbc:oracle:thin:#localhost:1521:orcl',
username='xxx',
password='xxx',
query='selec * from table_name'
)
|beam.Map(print)
)
When I run the above code, the following error occurs:
ERROR:apache_beam.utils.subprocess_server:Starting job service with ['java', '-jar', 'C:\\Users\\YFater/.apache_beam/cache/jars\\beam-sdks-java-extensions-schemaio-expansion-service-2.29.0.jar', '51933']
ERROR:apache_beam.utils.subprocess_server:Error bringing up service
Abache Beam needs to use a java expansion service in order to use JDBC from python.
You get an error because Beam cannot launch the expansion service.
To fix this, install the Java runtime in the computer where you run apache beam, and make sure java is in your path.
IF the problem persists after installing java (or if you already have it installed), probably the JAR files Beam downloaded are bad (maybe the download stopped or the file was truncated due to disk full). In that case, just remove the contents of the $HOME/.apache_beam/cache/jars directory and re-run the beam pipeline.
Add classpath param in ReadFromJdbc
Example:
classpath=['~/.apache_beam/cache/jars/ojdbc8.jar'],
Related
In HDFSCLI docs it says that it can be configured to connect to multiple hosts by adding urls separated with semicolon ; (https://hdfscli.readthedocs.io/en/latest/quickstart.html#configuration).
I use kerberos client, and this is my code -
from hdfs.ext.kerberos import KerberosClient hdfs_client = KerberosClient('http://host01:50070;http://host02:50070')
And when I try to makedir for example, I get the following error - requests.exceptions.InvalidURL: Failed to parse: http://host01:50070;http://host02:50070/webhdfs/v1/path/to/create
Apparently the version of hdfs I installed was old, the code didn't work with version 2.0.8, and it did work with version 2.5.7
I have a python script in my locale file and I don't want to SCP it to the remote machine and run with SSHOperator remotely triggered by airflow. How can I run a locale .py file in a remote machine and get results?
I need SSHOperator with python_callable, not bash_command.
Can anyone show me a remote custom operator sample like SSHPYTHONOperator ?
I solve problem following:
gettime="""
import os
import datetime
def gettimes():
print(True)
gettimes()
"""
remote_python_get_delta_times=SSHOperator(task_id= "get_delta_times",do_xcom_push=True,
command="MYVAR=`python -c" + ' "%s"`;echo $MYVAR' % gettime ,dag=dag,ssh_hook=remote)
I see an SSH operator in the Airflow docs: https://airflow.apache.org/docs/apache-airflow/1.10.13/_api/airflow/contrib/operators/ssh_operator/index.html
If that doesn't work out for you then, you'll have to create a custom Operator using an SSH library like Paramiko
and then use it to pull code from either Github/S3 or SCP your file to the server and then execute it there.
You would need to make sure all your dependencies are also installed on the remote server.
I'm using python-jenkins library to install plugins on jenkins.
The function install_plugin return before having fully installed the plugin. I reboot the jenkins server imediately after that function call.
What should I do to prevent this issue ?
facing similar issue. Code snippet is as below.
import jenkins
server = jenkins.Jenkins('http://192.168.99.100:8080', username='guest', password='guest')
user = server.get_whoami()
version = server.get_version()
print('Hello %s from Jenkins %s' % (user['fullName'], version))
#Get installed plugins
#plugins = server.get_plugins(3)
#print(plugins)
#Download a plugin - Convert to pipeline
info = server.install_plugin('convert-to-pipeline',True) #Convert To Pipeline
print(info)
Also, output is empty string from init.py instead of True
Jenkins' log have an error of broken pipe
We are building data pipeline using Beam Python SDK and trying to run on Dataflow, but getting the below error,
A setup error was detected in beamapp-xxxxyyyy-0322102737-03220329-8a74-harness-lm6v. Please refer to the worker-startup log for detailed information.
But could not find detailed worker-startup logs.
We tried increasing memory size, worker count etc, but still getting the same error.
Here is the command we use,
python run.py \
--project=xyz \
--runner=DataflowRunner \
--staging_location=gs://xyz/staging \
--temp_location=gs://xyz/temp \
--requirements_file=requirements.txt \
--worker_machine_type n1-standard-8 \
--num_workers 2
pipeline snippet,
data = pipeline | "load data" >> beam.io.Read(
beam.io.BigQuerySource(query="SELECT * FROM abc_table LIMIT 100")
)
data | "filter data" >> beam.Filter(lambda x: x.get('column_name') == value)
Above pipeline is just loading the data from BigQuery and filtering based on some column value. This pipeline works like a charm in DirectRunner but fails on Dataflow.
Are we doing any obvious setup mistake? anyone else getting the same error? We could use some help to resolve the issue.
Update:
Our pipeline code is spread across multiple files, so we created a python package. We solved setup error problem by passing --setup_file argument instead of --requirements_file.
We resolved this setup error issue by sending a different set of arguments to the dataflow. Our code is spread across multiple files, so had to create a package for it. If we use --requirements_file, the job will start, but fail eventually, because it wouldn't be able to find the package in the workers. Beam Python SDK sometimes does not throw explicit error message for these instead, it will retry the job and fail. To get your code running with a package, you will need to pass --setup_file argument, which has dependencies listed in it. Make sure package created by python setup.py sdist command includes all the files required by your pipeline code.
If you have a privately hosted python package dependency then pass --extra_package with the path to the package.tar.gz file. Better way is to store in a GCS bucket and pass the path here.
I have written an example project to get started with Apache Beam Python SDK on Dataflow - https://github.com/RajeshHegde/apache-beam-example
Read about it here - https://medium.com/#rajeshhegde/data-pipeline-using-apache-beam-python-sdk-on-dataflow-6bb8550bf366
I'm building a prediction pipeline using Apache Beam/Dataflow. I need to include the model files inside the dependencies available to the remote workers. The Dataflow job failed with the same error log:
Error message from worker: A setup error was detected in beamapp-xxx-xxxxxxxxxx-xxxxxxxx-xxxx-harness-xxxx. Please refer to the worker-startup log for detailed information.
However, this error message didn't give any details about the worker-startup log. Finally, I found a way to have the worker log and solve the problem.
As is known, Dataflow creates compute engines to run jobs and save logs on them so that we can access the vm to see logs. We can connect to the vm in use by our Dataflow job from the GCP console via SSH. Then we can check the boot-json.log file located in /var/log/dataflow/taskrunner/harness:
$ cd /var/log/dataflow/taskrunner/harness
$ cat boot-json.log
Here we should pay attention. When running in batch mode, the vm created by Dataflow is ephemeral and closed when the job failed. If the vm is closed, we can't access it anymore. But a process including a failing item is retried 4 times, so normally we have enough time to open boot-json.log and see what is going on.
At last, I put my Python setup solution here that may help someone else:
main.py
...
model_path = os.path.dirname(os.path.abspath(__file__)) + '/models/net.pd'
# pipeline code
...
MANIFEST.in
include models/*.*
setup.py complete example
REQUIRED_PACKAGES = [...]
setuptools.setup(
...
include_package_data=True,
install_requires=REQUIRED_PACKAGES,
packages=setuptools.find_packages(),
package_data={"models": ["models/*"]},
...
)
Run Dataflow pipelines
$ python main.py --setup_file=/absolute/path/to/setup.py ...
I'm wondering how one would get neo4j to work with Google Compute Engine. Has anybody done this? What problems did you encounter?
Here you go,
Basic Setup
Install and setup gcloud
Install py2neo
Create your GCE Instance (https://console.developers.google.com/project/PROJECT_APPID/compute/instancesAdd) using image (debian-7-wheezy-v20141021, Debian GNU/Linux 7.7 (wheezy) amd64 built on 2014-10-21 or ANY)
SSH your instance gcloud compute ssh INSTANCE_NAME --zone AVAILABLE_ZONES --> AVAILABLE_ZONES
Download and Install neo4j in GCE - You may need to install java(Fix) and lsof (Fix: apt-get install lsof).
Configuration for GCE
Configure server neo4j
(Optional), Add neo4j https support
Whitelist neo4j port 7474 (More on Networking and Firewalls)
Add security username:password from github
gcloud compute firewall-rules create neo4j --network default --allow tcp:7474
Play around
Start neo4j server ./bin/neo4j start
Check your running instances # http://IP_ADDRESS:7474/
Once py2neo Installed and server started, try some pycode to test it
>> from py2neo.neo4j import GraphDatabaseService, CypherQuery
>> # Set up a link to the local graph database.
>> # When () left blank defaults to http://localhost:7474/db/data/
>> graph = GraphDatabaseService('http://IP_ADDRESS:7474/db/data/')
>> CypherQuery(graph, "CREATE (n {name:'Example'}) RETURN n;").execute()
Above python setup / code, you can use it in GAE as well.
References
Check pricing of GCE.
py2neo Cookbook
Compute Instances
gcloud compute
Edit: Appengine + Neo4j
from py2neo import neo4j
GRAPH_DB = neo4j.GraphDatabaseService(
'http://uname:psswd#localhost:7474/db/data/')
if IS_PROD:
GRAPH_DB = neo4j.GraphDatabaseService(
'http://uname:psswd#host:port/db/data/')
def _execute(query):
"""Execute all neo4j queries and return list of Record objects.
Returns:
Returns list of Record objects.
"""
try:
result = neo4j.CypherQuery(GRAPH_DB, query).execute()
# logging.info(result.data)
return result
except neo4j.CypherError as error:
logging.error(error.exception)
except DeadlineExceededError as dead:
logging.warn(dead)
except urlfetch_errors.InternalTransientError as tra_error:
logging.warn(tra_error)
except httplib.HTTPException as exp:
logging.warn(exp)
except neo4j.http.SocketError as soc:
logging.warn(soc)
the easiest and the safest way is to use docker neo4j image
and this is docker docs, to install and deploy on google compute engine