Why does Spark-XML on AWS Glue fail with AbstractMethodError?

Why does Spark-XML on AWS Glue fail with AbstractMethodError? - python

I have an AWS Glue job written in Python that pulls in the spark-xml library (through the Dependent jars path). I'm using spark-xml_2.11-0.2.0.jar. When I try to output my DataFrame to XML I get an error. The code I'm using is:
applymapping1.toDF().repartition(1).write.format("com.databricks.xml").save("s3://glue.xml.output/Test.xml");
The error I get is:
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/readwriter.py",
line 550, in save File
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco File
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o75.save. : java.lang.AbstractMethodError:
com.databricks.spark.xml.DefaultSource15.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at
If I change it to CSV, it works fine:
applymapping1.toDF().repartition(1).write.format("com.databricks.csv").save("s3://glue.xml.output/Test.xml");
Note: When using CSV I don't have to import spark-xml. I think spark-csv is included in AWS Glue's Spark environment.
Any suggestions to what to try?
I've tried various versions of spark-xml:
spark-xml_2.11-0.2.0
spark-xml_2.11-0.3.1
spark-xml_2.10-0.2.0

That question is very similar to (but not an exact duplicate of) Why does elasticsearch-spark 5.5.0 give AbstractMethodError when submitting to YARN cluster? that also deals with AbstractMethodError.
Quoting the javadoc of java.lang.AbstractMethodError:
Thrown when an application tries to call an abstract method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of some class has incompatibly changed since the currently executing method was last compiled.
That pretty much explains what you experience (note the part that starts with "this error can only occur at run time").
I think it's a Spark version mismatch in play here.
Given com.databricks.spark.xml.DefaultSource15 in the stack trace and the change that does the following:
Remove the separated DefaultSource15 due to compatibility in Spark 1.5+
This removes DefaultSource15 and merge it into DefaultSource. This was separated for compatibility in Spark 1.5+ . In master and spark-xml 0.4.x, it dropped 1.x support.
You should make sure that the version of Spark in AWS Glue's Spark environment matches the spark-xml. The latest version of spark-xml 0.4.1 was released on 6 Nov 2016.

Related

Ansible ERROR! no action detected in task

I am trying to run a playbook https://github.com/Datanexus/dn-cassandra
With the different deployment scenarios listed out there, I am going for multinode cassandra setup described here: deployment scenarios.
I have setup a static inventory file.
cassandra-seed-01 ansible_ssh_host=192.168.0.17 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
cassandra-seed-02 ansible_ssh_host=192.168.0.18 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
cassandra-non-seed-01 ansible_ssh_host=192.168.0.22 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
[cassandra_seed]
192.168.0.17
192.168.0.18
[cassandra]
192.168.0.22
However when I try running the playbook it throws the following error:
ERROR! no action detected in task
The error appears to have been in
'/home/laumair/workspace/dn-cassandra/provision-cassandra.yml': line
21, column 9, but may be elsewhere in the file depending on the exact
syntax problem.
The offending line appears to be:
# then, build the seed and non-seed host groups
- include_role:
^ here
I would appreciate any sort of direction with this error as I have tried out solutions for similar errors but no luck so far.

include_role is available since Ansible 2.2.
Please upgrade your Ansible installation.

Code for gensim Word2vec as an HTTP service 'KeyedVectors' Attribute error

I am using the w2v_server_googlenews code from the word2vec HTTP server running at https://rare-technologies.com/word2vec-tutorial/#bonus_app. I changed the loaded file to a file of vectors trained with the original C version of word2vec. I load the file with
gensim.models.KeyedVectors.load_word2vec_format(fname, binary=True)
and it seems to load without problems. But when I test the HTTP service with, let's say
curl 'http://127.0.0.1/most_similar?positive%5B%5D=woman&positive%5B%5D=king&negative%5B%5D=man'
I got an empty result with only the execution time.
{"taken": 0.0003361701965332031, "similars": [], "success": 1}
I put a traceback.print_exc() on the except part of the related method, which is in this case def most_similar(self, *args, **kwargs): and I got:
Traceback (most recent call last):
File "./w2v_server.py", line 114, in most_similar
topn=5)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 304, in most_similar
self.init_sims()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/keyedvectors.py", line 817, in init_sims
self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
AttributeError: 'KeyedVectors' object has no attribute 'syn0'
Any idea on why this might happens?
Note: I use python 2.7 and I installed gensim using pip, which gave me gensim 2.1.0.

FYI that demo code was baed on gensim 0.12.3 (from 2015, as listed in its requirements.txt), and would need updating to work with the latest gensim.
It might be sufficient to add a line to w2v_server.py at line 70 (just after the load_word2vec_format()), to force the creation of the needed syn0norm property (which in older gensims was auto-created on load), before deleting the raw syn0 values. Specifically:
self.model.init_sims(replace=True)
(You would leave out the replace=True if you were going to be doing operations other than most_similar(), that might require raw vectors.)
If this works to fix the problem for you, a pull-request to the w2v_server_googlenews repo would be favorably received!

How to Set spark.sql.parquet.output.committer.class in pyspark

I'm trying to set spark.sql.parquet.output.committer.class and nothing I do seems to get the setting to take effect.
I'm trying to have many threads write to the same output folder, which would work with org.apache.spark.sql.
parquet.DirectParquetOutputCommitter since it wouldn't use the _temporary folder. I'm getting the following error, which is how I know it's not working:
Caused by: java.io.FileNotFoundException: File hdfs://path/to/stuff/_temporary/0/task_201606281757_0048_m_000029/some_dir does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:849)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:382)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
Note the call to org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob, the default class.
I've tried the following, based on other SO answers and searches:
sc._jsc.hadoopConfiguration().set(key, val) (this does work for settings like parquet.enable.summary-metadata)
dataframe.write.option(key, val).parquet
Adding --conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter" to the spark-submit call
Adding --conf "spark.sql.parquet.output.committer.class"=" org.apache.spark.sql.parquet.DirectParquetOutputCommitter" to the spark-submit call.
That's all I've been able to find, and nothing works. It looks like it's not hard to set in Scala but appears impossible in Python.

The approach in this comment definitively worked for me:
16/06/28 18:49:59 INFO ParquetRelation: Using user defined output committer for Parquet: org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter
It was a lost log message in the flood that Spark gives, and the error I was seeing was unrelated. It's all moot anyway, since the DirectParquetOutputCommitter has been removed from Spark.

Error calling Python module function in MySQL Workbench

I'm kind of at my wits end here, and so far have had no feedback from the MySQL Workbench bug reporting site, so I thought I'd throw this question/problem out to more sites.
I'm attempting to migrate from a MSSQL server on a Windows Server 2003 machine to MySQL server running on a Centos 6.5 VM. I can connect to the source and target databases, select a schemata, and runs through a pass through once for retrieving tables. After this the process fails and throws the following errors:
Traceback (most recent call last):
File "/usr/lib64/mysql-workbench/modules/db_mssql_grt.py", line 409, in reverseEngineer
reverseEngineerProcedures(connection, schema)
File "/usr/lib64/mysql-workbench/modules/db_mssql_grt.py", line 1016, in reverseEngineerProcedures
for idx, (proc_count, proc_name, proc_definition) in enumerate(cursor):
MemoryError
Traceback (most recent call last):
File "/usr/share/mysql-workbench/libraries/workbench/wizard_progress_page_widget.py", line 192, in thread_work
self.func()
File "/usr/lib64/mysql-workbench/modules/migration_schema_selection.py", line 160, in task_reveng
self.main.plan.migrationSource.reverseEngineer()
File "/usr/lib64/mysql-workbench/modules/migration.py", line 353, in reverseEngineer
self.state.sourceCatalog = self._rev_eng_module.reverseEngineer(self.connection, self.selectedCatalogName, self.selectedSchemataNames, self.state.applicationData)
SystemError: MemoryError(""): error calling Python module function DbMssqlRE.reverseEngineer
ERROR: Reverse engineer selected schemata: MemoryError(""): error calling Python module function DbMssqlRE.reverseEngineer
Failed
I thought this was initally a memory error, so I've upped the memory on the box to 16 GiB. This error also occurs on any size DBs, as I've tried very minimal sized ones with hardly any tables.
Any thoughts? Thanks for looking

Just in case anyone else runs into this. I had the same problem and fixed it by getting rid of non-ASCII characters in schemas, tables....basically all MSSQL objects. This was confounded by the fact that I had SQL# (www.sqlsharp.com) installed, which adds a number of functions and stored procs with a schema called SQL#. You can remove that with this command:
EXEC SQL#.SQLsharp_Uninstall
Once you get rid of non-ASCII chars, the migration works.

The OP (I assume) closed their bug report with this message:
[...] I figured out a work around, or the flaw in the system perhaps. Turns out that Null values were not allowed inside the Datetime fields when doing a migration. I turned every Datetime field in my database to a default value and migrated it successfully after that.

py2neo Cypher transactions failing

I'm trying to batch import millions of nodes through Py2Neo.
I don't know what's faster, the BatchWrite or the cipher.Transaction, but the latter seemed the best option as I need to split my batches.
However, when I try to execute a simple transaction, I receive a weird error.
The python code:
session = cypher.Session("http://127.0.0.1:7474/db/data/") #error also w/o /db/data/
def init():
tx = session.create_transaction()
for ngram, one_grams in data.items():
tx.append("CREATE "+str(n)+":WORD {'word': "+ngram+", 'rank': "+str(ngram_rank)+", 'prob': "+str(ngram_prob)+", 'gram': '0gram'}")
tx.execute() # line 69 in the error below
The error:
Traceback (most recent call last):
File "Ngram_neo4j.py", line 176, in <module>
init(rNgram_file="dataset_id.json")
File "Ngram_neo4j.py", line 43, in init
data = probability_items(data)
File "Ngram_neo4j.py", line 69, in probability_items
tx.execute()
File "D:\datasets\GOOGLE~1\virtenv\lib\site-packages\py2neo\cypher.py", line 224, in execute
return self._post(self._execute or self._begin)
File "D:\datasets\GOOGLE~1\virtenv\lib\site-packages\py2neo\cypher.py", line 209, in _post
raise TransactionError(error["code"], error["status"], error["message"])
KeyError: 'status'
I tried catching the exception:
except cypher.TransactionError as e:
print("--------------------------------------------------------------------------------------------")
print(e.status)
print(e.message)
But never gets called. (maybe an error on my part?)
Regular insert using graph_db.create({"node:" node}) do work, but are incredibly slow (36hrs for 2.5M nodes)
Note that the dataset consists of a series of JSON files, each with a structure to 5 levels deep.
I'd like to batch the last 2 levels (around 100 to 20.000 nodes per batch)
--- EDIT ---
I'm using Py2Neo 1.6.1, Neo4j 2.0.0. Currently on Windows 7 (but also OSX Mav., CentOS 6)

The problem you're seeing is due to a last minute alteration in the way that Cypher transaction errors are reported by the Neo4j server. Py2neo 1.6 was built against M05/M06 and when a few features changed in RC1/GA, Py2neo broke in a few places.
This has been fixed for Py2neo 1.6.2 (https://github.com/nigelsmall/py2neo/issues/224) but I do not yet know when I will get a chance to finish and release this version.

What neo4j and py2neo versions are you using?
You should use parameters for your create statements.
Can you check the server logs in data/logs and data/graph.db/messages.log for errors?
If you have so much data to insert then perhaps direct batch-insertion would make more sense?
See: http://neo4j.org/develop/import
Two tools I wrote for this:
https://github.com/jexp/neo4j-shell-tools/tree/20
https://github.com/jexp/batch-import/tree/20#binary-download

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does Spark-XML on AWS Glue fail with AbstractMethodError? - python

Related

Ansible ERROR! no action detected in task

Code for gensim Word2vec as an HTTP service 'KeyedVectors' Attribute error

How to Set spark.sql.parquet.output.committer.class in pyspark

Error calling Python module function in MySQL Workbench

py2neo Cypher transactions failing

Categories

Resources