How to Set spark.sql.parquet.output.committer.class in pyspark - python

I'm trying to set spark.sql.parquet.output.committer.class and nothing I do seems to get the setting to take effect.
I'm trying to have many threads write to the same output folder, which would work with org.apache.spark.sql.
parquet.DirectParquetOutputCommitter since it wouldn't use the _temporary folder. I'm getting the following error, which is how I know it's not working:
Caused by: java.io.FileNotFoundException: File hdfs://path/to/stuff/_temporary/0/task_201606281757_0048_m_000029/some_dir does not exist.
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatusInternal(DistributedFileSystem.java:795)
at org.apache.hadoop.hdfs.DistributedFileSystem.access$700(DistributedFileSystem.java:106)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:853)
at org.apache.hadoop.hdfs.DistributedFileSystem$18.doCall(DistributedFileSystem.java:849)
at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
at org.apache.hadoop.hdfs.DistributedFileSystem.listStatus(DistributedFileSystem.java:849)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:382)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.mergePaths(FileOutputCommitter.java:384)
at org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter.commitJob(FileOutputCommitter.java:326)
at org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob(ParquetOutputCommitter.java:46)
at org.apache.spark.sql.execution.datasources.BaseWriterContainer.commitJob(WriterContainer.scala:230)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1.apply$mcV$sp(InsertIntoHadoopFsRelation.scala:151)
Note the call to org.apache.parquet.hadoop.ParquetOutputCommitter.commitJob, the default class.
I've tried the following, based on other SO answers and searches:
sc._jsc.hadoopConfiguration().set(key, val) (this does work for settings like parquet.enable.summary-metadata)
dataframe.write.option(key, val).parquet
Adding --conf "spark.hadoop.spark.sql.parquet.output.committer.class=org.apache.spark.sql.parquet.DirectParquetOutputCommitter" to the spark-submit call
Adding --conf "spark.sql.parquet.output.committer.class"=" org.apache.spark.sql.parquet.DirectParquetOutputCommitter" to the spark-submit call.
That's all I've been able to find, and nothing works. It looks like it's not hard to set in Scala but appears impossible in Python.

The approach in this comment definitively worked for me:
16/06/28 18:49:59 INFO ParquetRelation: Using user defined output committer for Parquet: org.apache.spark.sql.execution.datasources.parquet.DirectParquetOutputCommitter
It was a lost log message in the flood that Spark gives, and the error I was seeing was unrelated. It's all moot anyway, since the DirectParquetOutputCommitter has been removed from Spark.

Related

MWAA EKSPodOperator error when assuming role

we have the following basic EKS Operator on MWAA (Airflow version 2.2.2)
start_pod = EKSPodOperator(
aws_conn_id="eks-connection",
task_id='start_pod',
namespace="airflow",
cluster_name="eks-data-stg",
in_cluster=False,
service_account_name="airflow-sa",
image='amazon/aws-cli:latest',
cmds=['sh', '-c', 'echo Test Airflow; date'],
labels={'demo': 'hello_world'},
get_logs=True,
# Delete the pod when it reaches its final state, or the execution is interrupted.
is_delete_operator_pod=True,
)
This fails with the following error:
airflow.exceptions.AirflowConfigException: `[logging] logging_level` should not be 'fatal'. Possible values: CRITICAL, FATAL, ERROR, WARN, WARNING, INFO, DEBUG.
This we traced back to the following issue that is fixed in a older version of the Operator(https://github.com/apache/airflow/issues/21421).
Unfortunately we are not able to override theairflow-providers-amazon.
Has anyone found a way around this bug by either overriding the dependency or fixing the operator?

Pyinstaller with APScheduler - TypeError with IntervalTrigger

I had the same problem as here (see link below), brielfy: unable to create .exe of a python script that uses APScheduler
Pyinstaller 3.3.1 & 3.4.0-dev build with apscheduler
So I did as suggested:
from apscheduler.triggers import interval
scheduler.add_job(Run, 'interval', interval.IntervalTrigger(minutes = time_int),
args = (input_file, output_dir, time_int),
id = theID, replace_existing=True)
And indeed importing interval.IntervalTrigger and passing it as an argument to add_job solved this particular error.
However, now I am encountring:
TypeError: add_job() got multiple values for argument 'args'
I tested it and I can ascertain it is occurring because of the way trigger is called now. I also tried defining trigger = interval.IntervalTrigger(minutes = time_int) separately and then just passing trigger, and the same happens.
If I ignore the error with try/except, I see that it does not add the job to the sql database at all (I am using SQLAlchemy as a jobstore). Initially I thought it is because I am adding several jobs in a for loop, but it happens with a single job add as well.
Anyone know of some other workaround if the initial problem, or any idea why this error might occur? I can't find anything online either :(
Things always work better in the morning.
For anyone else who encounters this: you don't need both 'interval' and interval.IntervalTrigger() as arguments, the code should be, this is where the error comes from.
scheduler.add_job(Run, interval.IntervalTrigger(minutes = time_int),
args = (input_file, output_dir, time_int),
id = theID, replace_existing=True)

bingads V13 report request fails in python sdk

I try to download a bingads report using python SDK, but I keep getting an error says: "Type not found: 'Aggregation'" after submitting a report request. I've tried all 4 options mentioned in the following link:
https://github.com/BingAds/BingAds-Python-SDK/blob/master/examples/v13/report_requests.py
Authentication process prior to request works just fine.
I execute the following:
report_request = get_report_request(authorization_data.account_id)
reporting_download_parameters = ReportingDownloadParameters(
report_request=report_request,
result_file_directory=FILE_DIRECTORY,
result_file_name=RESULT_FILE_NAME,
overwrite_result_file=True, # Set this value true if you want to overwrite the same file.
timeout_in_milliseconds=TIMEOUT_IN_MILLISECONDS
)
output_status_message("-----\nAwaiting download_report...")
download_report(reporting_download_parameters)
after a careful debugging, it seems that the program fails when trying to execute a command within "reporting_service_manager.py". Here is workflow:
download_report(self, download_parameters):
report_file_path = self.download_file(download_parameters)
then:
download_file(self, download_parameters):
operation = self.submit_download(download_parameters.report_request)
then:
submit_download(self, report_request):
self.normalize_request(report_request)
response = self.service_client.SubmitGenerateReport(report_request)
SubmitGenerateReport starts a sequence of events ending with a call to "_SeviceCall.init" function within "service_client.py", returning an exception "Type not found: 'Aggregation'"
try:
response = self.service_client.soap_client.service.__getattr__(self.name)(*args, **kwargs)
return response
except Exception as ex:
if need_to_refresh_token is False \
and self.service_client.refresh_oauth_tokens_automatically \
and self.service_client._is_expired_token_exception(ex):
need_to_refresh_token = True
else:
raise ex
Can anyone shed some light? .
Thanks
Please be sure to set Aggregation e.g., as shown here.
aggregation = 'Daily'
If the report type does not use aggregation, you can set Aggregation=None.
Does this help?
This may be a bit late 2 months after the fact but maybe this will help someone else. I had the same error (though I suppose it may not be the same issue). It does look like you did what I did (and I'm sure others will as well): copy-paste the Microsoft example code and tried to run it only to find that it didn't work.
I spent quite some time trying to debug the issue and it looked to me like the XML wasn't being searched correctly. I was using suds-py3 for the script at the time so I tried suds-community and everything just worked after that.
I also re-read the Bing Ads API walkthrough for getting started again and found that they recommend suds-jurko instead.
Long story short: If you want to use the bingads API don't use suds-py3, use either suds-community (which I can confirm works for everything I've used the API for) or suds-jurko (which is the one recommended by Microsoft).

Why does Spark-XML on AWS Glue fail with AbstractMethodError?

I have an AWS Glue job written in Python that pulls in the spark-xml library (through the Dependent jars path). I'm using spark-xml_2.11-0.2.0.jar. When I try to output my DataFrame to XML I get an error. The code I'm using is:
applymapping1.toDF().repartition(1).write.format("com.databricks.xml").save("s3://glue.xml.output/Test.xml");
The error I get is:
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/readwriter.py",
line 550, in save File
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/java_gateway.py",
line 1133, in call File
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/pyspark.zip/pyspark/sql/utils.py",
line 63, in deco File
"/mnt/yarn/usercache/root/appcache/application_1517883778506_0016/container_1517883778506_0016_02_000001/py4j-0.10.4-src.zip/py4j/protocol.py",
line 319, in get_return_value py4j.protocol.Py4JJavaError: An error
occurred while calling o75.save. : java.lang.AbstractMethodError:
com.databricks.spark.xml.DefaultSource15.createRelation(Lorg/apache/spark/sql/SQLContext;Lorg/apache/spark/sql/SaveMode;Lscala/collection/immutable/Map;Lorg/apache/spark/sql/Dataset;)Lorg/apache/spark/sql/sources/BaseRelation;
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:426)
at
org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:215)
at
If I change it to CSV, it works fine:
applymapping1.toDF().repartition(1).write.format("com.databricks.csv").save("s3://glue.xml.output/Test.xml");
Note: When using CSV I don't have to import spark-xml. I think spark-csv is included in AWS Glue's Spark environment.
Any suggestions to what to try?
I've tried various versions of spark-xml:
spark-xml_2.11-0.2.0
spark-xml_2.11-0.3.1
spark-xml_2.10-0.2.0
That question is very similar to (but not an exact duplicate of) Why does elasticsearch-spark 5.5.0 give AbstractMethodError when submitting to YARN cluster? that also deals with AbstractMethodError.
Quoting the javadoc of java.lang.AbstractMethodError:
Thrown when an application tries to call an abstract method. Normally, this error is caught by the compiler; this error can only occur at run time if the definition of some class has incompatibly changed since the currently executing method was last compiled.
That pretty much explains what you experience (note the part that starts with "this error can only occur at run time").
I think it's a Spark version mismatch in play here.
Given com.databricks.spark.xml.DefaultSource15 in the stack trace and the change that does the following:
Remove the separated DefaultSource15 due to compatibility in Spark 1.5+
This removes DefaultSource15 and merge it into DefaultSource. This was separated for compatibility in Spark 1.5+ . In master and spark-xml 0.4.x, it dropped 1.x support.
You should make sure that the version of Spark in AWS Glue's Spark environment matches the spark-xml. The latest version of spark-xml 0.4.1 was released on 6 Nov 2016.

Ansible ERROR! no action detected in task

I am trying to run a playbook https://github.com/Datanexus/dn-cassandra
With the different deployment scenarios listed out there, I am going for multinode cassandra setup described here: deployment scenarios.
I have setup a static inventory file.
cassandra-seed-01 ansible_ssh_host=192.168.0.17 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
cassandra-seed-02 ansible_ssh_host=192.168.0.18 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
cassandra-non-seed-01 ansible_ssh_host=192.168.0.22 ansible_ssh_port=22 ansible_ssh_user='root' ansible_ssh_private_key_file='keys/id_rsa'
[cassandra_seed]
192.168.0.17
192.168.0.18
[cassandra]
192.168.0.22
However when I try running the playbook it throws the following error:
ERROR! no action detected in task
The error appears to have been in
'/home/laumair/workspace/dn-cassandra/provision-cassandra.yml': line
21, column 9, but may be elsewhere in the file depending on the exact
syntax problem.
The offending line appears to be:
# then, build the seed and non-seed host groups
- include_role:
^ here
I would appreciate any sort of direction with this error as I have tried out solutions for similar errors but no luck so far.
include_role is available since Ansible 2.2.
Please upgrade your Ansible installation.

Categories

Resources