why not able to write sparkdataframe into parquet file? - python

I'm trying to write spark dataframe into parquet file but not able to write dataframe into parquet even i tried with csv
df is my dataframe
CUST_ID
---------------
00000082MM778Q49X
00000372QM8890MX7
00000424M09X729MQ
0000062Q028M05MX
my dataframe looks as above
df_parquet = (tempDir+"/"+"df.parquet") #filepath
customerQuery = f"SELECT DISTINCT(m.customer_ID) FROM ada_customer m INNER JOIN customer_nol mr ON m.customer_ID = mr.customer_ID \
WHERE mr.MODEL <> 'X' and m.STATUS = 'Process' AND m.YEAR = {year} AND mr.YEAR = {year}"
customer_df = sqlContext.read.format("jdbc").options(url="jdbc:mysql://localhost:3306/dbkl",
driver="com.mysql.jdbc.Driver",
query=customerQuery, user="root", password="root").load()
# above lines are working only writing into file not working
customer_df.write.mode("overwrite").parquet(df_parquet)
i'm getting this error don't know exactly what's wrong. can some one help with this
Traceback (most recent call last):
File "F:/SparkBook/HG.py", line 135, in <module>
customer_xdf.write.mode("overwrite").parquet(customer_parquet)
File "C:\spark3\python\lib\pyspark.zip\pyspark\sql\readwriter.py", line 1372, in csv
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1305, in __call__
File "C:\spark3\python\lib\pyspark.zip\pyspark\sql\utils.py", line 111, in deco
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.csv.
: org.apache.spark.SparkException: Job aborted.
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:231)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.UnsatisfiedLinkError: org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:220)
... 33 more
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "F:/SparkBook/HG.py", line 148, in <module>
logger.error(e)
File "F:\SparkBook\lib\logger.py", line 16, in error
self.logger.error(message)
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1296, in __call__
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1266, in _build_args
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1266, in <listcomp>
File "C:\spark3\python\lib\py4j-0.10.9-src.zip\py4j\protocol.py", line 298, in get_command_part
AttributeError: 'Py4JJavaError' object has no attribute '_get_object_id'
Process finished with exit code 1

Related

Stuck with connecting to mongodb

I was about to start using mongodb as my database project. I'm always stuck when want to pushing some data to it. I use mongodb atlas. This code below always resulting errors. I've following and learning about mongodb from online (like youtube, w3, etc)
import pymongo
from pymongo import MongoClient
url = "mongodb+srv://<username>:<password>#cluster0.oecqh.mongodb.net/test?retryWrites=true&w=majority"
cluster = MongoClient(url)
db = cluster["discord"]
collection = db["discord"]
push = {"name": "test", "value": "test"}
try:
collection.insert_one(push)
except Exception as e:
print(e)
The errors
Exception ignored in: <function Client.__del__ at 0x00000213C5D84940>
Traceback (most recent call last):
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 1139, in __del__
self.close()
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\httpx\_client.py", line 1111, in close
self._transport.close()
AttributeError: 'Client' object has no attribute '_transport'
Exception in thread Thread-1:
Traceback (most recent call last):
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\threading.py", line 973, in _bootstrap_inner
self.run()
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\dns\win32util.py", line 48, in run
self.info.domain = _config_domain(interface.DNSDomain)
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\dns\win32util.py", line 26, in _config_domain
if domain.startswith('.'):
AttributeError: 'NoneType' object has no attribute 'startswith'
Traceback (most recent call last):
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\srv_resolver.py", line 89, in _resolve_uri
results = _resolve(
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\srv_resolver.py", line 43, in _resolve
return resolver.resolve(*args, **kwargs)
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\dns\resolver.py", line 1193, in resolve
return get_default_resolver().resolve(qname, rdtype, rdclass, tcp, source,
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\dns\resolver.py", line 1063, in resolve
(nameserver, port, tcp, backoff) = resolution.next_nameserver()
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\dns\resolver.py", line 646, in next_nameserver
raise NoNameservers(request=self.request, errors=self.errors)
dns.resolver.NoNameservers: All nameservers failed to answer the query _mongodb._tcp.cluster0.oecqh.mongodb.net. IN SRV:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\Users\chris\OneDrive\Documents\Programming\Python\tempCodeRunnerFile.py", line 5, in <module>
cluster = MongoClient(url)
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\mongo_client.py", line 704, in __init__
res = uri_parser.parse_uri(
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\uri_parser.py", line 542, in parse_uri
nodes = dns_resolver.get_hosts()
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\srv_resolver.py", line 121, in get_hosts
_, nodes = self._get_srv_response_and_hosts(True)
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\srv_resolver.py", line 101, in _get_srv_response_and_hosts
results = self._resolve_uri(encapsulate_errors)
File "C:\Users\chris\AppData\Local\Programs\Python\Python39\lib\site-packages\pymongo\srv_resolver.py", line 97, in _resolve_uri
raise ConfigurationError(str(exc))
pymongo.errors.ConfigurationError: All nameservers failed to answer the query _mongodb._tcp.cluster0.oecqh.mongodb.net. IN SRV:

Python connection with Hive database in HDInsight

I am trying to create connection to Hive hosted in HDInsight cluster through my python script and getting below error-
Traceback (most recent call last):
File "ClassLoader.java", line 357, in java.lang.ClassLoader.loadClass
File "Launcher.java", line 349, in sun.misc.Launcher$AppClassLoader.loadClass
File "ClassLoader.java", line 424, in java.lang.ClassLoader.loadClass
File "URLClassLoader.java", line 382, in java.net.URLClassLoader.findClass
java.lang.ClassNotFoundException: java.lang.ClassNotFoundException: org.apache.thrift.transport.TTransportException
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "org.jpype.JPypeContext.java", line 330, in org.jpype.JPypeContext.callMethod
File "Method.java", line 498, in java.lang.reflect.Method.invoke
File "DelegatingMethodAccessorImpl.java", line 43, in sun.reflect.DelegatingMethodAccessorImpl.invoke
File "NativeMethodAccessorImpl.java", line 62, in sun.reflect.NativeMethodAccessorImpl.invoke
File "NativeMethodAccessorImpl.java", line -2, in sun.reflect.NativeMethodAccessorImpl.invoke0
File "DriverManager.java", line 247, in java.sql.DriverManager.getConnection
File "DriverManager.java", line 664, in java.sql.DriverManager.getConnection
File "HiveDriver.java", line 105, in org.apache.hive.jdbc.HiveDriver.connect
Exception: Java Exception
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "s.py", line 5, in <module>
"/root/jdbc/hive-jdbc-1.2.1000.2.6.5.3009-43.jar")
File "/usr/local/lib64/python3.6/site-packages/jaydebeapi/__init__.py", line 412, in connect
jconn = _jdbc_connect(jclassname, url, driver_args, jars, libs)
File "/usr/local/lib64/python3.6/site-packages/jaydebeapi/__init__.py", line 230, in _jdbc_connect_jpype
return jpype.java.sql.DriverManager.getConnection(url, *dargs)
java.lang.NoClassDefFoundError: java.lang.NoClassDefFoundError: org/apache/thrift/transport/TTransportException
My Script is -
import jaydebeapi
conn = jaydebeapi.connect("org.apache.hive.jdbc.HiveDriver",
"jdbc:hive2://10.20.30.40:10001/default;transportMode=http;ssl=false;httpPath=/hive2",
["username", "password"],
"/root/jdbc/hive-jdbc-1.2.1000.2.6.5.3009-43.jar")
I have exported CLASSPATH wil all jar files.
The error is java.lang.ClassNotFoundException: java.lang.ClassNotFoundException which specifies that the execution is not able to find the jar /root/jdbc/hive-jdbc-1.2.1000.2.6.5.3009-43.jar.I believe it's only placed in a host from where you are executing the code. I would suggest placing the jar file in the same directory structure in all the nodes in the cluster and have a check on permissions so that the user executing the job has access to that path.

PermissionError while using pandas .to_csv function

I have this code to read .txt file from s3 and convert this file to .csv using pandas:
file = pd.read_csv(f's3://{bucket_name}/{bucket_key}', sep=':', error_bad_lines=False)
file.to_csv(f's3://{bucket_name}/file_name.csv')
I have provided read write permission to IAM role but still this errors comes for the .to_csv function:
Anonymous access is forbidden for this operation: PermissionError
update: full error in ec2 logs is:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 446, in _mkdir
await self.s3.create_bucket(**params)
File "/usr/local/lib/python3.6/dist-packages/aiobotocore/client.py", line 134, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous access is forbidden for this operation
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "convert_file_instance.py", line 92, in <module>
main()
File "convert_file_instance.py", line 36, in main
raise e
File "convert_file_instance.py", line 30, in main
file.to_csv(f's3://{bucket_name}/file_name.csv')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3165, in to_csv
decimal=decimal,
File "/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py", line 67, in __init__
path_or_buf, encoding=encoding, compression=compression, mode=mode
File "/usr/local/lib/python3.6/dist-packages/pandas/io/common.py", line 233, in get_filepath_or_buffer
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 399, in open
**kwargs
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in open_files
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in <listcomp>
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 460, in makedirs
self.mkdir(path, create_parents=True)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 100, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 80, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 450, in _mkdir
raise translate_boto_error(e) from e
PermissionError: Anonymous access is forbidden for this operation
I don't know why is it trying to create bucket?
and I have provided full access of s3 to lambda role
Can someone please tell me what i'm missing here?
Thank you.
I think this is due to an incompatibility between the pandas, boto3 and s3fs libraries.
Try this setup:
pandas==1.0.3
boto3==1.13.11
s3fs==0.4.2
I also tried with pandas version 1.1.1, it works too (I work with Python 3.7)

tensorflow_hub throwing this error: 'SentencepieceOp' when loading the link

The following line of code I am trying to run in PyCharm and I have tensorflow_hub installed and imported.
use = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")
Any suggestions for the below error? As I need this for my project.
Traceback (most recent call last):
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3820, in _get_op_def
return self._op_def_cache[type]
KeyError: 'SentencepieceOp'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:/Users/Jon10/OneDrive/Documents/Computer Science/Dissertation/PythonPractice/TFTest/test.py", line 28, in <module>
use = hub.load("https://tfhub.dev/google/universal-sentence-encoder-multilingual-large/3")
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_hub\module_v2.py", line 102, in load
obj = tf_v1.saved_model.load_v2(module_path, tags=tags)
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\saved_model\load.py", line 517, in load
return load_internal(export_dir, tags)
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\saved_model\load.py", line 541, in load_internal
export_dir)
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\saved_model\load.py", line 114, in __init__
meta_graph.graph_def.library))
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\saved_model\function_deserialization.py", line 312, in load_function_def_library
copy, copy_functions=False)
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\framework\function_def_to_graph.py", line 61, in function_def_to_graph
fdef, input_shapes, copy_functions)
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\framework\function_def_to_graph.py", line 214, in function_def_to_graph_def
op_def = ops.get_default_graph()._get_op_def(node_def.op) # pylint: disable=protected-access
File "C:\Users\Jon10\miniconda3\envs\tensorflow\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3824, in _get_op_def
c_api.TF_GraphGetOpDef(self._c_graph, compat.as_bytes(type), buf)
tensorflow.python.framework.errors_impl.NotFoundError: Op type not registered 'SentencepieceOp' in binary running on DESKTOP-..... Make sure the Op and Kernel are registered in the binary running in this process. Note that if you are loading a saved graph which used ops from tf.contrib, accessing (e.g.) `tf.contrib.resampler` should be done before importing the graph, as contrib ops are lazily registered when the module is first accessed.
You need to install tensorflow_text, and import it before using hub.load

Pandas to GBQ method returning GenericGBQException: Reason 404 POST

This error just started to pop up in our pipelines.
I'm moving a dataframe that's about 1.5mil rows using the pandas.to_gbq method.
Any help would be greatly appreciated!
Code:
output.to_gbq('table_name',
'project-id',
chunksize=50000,
private_key='ga_auth.json',
if_exists='replace'
)
Error:
Traceback (most recent call last):
File ".\rfm_bigquery.py", line 175, in <module>
send_rfm_to_gbq()
File ".\rfm_bigquery.py", line 152, in send_rfm_to_gbq
if_exists='replace',
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas\core\frame.py", line 1187, in to_gbq
table_schema=table_schema)
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas\io\gbq.py", line 119, in to_gbq
table_schema=table_schema)
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas_gbq\gbq.py", line 1036, in to_gbq
progress_bar=progress_bar,
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas_gbq\gbq.py", line 513, in load_data
self.process_http_error(ex)
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas_gbq\gbq.py", line 376, in process_http_error
raise GenericGBQException("Reason: {0}".format(ex))
pandas_gbq.gbq.GenericGBQException: Reason: 404 POST https://www.googleapis.com/upload/bigquery/v2/projects/hidden-moon-164616/jobs?uploadType=resumable: Not Found
I had the same problem with the latest version, I moved back to pandas==0.23.3 and pandas-gbq==0.5.0 and it is finally working ...

Categories

Resources