Error writing to parquet file using pyspark

Error writing to parquet file using pyspark - python

I am working on windows 10. I installed spark, and the goal is to use pyspark. I have made the following steps:
I have installed Python 3.7 with anaconda -- Python was added to C:\Python37
I download wintils from this link -- winutils is added to C:\winutils\bin
I downloaded spark -- spark was extracted is: C:\spark-3.0.0-preview2-bin-hadoop2.7
I downloaded Java 8 from AdoptOpenJDK
under system variables, I set following variables:
HADOOP_HOME : C:\winutils
SPARK_HOME: C:\spark-3.0.0-preview2-bin-hadoop2.7
JAVA_HOME: C:\PROGRA~1\AdoptOpenJDK\jdk-8.0.242.08-hotspot
And finally, under system path, I added:
%JAVA_HOME%\bin
%SPARK_HOME%\bin
%HADOOP_HOME%\bin
In the terminal:
So I would like to know why I am getting this warning:
unable to load native-hadoop library... And why I couldn't bind on port 4040...
Finally, inside Jupyter Notebook, I am getting the following error when trying to write into Parquet file. This image shows a working example, and the following one shows the code with errors:
And here is DataMaster__3.csv on my disk:
And the DaterMaster_par2222.parquet:
Any help is much appreciated!!

If you are writing the file in csv format, I have found that the best way to do that is using the following approach
LCL_POS.toPandas().to_csv(<path>)
There is another way to save it directly without converting to pandas but the issue is it ends up getting split into multiple files (with weird names so I tend to avoid those). If you are happy to split the file up, its much better to write a parquet file in my opinion.
LCL_POS.repartition(1).write.format("com.databricks.spark.csv").option("header", "true").save(<path>)
Hope that answers your question.

Related

Import error for csv file azure databricks

I have a csv file which has a lot of text data. I am trying to import it in azure databricks using python pandas but it is giving me a long list of errors but primarily its telling me this:- ERROR: Internal Python error in the inspect module. However, when I am putting file in local desktop and then importing it on local desktop using jupyter/spyder it is imported without any errors.
I have also put in option of encoding UTF-8 while importing it in azure databricks but its still showing error. Any idea how to tackle this?

problem solved. had to enter encoding=cp1252. Not sure about it why i had to put this option but tried several and this worked. There were several symbols and brackets in the text data fields so this might be useful when importing similar data and facing such problems

missing_dependencies error using Pandas in Azure Web Job

I need to run some long running job via Azure Web Job in Python.
I am facing below error trying to import pandas.
File "D:\local\Temp\jobs\triggered\demo2\eveazbwc.iyd\pandas_init_.py", line 13
missing_dependencies.append(f"{dependency}: {e}")
The Web app (under which I will run the web job) also has python code using pandas, and it does not throw any error.
I have tried uploading pandas and numpy folder inside the zip file (creating venv, installing packages and zipping Lib/site-packages content), (for 32 bit and 64 bit python) as well as tried appending 'D:/home/site/wwwroot/my_app_name/env/Lib/site-packages' to sys.path.
I am not facing such issues in importing standard python modules or additional package modules like requests.
Error is also thrown in trying to import numpy.
So, I am assuming some kind of version mismatch is happening somewhere.
Any pointers to solve this will be really useful.
I have been using Python 3.x, not sure if I should try Python 2.x (virtual env, install package and zip content of Lib/site-packages).
Regards
Kunal

The key to solving the problem is to add Python 3.6.4 x64 Extension on the portal.
Steps:
Add Extensions on portal.
Create a requirements.txt file.
pandas==1.1.4
numpy==1.19.3
Create run.cmd.
Create below file and zip them into a single zip.
Upload the zip for the webjob.
For more details, please read this post.
Webjobs Running Error (3587fd: ERR ) from zipfile

How to fix this error on tabula.read_pdf() function in Python

I am trying to extract tables from a PDF file using Python (Pycharm).
I tried the following code:
from tabula import wrapper
object = wrapper.read_pdf("C:/Users/Ojasvi/Desktop/sample.pdf")
However, the error i got was:
"tabula.errors.JavaNotFoundError: `java` command is not found from this Python process. Please ensure Java is installed and PATH is set for `java`"

You probably need to add java to your system path. You can check those posts, they should help you in solving your problem:
How to Set Java Path On Windows?
Environment variables for java installation

I had every thing setup Java installed and Java path setup but was still getting same error, after spending half a day , I did below and everything worked.
I was using a python environment and running Tabula in python environment. I was getting error mentioned in questions.
I changed my python environment basically default one which is no environment and everything worked. I think Tabula is not able to detect Java once we are inside an python environment.

No module named 'resource' installing Apache Spark on Windows

I am trying to install apache spark to run locally on my windows machine. I have followed all instructions here https://medium.com/#loldja/installing-apache-spark-pyspark-the-missing-quick-start-guide-for-windows-ad81702ba62d.
After this installation I am able to successfully start pyspark, and execute a command such as
textFile = sc.textFile("README.md")
When I then execute a command that operates on textFile such as
textFile.first()
Spark gives me the error 'worker failed to connect back', and I can see an exception in the console coming from worker.py saying 'ModuleNotFoundError: No module named resource'. Looking at the source file I can see that this python file does indeed try to import the resource module, however this module is not available on windows systems. I understand that you can install spark on windows so how do I get around this?

I struggled the whole morning with the same problem. Your best bet is to downgrade to Spark 2.3.2

The fix can be found at https://github.com/apache/spark/pull/23055.
The resource module is only for Unix/Linux systems and is not applicaple in a windows environment. This fix is not yet included in the latest release but you can modify the worker.py in your installation as shown in the pull request. The changes to that file can be found at https://github.com/apache/spark/pull/23055/files.
You will have to re-zip the pyspark directory and move it the lib folder in your pyspark installation directory (where you extracted the pre-compiled pyspark according to the tutorial you mentioned)

Adding to all those valuable answers,
For windows users,Make sure you have copied the correct version of the winutils.exe file(for your specific version of Hadoop) to the spark/bin folder
Say,
if you have Hadoop 2.7.1, then you should copy the winutils.exe file from the Hadoop 2.7.1/bin folder
Link for that is here
https://github.com/steveloughran/winutils

I edited worker.py file. Removed all resource-related lines. Actually # set up memory limits block and import resource. The error disappeared.

Using ptrepack to reclaim deleted nodes in hdf5 file

I have written a bunch of pandas DataFrames to a h5 file using the Pytables integration in pandas. Since then I've deleted some of the groups in the h5 file and I want to repack it in order to reclaim the space. From what I've found I know I need to use the Pytables ptrepack tool. However I can't get it to work. Can someone let me know if I'm messing something up in my script or if I'm actually running across a bug in pytables? If I am messing it up can you give me an example for importing, and calling ptrepack to simply repack a h5 file in order to save space?
Here's my script and the errors I get:
When I looked at the ptrepack.py script in the pytables folder in anaconda I also saw that I should be able to pass a help flag to it.. but that's also not working. Here's the error I get when I try get the help flag to work
Currently I'm working on a windows 10 machine
with the following package versions:
python 3.5.1
pytables: 3.2.2
pandas: 0.18.0
Thanks!

Ok, firstly, to get the help dialog to show in the command prompt you have to do either ptrepack -h or ptrepack --help
I didn't manage to get the script working in python as it seems it has been made specially for the command line- I did however find this very helpful notebook on the subject ([Reclaiming HDF5 Space][1]) which have the following solution
from subprocess import call
outfilename = 'out.h5'
command = ["ptrepack", "-o", "--chunkshape=auto", "--propindexes", filename, outfilename]
call(command)
Note that this essentially just starts a subprocess which calls the repack function.
[1]: https://github.com/jackdotwa/python-concepts/blob/master/hdf5/reclaiming_space.ipynb "Reclaiming HDF5 space"

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Error writing to parquet file using pyspark - python

Related

Import error for csv file azure databricks

missing_dependencies error using Pandas in Azure Web Job

How to fix this error on tabula.read_pdf() function in Python

No module named 'resource' installing Apache Spark on Windows

Using ptrepack to reclaim deleted nodes in hdf5 file

Categories

Resources