PermissionError while using pandas .to_csv function - python

I have this code to read .txt file from s3 and convert this file to .csv using pandas:
file = pd.read_csv(f's3://{bucket_name}/{bucket_key}', sep=':', error_bad_lines=False)
file.to_csv(f's3://{bucket_name}/file_name.csv')
I have provided read write permission to IAM role but still this errors comes for the .to_csv function:
Anonymous access is forbidden for this operation: PermissionError
update: full error in ec2 logs is:
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 446, in _mkdir
await self.s3.create_bucket(**params)
File "/usr/local/lib/python3.6/dist-packages/aiobotocore/client.py", line 134, in _make_api_call
raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the CreateBucket operation: Anonymous access is forbidden for this operation
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "convert_file_instance.py", line 92, in <module>
main()
File "convert_file_instance.py", line 36, in main
raise e
File "convert_file_instance.py", line 30, in main
file.to_csv(f's3://{bucket_name}/file_name.csv')
File "/usr/local/lib/python3.6/dist-packages/pandas/core/generic.py", line 3165, in to_csv
decimal=decimal,
File "/usr/local/lib/python3.6/dist-packages/pandas/io/formats/csvs.py", line 67, in __init__
path_or_buf, encoding=encoding, compression=compression, mode=mode
File "/usr/local/lib/python3.6/dist-packages/pandas/io/common.py", line 233, in get_filepath_or_buffer
filepath_or_buffer, mode=mode or "rb", **(storage_options or {})
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 399, in open
**kwargs
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in open_files
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/fsspec/core.py", line 254, in <listcomp>
[fs.makedirs(parent, exist_ok=True) for parent in parents]
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 460, in makedirs
self.mkdir(path, create_parents=True)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 100, in wrapper
return maybe_sync(func, self, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 80, in maybe_sync
return sync(loop, func, *args, **kwargs)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 51, in sync
raise exc.with_traceback(tb)
File "/usr/local/lib/python3.6/dist-packages/fsspec/asyn.py", line 35, in f
result[0] = await future
File "/usr/local/lib/python3.6/dist-packages/s3fs/core.py", line 450, in _mkdir
raise translate_boto_error(e) from e
PermissionError: Anonymous access is forbidden for this operation
I don't know why is it trying to create bucket?
and I have provided full access of s3 to lambda role
Can someone please tell me what i'm missing here?
Thank you.

I think this is due to an incompatibility between the pandas, boto3 and s3fs libraries.
Try this setup:
pandas==1.0.3
boto3==1.13.11
s3fs==0.4.2
I also tried with pandas version 1.1.1, it works too (I work with Python 3.7)

Related

Has anyone experienced random file access errors when working with luigi in Windows?

When working with luigi on Windows 10, the following error is sometimes thrown:
Traceback (most recent call last):
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\luigi\worker.py", line 192, in run
new_deps = self._run_get_new_deps()
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\luigi\worker.py", line 130, in _run_get_new_deps
task_gen = self.task.run()
File "project.py", line 15000, in run
data_frame = pd.read_excel(self.input()["tables"].open(),segment)
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\pandas\io\excel.py", line 191, in read_excel
io = ExcelFile(io, engine=engine)
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\pandas\io\excel.py", line 247, in __init__
self.book = xlrd.open_workbook(file_contents=data)
File "D:\Users\myuser\PycharmProjects\project\venv\lib\site-packages\xlrd\__init__.py", line 115, in open_workbook
zf = zipfile.ZipFile(timemachine.BYTES_IO(file_contents))
File "C:\Python27\lib\zipfile.py", line 793, in __init__
self._RealGetContents()
File "C:\Python27\lib\zipfile.py", line 862, in _RealGetContents
raise BadZipfile("Bad magic number for central directory")
BadZipfile: Bad magic number for central directory
This error seems to only happen when working on Windows, and happens more frequently when a task requires to access the same file more than once in the same task, or is used by different tasks as a Target even when there are no parallel workers running. Code runs without errors on Linux.
Has anyone else experienced this behavior?
I'm trying to create pandas DataFrames from Excel file sheets but I get a BadZipFile error instead sometimes.

Why can't I access the excel file in python?

I'm trying to use some data that I have in an excel file. However, I'm getting an error saying that it doesn't find the file. I've looked up and the directory and the file name are correct, What am I doing wrong?
Here is the code:
import os
import pandas as pd
print(os.getcwd())
df = pd.read_excel(r'C:/Users/Eder/Desktop/TFG/Data/Interpolation_sample.xlsx',
index_col =0,parse_dates=True, sheet_name='sheet3')
And the answer from the console:
runcell(0, 'C:/Users/Eder/untitled0.py')
C:\Users\Eder\Desktop\TFG\Data
Traceback (most recent call last):
File "C:\Users\Eder\untitled0.py", line 14, in <module>
index_col =0,parse_dates=True, sheet_name='sheet3')
File "E:\Anaconda3\lib\site-packages\pandas\util\_decorators.py", line 299, in wrapper
return func(*args, **kwargs)
File "E:\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 336, in read_excel
io = ExcelFile(io, storage_options=storage_options, engine=engine)
File "E:\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 1072, in __init__
content=path_or_buffer, storage_options=storage_options
File "E:\Anaconda3\lib\site-packages\pandas\io\excel\_base.py", line 950, in inspect_excel_format
content_or_path, "rb", storage_options=storage_options, is_text=False
File "E:\Anaconda3\lib\site-packages\pandas\io\common.py", line 651, in get_handle
handle = open(handle, ioargs.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:\\Users\\Eder\\Desktop\\TFG\\Data\\Interpolation_sample.xlsx'
I've figured out a way to solve the problem. I just changed the name of the file from 'Interpolation_sample' to 'Interpolation sample'. I don't know why, but the underscore in the file name is what was causing this error.

Pandas to GBQ method returning GenericGBQException: Reason 404 POST

This error just started to pop up in our pipelines.
I'm moving a dataframe that's about 1.5mil rows using the pandas.to_gbq method.
Any help would be greatly appreciated!
Code:
output.to_gbq('table_name',
'project-id',
chunksize=50000,
private_key='ga_auth.json',
if_exists='replace'
)
Error:
Traceback (most recent call last):
File ".\rfm_bigquery.py", line 175, in <module>
send_rfm_to_gbq()
File ".\rfm_bigquery.py", line 152, in send_rfm_to_gbq
if_exists='replace',
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas\core\frame.py", line 1187, in to_gbq
table_schema=table_schema)
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas\io\gbq.py", line 119, in to_gbq
table_schema=table_schema)
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas_gbq\gbq.py", line 1036, in to_gbq
progress_bar=progress_bar,
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas_gbq\gbq.py", line 513, in load_data
self.process_http_error(ex)
File "C:\Users\yyu\Desktop\env\rfm_bigquery\lib\site-packages\pandas_gbq\gbq.py", line 376, in process_http_error
raise GenericGBQException("Reason: {0}".format(ex))
pandas_gbq.gbq.GenericGBQException: Reason: 404 POST https://www.googleapis.com/upload/bigquery/v2/projects/hidden-moon-164616/jobs?uploadType=resumable: Not Found
I had the same problem with the latest version, I moved back to pandas==0.23.3 and pandas-gbq==0.5.0 and it is finally working ...

Pyspark celery task : toPandas() throwing Pickling error

I have a web application to run long running tasks in pyspark. I am using Django, and Celery to run the tasks asynchronously.
I have a piece of code that works great when I execute it in the console. But I am getting quite some errors when I run it through the celery task.
Firstly, my udf's don't work for some reason. I put it in a try-except block and it always goes in to the except block.
try:
func = udf(lambda x: parse(x), DateType())
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
except:
raise ValueError("No valid date format found.")
The error :
[2018-04-05 07:47:37,223: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[afbda586-0929-4d51-87f1-d612cbdb4c5e] raised unexpected: Py4JError('An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:\npy4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)\n\tat py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)\n\tat py4j.Gateway.invoke(Gateway.java:235)\n\tat py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)\n\tat py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)\n\tat py4j.GatewayConnection.run(GatewayConnection.java:214)\n\tat java.lang.Thread.run(Thread.java:748)\n\n',)
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 68, in outlier_algorithm
spark_data_frame = spark_data_frame.withColumn('date_format', func(col(date_name)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 179, in wrapper
return self(*args)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 157, in __call__
judf = self._judf
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 141, in _judf
self._judf_placeholder = self._create_judf()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/udf.py", line 153, in _create_judf
self._name, wrapped_func, jdt, self.evalType, self.deterministic)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1428, in __call__
answer, self._gateway_client, None, self._fqn)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 324, in get_return_value
format(target_id, ".", name, value))
py4j.protocol.Py4JError: An error occurred while calling None.org.apache.spark.sql.execution.python.UserDefinedPythonFunction. Trace:
py4j.Py4JException: Constructor org.apache.spark.sql.execution.python.UserDefinedPythonFunction([class java.lang.String, class org.apache.spark.api.python.PythonFunction, class org.apache.spark.sql.types.DateType$, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:179)
at py4j.reflection.ReflectionEngine.getConstructor(ReflectionEngine.java:196)
at py4j.Gateway.invoke(Gateway.java:235)
at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:80)
at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:69)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Further, I am using toPandas() to convert the dataframe and run some pandas function on it but it throws the following error:
[2018-04-05 07:46:29,701: ERROR/ForkPoolWorker-3] Task algorithms.tasks.outlier_algorithm[ec267a9b-b482-492d-8404-70b489fbbfe7] raised unexpected: Py4JJavaError('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 374, in trace_task
R = retval = fun(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/celery/app/trace.py", line 629, in __protected_call__
return self.run(*args, **kwargs)
File "/home/fractaluser/dev_eugenie/eugenie/eugenie/algorithms/tasks.py", line 146, in outlier_algorithm
data_frame_new = data_frame_1.toPandas()
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/dataframe.py", line 1937, in toPandas
if self.sql_ctx.getConf("spark.sql.execution.pandas.respectSessionTimeZone").lower() \
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/context.py", line 142, in getConf
return self.sparkSession.conf.get(key, defaultValue)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/conf.py", line 46, in get
return self._jconf.get(key)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/py4j/protocol.py", line 320, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: ('An error occurred while calling o224.get.\n', 'JavaObject id=o225')
[2018-04-05 07:46:29,706: ERROR/MainProcess] Task handler raised error: <MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.>
Traceback (most recent call last):
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/pool.py", line 362, in workloop
put((READY, (job, i, result, inqW_fd)))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/queues.py", line 366, in put
self.send_payload(ForkingPickler.dumps(obj))
File "/home/fractaluser/dev_eugenie/eugenie/venv_eugenie/lib/python3.4/site-packages/billiard/reduction.py", line 56, in dumps
cls(buf, protocol).dump(obj)
billiard.pool.MaybeEncodingError: Error sending result: '"(1, <ExceptionInfo: Py4JJavaError('An error occurred while calling o224.get.\\n', 'JavaObject id=o225')>, None)"'. Reason: ''PicklingError("Can\'t pickle <class \'py4j.protocol.Py4JJavaError\'>: it\'s not the same object as py4j.protocol.Py4JJavaError",)''.
I ran into this problem and was having a hard time pinning it down. As it turns out this error can occur if the version of Spark you are running does not match the version of PySpark you are executing it with. In my case I am running Spark 2.2.3.4 and was trying to use PySpark 2.4.4. After I downgraded PySpark to 2.2.3 the problem went away. I ran into another problem caused by the code using functionality in PySpark that was added after 2.2.3, but that's another issue.
This is just not going to work. Spark uses complex state, including JVM state, which cannot be simply serialized and send to the worker. If you want to run your code asynchronously, use thread pool to submit jobs.
I'm answering my own question. It was probably a pyspark 2.3 bug
I was using Pyspark 2.3.0 and for some reason it did not work well with Python 3.5.
I downgraded to Pyspark 2.1.2 and everything worked fine.

Python: Broken Pipe Error For PyNomo Example (in function Nomographer)

I am using pycharm on Python 2.7. I have installed PyNomo. I am trying to run this small example from the official site. Code is available on the link, I have simply copy pasted it. I get the following error:
Aligning with tag A
Traceback (most recent call last):
File "/home/darshil/Desktop/Caltech Summer Internship/Radiation Ononcology Data/DB/rad3/pynomo_temp.py", line 71, in <module>
Nomographer(main_params)
File "/usr/local/lib/python2.7/dist-packages/pynomo/nomographer.py", line 203, in __init__
wrapper.draw_nomogram(c,params['post_func'])
File "/usr/local/lib/python2.7/dist-packages/pynomo/nomo_wrapper.py", line 213, in draw_nomogram
block.draw(canvas)
File "/usr/local/lib/python2.7/dist-packages/pynomo/nomo_wrapper.py", line 445, in draw
atom.draw(canvas)
File "/usr/local/lib/python2.7/dist-packages/pynomo/nomo_wrapper.py", line 2503, in draw
axis_appear=p,base_start=base_start,base_stop=base_stop)
File "/usr/local/lib/python2.7/dist-packages/pynomo/nomo_axis.py", line 123, in __init__
self.draw_axis(canvas)
File "/usr/local/lib/python2.7/dist-packages/pynomo/nomo_axis.py", line 1067, in draw_axis
c.text(x,y,ttext,attr+[text_color])
File "/usr/local/lib/python2.7/dist-packages/pyx/canvas.py", line 324, in text
return self.insert(self.texrunner.text(x, y, atext, *args, **kwargs))
File "/usr/local/lib/python2.7/dist-packages/pyx/text.py", line 1194, in text
self.execute(expr, self.defaulttexmessagesdefaultrun + self.texmessagesdefaultrun + texmessages)
File "/usr/local/lib/python2.7/dist-packages/pyx/text.py", line 951, in execute
self.defaulttexmessagesstart + self.texmessagesstart)
File "/usr/local/lib/python2.7/dist-packages/pyx/text.py", line 1005, in execute
self.texinput.write(self.expr)
IOError: [Errno 32] Broken pipe
Error in atexit._run_exitfuncs:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/local/lib/python2.7/dist-packages/pyx/text.py", line 748, in _cleantmp
texrunner.texinput.write("\n\\end\n")
IOError: [Errno 32] Broken pipe
Error in sys.exitfunc:
Traceback (most recent call last):
File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
func(*targs, **kargs)
File "/usr/local/lib/python2.7/dist-packages/pyx/text.py", line 748, in _cleantmp
texrunner.texinput.write("\n\\end\n")
IOError: [Errno 32] Broken pipe
Process finished with exit code 1
The error is in the final line of the code:
Nomographer(main_params)
I have looked at other questions with "broken pipe error": here,here, and here. But none of them are helpful to me.
Any indication on how to solve would be very helpful.
PyNomo uses a TeX installation to typeset text. Maybe this is missing resulting in a broken pipe. You need to be able to run a file hello.tex with the content Hello, world!\bye on a command line tex hello.tex. It should result in a file hello.dvi. If not you need to install a TeX distribution like TeXLive.

Categories

Resources