Some backend-endpoint returns parquet-file in octet-stream.
In pandas I can do something like this:
result = requests.get("https://..../file.parquet")
df = pd.read_parquet(io.BytesIO(result.content))
Can I do it in Dask somehow?
This code:
dd.read_parquet("https://..../file.parquet")
Raises exception (obviously, because this is bytes-like object):
File "to_parquet_dask.py", line 153, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 137, in main
download_parquet(
File "to_parquet_dask.py", line 121, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 148, in _determine_pf_parts
elif fs.isdir(paths[0]):
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 418, in _isdir
return bool(await self._ls(path))
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 195, in _ls
out = await self._ls_real(url, detail=detail, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 150, in _ls_real
text = await r.text()
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1082, in text
return self._body.decode(encoding, errors=errors) # type: ignore
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 7: invalid start byte
UPD
With changes in fsspec from #mdurant answer I got error
ValueError: Cannot seek streaming HTTP file
So I put "simplecache::" to my url and I face next:
Traceback (most recent call last):
File "to_parquet_dask.py", line 161, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 145, in main
download_parquet(
File "to_parquet_dask.py", line 128, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 185, in _determine_pf_parts
pf = ParquetFile(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fastparquet/api.py", line 127, in __init__
raise ValueError("Opening directories without a _metadata requires"
ValueError: Opening directories without a _metadata requiresa filesystem compatible with fsspec
Temperary workaround
Maybe this way is dirty and not optimal, but some kind of works:
#dask.delayed
def parquet_from_http(url, token):
result = requests.get(
url,
headers={'Authorization': token}
)
return pd.read_parquet(io.BytesIO(result.content))
delayed_download = parquet_from_http(url, token)
df = dd.from_delayed(delayed_download, meta=meta)
p.s. meta argument in this approach is necessary, because otherwise dask will use this function twice: to find out meta and than to calculate, so two requests will be made.
This is not an answer, but I believe the following change in fsspec will fix your problem. If you would be willing to try and confirm, we can make this a patch.
--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
## -472,7 +472,10 ## class HTTPFileSystem(AsyncFileSystem):
async def _isdir(self, path):
# override, since all URLs are (also) files
- return bool(await self._ls(path))
+ try:
+ return bool(await self._ls(path))
+ except (FileNotFoundError, ValueError):
+ return False
(we can put this in a branch, if that makes it easier for you to install)
-edit-
The second problem (which is the same thing in both parquet engines) stems from the server either not providing the size of the file, or not allowing range-gets. The parquet format requires random access to the data to be able to read. The only way to get around this (short of improving the server) is to copy the whole file locally, e.g., by prepending "simplecache::" to your URL.
Related
from airflow.operators.python import get_current_context
context = get_current_context()
ti = context['ti']
ti.xcom_push(key="file", value = doc )
I have the above code in a task and doc is the data that I want to pass to xcom. Its throwing the following error stack trace :
Traceback (most recent call last):
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/decorators/base.py", line 217, in execute
return_value = super().execute(context)
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/operators/python.py", line 175, in execute
return_value = self.execute_callable()
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/operators/python.py", line 192, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/opt/bitnami/airflow/dags/rover_ocr_pipeline.py", line 65, in retrieve
ti.xcom_push(key="file", value = doc )
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/utils/session.py", line 75, in wrapper
return func(*args, session=session, **kwargs)
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/models/taskinstance.py", line 2294, in xcom_push
XCom.set(
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/utils/session.py", line 72, in wrapper
return func(*args, **kwargs)
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/models/xcom.py", line 234, in set
value = cls.serialize_value(
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/models/xcom.py", line 627, in serialize_value
return json.dumps(value, cls=XComEncoder).encode("UTF-8")
File "/opt/bitnami/python/lib/python3.9/json/__init__.py", line 234, in dumps
return cls(
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/utils/json.py", line 176, in encode
return super().encode(o)
File "/opt/bitnami/python/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/opt/bitnami/python/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
File "/opt/bitnami/airflow/venv/lib/python3.9/site-packages/airflow/utils/json.py", line 153, in default
CLASSNAME: o.__module__ + "." + o.__class__.__qualname__,
AttributeError: 'bytes' object has no attribute '__module__'
This was working till now, I am guessing its an issue with airflow version. Previously I was using 2.3.4 , now using 2.5.0.
Airflow is running on kubernetes cluster and using airflow:2.5.0-debian-11-r11 image.
Moving from comments to an actual answer, see above comments for full conversation
XCOM tries to convert everything to a string before storing in the XCOM tables. In this case since bytes is a class, it was trying to serialize it which isn't possible. Converting the bytes to a normal string by base64 encoding the bytes allowed for it to be stored in xcom.
While probably not worth the effort for just this case, this could be handled automatically by creating a custom xcom backend that accurately detects when dealing with byte strings and performs the conversion behind the scenes.
I am using python along with airflow and gcp python library. I automated the process of sending files to gcp using airflow dags. The code is as follows :-
for fileid, filename in files_dictionary.items():
if ftp.size(filename) <= int(MAX_FILE_SIZE):
data = BytesIO()
ftp.retrbinary('RETR ' + filename, callback=data.write)
f = client.File(client, fid=fileid)
size = sys.getsizeof(data.read())
// Another option is to use FileIO but not sure how
f.send(data, filename, size) // This method is in another library
The code to trigger the upload is current repo (as soon above) but real upload is done by another dependency which is not in our control. The documentation of that method is
def send(self, fp, filename, file_bytes):
"""Send file to cloud
fp file object
filename is the name of the file.
file_bytes is the size of the file in bytes
"""
data = self.initiate_resumable_upload(self.getFileid())
_, blob = self.get_gcs_blob_and_bucket(data)
# Set attachment filename. Does this work with datasets with folders
original_filename = filename.rsplit(os.sep, 1)[-1]
blob.content_disposition = "attachment;filename=" + original_filename
blob.upload_from_file(fp)
self.finish_resumable_upload(self.getFileid())
I am getting below error
[2020-04-23 09:43:17,239] {{models.py:1788}} ERROR - Stream must be at beginning.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1657, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python_operator.py", line 103, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python_operator.py", line 108, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/transfer_data.py", line 241, in upload
f.send(data, filename, size)
File "/usr/local/lib/python3.6/site-packages/client/utils.py", line 53, in wrapper_timer
value = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/client/client.py", line 518, in send
blob.upload_from_file(fp)
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1158, in upload_from_file
client, file_obj, content_type, size, num_retries, predefined_acl
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1068, in _do_upload
client, stream, content_type, size, num_retries, predefined_acl
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1011, in _do_resumable_upload
predefined_acl=predefined_acl,
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 960, in _initiate_resumable_upload
stream_final=False,
File "/usr/local/lib/python3.6/site-packages/google/resumable_media/requests/upload.py", line 343, in initiate
stream_final=stream_final,
File "/usr/local/lib/python3.6/site-packages/google/resumable_media/_upload.py", line 415, in _prepare_initiate_request
raise ValueError(u"Stream must be at beginning.")
ValueError: Stream must be at beginning.
The upload_from_file function has a parameter that handles the seek(0) call for you:
I would modify your upload_from_file call to:
blob.upload_from_file(file_obj=fp, rewind=True)
That should do the trick, and you don't need to include the additional seek()
When reading a binary file, you can navigate through it using seek operations. In other words, you can move the reference from the beginning of the file to any other position. The error ValueError: Stream must be at beginning. is basically saying: "your reference is not pointed to the beginning of the stream and it must be"
Given that, you need to set your reference back to the beginning of the stream. You can do that using the function seek.
In your case, you would do something like:
data = BytesIO()
ftp.retrbinary('RETR ' + filename, callback=data.write)
f = client.File(client, fid=fileid)
size = sys.getsizeof(data.read())
data.seek(0)
f.send(data, filename, size)
I can't find a way of reading the Minecraft world files in a way that i could use in python
I've looked around the internet but can find no tutorials and only a few libraries that claim that they can do this but never actually work
from nbt import *
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
I expected this to work but instead I got errors about the file not being compressed or something of the sort
Full error:
Traceback (most recent call last):
File "C:\Users\rober\Desktop\MinePy\MinecraftWorldReader.py", line 2, in <module>
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 628, in __init__
self.parse_file()
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 652, in parse_file
type = TAG_Byte(buffer=self.file)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 99, in __init__
self._parse_buffer(buffer)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 105, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')
Use anvil parser. (Install with pip install anvil-parser)
Reading
import anvil
region = anvil.Region.from_file('r.0.0.mca')
# You can also provide the region file name instead of the object
chunk = anvil.Chunk.from_region(region, 0, 0)
# If `section` is not provided, will get it from the y coords
# and assume it's global
block = chunk.get_block(0, 0, 0)
print(block) # <Block(minecraft:air)>
print(block.id) # air
print(block.properties) # {}
https://pypi.org/project/anvil-parser/
According to this page, the .mca files is not totally kind of of NBT file. It begins with an 8KiB header which includes the offsets of chunks in the region file itself and the timestamps for the last updates of those chunks.
I recommend you to see the offical announcement and this page for more information.
I'm trying to build a python function with web.py and SQLite that will allow users to search for a given string within a description field and will return all matching results.
Right now I've gotten to the below function, which works but only if the input is an exact match.
def getItem(params, max_display):
query_string = 'SELECT * FROM items WHERE 1=1'
description = params['description']
if params['description']:
query_string = query_string + ' AND description LIKE $description'
result = query(query_string, {
'description': params['description']
I've tried to implement this feature with LIKE "%$description%"' , however I keep getting the below web.py error.
Traceback (most recent call last):
File "lib/web/wsgiserver/__init__.py", line 1245, in communicate
req.respond()
File "lib/web/wsgiserver/__init__.py", line 775, in respond
self.server.gateway(self).respond()
File "lib/web/wsgiserver/__init__.py", line 2018, in respond
response = self.req.server.wsgi_app(self.env, self.start_response)
File "lib/web/httpserver.py", line 306, in __call__
return self.app(environ, xstart_response)
File "lib/web/httpserver.py", line 274, in __call__
return self.app(environ, start_response)
File "lib/web/application.py", line 279, in wsgi
result = self.handle_with_processors()
File "lib/web/application.py", line 249, in handle_with_processors
return process(self.processors)
File "lib/web/application.py", line 246, in process
raise self.internalerror()
File "lib/web/application.py", line 478, in internalerror
return debugerror.debugerror()
File "lib/web/debugerror.py", line 305, in debugerror
return web._InternalError(djangoerror())
File "lib/web/debugerror.py", line 290, in djangoerror
djangoerror_r = Template(djangoerror_t, filename=__file__, filter=websafe)
File "lib/web/template.py", line 846, in __init__
code = self.compile_template(text, filename)
File "lib/web/template.py", line 926, in compile_template
ast = compiler.parse(code)
File "/Users/sokeefe/homebrew/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/compiler/transformer.py", line 51, in parse
return Transformer().parsesuite(buf)
File "/Users/sokeefe/homebrew/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/compiler/transformer.py", line 128, in parsesuite
return self.transform(parser.suite(text))
AttributeError: 'module' object has no attribute 'suite'
Any thoughts on what might be going wrong with this function?
Thanks in advance!
What do you think is going on with parser.py?
Here is the relevant portion of the error message:
File
"/Users/sokeefe/homebrew/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/lib/python2.7/compiler/transformer.py",
line 128, in parsesuite
return self.transform(parser.suite(text)) AttributeError: 'module' object has no attribute 'suite'
So, somewhere there is a file called parser.py, which defines a function called suite(), which is used by some library code that executes when your program executes. But because you named one of your files parser.py, when the library code executes, python searches for a file named parser.py, and python found your file first, and there was no function named suite() in your file.
I am attempting to use a Webservice created by one of our developers that allows us to upload files into the system, within certain restrictions.
Using SUDS, I get the following information:
Suds ( https://fedorahosted.org/suds/ ) version: 0.4 GA build: R699-20100913
Service ( ConnectToEFS ) tns="http://tempuri.org/"
Prefixes (3)
ns0 = "http://schemas.microsoft.com/2003/10/Serialization/"
ns1 = "http://schemas.microsoft.com/Message"
ns2 = "http://tempuri.org/"
Ports (1):
(BasicHttpBinding_IConnectToEFS)
Methods (2):
CreateContentFolder(xs:string FileCode, xs:string FolderName, xs:string ContentType, xs:string MetaDataXML, )
UploadFile(ns1:StreamBody FileByteStream, )
Types (4):
ns1:StreamBody
ns0:char
ns0:duration
ns0:guid
My method to using UploadFile is as follows:
def webserviceUploadFile(self, targetLocation, fileName, fileSource):
fileSource = './test_files/' + fileSource
ntlm = WindowsHttpAuthenticated(username=uname, password=upass)
client = Client(webservice_url, transport=ntlm)
client.set_options(soapheaders={'TargetLocation':targetLocation, 'FileName': fileName})
body = client.factory.create('AIRDocument')
body_file = open(fileSource, 'rb')
body_data = body_file.read()
body.FileByteStream = body_data
return client.service.UploadFile(body)
Running this gets me the following result:
Traceback (most recent call last):
File "test_cases.py", line 639, in test_upload_file_invalid_extension
result_string = self.HM.webserviceUploadFile('9999', 'AD-1234-5424__44.exe',
'test_data.pdf')
File "test_cases.py", line 81, in webserviceUploadFile
return client.service.UploadFile(body)
File "build\bdist.win32\egg\suds\client.py", line 542, in __call__
return client.invoke(args, kwargs)
File "build\bdist.win32\egg\suds\client.py", line 595, in invoke
soapenv = binding.get_message(self.method, args, kwargs)
File "build\bdist.win32\egg\suds\bindings\binding.py", line 120, in get_message
content = self.bodycontent(method, args, kwargs)
File "build\bdist.win32\egg\suds\bindings\document.py", line 63, in bodycontent
p = self.mkparam(method, pd, value)
File "build\bdist.win32\egg\suds\bindings\document.py", line 105, in mkparam
return Binding.mkparam(self, method, pdef, object)
File "build\bdist.win32\egg\suds\bindings\binding.py", line 287, in mkparam
return marshaller.process(content)
File "build\bdist.win32\egg\suds\mx\core.py", line 62, in process
self.append(document, content)
File "build\bdist.win32\egg\suds\mx\core.py", line 75, in append
self.appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 102, in append
appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 243, in append
Appender.append(self, child, cont)
File "build\bdist.win32\egg\suds\mx\appender.py", line 182, in append
self.marshaller.append(parent, content)
File "build\bdist.win32\egg\suds\mx\core.py", line 75, in append
self.appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 102, in append
appender.append(parent, content)
File "build\bdist.win32\egg\suds\mx\appender.py", line 198, in append
child.setText(tostr(content.value))
File "build\bdist.win32\egg\suds\sax\element.py", line 251, in setText
self.text = Text(value)
File "build\bdist.win32\egg\suds\sax\text.py", line 43, in __new__
result = super(Text, cls).__new__(cls, *args, **kwargs)
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 10: ordinal
not in range(128)
After much research and talking with the developer of the webservice, I modified the body_data = body_file.read() into body_data = body_file.read().decode("UTF-8") which gets me this error:
Traceback (most recent call last):
File "test_cases.py", line 639, in test_upload_file_invalid_extension
result_string = self.HM.webserviceUploadFile('9999', 'AD-1234-5424__44.exe', 'test_data.pdf')
File "test_cases.py", line 79, in webserviceUploadFile
body_data = body_file.read().decode("utf-8")
File "C:\python27\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xe2 in position 10: invalid
continuation byte
Which is less than helpful.
After more research into the problem, I tried adding 'errors='ignore'' to the UTF-8 encode, and this was the result:
<TransactionDescription>Error in INTL-CONF_France_PROJ_MA_126807.docx: An exception has been thrown when reading the stream.. Inner Exception: System.Xml.XmlException: The byte 0x03 is not valid at this location. Line 1, position 318.
at System.Xml.XmlExceptionHelper.ThrowXmlException(XmlDictionaryReader reader, String res, String arg1, String arg2, String arg3)
at System.Xml.XmlUTF8TextReader.Read()
at System.ServiceModel.Dispatcher.StreamFormatter.MessageBodyStream.Exhaust(XmlDictionaryReader reader)
at System.ServiceModel.Dispatcher.StreamFormatter.MessageBodyStream.Read(Byte[] buffer, Int32 offset, Int32 count). Source: System.ServiceModel</TransactionDescription>
Which pretty much stumps me on what to do. Based on the result stack trace by the webservice, it looks like it wants UTF-8 but I can't seem to get it to the webservice without Python or SUDS throwing a fit, or by ignoring problems in the encoding. The system I'm working on only takes in MicroSoft office type files (doc, xls, and the like), PDFs, and TXT files, so using something that I have more control on the encoding is not an option. I also tried detecting the encoding used by the sample PDF and the sample DOCX, but using what it suggested (Latin-1, ISO8859-x, and several windows XXXX) all were accepted by Python and SUDS, but not by the webservice.
Also note in the example shown, its most frequently referencing a test to an invalid extension. This error applies even in what should be a test of the successful upload, which is the only time really that the final stacktrace ever shows up.
You can use this base64.b64encode(body_file.read()) and this will return the base64 string value. So your request variable must be a string.