Airflow with gcp python : ValueError: Stream must be at beginning - python

I am using python along with airflow and gcp python library. I automated the process of sending files to gcp using airflow dags. The code is as follows :-
for fileid, filename in files_dictionary.items():
if ftp.size(filename) <= int(MAX_FILE_SIZE):
data = BytesIO()
ftp.retrbinary('RETR ' + filename, callback=data.write)
f = client.File(client, fid=fileid)
size = sys.getsizeof(data.read())
// Another option is to use FileIO but not sure how
f.send(data, filename, size) // This method is in another library
The code to trigger the upload is current repo (as soon above) but real upload is done by another dependency which is not in our control. The documentation of that method is
def send(self, fp, filename, file_bytes):
"""Send file to cloud
fp file object
filename is the name of the file.
file_bytes is the size of the file in bytes
"""
data = self.initiate_resumable_upload(self.getFileid())
_, blob = self.get_gcs_blob_and_bucket(data)
# Set attachment filename. Does this work with datasets with folders
original_filename = filename.rsplit(os.sep, 1)[-1]
blob.content_disposition = "attachment;filename=" + original_filename
blob.upload_from_file(fp)
self.finish_resumable_upload(self.getFileid())
I am getting below error
[2020-04-23 09:43:17,239] {{models.py:1788}} ERROR - Stream must be at beginning.
Traceback (most recent call last):
File "/usr/local/lib/python3.6/site-packages/airflow/models.py", line 1657, in _run_raw_task
result = task_copy.execute(context=context)
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python_operator.py", line 103, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.6/site-packages/airflow/operators/python_operator.py", line 108, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/transfer_data.py", line 241, in upload
f.send(data, filename, size)
File "/usr/local/lib/python3.6/site-packages/client/utils.py", line 53, in wrapper_timer
value = func(*args, **kwargs)
File "/usr/local/lib/python3.6/site-packages/client/client.py", line 518, in send
blob.upload_from_file(fp)
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1158, in upload_from_file
client, file_obj, content_type, size, num_retries, predefined_acl
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1068, in _do_upload
client, stream, content_type, size, num_retries, predefined_acl
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 1011, in _do_resumable_upload
predefined_acl=predefined_acl,
File "/usr/local/lib/python3.6/site-packages/google/cloud/storage/blob.py", line 960, in _initiate_resumable_upload
stream_final=False,
File "/usr/local/lib/python3.6/site-packages/google/resumable_media/requests/upload.py", line 343, in initiate
stream_final=stream_final,
File "/usr/local/lib/python3.6/site-packages/google/resumable_media/_upload.py", line 415, in _prepare_initiate_request
raise ValueError(u"Stream must be at beginning.")
ValueError: Stream must be at beginning.

The upload_from_file function has a parameter that handles the seek(0) call for you:
I would modify your upload_from_file call to:
blob.upload_from_file(file_obj=fp, rewind=True)
That should do the trick, and you don't need to include the additional seek()

When reading a binary file, you can navigate through it using seek operations. In other words, you can move the reference from the beginning of the file to any other position. The error ValueError: Stream must be at beginning. is basically saying: "your reference is not pointed to the beginning of the stream and it must be"
Given that, you need to set your reference back to the beginning of the stream. You can do that using the function seek.
In your case, you would do something like:
data = BytesIO()
ftp.retrbinary('RETR ' + filename, callback=data.write)
f = client.File(client, fid=fileid)
size = sys.getsizeof(data.read())
data.seek(0)
f.send(data, filename, size)

Related

Can I read parquet from HTTP(s) octet-stream?

Some backend-endpoint returns parquet-file in octet-stream.
In pandas I can do something like this:
result = requests.get("https://..../file.parquet")
df = pd.read_parquet(io.BytesIO(result.content))
Can I do it in Dask somehow?
This code:
dd.read_parquet("https://..../file.parquet")
Raises exception (obviously, because this is bytes-like object):
File "to_parquet_dask.py", line 153, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 137, in main
download_parquet(
File "to_parquet_dask.py", line 121, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 148, in _determine_pf_parts
elif fs.isdir(paths[0]):
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 88, in wrapper
return sync(self.loop, func, *args, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 69, in sync
raise result[0]
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/asyn.py", line 25, in _runner
result[0] = await coro
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 418, in _isdir
return bool(await self._ls(path))
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 195, in _ls
out = await self._ls_real(url, detail=detail, **kwargs)
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fsspec/implementations/http.py", line 150, in _ls_real
text = await r.text()
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/aiohttp/client_reqrep.py", line 1082, in text
return self._body.decode(encoding, errors=errors) # type: ignore
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x90 in position 7: invalid start byte
UPD
With changes in fsspec from #mdurant answer I got error
ValueError: Cannot seek streaming HTTP file
So I put "simplecache::" to my url and I face next:
Traceback (most recent call last):
File "to_parquet_dask.py", line 161, in <module>
main(*parser.parse_args())
File "to_parquet_dask.py", line 145, in main
download_parquet(
File "to_parquet_dask.py", line 128, in download_parquet
dd.read_parquet(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/core.py", line 313, in read_parquet
read_metadata_result = engine.read_metadata(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 733, in read_metadata
parts, pf, gather_statistics, base_path = _determine_pf_parts(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/dask/dataframe/io/parquet/fastparquet.py", line 185, in _determine_pf_parts
pf = ParquetFile(
File "/home/bc30138/Documents/CODE/flexydrive/driver_style/.venv/lib/python3.8/site-packages/fastparquet/api.py", line 127, in __init__
raise ValueError("Opening directories without a _metadata requires"
ValueError: Opening directories without a _metadata requiresa filesystem compatible with fsspec
Temperary workaround
Maybe this way is dirty and not optimal, but some kind of works:
#dask.delayed
def parquet_from_http(url, token):
result = requests.get(
url,
headers={'Authorization': token}
)
return pd.read_parquet(io.BytesIO(result.content))
delayed_download = parquet_from_http(url, token)
df = dd.from_delayed(delayed_download, meta=meta)
p.s. meta argument in this approach is necessary, because otherwise dask will use this function twice: to find out meta and than to calculate, so two requests will be made.
This is not an answer, but I believe the following change in fsspec will fix your problem. If you would be willing to try and confirm, we can make this a patch.
--- a/fsspec/implementations/http.py
+++ b/fsspec/implementations/http.py
## -472,7 +472,10 ## class HTTPFileSystem(AsyncFileSystem):
async def _isdir(self, path):
# override, since all URLs are (also) files
- return bool(await self._ls(path))
+ try:
+ return bool(await self._ls(path))
+ except (FileNotFoundError, ValueError):
+ return False
(we can put this in a branch, if that makes it easier for you to install)
-edit-
The second problem (which is the same thing in both parquet engines) stems from the server either not providing the size of the file, or not allowing range-gets. The parquet format requires random access to the data to be able to read. The only way to get around this (short of improving the server) is to copy the whole file locally, e.g., by prepending "simplecache::" to your URL.

How can read Minecraft .mca files so that in python I can extract individual blocks?

I can't find a way of reading the Minecraft world files in a way that i could use in python
I've looked around the internet but can find no tutorials and only a few libraries that claim that they can do this but never actually work
from nbt import *
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
I expected this to work but instead I got errors about the file not being compressed or something of the sort
Full error:
Traceback (most recent call last):
File "C:\Users\rober\Desktop\MinePy\MinecraftWorldReader.py", line 2, in <module>
nbtfile = nbt.NBTFile("r.0.0.mca",'rb')
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 628, in __init__
self.parse_file()
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 652, in parse_file
type = TAG_Byte(buffer=self.file)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 99, in __init__
self._parse_buffer(buffer)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\site-packages\nbt\nbt.py", line 105, in _parse_buffer
self.value = self.fmt.unpack(buffer.read(self.fmt.size))[0]
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 276, in read
return self._buffer.read(size)
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\_compression.py", line 68, in readinto
data = self.read(len(byte_view))
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 463, in read
if not self._read_gzip_header():
File "C:\Users\rober\AppData\Local\Programs\Python\Python36-32\lib\gzip.py", line 411, in _read_gzip_header
raise OSError('Not a gzipped file (%r)' % magic)
OSError: Not a gzipped file (b'\x00\x00')
Use anvil parser. (Install with pip install anvil-parser)
Reading
import anvil
region = anvil.Region.from_file('r.0.0.mca')
# You can also provide the region file name instead of the object
chunk = anvil.Chunk.from_region(region, 0, 0)
# If `section` is not provided, will get it from the y coords
# and assume it's global
block = chunk.get_block(0, 0, 0)
print(block) # <Block(minecraft:air)>
print(block.id) # air
print(block.properties) # {}
https://pypi.org/project/anvil-parser/
According to this page, the .mca files is not totally kind of of NBT file. It begins with an 8KiB header which includes the offsets of chunks in the region file itself and the timestamps for the last updates of those chunks.
I recommend you to see the offical announcement and this page for more information.

Python ftplib cannot use STOR in callback function following RETR

Here is what I need to accomplish:
- connect to FTP
- get contents of test.txt
- write new contents into test.txt right after getting the results
In the real case scenario I need to get previos modification time, stored in a txt file and then upload to FTP only those files which were modified after that time without checking every file specifically (there are thousands of them, that would be too long).
Here is where I'm stuck.
def continueTest(data, ftp):
print(data, ftp)
with open('test.txt', 'w+') as file:
file.write('test')
with open('test.txt', 'rb') as file:
ftp.storbinary('STOR htdocs/test.txt', file)
def test():
host_data=FTP_HOSTS['planz-norwegian']
ftp = ftplib.FTP(host=host_data['server'],
user = host_data['username'],
passwd = host_data['password'])
print('connected to ftp')
ftp.retrbinary('RETR htdocs/test.txt', lambda data:continueTest(data, ftp))
if __name__=='__main__':
test()
This outputs:
connected to ftp
b'test' <ftplib.FTP object at 0x0322FAB0>
Traceback (most recent call last):
File "C:\Python33\Plan Z Editor SL\redistdb.py", line 111, in <module>
test()
File "C:\Python33\Plan Z Editor SL\redistdb.py", line 107, in test
ftp.retrbinary('RETR htdocs/test.txt', lambda data:continueTest(data, ftp))
File "C:\Python33\lib\ftplib.py", line 434, in retrbinary
callback(data)
File "C:\Python33\Plan Z Editor SL\redistdb.py", line 107, in <lambda>
ftp.retrbinary('RETR htdocs/test.txt', lambda data:continueTest(data, ftp))
File "C:\Python33\Plan Z Editor SL\redistdb.py", line 99, in continueTest
ftp.storbinary('STOR htdocs/test.txt', file)
File "C:\Python33\lib\ftplib.py", line 483, in storbinary
with self.transfercmd(cmd, rest) as conn:
File "C:\Python33\lib\ftplib.py", line 391, in transfercmd
return self.ntransfercmd(cmd, rest)[0]
File "C:\Python33\lib\ftplib.py", line 351, in ntransfercmd
host, port = self.makepasv()
File "C:\Python33\lib\ftplib.py", line 329, in makepasv
host, port = parse227(self.sendcmd('PASV'))
File "C:\Python33\lib\ftplib.py", line 873, in parse227
raise error_reply(resp)
ftplib.error_reply: 200 Type set to I.
If I don't use STOR in a callback, everything works fine, But then, how am I supposed to get data from RETR command?
I know possible solutions, but I'm sure there must be a more elegant one:
- use urllib.request instead of RETR (what if there's no HTTP on the server?)
- reinitialize FTP connection in callback function (may be slower than expected because of waiting for the server to reconnect)
- user ftp.set_pasv(False) (callback launches, but the script does not end and cannot use ftp.quit() or ftp.close())
According to the documentation of retrbinary:
The callback function is called for each block of data received, with a single string argument giving the data block.
This suggests that the callback is called while the data connection to retrieve the file is still open and the STOR command is not yet completed. It is not possible with FTP to create a new data connection (in the same FTP session) while another is still active. Additionally it looks like ftplib gets confused and considers the response to TYPE I beeing the response for PASV:
File "C:\Python33\lib\ftplib.py", line 873, in parse227
raise error_reply(resp)
ftplib.error_reply: 200 Type set to I.
What you should do instead is to call STOR only after the RETR is completed, i.e. let the callback store everything in the file but then open the file only after retrbinary returned.
But then, how am I supposed to get data from RETR command?
In your current callback you store the data inside a file and then you read the file. The callback should still store the data in the file but reading and calling STOR should be done outside the callback, right after retrbinary. You cannot RETR and STOR data in parallel.

Dropbox Python API: File size detection may have failed

I'm attempting to upload a text file to Dropbox using this code:
def uploadFile(file):
f = open('logs/%s.txt' % file)
response = client.put_file('/%s.txt' % file, f)
print "Uploaded log file %s" % file
Connecting to dropbox works perfectly fine, it's just when I upload files I recieve this error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Python27\lib\site-packages\dropbox_python_sdk-1.5.1-py2.7.egg\dropbox
\client.py", line 352, in put_file
return self.rest_client.PUT(url, file_obj, headers)
File "C:\Python27\lib\site-packages\dropbox_python_sdk-1.5.1-py2.7.egg\dropbox
\rest.py", line 265, in PUT
return cls.IMPL.PUT(*n, **kw)
File "C:\Python27\lib\site-packages\dropbox_python_sdk-1.5.1-py2.7.egg\dropbox
\rest.py", line 211, in PUT
return self.request("PUT", url, body=body, headers=headers, raw_response=raw
_response)
File "C:\Python27\lib\site-packages\dropbox_python_sdk-1.5.1-py2.7.egg\dropbox
\rest.py", line 174, in request
raise util.AnalyzeFileObjBug(clen, bytes_read)
dropbox.util.AnalyzeFileObjBug:
Expected file object to have 18 bytes, instead we read 17 bytes.
File size detection may have failed (see dropbox.util.AnalyzeFileObj)
Google has given me no help with this one.
Sounds like you are a victim of newline unification. The file object reports a file size of 18 bytes ("abcdefghijklmnop\r\n") but you read only 17 bytes ("abcdefghijklmnop\n").
Open the file in binary mode to avoid this:
f = open('logs/%s.txt' % file, 'rb')
The default is to use text mode, which may convert '\n' characters to a platform-specific representation on writing and back on reading.

Python base64.decode does not seem to work on windows

I am consuming a webservice (written in java) - that basically returns a byte[] array (the SOAP equivalent is base64 encoded binary data).
I am using the python suds library and the following code works for me on my mac (and on cygwin under windows), but the decoding does not work on vanilla windows (python 2.6.5). I am primarily a java developer so any help will be really helpful.
from suds.client import Client
import base64,os,shutil,tarfile,StringIO
u = "user"
p = "password"
url = "https://xxxx/?wsdl"
client = Client(url, username=u, password=p)
bin = client.service.getTargz("test")
f = open("tools.tar.gz", "w")
f.write(base64.b64decode(bin.encode('ASCII')))
f.close()
print "finished writing"
tarfile.open("tools.tar.gz").extractall()
Works great on a mac - but on windows gives me this error:
C:\client>python client.py
xml
Getting the sysprep file from the webservice
finished writing
Traceback (most recent call last):
File "client.py", line 28, in
tarfile.open("tools.tar.gz").extractall()
File "C:\Python26\lib\tarfile.py", line 1653, in open
return func(name, "r", fileobj, **kwargs)
File "C:\Python26\lib\tarfile.py", line 1720, in gzopen
**kwargs)
File "C:\Python26\lib\tarfile.py", line 1698, in taropen
return cls(name, mode, fileobj, **kwargs)
File "C:\Python26\lib\tarfile.py", line 1571, in __init__
self.firstmember = self.next()
File "C:\Python26\lib\tarfile.py", line 2317, in next
tarinfo = self.tarinfo.fromtarfile(self)
File "C:\Python26\lib\tarfile.py", line 1235, in fromtarfile
buf = tarfile.fileobj.read(BLOCKSIZE)
File "C:\Python26\lib\gzip.py", line 219, in read
self._read(readsize)
File "C:\Python26\lib\gzip.py", line 271, in _read
uncompress = self.decompress.decompress(buf)
zlib.error: Error -3 while decompressing: invalid distance too far back
Try
f = open("tools.tar.gz", "wb")
It's crucial to tell Python that it's a binary file (in Py3, it also becomes crucial on Unixy systems, but in Py2 it's not strictly needed on them, which is why your code works on MacOSX): the default is text, which, on Windows, translates each \n written into \r\n on disk upon writing.

Categories

Resources