I am currently working on a script for RaspberryPi using a SIM module to send data to an FTP server. Problem is, some data are quite large and I formatted them into csv files but still, are a bit large to send through GPRS. By compressing them in gz files it reduces the size by 5 which is great, but in order to send data, the only way is to send data line by line. I was wondering if there was a way to send the information of a gzip file without sending the uncompressed data. Here is my code so far:
list_of_files = glob.glob('/home/pi/src/git/RPI/DATA/*.gz')
print(list_of_files)
for file_data in list_of_files:
zipp = gzip.GzipFile(file_data,'rb')
file_content = zipp.read()
#array = np.fromstring(file_content, dtype='f4')
print(len(file_content))
#AT commands to send the file_content to FTP server
Here the length returned is the length of the uncompressed data, but i want to be able to retrieve the uncompressed value of the gzip file? Is it doable?
Thanks for your help.
zipp = gzip.GzipFile(file_data,'rb')
specifically requests unzipping. If you just want to read the bare raw binary gzip data, use a regular open:
zipp = open(file_data,'rb')
You don't need to read the file into memory to fetch its size, though. The os.stat function lets you get information about a file's metadata without opening it.
Related
I start with a pandas dataframe and I want to save that as a zipped parquet file, all in memory without intermediate steps on the disk. I have the following:
bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()
with ZipFile('example.zip', 'w') as zip_obj:
zip_obj.write(bytes_buffer.getvalue())
But I get this encoding error: ValueError: stat: embedded null character in path. I got my infos from the only link I found on creating zipfiles from within the memory: https://www.neilgrogan.com/py-bin-zip/
Thank your for your help :)
The correct way to do this is:
bytes_buffer = BytesIO()
df.to_parquet(bytes_buffer)
bytes_value= bytes_buffer.getvalue()
with ZipFile('example.zip', 'w') as zip_obj:
zip_obj.writestr('file.parquet', bytes_buffer.getvalue())
But you should not that storing Parquet files in a ZIP just for compression reasons is removing a lot of benefits of the Parquet format itself. By default Parquet is already compressed with the Snappy compression code (but you can also use GZip, ZStandard, and others). The compression is not happing on the file level but on a column-chunk level. That means when you access the file, only the parts which you want to read have to be decompressed. In opposite to this, when you put the Parquet files into the ZIP, the whole file needs to be decompressed even when you only wanted to read a column selection.
I have old code below that gzips a file and stores it as json into S3, using the IO library ( so a file does not save locally). I am having trouble converting this same approach (ie using IO library for a buffer) to create a .txt file and push into S3 and later retrieve. I know how to create txt files and push into s3 is as well, but not how to use IO in the process.
The value I would want to be stored in the text value would just be a variable with a string value of 'test'
Goal: Use IO library and save string variable as a text file into S3 and be able to pull it down again.
x = 'test'
inmemory = io.BytesIO()
with gzip.GzipFile(fileobj=inmemory, mode='wb') as fh:
with io.TextIOWrapper(fh, encoding='utf-8',errors='replace') as wrapper:
wrapper.write(json.dumps(x, ensure_ascii=False,indent=2))
inmemory.seek(0)
s3_resource.Object(s3bucket, s3path + '.json.gz').upload_fileobj(inmemory)
inmemory.close()
Also any documentation with that anyone likes with specific respect to the IO library and writing to files, because the actual documentation ( f = io.StringIO("some initial text data")
ect..https://docs.python.org/3/library/io.html ) It just did not give me enough at my current level.
Duplicate.
For sake of brevity, it turns out there's a way to override the putObject call so that it takes in a string of text instead of a file.
The original post is answered in Java, but this additional thread should be sufficient for a Python-specific answer.
I want to read the WAV file (which is in my FTP server) directly from FTP server without downloading it into my PC in Python. Is it possible and if yes the how?
I tried this solution Read a file in buffer from ftp python but it didn't work. I have .wav audio file. I want to read the file and get details from that .wav file like file size, byte rate, etc.
My code which in which I was able to read the WAV files locally:
import struct
from ftplib import FTP
global ftp
ftp = FTP('****', user='user-****', passwd='********')
fin = open("C3.WAV", "rb")
chunkID = fin.read(4)
print("ChunkID=", chunkID)
chunkSizeString = fin.read(4) # Total Size of File in Bytes - 8 Bytes
chunkSize = struct.unpack('I', chunkSizeString) # 'I' Format is to to treat the 4 bytes as unsigned 32-bit inter
totalSize = chunkSize[0]+8 # The subscript is used because struct unpack returns everything as tuple
print("TotalSize=", totalSize)
For a quick implementation, you can make use of my FtpFile class from:
Get files names inside a zip file on FTP server without downloading whole archive
ftp = FTP(...)
fin = FtpFile(ftp, "C3.WAV")
# The rest of the code is the same
The code is bit inefficient though, as each fin.read will open a new download data connection.
For a more efficient implementation, just download whole header at once (I do not know WAV header structure, I'm downloading 10 KB here as an example):
from io import BytesIO
ftp = FTP(...)
fin = BytesIO(FtpFile(ftp, "C3.WAV").read(10240))
# The rest of the code is the same
I have an app with manages a set of files, but those files are actually stored in Rackspace's CloudFiles, because most of the files will be ~100GB. I'm using the Cloudfile's TempURL feature to allow individual files, but sometimes, the user will want to download a set of files. But downloading all those files and generating a local Zip file is impossible since the server only have 40GB of disk space.
From the user view, I want to implement it the way GMail does when you get an email with several pictures: It gives you a link to download a Zip file with all the images in it, and the download is immediate.
How to accomplish this with Python/Django? I have found ZipStream and looks promising because of the iterator output, but it still only accepts filepaths as arguments, and the writestr method would need to fetch all the file data at once (~100GB).
Since Python 3.5 it is possible to create zip chunks stream of huge files/folders. You can use the unseekable stream. So no need to use ZipStream now.
See my answer here.
And live example here: https://repl.it/#IvanErgunov/zipfilegenerator
If you don't have filepath, but have chunks of bytes you can exclude open(path, 'rb') as entry from example and replace iter(lambda: entry.read(16384), b'') with your iterable of bytes. And prepare ZipInfo manually:
zinfo = ZipInfo(filename='any-name-of-your-non-existent-file', date_time=time.localtime(time.time())[:6])
zinfo.compress_type = zipfile.ZIP_STORED
# permissions:
if zinfo.filename[-1] == '/':
# directory
zinfo.external_attr = 0o40775 << 16 # drwxrwxr-x
zinfo.external_attr |= 0x10 # MS-DOS directory flag
else:
# file
zinfo.external_attr = 0o600 << 16 # ?rw-------
You should also remember that the zipfile module writes chunks of its zipfile own size. So, if you send a piece of 512 bytes the stream will receive a piece of data only when and only with size the zipfile module decides to do it. It depends on the compression algorithm, but I think it is not a problem, because the zipfile module makes small chunks <= 16384.
You can use https://pypi.python.org/pypi/tubing. Here's an example using s3, you could pretty easily create a rackspace clouldfile Source. Create a customer Writer (instead of sinks.Objects) to stream the data some where else and custom Transformers to transform the stream.
from tubing.ext import s3
from tubing import pipes, sinks
output = s3.S3Source(bucket, key) \
| pipes.Gunzip() \
| pipes.Split(on=b'\n') \
| sinks.Objects()
print len(output)
Check this out - it's part of the Python Standard Library:
http://docs.python.org/3/library/zipfile.html#zipfile-objects
You can give it an open file or file-like-object.
I have a large local file. I want to upload a gzipped version of that file into S3 using the boto library. The file is too large to gzip it efficiently on disk prior to uploading, so it should be gzipped in a streamed way during the upload.
The boto library knows a function set_contents_from_file() which expects a file-like object it will read from.
The gzip library knows the class GzipFile which can get an object via the parameter named fileobj; it will write to this object when compressing.
I'd like to combine these two functions, but the one API wants to read by itself, the other API wants to write by itself; neither knows a passive operation (like being written to or being read from).
Does anybody have an idea on how to combine these in a working fashion?
EDIT: I accepted one answer (see below) because it hinted me on where to go, but if you have the same problem, you might find my own answer (also below) more helpful, because I implemented a solution using multipart uploads in it.
I implemented the solution hinted at in the comments of the accepted answer by garnaat:
import cStringIO
import gzip
def sendFileGz(bucket, key, fileName, suffix='.gz'):
key += suffix
mpu = bucket.initiate_multipart_upload(key)
stream = cStringIO.StringIO()
compressor = gzip.GzipFile(fileobj=stream, mode='w')
def uploadPart(partCount=[0]):
partCount[0] += 1
stream.seek(0)
mpu.upload_part_from_file(stream, partCount[0])
stream.seek(0)
stream.truncate()
with file(fileName) as inputFile:
while True: # until EOF
chunk = inputFile.read(8192)
if not chunk: # EOF?
compressor.close()
uploadPart()
mpu.complete_upload()
break
compressor.write(chunk)
if stream.tell() > 10<<20: # min size for multipart upload is 5242880
uploadPart()
It seems to work without problems. And after all, streaming is in most cases just a chunking of the data. In this case, the chunks are about 10MB large, but who cares? As long as we aren't talking about several GB chunks, I'm fine with this.
Update for Python 3:
from io import BytesIO
import gzip
def sendFileGz(bucket, key, fileName, suffix='.gz'):
key += suffix
mpu = bucket.initiate_multipart_upload(key)
stream = BytesIO()
compressor = gzip.GzipFile(fileobj=stream, mode='w')
def uploadPart(partCount=[0]):
partCount[0] += 1
stream.seek(0)
mpu.upload_part_from_file(stream, partCount[0])
stream.seek(0)
stream.truncate()
with open(fileName, "rb") as inputFile:
while True: # until EOF
chunk = inputFile.read(8192)
if not chunk: # EOF?
compressor.close()
uploadPart()
mpu.complete_upload()
break
compressor.write(chunk)
if stream.tell() > 10<<20: # min size for multipart upload is 5242880
uploadPart()
You can also compress Bytes with gzip easily and upload it as the following easily:
import gzip
import boto3
cred = boto3.Session().get_credentials()
s3client = boto3.client('s3',
aws_access_key_id=cred.access_key,
aws_secret_access_key=cred.secret_key,
aws_session_token=cred.token
)
bucketname = 'my-bucket-name'
key = 'filename.gz'
s_in = b"Lots of content here"
gzip_object = gzip.compress(s_in)
s3client.put_object(Bucket=bucket, Body=gzip_object, Key=key)
It is possible to replace s_in by any Bytes, io.BytesIO, pickle dumps, files, etc.
If you want to upload compressed Json then here is a nice example: Upload compressed Json to S3
There really isn't a way to do this because S3 doesn't support true streaming input (i.e. chunked transfer encoding). You must know the Content-Length prior to upload and the only way to know that is to have performed the gzip operation first.