how to upload chunks of a string longer than 2147483647 bytes? - python

I am trying to upload a file around ~5GB size as below but, it throws the error string longer than 2147483647 bytes. It sounds like there is a limit of 2 GB to upload. Is there a way to upload data in chunks? Can anyone provide guidance?
logger.debug(attachment_path)
currdir = os.path.abspath(os.getcwd())
os.chdir(os.path.dirname(attachment_path))
headers = self._headers
headers['Content-Type'] = content_type
headers['X-Override-File'] = 'true'
if not os.path.exists(attachment_path):
raise Exception, "File path was invalid, no file found at the path %s" % attachment_path
filesize = os.path.getsize(attachment_path)
fileToUpload = open(attachment_path, 'rb').read()
logger.info(filesize)
logger.debug(headers)
r = requests.put(self._baseurl + 'problems/' + problemID + "/" + attachment_type + "/" + urllib.quote(os.path.basename(attachment_path)),
headers=headers,data=fileToUpload,timeout=300)
ERROR:
string longer than 2147483647 bytes
UPDATE:
def read_in_chunks(file_object,chunk_size=30720*30720):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
f = open(attachment_path)
for piece in read_in_chunks(f):
r = requests.put(self._baseurl + 'problems/' + problemID + "/" + attachment_type + "/" + urllib.quote(os.path.basename(attachment_path)),
headers=headers,data=piece,timeout=300)

Your question has been asked on the requests bug tracker; their suggestion is to use streaming upload. If that doesn't work, you might see if a chunk-encoded request works.
[edit]
Example based on the original code:
# Using `with` here will handle closing the file implicitly
with open(attachment_path, 'rb') as file_to_upload:
r = requests.put(
"{base}problems/{pid}/{atype}/{path}".format(
base=self._baseurl,
# It's better to use consistent naming; search PEP-8 for standard Python conventions.
pid=problem_id,
atype=attachment_type,
path=urllib.quote(os.path.basename(attachment_path)),
),
headers=headers,
# Note that you're passing the file object, NOT the contents of the file:
data=file_to_upload,
# Hard to say whether this is a good idea with a large file upload
timeout=300,
)
I can't guarantee this would run as-is, since I can't realistically test it, but it should be close. The bug tracker comments I linked to also mention that sending multiple headers may cause issues, so if the headers you're specifying are actually necessary, this may not work.
Regarding chunk encoding: This should be your second choice. Your code was not specifying 'rb' as the mode for open(...), so changing that should probably make the code above work. If not, you could try this.
def read_in_chunks():
# If you're going to chunk anyway, doesn't it seem like smaller ones than this would be a good idea?
chunk_size = 30720 * 30720
# I don't know how correct this is; if it doesn't work as expected, you'll need to debug
with open(attachment_path, 'rb') as file_object:
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
# Same request as above, just using the function to chunk explicitly; see the `data` param
r = requests.put(
"{base}problems/{pid}/{atype}/{path}".format(
base=self._baseurl,
pid=problem_id,
atype=attachment_type,
path=urllib.quote(os.path.basename(attachment_path)),
),
headers=headers,
# Call the chunk function here and the request will be chunked as you specify
data=read_in_chunks(),
timeout=300,
)

Related

How do I split a PDF in google cloud storage using Python

I have a single PDF that I would like to create different PDFs for each of its pages. How would I be able to so without downloading anything locally? I know that Document AI has a file splitting module (which would actually identify different files.. that would be most ideal) but that is not available publicly.
I am using PyPDF2 to do this curretly
list_of_blobs = list(bucket.list_blobs(prefix = 'tmp/'))
print(len(list_of_blobs))
list_of_blobs[1].download_to_filename('/' + list_of_blobs[1].name)
inputpdf = PdfFileReader(open('/' + list_of_blobs[1].name, "rb"))
individual_files = []
stream = io.StringIO()
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
individual_files.append(output)
with open("document-page%s.pdf" % (i + 1), "a") as outputStream:
outputStream.write(stream.getvalue())
#print(outputStream.read())
with open(outputStream.name, 'rb') as f:
data = f.seek(85)
data = f.read()
individual_files.append(data)
bucket.blob('processed/' + "doc%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
In the output, I see different PyPDF2 objects such as
<PyPDF2.pdf.PdfFileWriter object at 0x12a2037f0> but I have no idea how I should proceed next. I am also open to using other libraries if those work better.
There were two reasons why my program was not working:
I was trying to read a file in append mode (I fixed this by moving the second with(open) block outside of the first one,
I should have been writing bytes (I fixed this by changing the open mode to 'wb' instead of 'a')
Below is the corrected code:
if inputpdf.numPages > 2:
for i in range(inputpdf.numPages):
output = PdfFileWriter()
output.addPage(inputpdf.getPage(i))
with open("/tmp/document-page%s.pdf" % (i + 1), "wb") as outputStream:
output.write(outputStream)
with open(outputStream.name, 'rb') as f:
data = f.seek(0)
data = f.read()
#print(data)
bucket.blob(prefix + '/processed/' + "page-%s.pdf" % (i + 1)).upload_from_string(data, content_type='application/pdf')
stream.truncate(0)
To split a PDF file in several small file (page), you need to download the data for that. You can materialize the data in a file (in the writable directory /tmp) or simply keep them in memory in a python variable.
In both cases:
The data will reside in memory
You need to get the data to perform the PDF split.
If you absolutely want to read the data in streaming (I don't know if it's possible with PDF format!!), you can use the streaming feature of GCS. But, because there isn't CRC on the downloaded data, I won't recommend you this solution, except if you are ready to handle corrupted data, retries and all related stuff.

Django 1.11 download file chunk by chunk

In my case, I have the Django 1.11 server acting as a proxy. When you click "download" from the browser, it sends a request to the django proxy that downloads files from another server and processes them, after which they must "send" them to the browser to allow the user to download them. My proxy downloads and processes the files chunks by chunks.
How can I send chunks to the browser as they are ready so that the user finally downloads a single file?
In practice, I have to let you download a file that is not yet ready, like a stream.
def my_download(self, res)
# some code
file_handle = open(local_path, 'wb', self.chunk_size)
for chunk in res.iter_content(self.chunk_size):
i = i+1
print("index: ", i, "/", chunks)
if i > chunks-1:
is_last = True
# some code on the chunk
# Here, instead of saving the chunk locally, I would like to allow it to download it directly.
file_handle.write(chunk)
file_handle.close()
return True
Thank you in advance, greetings.
This question should be flagged as duplicate of this post: Serving large files ( with high loads ) in Django
Always try to find the answer before you create a question in SO, please!
Essentially the answer is included in Django's Documentation: "Streaming Large CSV files" example and we will apply the above question into that example:
You can use Django's StreamingHttpResponse and Python's wsgiref.util.FileWrapper to serve a large file in chunks effectivelly and without loading it in memory.
def my_download(request):
file_path = 'path/to/file'
chunk_size = DEFINE_A_CHUNK_SIZE_AS_INTEGER
filename = os.path.basename(file_path)
response = StreamingHttpResponse(
FileWrapper(open(file_path, 'rb'), chunk_size),
content_type="application/octet-stream"
)
response['Content-Length'] = os.path.getsize(file_path)
response['Content-Disposition'] = "attachment; filename=%s" % filename
return response
Now if you want to apply some processing to the file chunk-by-chunk you can utilize FileWrapper's generated iterator:
Place your chunk processing code in a function which MUST return the chunk:
def chunk_processing(chunk):
# Process your chunk here
# Be careful to preserve chunk's initial size.
return processed_chunk
Now apply the function inside the StreamingHttpResponse:
response = StreamingHttpResponse(
(
process_chunk(chunk)
for chunk in FileWrapper(open(file_path, 'rb'), chunk_size
),content_type="application/octet-stream"
)

How to get image file size in python when fetching from URL (before deciding to save)

import urllib.request,io
url = 'http://www.image.com/image.jpg'
path = io.BytesIO(urllib.request.urlopen(url).read())
I'd like to check the file size of the URL image in the filestream path before saving, how can i do this?
Also, I don't want to rely on Content-Length headers, I'd like to fetch it into a filestream, check the size and then save
You can get the size of the io.BytesIO() object the same way you can get it for any file object: by seeking to the end and asking for the file position:
path = io.BytesIO(urllib.request.urlopen(url).read())
path.seek(0, 2) # 0 bytes from the end
size = path.tell()
However, you could just as easily have just taken the len() of the bytestring you just read, before inserting it into an in-memory file object:
data = urllib.request.urlopen(url).read()
size = len(data)
path = io.BytesIO(data)
Note that this means your image has already been loaded into memory. You cannot use this to prevent loading too large an image object. For that using the Content-Length header is the only option.
If the server uses a chunked transfer encoding to facilitate streaming (so no content length has been set up front), you can use a loop limit how much data is read.
Try importing urllib.request
import urllib.request, io
url = 'http://www.elsecarrailway.co.uk/images/Events/TeddyBear-3.jpg'
path = urllib.request.urlopen(url)
meta = path.info()
>>>meta.get(name="Content-Length")
'269898' # ie 269kb
You could ask the server for the content-length information. Using urllib2 (which I hope is available in your python):
req = urllib2.urlopen(url)
meta = req,info()
length_text = meta.getparam("Content-Length")
try:
length = int(length_text)
except:
# length unknown, you may need to read
length = -1

KineticJS toDataURL() gives incorrect padding error in Python

I have a base64 string that I've acquired from KineticJS toDataURL(). It's a valid base64 string. This fiddle shows that it is: http://jsfiddle.net/FakRe/
My Problem: I'm sending this dataURI up to my server, decoding using python, and saving on the server. However, I'm getting a padding error. Here's my code:
def base64_to_file(self, image_data, image_name):
extension = re.search('data\:image\/(\w+)\;', image_data)
if extension:
ext = extension.group(1)
image_name = '{0}.{1}'.format(image_name, ext)
else:
# it's not in a format we understand
return None
image_data = image_data + '=' #fix incorrect padding
image_path = os.path.join('/my/image/path/', image_name)
image_file = open(image_path, 'w+')
image_file.write(image_data.decode('base64'))
image_file.close()
return image_file
I can get around this padding error by doing this at the top of my function:
image_data = image_data + '=' #fixes incorrect padding
After I add the arbitrary padding, it decodes successfully and writes to the filesystem. However, whenever I try and view the image, it doesn't render, and gives me the 'broken image' icon. No 404, just a broken image. What gives? Any help would be greatly appreciated. I've already spent way too much time on this as it is.
Steps to recreate (May be pedantic but I'll try and help)
Grab the base64 string from the JSFiddle
Save it to a text file
Open up the file in python, read in the data, save to variable
Change '/path/to/my/image' in the function to anywhere on your machine
Pass in your encoded text variable into the function with an name
See the output
Again, any help would be awesome. Thanks.
If you need to add padding, you have the wrong string. Do make sure you are parsing the data URI correctly; the data:image/png;base64, section is metadata about the encoded value, only the part after the comma is the actual Base64 value.
The actual data portion is 223548 characters long:
>>> len(image_data)
223548
>>> import hashlib
>>> hashlib.md5(image_data).hexdigest()
'03918c3508fef1286af8784dc65f23ff'
If your URI still includes the data: prefix, do parse that out:
from urllib import unquote
if image_data.startswith('data:'):
params, data = image_data.split(',', 1)
params = params[5:] or 'text/plain;charset=US-ASCII'
params = params.split(';')
if not '=' in params[0] and '/' in params[0]:
mimetype = params.pop(0)
else:
mimetype = 'text/plain'
if 'base64' in params:
# handle base64 parameters first
data = data.decode('base64')
for param in params:
if param.startswith('charset='):
# handle characterset parameter
data = unquote(data).decode(param.split('=', 1)[-1])
This then lets you make some more informed decisions about what extension to use for image URLs, for example, as you now have the mimetype parsed out as well.

How to delete a line from a file after it has been used

I'm trying to create a script which makes requests to random urls from a txt file e.g.:
import urllib2
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
But I want that when some url indicates 404 not found, the line containing the URL is erased from the file. There is one unique URL per line, so basically the goal is to erase every URL (and its corresponding line) that returns 404 not found.
How can I accomplish this?
You could simply save all the URLs that worked, and then rewrite them to the file:
good_urls = []
with open('urls.txt') as urls:
for url in urls:
try:
r = urllib2.urlopen(url)
except urllib2.URLError as e:
r = e
if r.code in (200, 401):
print '[{}]: '.format(url), "Up!"
good_urls.append(url)
with open('urls.txt', 'w') as urls:
urls.write("".join(good_urls))
The easiest way is to read all the lines, loop over the saved lines and try to open them, and then when you are done, if any URLs failed you rewrite the file.
The way to rewrite the file is to write a new file, and then when the new file is successfully written and closed, then you use os.rename() to change the name of the new file to the name of the old file, overwriting the old file. This is the safe way to do it; you never overwrite the good file until you know you have the new file correctly written.
I think the simplest way to do this is just to create a list where you collect the good URLs, plus have a count of failed URLs. If the count is not zero, you need to rewrite the text file. Or, you can collect the bad URLs in another list. I did that in this example code. (I haven't tested this code but I think it should work.)
import os
import urllib2
input_file = "urls.txt"
debug = True
good_urls = []
bad_urls = []
bad, good = range(2)
def track(url, good_flag, code):
if good_flag == good:
good_str = "good"
elif good_flag == bad:
good_str = "bad"
else:
good_str = "ERROR! (" + repr(good) + ")"
if debug:
print("DEBUG: %s: '%s' code %s" % (good_str, url, repr(code)))
if good_flag == good:
good_urls.append(url)
else:
bad_urls.append(url)
with open(input_file) as f:
for line in f:
url = line.strip()
try:
r = urllib2.urlopen(url)
if r.code in (200, 401):
print '[{0}]: '.format(url), "Up!"
if r.code == 404:
# URL is bad if it is missing (code 404)
track(url, bad, r.code)
else:
# any code other than 404, assume URL is good
track(url, good, r.code)
except urllib2.URLError as e:
track(url, bad, "exception!")
# if any URLs were bad, rewrite the input file to remove them.
if bad_urls:
# simple way to get a filename for temp file: append ".tmp" to filename
temp_file = input_file + ".tmp"
with open(temp_file, "w") as f:
for url in good_urls:
f.write(url + '\n')
# if we reach this point, temp file is good. Remove old input file
os.remove(input_file) # only needed for Windows
os.rename(temp_file, input_file) # replace original input file with temp file
EDIT: In comments, #abarnert suggests that there might be a problem with using os.rename() on Windows (at least I think that is what he/she means). If os.rename() doesn't work, you should be able to use shutil.move() instead.
EDIT: Rewrite code to handle errors.
EDIT: Rewrite to add verbose messages as URLs are tracked. This should help with debugging. Also, I actually tested this version and it works for me.

Categories

Resources