Save a pdf file stored in Mongodb GridFS using Python

Save a pdf file stored in Mongodb GridFS using Python - python

I had uploaded some PDF, PNG files to a local instance of mongodb. By mistake I deleted these files and I can no longer recover them using the regular recover options. However, they are in my local mongodb database. How can I save them back in their original format on my computer?
I know the following:
import pymongo as pym
import gridfs
def connectToDb():
client = pym.MongoClient('mongodb://localhost:27017/')
db = client.questionbank
collectn = db.questionbank
fs = gridfs.GridFS(db)
return db, collectn, fs
db, collectn, fs = connectToDb()
filelist = list( db.fs.files.find({}, {"_id": 1, "filename": 1}) )
fileid = filelist[0]['_id']
fobj = fs.get(fileid)
## I don't know what to do after this. I think I cannot use read since I don't
## want the string. I want to save the pdf file as a pdf file.
Any help will be greatly appreciated. Thanks in advance.

Okay, I figured this out on my own. It can be done in the following way:
To the above code add the lines:
f = open('tempfigfile.pdf', 'wb')
f.write(fobj.read())
f.close()
This saves the file as tempfigfile.pdf.

This code will save all the files to ur local folder from mongodb gridfs.
i=0
cursor=fs.find()
while(i < cursor.count()):
fi=cursor.next()
with open("C:\\localfolder\\"+fi.filename,"wb") as f:
f.write(fi.read())
f.closed
i=i+1

Related

Can you upload to S3 using a stream rather than a local file?

I need to create a CSV and upload it to an S3 bucket. Since I'm creating the file on the fly, it would be better if I could write it directly to S3 bucket as it is being created rather than writing the whole file locally, and then uploading the file at the end.
Is there a way to do this? My project is in Python and I'm fairly new to the language. Here is what I tried so far:
import csv
import csv
import io
import boto
from boto.s3.key import Key
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
fieldnames = ['first_name', 'last_name']
writer = csv.DictWriter(io.StringIO(), fieldnames=fieldnames)
k.set_contents_from_stream(writer.writeheader())
I received this error: BotoClientError: s3 does not support chunked transfer
UPDATE: I found a way to write directly to S3, but I can't find a way to clear the buffer without actually deleting the lines I already wrote. So, for example:
conn = boto.connect_s3()
bucket = conn.get_bucket('dev-vs')
k = Key(bucket)
k.key = 'foo/foobar'
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
f = io.StringIO()
fieldnames = ['fieldA', 'fieldB', 'fieldC']
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
k.set_contents_from_string(f.getvalue())
for row in testDict:
writer.writerow(row)
k.set_contents_from_string(f.getvalue())
f.close()
Writes 3 lines to the file, however I'm unable to release memory to write a big file. If I add:
f.seek(0)
f.truncate(0)
to the loop, then only the last line of the file is written. Is there any way to release resources without deleting lines from the file?

I did find a solution to my question, which I will post here in case anyone else is interested. I decided to do this as parts in a multipart upload. You can't stream to S3. There is also a package available that changes your streaming file over to a multipart upload which I used: Smart Open.
import smart_open
import io
import csv
testDict = [{
"fieldA": "8",
"fieldB": None,
"fieldC": "888888888888"},
{
"fieldA": "9",
"fieldB": None,
"fieldC": "99999999999"}]
fieldnames = ['fieldA', 'fieldB', 'fieldC']
f = io.StringIO()
with smart_open.smart_open('s3://dev-test/bar/foo.csv', 'wb') as fout:
writer = csv.DictWriter(f, fieldnames=fieldnames)
writer.writeheader()
fout.write(f.getvalue())
for row in testDict:
f.seek(0)
f.truncate(0)
writer.writerow(row)
fout.write(f.getvalue())
f.close()

Here is a complete example using boto3
import boto3
import io
session = boto3.Session(
aws_access_key_id="...",
aws_secret_access_key="..."
)
s3 = session.resource("s3")
buff = io.BytesIO()
buff.write("test1\n".encode())
buff.write("test2\n".encode())
s3.Object(bucket, keypath).put(Body=buff.getvalue())

We were trying to upload file contents to s3 when it came through as an InMemoryUploadedFile object in a Django request. We ended up doing the following because we didn't want to save the file locally. Hope it helps:
#action(detail=False, methods=['post'])
def upload_document(self, request):
document = request.data.get('image').file
s3.upload_fileobj(document, BUCKET_NAME,
DESIRED_NAME_OF_FILE_IN_S3,
ExtraArgs={"ServerSideEncryption": "aws:kms"})

According to docs it's possible
s3.Object('mybucket', 'hello.txt').put(Body=open('/tmp/hello.txt', 'rb'))
so we can use StringIO in ordinary way
Update: smart_open lib from #inquiring minds answer is better solution

There's an interesting code solution mentioned in a GitHub smart_open issue (#82) that I've been meaning to try out. Copy-pasting here for posterity... looks like boto3 is required:
csv_data = io.BytesIO()
writer = csv.writer(csv_data)
writer.writerows(my_data)
gz_stream = io.BytesIO()
with gzip.GzipFile(fileobj=gz_stream, mode="w") as gz:
gz.write(csv_data.getvalue())
gz_stream.seek(0)
s3 = boto3.client('s3')
s3.upload_fileobj(gz_stream, bucket_name, key)
This specific example is streaming to a compressed S3 key/file, but it seems like the general approach -- using the boto3 S3 client's upload_fileobj() method in conjunction with a target stream, not a file -- should work.

There's a well supported library for doing just this:
pip install s3fs
s3fs is really trivial to use:
import s3fs
s3fs.S3FileSystem(anon=False)
with s3.open('mybucket/new-file', 'wb') as f:
f.write(2*2**20 * b'a')
f.write(2*2**20 * b'a')
Incidentally there's also something built into boto3 (backed by the AWS API) called MultiPartUpload.
This isn't factored as a python stream which might be an advantage for some people. Instead you can start an upload and send parts one at a time.

To write a string to an S3 object, use:
s3.Object('my_bucket', 'my_file.txt').put('Hello there')
So convert the stream to string and you're there.

How to get data from s3 and do some work on it? python and boto

I have a project task to use some output data I have already produced on s3 in an EMR task. So previously I have ran an EMR job that produced some output in one of my s3 buckets in the form of multiple files named part-xxxx. Now I need to access those files from within my new EMR job, read the contents of those files and by using that data I need to produce another output.
This is the local code that does the job:
def reducer_init(self):
self.idfs = {}
for fname in os.listdir(DIRECTORY): # look through file names in the directory
file = open(os.path.join(DIRECTORY, fname)) # open a file
for line in file: # read each line in json file
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
This runs just fine locally, but the problem is the data i need now is on s3 and i need to access it somehow in reducer_init function.
This is what I have so far, but it fails while executing on EC2:
def reducer_init(self):
self.idfs = {}
b = conn.get_bucket(bucketname)
idfparts = b.list(destination)
for key in idfparts:
file = open(os.path.join(idfparts, key))
for line in file:
term_idf = JSONValueProtocol().read(line)[1] # parse the line as a JSON object
self.idfs[term_idf['term']] = term_idf['idf']
def reducer(self, term_poster, howmany):
tfidf = sum(howmany) * self.idfs[term_poster['term']]
yield None, {'term_poster': term_poster, 'tfidf': tfidf}
AWS access info is defined as follows:
awskey = '*********'
awssecret = '***********'
conn = S3Connection(awskey, awssecret)
bucketname = 'mybucket'
destination = '/path/to/previous/output'

There are two ways of doing this :
Download the file into your local system and parse it. ( Kinda simple, quick and easy )
Get data stored on S3 into memory and parse it ( a bit more complex in case of huge files ).
Step 1:
On S3 filenames are stored as a Key, if you have a file named "Demo" stored in a folder named "DemoFolder" then the key for that particular file would be "DemoFolder\Demo".
Use the below code to download the file into a temp folder.
AWS_KEY = 'xxxxxxxxxxxxxxxxxx'
AWS_SECRET_KEY = 'xxxxxxxxxxxxxxxxxxxxxxxxxx'
BUCKET_NAME = 'DemoBucket'
fileName = 'Demo'
conn = connect_to_region(Location.USWest2,aws_access_key_id = AWS_KEY,
aws_secret_access_key = AWS_SECRET_KEY,
is_secure=False,host='s3-us-west-2.amazonaws.com'
)
source_bucket = conn.lookup(BUCKET_NAME)
''' Download the file '''
for name in source_bucket.list():
if name.name in fileName:
print("DOWNLOADING",fileName)
name.get_contents_to_filename(tempPath)
You can then work on the file in that temp path.
Step 2:
You can also fetch data as string using data = name.get_contents_as_string(). In case of huge files (> 1gb) you may come across memory errors, to avoid such errors you will have to write a lazy function which reads the data in chunks.
For example you can use range to fetch a part of file using data = name.get_contents_as_string(headers={'Range': 'bytes=%s-%s' % (0,100000000)}).
I am not sure if I answered your question properly, I can custom code for your requirement once I get some time. Meanwhile please feel free to post any query you have.

How to retrieve file's with grid fs

I have inserted image file with gridfs in mongodb with python and I want to retrieve that file with another function. How can I retrieve the file. I am using djanog and python(2.7).Thanks in advance!
def file_grid(request):
datafile = open('jobs.jpg',"r");
thedata = datafile.read()
fs = gridfs.GridFS(db)
stored = fs.put(thedata, filename="testimage")
return HttpResponse("inserted")

fs = gridfs.GridFS(db)
gridout = fs.get_last_version("testimage")
The gridout object is an instance of GridOut for reading files. You could get all the bytes at once with gridout.read(), or iterate over chunks of bytes like:
for chunk in gridout:
do_something_with(chunk)
GridFS chunks are about 256k by default.

get local path of an uploaded file in django

I want to open an uploaded csv file in the clean function of a django form.
The code looks like this:
def clean(self):
file_csv = self.cleaned_data['csv_file']
records = csv.reader(open('file_csv.name, 'rU'), dialect=csv.excel_tab)
how do I get the local path of file_csv ?

Could this work ? It's using basic python though...
import os
os.path.abspath(file_csv.name)

How do you read excel files with xlrd on Appengine

I am using xlrd in appengine. I use flask
I cant read the input file and it keeps on showing the same error message
the code is
def read_rows(inputfile):
rows = []
wb = xlrd.open_workbook(inputfile)
sh = wb.sheet_by_index(0)
for rownum in range(sh.nrows):
rows.append(sh.row_values(rownum))
return rows
#app.route('/process_input/',methods=['POST','GET'])
def process_input():
inputfile = request.files['file']
rows=read_rows(request.files['file'])
payload = json.dumps(dict(rows=rows))
return payload
I realize that this might be caused by not uploading and saving it as a file. Any workaround on this? This would help many others as well. Any help is appreciated, thx
Update: Found a solution that I posted below. For those confused with using xlrd can refer to the open source project repo I posted. The key is passing the content of the file instead of the filename

Find a solution finally
here's how I do it. Instead of saving the file, I read the content of the file and let xlrd reads it
def read_rows(inputfile):
rows = []
wb = xlrd.open_workbook(file_contents=inputfile.read())
sh = wb.sheet_by_index(0)
for rownum in range(sh.nrows):
rows.append(sh.row_values(rownum))
return rows
worked nicely and turned the excel files into JSON-able formats. If you want to output the json simply use json.dumps().
full code example can be found at https://github.com/cjhendrix/HXLator/blob/master/gae/main.py and it features full implementation of the xlrd and how to work with the data.
Thx for the pointers

Use:
wb = xlrd.open_workbook(file_contents=inputfile)
The way you are invoking open_workbook expects what you're passing in to be a filename, not a Flask FileStorage object wrapping the actual file.

Judge from your traceback.
File "/Users/fauzanerichemmerling/Desktop/GAEHxl/gae/lib/xlrd/init.py", line 941, in biff2_8_load
f = open(filename, open_mode)
You can try changing this line to :
f = filename

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Save a pdf file stored in Mongodb GridFS using Python - python

Okay, I figured this out on my own. It can be done in the following way: To the above code add the lines: f = open('tempfigfile.pdf', 'wb') f.write(fobj.read()) f.close() This saves the file as tempfigfile.pdf.

This code will save all the files to ur local folder from mongodb gridfs. i=0 cursor=fs.find() while(i < cursor.count()): fi=cursor.next() with open("C:\\localfolder\\"+fi.filename,"wb") as f: f.write(fi.read()) f.closed i=i+1

Related

Can you upload to S3 using a stream rather than a local file?

How to get data from s3 and do some work on it? python and boto

How to retrieve file's with grid fs

get local path of an uploaded file in django

How do you read excel files with xlrd on Appengine

Categories

Resources