I am using a resume parsing python library that accepts a pdf file and returns JSON. The code is as simple as below:
parsed_data = ResumeParser("file.pdf").get_extracted_data()
I wanted to expose an API around this, so in API the pdf data is sent as a base64 string. So, I first write the data to a file and then run the above code. My current code looks as below:
def parse(b64data):
bytes = b64decode(b64data, validate=True)
f = open('tmp_file.pdf', 'wb')
f.write(bytes)
f.close()
parsed_data = ResumeParser("tmp_file.pdf").get_extracted_data()
return parsed_data
Is there any better approach for me to avoid writing the data to a file? I am exposing this API as a serverless function and I think I can save time by not doing write.
References:
https://github.com/OmkarPathak/pyresparser (Library Used)
The library that you are using appears to accept a BytesIO object as an alternative to passing it a string that contains a filename. However, it also appears to expect that this BytesIO object has a name attribute from which it extracts an extension so it can determine the filetype. So, we will add a bogus name attribute that contains the string .pdf to our BytesIO object.
So, we should be able to use something like this:
import io, base64
def parse(b64data):
bytes = base64.b64decode(b64data, validate=True)
bytesio = io.BytesIO(bytes)
bytesio.name = '.pdf'
parsed_data = ResumeParser(bytesio).get_extracted_data()
return parsed_data
Related
I have a docx file stored in cloud storage and I want to convert it to PDF and store it there as well using a python cloud function. Meaning I don't want to download the file, convert it on my local machine and upload it back to storage, so simple stuff like the following code can't be used:
from docx import document
from docx2pdf import convert
doc = document()
doc.save('some_path')
convert(docx_path, pdf_path)
I tried using io.BytesIO like the following example:
import io
from docx import document
from docx2pdf import convert
file_stream = io.BytesIO()
file_stream_pdf = io.BytesIO()
doc = document()
doc.save(file_stream)
file_stream.seek(0)
convert(file_stream, file_stream_pdf)
but I am getting the error:
TypeError: expected str, bytes or os.PathLike object, not BytesIO.
I know convert receives paths as strings, but I dont know how to use BytesIO in this case.
Is there a simple way to fix what I have done so far or perhaps a different approach?
Try to write the BytesIO object to a temporary file. See this for more information about tempfile.
import tempfile
temp_docx_file = tempfile.NamedTemporaryFile(suffix=".docx")
temp_docx_path = temp_docx_file.name
with open(temp_docx_path, "wb") as f:
f.write(file_stream.read())
Then, create another temporary file to handle new .pdf file.
temp_pdf_file = tempfile.NamedTemporaryFile(suffix=".pdf")
Now, convert
temp_pdf_path = temp_pdf_file.name
convert(temp_docx_path, temp_pdf_path)
To convert the pdf into the BytesIO object, you can do it by:
with open(temp_pdf_path, "rb") as f:
file_stream_pdf = io.BytesIO(f.read())
I am using FastAPI to create an API that receives small audio files from a mobile app. In this API I do processing of the signal and I am able to return a response after classifying that sound. The final goal is to send the classification back to the user.
Here's what I am doing so far:
#app.post("/predict")
def predict(file: UploadFile = File(...)): # Upload the wav audio sent from the mobile app user
with open(name_file, "wb") as buffer:
shutil.copyfileobj(file.file, buffer) #creating a file with the received audio data
...
prev= test.my_classification_module(name_file) #some processing and the goal response in PREV variable
In my_classification_module(), I have this :
X, sr = librosa.load(sound_file)
I want to avoid creating a file to be classified with librosa. I would like to do this with a temporary file, without using memory unecessarily and to avoid the overlap of files when the app is used by multiple users.
If your function supports a file-like object, you could use the .file attribute of UploadFile, e.g., file.file (which is a SpooledTemporaryFile instance), or if your function requires the file in bytes format, use the .read() async method (see the documentation). If you wish to keep your route defined with def instead of async def (have a look at this answer for more info on def vs async def), you could use the .read() method of the file-like object directly, e.g., file.file.read().
Update - How to resolve File contains data in an unknown format error
Make sure the audio file is not corrupted. If, let's say, you saved it and opened it with a media player, would the sound file play?
Make sure you have the latest version of librosa module installed.
Try installing ffmpeg and adding it to the system path, as suggested here.
As described here and in the
documentation, librosa.load() can take a file-like object as an alternative to a file path - thus, using file.file or file.file._file should normally be fine (as per the documentation, _file attribute is either an io.BytesIO or io.TextIOWrapper object...).
However, as described in the documentation here and here, as well as in this github discussion, you could also use the soundfile module to read audio from file-like objects. Example:
import soundfile as sf
data, samplerate = sf.read(file.file)
You could also write the file contents of the uploaded file to a BytesIO stream, and then pass it to either sf.read() or librosa.load():
from io import BytesIO
contents = file.file.read()
buffer = BytesIO(contents)
data, samplerate = librosa.load(buffer) # ussing librosa module
#data, samplerate = sf.read(buffer) # using soundfile module
buffer.close()
Another option would be to save the file contents to a NamedTemporaryFile, which "has a visible name in the file system" that "can be used to open the file". Once you are done with it, you can manually delete it using the remove() or unlink() method.
from tempfile import NamedTemporaryFile
import os
contents = file.file.read()
temp = NamedTemporaryFile(delete=False)
try:
with temp as f:
f.write(contents);
data, samplerate = librosa.load(temp.name) # ussing librosa module
#data, samplerate = sf.read(temp.name) # using soundfile module
except Exception:
return {"message": "There was an error processing the file"}
finally:
#temp.close() # the `with` statement above takes care of closing the file
os.remove(temp.name)
I have old code below that gzips a file and stores it as json into S3, using the IO library ( so a file does not save locally). I am having trouble converting this same approach (ie using IO library for a buffer) to create a .txt file and push into S3 and later retrieve. I know how to create txt files and push into s3 is as well, but not how to use IO in the process.
The value I would want to be stored in the text value would just be a variable with a string value of 'test'
Goal: Use IO library and save string variable as a text file into S3 and be able to pull it down again.
x = 'test'
inmemory = io.BytesIO()
with gzip.GzipFile(fileobj=inmemory, mode='wb') as fh:
with io.TextIOWrapper(fh, encoding='utf-8',errors='replace') as wrapper:
wrapper.write(json.dumps(x, ensure_ascii=False,indent=2))
inmemory.seek(0)
s3_resource.Object(s3bucket, s3path + '.json.gz').upload_fileobj(inmemory)
inmemory.close()
Also any documentation with that anyone likes with specific respect to the IO library and writing to files, because the actual documentation ( f = io.StringIO("some initial text data")
ect..https://docs.python.org/3/library/io.html ) It just did not give me enough at my current level.
Duplicate.
For sake of brevity, it turns out there's a way to override the putObject call so that it takes in a string of text instead of a file.
The original post is answered in Java, but this additional thread should be sufficient for a Python-specific answer.
I'm trying to modify a csv that is uploaded into my flask application. I have the logic that works just fine when I don't upload it through flask.
import pandas as pd
import StringIO
with open('example.csv') as f:
data = f.read()
data = data.replace(',"', ",'")
data = data.replace('",', "',")
df = pd.read_csv(StringIO.StringIO(data), header=None, sep=',', quotechar="'")
print df.head(10)
I upload it to flask and access it using
f = request.files['data_file']
When I run it through the code above, replacing open('example.csv') with open(f), I get the following error
coercing to Unicode: need string or buffer, FileStorage found
I have figured out that the problem is the file type here. I can't use open on my file because open is looking for a file name and when the file is uploaded to flask it is the instance of the file that is being passed to the open command. However, I don't know how to make this work. I've tried skipping the open command and just using data = f.read() but that doesn't work. Any suggestions?
Thanks
FileStorage is a file-like wrapper around the incoming data. You can pass it directly to read_csv.
pd.read_csv(request.files['data_file'])
You most likely should not be performing those replace calls on the data, as the CSV module should handle that and the naive replacement can corrupt data in quoted columns. However, if you still need to, you can read the data out just like you were before.
data = request.files['data_file'].read()
If your data has a mix of quoting styles, you should fix the source of your data.
Answering my own question in case someone else needs this.
FileStorage objects have a .stream attribute which will be an io.BytesIO
f = request.files['data_file']
df = pandas.read_csv(f.stream)
I'm working on a simple web app that pulls query information from a news article api. I'm looking to reduce client-side processing by stripping a json file of unnecessary information within my flask server. I want to store the edited json in a database (currently just locally in code below).
Currently my python code looks like:
def get_query(query):
response = urllib2.urlopen(link + '?q=' + query + '&fl=' + fields + '&api-key=' + key)
result = response.read()
# store json locally
with open('static/json/' + query + '.json', 'w') as stored_json:
json.dump(result, stored_json)
with open('static/json/' + query + '.json', 'r') as stored_json:
return json.load(stored_json)
My issues are:
a) I am unsure of how to properly edit the json. Currently in my javascript I am using the data on my ajax call as:
data.response.docs[i].headline.main;
where I would rather just store and return the object docs as a json. I know variable result in my python code is a string so I cannot write and return result.response.docs. I tried returning response.response.docs but I realize this is incorrect.
b) My last four lines seem redundant, I was wondering how to place my return within my first open block. I tried both 'w+' and 'r+' with no luck.
Im not sure if I am getting your question completely, but it sounds like what you want to do is:
1) receive the response
2) parse the json into a Python object
3) filter the data
4) store the filtered data locally (in a database, file, etc)
5) return the filtered data to the client
I am supposing that your json.dump / json.load combination was intended to get the json string into a format that you can manipulate easily (i.e. a Python object). If so, the json.loads (emphasis on the s) does what you need. Try something like this:
import json
def get_query(query):
response = urllib2.urlopen(...)
result = json.loads(response.read())
# result is a regular Python object holding the data from the json response
filtered = filter_the_data(result)
# filter_the_data is some function that manipulates data
with open('outfile.json', 'w') as outfile:
# here dump (no s) is used to serialize the data
# back to json and store it on the filesystem as outfile.json
json.dump(filtered, outfile)
...
At this point you have saved the data locally, and you still hold a reference to the filtered data. You can re-serialize it and send it to the client easily using Flask's jsonify function
Hope it helps