Downloading binary data generated by a pyscript script

Downloading binary data generated by a pyscript script - python

I need to download a binary file that is created by a pyscript script.
Pressing the "Download" button should result in downloading a file containing the binary data generated by the pyscript. However the result is a plaintext .txt file , with a single line inside it:
<gzip _io.BufferedWriter name='file_test.dat' 0xdce348>
Here's the code I use to create the "Download" button (the code is inside an async python function that is
called after a "start the program" button is clicked)
f_in = open('datafile','rb') // the file is opened. The file itself is stored in /home/pyodide/
//the contents of the file are written into a gzip.GzipFile
f = open('file_test.dat','wb')
gz = gzip.GzipFile('', 'wb', 8, f, 0.)
for ln in f_in:
gz.write(ln)
print(ln) // printing the file contents line by line results in correct data, so Im fairly certain that
// the script itself generates data properly
//a blob object is created that is then attached to a new <a> tag
blob = Blob.new([gz], {type: "application/gzip"})
downloadDoc = document.createElement('a')
downloadDoc.href = window.URL.createObjectURL(blob)
//then the <a> tag is appended to a <div> inside the main html file, and a download button is created
downloadButton = document.createElement('button')
downloadButton.innerHTML = "Download"
downloadDoc.appendChild(downloadButton)
document.getElementById("download-div").appendChild(downloadDoc)
I'm fairly new to web development (in fact this is my first web project), so there may be some obvious
thing I'm missing.
Thanks in advance!
UPDATE - SOLUTION
Closing all the files and passing the 'file_test.dat' file
to the pyodide.to_js() function before creating a Blob solves the problem. Here's the complete working code:
f_in = open('datafile','rb')
f = open('file_test.dat','wb')
gz = gzip.GzipFile('', 'wb', 8, f, 0.)
for ln in f_in:
gz.write(ln)
gz.close()
f_in.close()
f.close()
//reopening the file and then passing it into pyodide.to_js() function solved the issue
f2 = open('file_test.dat',"rb")
content = to_js(f2.read())
blob = Blob.new([content], {type: ""})
downloadDoc = document.createElement('a')
downloadDoc.href = window.URL.createObjectURL(blob)
downloadButton = document.createElement('button')
downloadButton.innerHTML = "Download"
downloadDoc.appendChild(downloadButton)
document.getElementById("download-div").appendChild(downloadDoc)

Related

file upload from container using webdav results into empty file upload

I'm attempting to wrap my brain around this because the identical code generates two different sets of outcomes, implying that there must be a fundamental difference between the settings in which the code is performed.
This is the code I use:
from webdav3.client import Client
if __name__ == "__main__":
client = Client(
{
"webdav_hostname": "http://some_address"
+ "project"
+ "/",
"webdav_login": "somelogin",
"webdav_password": "somepass",
}
)
ci = "someci"
version = "someversion"
directory = f'release-{ci.replace("/", "-")}-{version}'
client.webdav.disable_check = (
True # Need to be disabled as the check can not be performed on the root
)
f = "a.rst"
with open(f, "r") as fh:
contents = fh.read()
print(contents)
evaluated = contents.replace("#PIPELINE_URL#", "DUMMY PIPELINE URL")
with open(f, "w") as fh:
fh.write(evaluated)
print(contents)
client.upload(local_path=f, remote_path=f)
The file a.rst contains some text like:
Please follow instruction link below
#####################################
`Click here for instructions <https://some_website>`_
When I execute this code from macOS, a file with the same contents of a.rst appears on my website.
When I execute this script from within a container with a base image of Python 3.9 and the webdav dependencies, it creates a file on my website, but the content is always empty. I'm not sure why, but it could have something to do with the fact that I'm running it from within a Docker container that on top of it can't handle the special characters in the file (plain text seems to work though)?
Anyone have any ideas as to why this is happening and how to fix it?
EDIT:
It seems that the character ":" is creating the problem..

how to save file as zip without saving it to local folder

I'm trying to create a download function for my streamlit app. But what I currently have allows me to download a zip file via a button on my streamlit app but unfortunately it also saves it to my local folder. I don't want it to save to my local folder. The problem is when I initialize the file_zip object. I want the zip file in a specific name ideally the same name of the file that the user upload with a '.zip' extension (i.e datafile that contains the string file name as a parameter in the function). But everytime I do that it keeps saving the zip file in my local folder. Is there an alternative to this? BTW I'm trying to save list of pandas dataframe into one zip file.
def downloader(list_df, datafile, file_type):
file = datafile.name.split(".")[0]
#create zip file
with zipfile.ZipFile("{}.zip".format(file), 'w', zipfile.ZIP_DEFLATED) as file_zip:
for i in range(len(list_df)):
file_zip.writestr(file+"_group_{}".format(i)+".csv", pd.DataFrame(list_df[i]).to_csv())
file_zip.close()
#pass it to front end for download
zip_name = "{}.zip".format(file)
with open(zip_name, "rb") as f:
bytes=f.read()
b64 = base64.b64encode(bytes).decode()
href = f'Click Here To Download'
st.markdown(href, unsafe_allow_html=True)

It sounds like you want to create the zip file in memory and use it later to build a base64 encoding. You can use an io.BytesIO() object with ZipFile, rewind it, and read the data back for base64 encoding.
import io
def downloader(list_df, datafile, file_type):
file = datafile.name.split(".")[0]
#create zip file
zip_buf = io.BytesIO()
with zipfile.ZipFile(zip_buf, 'w', zipfile.ZIP_DEFLATED) as file_zip:
for i in range(len(list_df)):
file_zip.writestr(file+"_group_{}".format(i)+".csv", pd.DataFrame(list_df[i]).to_csv())
zip_buf.seek(0)
#pass it to front end for download
zip_name = "{}.zip".format(file)
b64 = base64.b64encode(zip_buf.read()).decode()
del zip_buf
href = f'Click Here To download'
st.markdown(href, unsafe_allow_html=True)

How to copy / download file created in Pyodide in browser?

I managed to run Pyodide in browser. I created hello.txt file. But how can I access it.
Pyodide https://github.com/iodide-project/pyodide/blob/master/docs/using_pyodide_from_javascript.md
pyodide.runPython('open("hello.txt", "w")')
What I tried in chrome devtools?
pyodide.runPython('os.chdir("../")')
pyodide.runPython('os.listdir()')
pyodide.runPython('os.path.realpath("hello.txt")')
Output for listdir
["hello.txt", "lib", "proc", "dev", "home", "tmp"]
Output for realpath
"/hello.txt"
Also,
pyodide.runPython('import platform')
pyodide.runPython('platform.platform()')
Output
"Emscripten-1.0-x86-JS-32bit"
All outputs in chrome devtools console.
It is created in root folder. But how it can be accessed in file explorer or anyway to copy file to Download folder?
Thanks

Indeed pyodide operates in an in-memory (MEMFS) filesystem created by Emscripten. You can't directly write files to disk from pyodide since it's executed in the browser sandbox.
You can however, pass your file to JavaScript, create a Blob out of it and then download it. For instance, using,
let txt = pyodide.runPython(`
with open('/test.txt', 'rt') as fh:
txt = fh.read()
txt
`);
const blob = new Blob([txt], {type : 'application/text'});
let url = window.URL.createObjectURL(blob);
window.location.assign(url);
It should have been also possible to do all of this from the Python side, using the type conversions included in pyodide, i.e.
from js import Blob, document
from js import window
with open('/test.txt', 'rt') as fh:
txt = fh.read()
blob = Blob.new([txt], {type : 'application/text'})
url = window.URL.createObjectURL(blob)
window.location.assign(url)
however at present, this unfortunately doesn't work, as it depends on pyodide#788 being resolved first.

I have modified the answer by rth. It will download file with the name of file.
let txt = pyodide.runPython(`
with open('/test.txt', 'rt') as fh:
txt = fh.read()
txt
`);
const blob = new Blob([txt], {type : 'application/text'});
let url = window.URL.createObjectURL(blob);
var downloadLink = document.createElement("a");
downloadLink.href = url;
downloadLink.download = "test.txt";
document.body.appendChild(downloadLink);
downloadLink.click();

How to extract text from a directory of PDF files efficiently with OCR?

I have a large directory with PDF files (images), how can I extract efficiently the text from all the files inside the directory?. So far I tried to:
import multiprocessing
import textract
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
p = multiprocessing.Pool(2)
file_path = ['/Users/user/Desktop/sample.pdf']
list(p.map(extract_txt, file_path))
However, it is not working... it takes a lot of time (I have some documents that have 600 pages). Additionally: a) I do not know how to handle efficiently the directory transformation part. b) I would like to add a page separator, let's say: <start/age = 1> ... page content ... <end/page = 1>, but I have no idea of how to do this.
Thus, how can I apply the extract_txt function to all the elements of a directory that end with .pdf and return the same files in another directory but in a .txt format, and add a page separator with OCR text extraction?.
Also, I was curios about using google docs to make this task, is it possible to programmatically use google docs to solve the aforementioned text extracting problem?.
UPDATE
Regarding the "adding a page separator" issue (<start/age = 1> ... page content ... <end/page = 1>) after reading Roland Smith's answer I tried to:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('\n<begin page pos =' , i, '>\n')
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
print(str(text, 'utf8'))
print('\n<end page pos =' , i, '>\n')
extract_text('/Users/user/Downloads/ImageOnly.pdf')
However, I still have issues with the print() part, since instead of printing, it would be more useful to save into a file all the output. Thus, I tried to redirect the output to a a file:
sys.stdout=open("test.txt","w")
print('\n<begin page pos =' , i, '>\n')
sys.stdout.close()
text = textract.process(str(outfname), method='tesseract')
os.remove(outfname) # clean up.
sys.stdout=open("test.txt","w")
print(str(text, 'utf8'))
sys.stdout.close()
sys.stdout=open("test.txt","w")
print('\n<end page pos =' , i, '>\n')
sys.stdout.close()
Any idea of how to make the page extraction/separator trick and saving everything into a file?...

In your code, you are extracting the text, but you don't do anything with it.
Try something like this:
def extract_txt(file_path):
text = textract.process(file_path, method='tesseract')
outfn = file_path[:-4] + '.txt' # assuming filenames end with '.pdf'
with open(outfn, 'wb') as output_file:
output_file.write(text)
return file_path
This writes the text to file that has the same name but a .txt extension.
It also returns the path of the original file to let the parent know that this file is done.
So I would change the mapping code to:
p = multiprocessing.Pool()
file_path = ['/Users/user/Desktop/sample.pdf']
for fn in p.imap_unordered(extract_txt, file_path):
print('completed file:', fn)
You don't need to give an argument when creating a Pool. By default it will create as many workers as there are cpu-cores.
Using imap_unordered creates an iterator that starts yielding values as soon as they are available.
Because the worker function returned the filename, you can print it to let the user know that this file is done.
Edit 1:
The additional question is if it is possible to mark page boundaries. I think it is.
A method that would surely work is to split the PDF file into pages before the OCR. You could use e.g. pdfinfo from the poppler-utils package to find out the number of pages in a document. And then you could use e.g. pdfseparate from the same poppler-utils package to convert that one pdf file of N pages into N pdf files of one page. You could then OCR the single page PDF files separately. That would give you the text on each page separately.
Alternatively you could OCR the whole document and then search for page breaks. This will only work if the document has a constant or predictable header or footer on every page. It is probably not as reliable as the abovementioned method.
Edit 2:
If you need a file, write a file:
from PyPDF2 import PdfFileWriter, PdfFileReader
import textract
def extract_text(pdf_file):
inputpdf = PdfFileReader(open(pdf_file, "rb"))
outfname = pdf_file[:-4] + '.txt' # Assuming PDF file name ends with ".pdf"
with open(outfname, 'w') as textfile:
for i in range(inputpdf.numPages):
w = PdfFileWriter()
w.addPage(inputpdf.getPage(i))
outfname = 'page{:03d}.pdf'.format(i)
with open(outfname, 'wb') as outfile: # I presume you need `wb`.
w.write(outfile)
print('page', i)
text = textract.process(outfname, method='tesseract')
# Add header and footer.
text = '\n<begin page pos = {}>\n'.format(i) + text + '\n<end page pos = {}>\n'.format(i)
# Write the OCR-ed text to the output file.
textfile.write(text)
os.remove(outfname) # clean up.
print(text)

File management

I am working on python and biopython right now. I have a file upload form and whatever file is uploaded suppose(abc.fasta) then i want to pass same name in execute (abc.fasta) function parameter and display function parameter (abc.aln). Right now i am changing file name manually, but i want to have it automatically.
Workflow goes like this.
----If submit is not true then display only header and form part
--- if submit is true then call execute() and get file name from form input
--- Then displaying result file name is same as executed file name but only change in extension
My raw code is here -- http://pastebin.com/FPUgZSSe
Any suggestions, changes and algorithm is appreciated
Thanks

You need to read the uploaded file out of the cgi.FieldStorage() and save it onto the server. Ususally a temp directory (/tmp on Linux) is used for this. You should remove these files after processing or on some schedule to clean up the drive.
def main():
import cgi
import cgitb; cgitb.enable()
f1 = cgi.FieldStorage()
if "dfile" in f1:
fileitem = f1["dfile"]
pathtoTmpFile = os.path.join("path/to/temp/directory", fileitem.filename)
fout = file(pathtoTmpFile, 'wb')
while 1:
chunk = fileitem.file.read(100000)
if not chunk: break
fout.write (chunk)
fout.close()
execute(pathtoTmpFile)
os.remove(pathtoTmpFile)
else:
header()
form()
This modified the execute to take the path to the newly saved file.
cline = ClustalwCommandline("clustalw", infile=pathToFile)
For the result file, you could also stream it back so the user gets a "Save as..." dialog. That might be a little more usable than displaying it in HTML.

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Downloading binary data generated by a pyscript script - python

Related

file upload from container using webdav results into empty file upload

how to save file as zip without saving it to local folder

How to copy / download file created in Pyodide in browser?

How to extract text from a directory of PDF files efficiently with OCR?

File management

Categories

Resources