how to get text from a downloadable .doc file without saving?

how to get text from a downloadable .doc file without saving? - python

I'm trying to download a .doc file using requests.get() request (though I've heard about other methods - they all require saving too)
Is there any method I could use to extract the text from it (or even convert it into a .txt for example) straight away without saving it into a file?
I've tried passing request.raw into various conventors (docx2txt.process() for example) but I assume they all work with files, not with streams.

While the script is running the memory allocation are handled by the python interpreter but if you save the content to a file the memory allocated is different. This article can be helpful to you.
Link: article

Related

How to read ATFX data?

I have a set of measurement data in .atfx format. I know it is possible to read this data with ArtemiS SUITE, but I would need to make some post processing in python. I tried to look into the files, but as I see, atfx is a header file (with an xml structure) that points to binary files, so I'm not sure how I could write a python script to decode that, or if it is possible at all.
Is there a way to open ATFX files in python or is there a workaround?

Attach a .xlsx file in PDF using python

Currently I am using Adobe Acrobat Pro for attaching excel file using the Comments tools present there, but since I have to work with almost hundreds of pdf therefore the manual process becomes a tedious task.
Therefore, I have been trying to automate the process wherein I can a attach a excel file in pdf using python, but as of now haven't found a single reference as how to start the same.
Is this process even possible to automate?

Making changes to a ntriples file with python

Scenario: I just got my hands on a huge ntriples file (6.5gb uncompressed). I am trying to open it and perform some operations (such as cleaning some of the data that it contains).
Issue: I haven't been able to check the contents of this file. Notepad++ cannot handle it, and in RDFlib, the far as I got was to load the file, but I cannot seem to find a way to edit without parsing the entire thing. I also tried using RDF package (from how to parse big datasets using RDFLib?), but I cannot find a way to install it in Python 3.
Question: What is the best option to perform this kind of operation? Is there any command in rdflib that allows for this kind of editing?

if it's ntriples then basically it's a line-by-line triples. Therefore, you can read the file by small chunks (some N lines from the file) and parse the chunk via rdflib followed by any cleaning operation you need on the graph.

Upload image with an in-memory stream to input using Pillow + WebDriver?

I'm getting an Image from URL with Pillow, and creating an stream (BytesIO/StringIO).
r = requests.get("http://i.imgur.com/SH9lKxu.jpg")
stream = Image.open(BytesIO(r.content))
Since I want to upload this image using an <input type="file" /> with selenium WebDriver. I can do something like this to upload a file:
self.driver.find_element_by_xpath("//input[#type='file']").send_keys("PATH_TO_IMAGE")
I would like to know If its possible to upload that image from a stream without having to mess with files / file paths... I'm trying to avoid filesystem Read/Write. And do it in-memory or as much with temporary files. I'm also Wondering If that stream could be encoded to Base64, and then uploaded passing the string to the send_keys function you can see above :$
PS: Hope you like the image :P

You seem to be asking multiple questions here.
First, how do you convert a a JPEG without downloading it to a file? You're already doing that, so I don't know what you're asking here.
Next, "And do it in-memory or as much with temporary files." I don't know what this means, but you can do it with temporary files with the tempfile library in the stdlib, and you can do it in-memory too; both are easy.
Next, you want to know how to do a streaming upload with requests. The easy way to do that, as explained in Streaming Uploads, is to "simply provide a file-like object for your body". This can be a tempfile, but it can just as easily be a BytesIO. Since you're already using one in your question, I assume you know how to do this.
(As a side note, I'm not sure why you're using BytesIO(r.content) when requests already gives you a way to use a response object as a file-like object, and even to do it by streaming on demand instead of by waiting until the full content is available, but that isn't relevant here.)
If you want to upload it with selenium instead of requests… well then you do need a temporary file. The whole point of selenium is that it's scripting a web browser. You can't just type a bunch of bytes at your web browser in an upload form, you have to select a file on your filesystem. So selenium needs to fake you selecting a file on your filesystem. This is a perfect job for tempfile.NamedTemporaryFile.
Finally, "I'm also Wondering If that stream could be encoded to Base64".
Sure it can. Since you're just converting the image in-memory, you can just encode it with, e.g., base64.b64encode. Or, if you prefer, you can wrap your BytesIO in a codecs wrapper to base-64 it on the fly. But I'm not sure why you want to do that here.

How to handle unicode of an unknown encoding in Django?

I want to save some text to the database using the Django ORM wrappers. The problem is, this text is generated by scraping external websites and many times it seems they are listed with the wrong encoding. I would like to store the raw bytes so I can improve my encoding detection as time goes on without redoing the scrapes. But Django seems to want everything to be stored as unicode. Can I get around that somehow?

You can store data, encoded into base64, for example. Or try to analize HTTP headers from browser, may be it is simplier to get proper encoding from there.

Create a File with the data. Use a Django models.FileField to hold a reference to the file.
No it does not involve a ton of I/O. If your file is small it adds 2 or 3 I/O's (the directory read, the iNode read and the data read.)

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.