storing full text from txt file into mongodb - python

I have created a python script that automates a workflow converting PDF to txt files. I want to be able to store and query these files in MongoDB. Do I need to turn the .txt file into JSON/BSON? Should I be using a program like PyMongo?
I am just not sure what the steps of such a project would be let alone the tools that would help with this.
I've looked at this post: How can one add text files in Mongodb?, which makes me think I need to convert the file to a JSON file, and possibly integrate GridFS?

You don't need to JSON/BSON encode it if you're using a driver. If you're using the MongoDB shell, you'd need to worry about it when you pasted the contents.
You'd likely want to use the Python MongoDB driver:
from pymongo import MongoClient
client = MongoClient()
db = client.test_database # use a database called "test_database"
collection = db.files # and inside that DB, a collection called "files"
f = open('test_file_name.txt') # open a file
text = f.read() # read the entire contents, should be UTF-8 text
# build a document to be inserted
text_file_doc = {"file_name": "test_file_name.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)
(Untested code)
If you made sure that the file names are unique, you could set the _id property of the document and retrieve it like:
text_file_doc = collection.find_one({"_id": "test_file_name.txt"})
Or, you could ensure the file_name property as shown above is indexed and do:
text_file_doc = collection.find_one({"file_name": "test_file_name.txt"})
Your other option is to use GridFS, although it's often not recommended for small files.
There's a starter here for Python and GridFS.

Yes, you must convert your file to JSON. There is a trivial way to do that: use something like {"text": "your text"}. It's easy to extend / update such records later.
Of course you'd need to escape the " occurences in your text. I suppose that you use a JSON library and/or MongoDB library of your favorite language to do all the formatting.

Related

Save Stanza output to use later - Python

Is there a way to save a stanza document output to use later? (calling back .entities, .sentences, .text)
I need to iterate over different files and I need to store the output in a way that is later available for some NLP projects.
For example:
import stanza
stanza.download('en')
nlp = stanza.Pipeline('en')
doc = nlp(data)
where data is some string.
I need a way to save the "doc" so that I can access it later without needing to re-apply the nlp().
I have tried following this:
https://stanfordnlp.github.io/stanza/data_conversion.html#conll-to-document
but when I save the file either as CoNLL or as a dict and then I convert it back to the stanza document, I cannot call .entities back.

How to save multiple data at once in Python

I am running a script which takes, say, an hour to generate the data I want. I want to be able to save all of the relevant variables to some external file so I can fiddle with them later without having to run the hour-long calculation over again. Is there an easy way I can save all of the variables I need into one convenient file?
In Matlab I would just contain all of the results of the calculation in a single structure so that later I could just load results.mat and I would have everything I need stored as results.output1, results.output2 or whatever. What is the Python equivalent of this?
In particular, the data that I would like to save includes arrays of complex numbers, which seems to present difficulties for using things like json.
I suggest taking look at built-in shelve module which provides persistent, dictionary-like object and generally does work with all native Python types so you can do:
Write complex to some file (in my example it is named mydata) under key n (keep in mind that keys should be strings).
import shelve
my_number = 2+7j
with shelve.open('mydata') as db:
db['n'] = my_number
Later retrieve that number from given file
import shelve
with shelve.open('mydata') as db:
my_number = db['n']
print(my_number) # (2+7j)
You can use pickle function in Python and then use the dump function to dump all your data into a file. You can reuse the data later.I suggest you find more about pickle.
I would recommend a json file. With json you can assign variables to keywords, just like dictionaries in stock python. The json package is automatically installed when installing python.
import json
dict = {var1: "abcde", var2: "fghij"}
with open(path, "w") as file:
json.dump(dict, file, indent=2, ensure_ascii = False)
You can also load this from a file using the same api:
with open(path, r) as file:
text = file.read()
dict = json.loads(text)
Edit: Json can also handle every datatype python can, so if you want to save an array you can just define that in the dict:
dict = {list1: ["ab", "cd", "ef"]}

What is the best way to store an XML file in a database using sqlalchemy-flask?

I'm working on a Flask app that I'd like to have store xml files in a database. I'd like to use flask-sqlalchemy. I've seen that in regular old sqlalchemy it is possible to use the LONGTEXT type. I believe this would work for my use case.
I would like to know (1) if LONGTEXT would be the best way store xml files and, if so, (2) how to use LONGTEXT within the flask-sqlalchemy syntax.
What should {insert-name-here} be in the code below? Will I need to install additional dependencies to use whatever is suggested?
xml_column = db.Column(db.{insert-name-here})
I use python some time.
I use [xml.etree.ElementTree] package to read or write xml data in python.
I use it like:
'''
#import
Import xml.etree.ElementTree as ET
#xml file
_xml = 'c:/.../test.xml'
#read
tree = ET.parse(_xml)
root = tree.getroot()
h_data = root.findall('h')
#write
root = ET.Element('test')
tree = ET.ElementTree(root)
tree.write(_xml), encoding='utf-8', xml_declaration=1)
'''
More you can see documents.
Xml file can save as a txt, but best the encoding is utf8.
I think xml data is not best for python.
The best is json data.
Hope I can help you.

Uploading a file via paperclip through an external script

I'm trying to create a rails app that is a CMS for a client. The app currently has a documents class that uploads the document with paperclip.
Separate to this, we're running a python script that accesses the database and gets a bunch of information for a given event, creates a proposal word document, and uploads it to the database under the correct event.
This all works, but the app does not recognize the document. How do I make a python script that will correctly upload the document such that paperclip knows what's going on?
Here is my paperclip controller:
def new
#event = Event.find(params[:event_id])
#document = Document.new
end
def create
#event = Event.find(params[:event_id])
#document = #event.documents.new(document_params)
if #document.save
redirect_to event_path(#event)
end
end
private
def document_params
params.require(:document).permit(:event_id, :data, :title)
end
Model
validates :title, presence: true
has_attached_file :data
validates_attachment_content_type :data, :content_type => ["application/pdf", "application/msword"]
Here is the python code.
f = open(propStr, 'r')
binary = psycopg2.Binary(f.read())
self.cur.execute("INSERT INTO documents (event_id, title, data_file_name, data_content_type) VALUES (%d,'Proposal.doc',%s,'application/msword');" % (self.eventData[0], binary))
self.con.commit()
You should probably use Ruby to script this since it can load in any model information or other classes you need.
But assuming your requirements dictate the use of python, be aware that Paperclip does not store the documents in your database tables, only the files' metadata. The actual file is stored in your file system in the /public dir by default (could also be s3, etc depending on your configuration). I would make sure you were actually saving the file to the correct anticipated directory. The default path according to the docs is:
:rails_root/public/system/:class/:attachment/:id_partition/:style/:filename
so you will have to make another sql query to retrieve the id of your new record. I don't believe pdfs have a :style attribute since you don't use imagicmagick to resize them, so build a path that looks something like this:
/public/system/documents/data/000/000/123/my_file.pdf
and save it from your python script.

How to update/add data to Django application without deleting data

Right now, I have a Django application with an import feature which accepts a .zip file, reads out the csv files and formats them to JSON and then inserts them into the database. The JSON file with all the data is put into temp_dir and is called data.json.
Unfortunatly, the insertion is done like so:
Building.objects.all().delete()
call_command('loaddata', os.path.join(temp_dir, 'data.json'))
My problem is that all the data is deleted then re-added. I need to instead find a way to update and add data and not delete the data.
I've been looking at other Django commands but I can't seem to find out that would allow me to insert the data and update/add records. I'm hoping that there is a easy way to do this without modifying a whole lot.
If you loop through your data you could use get_or_create(), this will return the object if it exist and create it if it doesn't:
obj, created = Person.objects.get_or_create(first_name='John', last_name='Lennon', defaults={'birthday': date(1940, 10, 9)})

Categories

Resources