Is there a way to save a stanza document output to use later? (calling back .entities, .sentences, .text)
I need to iterate over different files and I need to store the output in a way that is later available for some NLP projects.
For example:
import stanza
stanza.download('en')
nlp = stanza.Pipeline('en')
doc = nlp(data)
where data is some string.
I need a way to save the "doc" so that I can access it later without needing to re-apply the nlp().
I have tried following this:
https://stanfordnlp.github.io/stanza/data_conversion.html#conll-to-document
but when I save the file either as CoNLL or as a dict and then I convert it back to the stanza document, I cannot call .entities back.
Related
I am running a script which takes, say, an hour to generate the data I want. I want to be able to save all of the relevant variables to some external file so I can fiddle with them later without having to run the hour-long calculation over again. Is there an easy way I can save all of the variables I need into one convenient file?
In Matlab I would just contain all of the results of the calculation in a single structure so that later I could just load results.mat and I would have everything I need stored as results.output1, results.output2 or whatever. What is the Python equivalent of this?
In particular, the data that I would like to save includes arrays of complex numbers, which seems to present difficulties for using things like json.
I suggest taking look at built-in shelve module which provides persistent, dictionary-like object and generally does work with all native Python types so you can do:
Write complex to some file (in my example it is named mydata) under key n (keep in mind that keys should be strings).
import shelve
my_number = 2+7j
with shelve.open('mydata') as db:
db['n'] = my_number
Later retrieve that number from given file
import shelve
with shelve.open('mydata') as db:
my_number = db['n']
print(my_number) # (2+7j)
You can use pickle function in Python and then use the dump function to dump all your data into a file. You can reuse the data later.I suggest you find more about pickle.
I would recommend a json file. With json you can assign variables to keywords, just like dictionaries in stock python. The json package is automatically installed when installing python.
import json
dict = {var1: "abcde", var2: "fghij"}
with open(path, "w") as file:
json.dump(dict, file, indent=2, ensure_ascii = False)
You can also load this from a file using the same api:
with open(path, r) as file:
text = file.read()
dict = json.loads(text)
Edit: Json can also handle every datatype python can, so if you want to save an array you can just define that in the dict:
dict = {list1: ["ab", "cd", "ef"]}
I am reading one Json file in dataflow pipeline using beam.io.ReadFromText, When I pass its output to any of the class (ParDo) it will become element. I wanted to use this json file content in my class, How do I do this?
Content in Json File:
{"query": "select * from tablename", "Unit": "XX", "outputFileLocation": "gs://test-bucket/data.csv", "location": "US"}
Here I want to use each of its value like query, Unit, location and outputFileLocation in class Query():
p | beam.io.ReadFromText(file_pattern=user_options.inputFile) | 'Executing Query' >> beam.ParDo(Query())
My class:
class Query(beam.DoFn):
def process(self, element):
# do something using content available in element
.........
I don't think it is possible with current set of IOs.
the reason being that a multiline json requires parsing complete file to identify a single json block. This could have been possible if we had no parallelism while reading. However, as File based IOs run on multiple workers in parallel using certain partitioning logic and Line delimiter, parsing multiline json is not possible.
If you have multiple smaller files then you can probably read those files separately and emit the parsed json. You can further use a reshuffle to evenly distribute the data for the down stream operations.
The pipeline would look something like this.
Get File List -> Reshuffle -> Read content of individual files and emit the parsed json -> Reshuffle -> Do things.
I'm working on something using gensim.
In gensim, var index usually means an object of gensim.similarities.<cls>.
At first, I use gensim.similarities.Similarity(filepath, ...) to save index as a file, and then loads it by gensim.similarities.Similarity.load(filepath + '.0'). Because gensim.similarities.Similarity default save index to shards file like index.0.
When index file becoming larger, it automatically seperate into more shards, like index.0,index.1,index.2......
How can I load these shards file? gensim.similarities.Similarity.load() can only load one file.
BTW: I have try to find the answer in gensim's doc, but failed.
from gensim.corpora.textcorpus import TextCorpus
from gensim.test.utils import datapath, get_tmpfile
from gensim.similarities import Similarity
temp_fname = get_tmpfile("index")
output_fname = get_tmpfile("saved_index")
corpus = TextCorpus(datapath('testcorpus.txt'))
index = Similarity(output_fname, corpus, num_features=400)
index.save(output_fname)
loaded_index = index.load(output_fname)
https://radimrehurek.com/gensim/similarities/docsim.html
shoresh's answer is correct. The key part that OP was missing was
index.save(output_fname)
While just creating the object appears to save it, it's really only saving the shards, which require saving a sort of directory file (via index.save(output_fname) to be made accessible as a whole object.
I am trying to import a JSON file for use in a Python editor so that I can perform analysis on the data. I am quite new to Python so not sure how I am meant to achieve this. My JSON file is full of tweet data, example shown here:
{"id":441999105775382528,"score":0.0,"text":"blablabla","user_id":1441694053,"created":"Fri Mar 07 18:09:33 GMT 2014","retweet_id":0,"source":"twitterfeed","geo_long":null,"geo_lat":null,"location":"","screen_name":"SevenPS4","name":"Playstation News","lang":"en","timezone":"Amsterdam","user_created":"2013-05-19","followers":463,"hashtags":"","mentions":"","following":1062,"urls":"http://bit.ly/1lcbBW6","media_urls":"","favourites_count":4514,"reply_status_id":0,"reply_user_id":0,"is_truncated":false,"is_retweet":false,"original_text":null,"status_count":4514,"description":"Tweeting the latest Playstation news!","url":null,"utc_offset":3600}
My questions:
How do I import the JSON file so that I can perform analysis on it in a Python editor?
How do I perform analysis on only a set number of the data (IE 100/200 of them instead of all of them)?
Is there a way to get rid of some of the fields such as score, user_id, created, etc without having to go through all of my data manually to do so?
Some of the tweets have invalid/unusable symbols within them, is there anyway to get rid of those without having to go through manually?
I'd use Pandas for this job, as you are will not only load the json, but perform some data analysis tasks on it. Depending on the size of your json-file, this one should do it:
import pandas as pd
import json
# read a sample json-file (replace the link with your file location
j = json.loads("yourfilename")
# you might select the relevant keys before constructing the data-frame
df = pd.DataFrame.from_dict([{k:v} for k,v in j.iteritems() if k in ["id","retweet_count"]])
# select a subset (the first five rows)
df.iloc[:5]
# do some analysis
df.retweet_count.sum()
>>> 200
I have created a python script that automates a workflow converting PDF to txt files. I want to be able to store and query these files in MongoDB. Do I need to turn the .txt file into JSON/BSON? Should I be using a program like PyMongo?
I am just not sure what the steps of such a project would be let alone the tools that would help with this.
I've looked at this post: How can one add text files in Mongodb?, which makes me think I need to convert the file to a JSON file, and possibly integrate GridFS?
You don't need to JSON/BSON encode it if you're using a driver. If you're using the MongoDB shell, you'd need to worry about it when you pasted the contents.
You'd likely want to use the Python MongoDB driver:
from pymongo import MongoClient
client = MongoClient()
db = client.test_database # use a database called "test_database"
collection = db.files # and inside that DB, a collection called "files"
f = open('test_file_name.txt') # open a file
text = f.read() # read the entire contents, should be UTF-8 text
# build a document to be inserted
text_file_doc = {"file_name": "test_file_name.txt", "contents" : text }
# insert the contents into the "file" collection
collection.insert(text_file_doc)
(Untested code)
If you made sure that the file names are unique, you could set the _id property of the document and retrieve it like:
text_file_doc = collection.find_one({"_id": "test_file_name.txt"})
Or, you could ensure the file_name property as shown above is indexed and do:
text_file_doc = collection.find_one({"file_name": "test_file_name.txt"})
Your other option is to use GridFS, although it's often not recommended for small files.
There's a starter here for Python and GridFS.
Yes, you must convert your file to JSON. There is a trivial way to do that: use something like {"text": "your text"}. It's easy to extend / update such records later.
Of course you'd need to escape the " occurences in your text. I suppose that you use a JSON library and/or MongoDB library of your favorite language to do all the formatting.