How to load index shards by gensim.similarities.Similarity？

How to load index shards by gensim.similarities.Similarity？ - python

I'm working on something using gensim.
In gensim, var index usually means an object of gensim.similarities.<cls>.
At first, I use gensim.similarities.Similarity(filepath, ...) to save index as a file, and then loads it by gensim.similarities.Similarity.load(filepath + '.0'). Because gensim.similarities.Similarity default save index to shards file like index.0.
When index file becoming larger, it automatically seperate into more shards, like index.0,index.1,index.2......
How can I load these shards file? gensim.similarities.Similarity.load() can only load one file.
BTW: I have try to find the answer in gensim's doc, but failed.

from gensim.corpora.textcorpus import TextCorpus
from gensim.test.utils import datapath, get_tmpfile
from gensim.similarities import Similarity
temp_fname = get_tmpfile("index")
output_fname = get_tmpfile("saved_index")
corpus = TextCorpus(datapath('testcorpus.txt'))
index = Similarity(output_fname, corpus, num_features=400)
index.save(output_fname)
loaded_index = index.load(output_fname)
https://radimrehurek.com/gensim/similarities/docsim.html

shoresh's answer is correct. The key part that OP was missing was
index.save(output_fname)
While just creating the object appears to save it, it's really only saving the shards, which require saving a sort of directory file (via index.save(output_fname) to be made accessible as a whole object.

Related

How to extract relation members from .osm xml files

All,
I've been trying to build a website (in Django) which is to be an index of all MTB routes in the world. I'm a Pythonian so wherever I can I try to use Python.
I've successfully extracted data from the OSM API (Display relation (trail) in leaflet) but found that doing this for all MTB trails (tag: route=mtb) is too much data (processing takes very long). So I tried to do everything locally by downloading a torrent of the entire OpenStreetMap dataset (from Latest Weekly Planet XML File) and filtering for tag: route=mtb using osmfilter (part of osmctools in Ubuntu 20.04), like this:
osmfilter $unzipped_osm_planet_file --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.osm
This produces a file of about 1.2 GB and on closer inspection seems to contain all the data I need. My goal was to transform the file into a pandas.DataFrame() so I could do some further filtering en transforming before pushing relevant aspects into my Django DB. I tried to load the file as a regular XML file using Python Pandas but this crashed the Jupyter notebook Kernel. I guess the data is too big.
My second approach was this solution: How to extract and visualize data from OSM file in Python. It worked for me, at least, I can get some of the information, like the tags of the relations in the file (and the other specified details). What I'm missing is the relation members (the ways) and then the way members (the nodes) and their latitude/longitudes. I need these to achieve what I did here: Plotting OpenStreetMap relations does not generate continuous lines
I'm open to many solutions, for example one could break the file up into many different files containing 1 relation and it's members per file, using an osmium based script. Perhaps then I can move on with pandas.read_xml(). This would be nice for batch processing en filling the Database. Loading the whole OSM XML file into a pd.DataFrame would be nice but I guess this really is a lot of data. Perhaps this can also be done on a per-relation basis with pyosmium?
Any help is appreciated.

Ok, I figured out how to get what I want (all information per relation of the type "route=mtb" stored in an accessible way), it's a multi-step process, I'll describe it here.
First, I downloaded the world file (went to wiki.openstreetmap.org/wiki/Planet.osm, opened the xml of the pbf file and downloaded the world file as .pbf (everything on Linux, and this file is referred to as $osm_planet_file below).
I converted this file to o5m using osmconvert (available in Ubuntu 20.04 by doing apt install osmctools, on the Linux cli:
osmconvert --verbose --drop-version $osm_planet_file -o=$osm_planet_dir/planet.o5m
The next step is to filter all relations of interest out of this file (in my case I wanted all MTB routes: route=mtb) and store them in a new file, like this:
osmfilter $osm_planet_dir/planet.o5m --keep="route=mtb" -o=$osm_planet_dir/world_mtb_routes.o5m
This creates a much smaller file that contains all information on the relations that are MTB routes.
From there on I switched to a Jupyter notebook and used Python3 to further divide the file into useful, manageable chunks. I first installed osmium using conda (in the env I created first but that can be skipped):
conda install -c conda-forge osmium
Then I made a recommended osm.SimpleHandle class, this class iterates through the large o5m file and while doing this it can do actions. This is the way to deal with these files because they are far to big for memory. I made the choice to iterate through the file and store everything I needed into separate json files. This does generate more than 12.000 json files but it can be done on my laptop with 8 GB of memory. This is the class:
import osmium as osm
import json
import os
data_dump_dir = '../data'
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = []
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
data = dict()
data['version'] = elem.version,
data['members'] = [int(member.ref) for member in elem.members if member.type == 'w'], # filter nodes from waylist => could be a mistake
data['visible'] = elem.visible,
data['timestamp'] = str(elem.timestamp),
data['uid'] = elem.uid,
data['user'] = elem.user,
data['changeset'] = elem.changeset,
data['num_tags'] = len(elem.tags),
data['key'] = tag.k,
data['value'] = tag.v,
data['deleted'] = elem.deleted
with open(os.path.join(data_dump_dir, str(elem.id)+'.json'), 'w') as f:
json.dump(data, f)
def relation(self, r):
self.tag_inventory(r, "relation")
Run the class like this:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
Now we have json files with the relation number as their filename and with all metadata, and a list of the ways. But we want a list of the ways and then also all the nodes per way, so we can plot the full relations (the MTB routes). To achieve this, we parse the o5m file again (using a class build on the osm.SimpleHandler class) and this time we extract all way members (the nodes), and create a dictionary:
class OSMHandler(osm.SimpleHandler):
def __init__(self):
osm.SimpleHandler.__init__(self)
self.osm_data = dict()
def tag_inventory(self, elem, elem_type):
for tag in elem.tags:
self.osm_data[int(elem.id)] = dict()
# self.osm_data[int(elem.id)]['is_closed'] = str(elem.is_closed)
self.osm_data[int(elem.id)]['nodes'] = [str(n) for n in elem.nodes]
def way(self, w):
self.tag_inventory(w, "way")
Execute the class:
osmhandler = OSMHandler()
osmhandler.apply_file("../data/world_mtb_routes.o5m")
ways = osmhandler.osm_data
This gives is dict (called ways) of all ways as keys and the node IDs (!Meaning we need some more steps!) as values.
len(ways.keys())
>>> 337597
In the next (and almost last) step we add the node IDs for all ways to our relation jsons, so they become part of the files:
all_data = dict()
for relation_file in [
os.path.join(data_dump_dir,file) for file in os.listdir(data_dump_dir) if file.endswith('.json')
]:
with open(relation_file, 'r') as f:
data = json.load(f)
if 'members' in data: # Make sure these steps are never performed twice
try:
data['ways'] = dict()
for way in data['members'][0]:
data['ways'][way] = ways[way]['nodes']
del data['members']
with open(relation_file, 'w') as f:
json.dump(data, f)
except KeyError as err:
print(err, relation_file) # Not sure why some relations give errors?
So now we have relation jsons with all ways and all ways have all node IDs, the last thing to do is to replace the node IDs with their values (latitude and longitude). I also did this in 2 steps, first I build a nodeID:lat/lon dictionary, again using an osmium.SimpleHandler based class :
import osmium
class CounterHandler(osmium.SimpleHandler):
def __init__(self):
osmium.SimpleHandler.__init__(self)
self.osm_data = dict()
def node(self, n):
self.osm_data[int(n.id)] = [n.location.lat, n.location.lon]
Execute the class:
h = CounterHandler()
h.apply_file("../data/world_mtb_routes.o5m")
nodes = h.osm_data
This gives us dict with a latitude/longitude pair for every node ID. We can use this on our json files to fill the ways with coordinates (where there are now still only node IDs), I create these final json files in a new directory (data/with_coords in my case) because if there is an error, my original (input) json file is not affected and I can try again:
import os
relation_files = [file for file in os.listdir('../data/') if file.endswith('.json')]
for relation in relation_files:
relation_file = os.path.join('../data/',relation)
relation_file_with_coords = os.path.join('../data/with_coords',relation)
with open(relation_file, 'r') as f:
data = json.load(f)
try:
for way in data['ways']:
node_coords_per_way = []
for node in data['ways'][way]:
node_coords_per_way.append(nodes[int(node)])
data['ways'][way] = node_coords_per_way
with open(relation_file_with_coords, 'w') as f:
json.dump(data, f)
except KeyError:
print(relation)
Now I have what I need and I can start adding the info to my Django database, but that is beyond the scope of this question.
Btw, there are some relations that give an error, I suspect that for some relations ways were labelled as nodes but I'm not sure. I'll update here if I find out. I also have to do this process regularly (when the world file updates, or every now and then) so I'll probably write something more concise later on, but for now this works and the steps are understandable, to me, after a lot of thinking at least.
All of the complexity comes from the fact that the data is not big enough for memory, otherwise I'd have created a pandas.DataFrame in step one and be done with it. I could also have loaded the data in a database in one go perhaps, but I'm not that good with databases, yet.

Update spaCy Vocabulary

I was wondering if it is possible to update spacys default vocabulary. What I am trying doing is this:
run word2vec on my own corpus with gensim
load the vectors into my model with nlp.vocab.load_vectors_from_bin_loc(\path)
But since a lot of words in my corpus aren't in spacys default vocabulary I can't make use of the imported vectors. Is there an (easy) way to add those missing types?
Edit:
I realize it might be problematic to mix vectors. So my question is:
How can I import a custom vocabulary into spacy?

This is much easier in the next version, which should be out this week --- I'm just finishing testing it. For now:
By default spaCy loads a data/vocab/vec.bin file, where the "data" directory is within the spacy.en module directory
Create the vec.bin file from a bz2 file using spacy.vocab.write_binary_vectors
Either replace spaCy's vec.bin file, or call nlp.vocab.load_rep_vectors at run-time, with the path to the binary file.
The above is a bit inconvenient at first, but the binary file format is much smaller and faster to load, and the vectors files are fairly big. Note that GloVe distributes in gzip format, not bzip.
Out of interest: are you using the GloVe vectors, or something you trained on your own data? If your own data, did you use Gensim? I'd like to make this much easier, so I'd appreciate suggestions for what work-flow you'd like to see.
Load new vectors at run-time, optionally converting them
import spacy.vocab
def set_spacy_vectors(nlp, binary_loc, bz2_loc=None):
if bz2_loc is not None:
spacy.vocab.write_binary_vectors(bz2_loc, binary_loc)
write_binary_vectors(bz2_input_loc, binary_loc)
nlp.vocab.load_rep_vectors(binary_loc)
Replace the vec.bin, so your vectors will be loaded by default
from spacy.vocab import write_binary_vectors
import spacy.en
from os import path
def main(bz2_loc):
bin_loc = path.join(path.dirname(spacy.en.__file__), 'data', 'vocab', 'vec.bin')
write_binary_vectors(bz2_loc, bin_loc)
if __name__ == '__main__':
plac.call(main)

PyFITS: hdulist.writeto()

I'm extracting extensions from a multi-extension FITS file, manipulate the data, and save the data (with the extension's header information) to a new FITS file.
To my knowledge pyfits.writeto() does the task. However, when I give it a data parameter in the form of an array, it gives me the error:
'AttributeError: 'numpy.ndarray' object has no attribute 'lower''
Here is a sample of my code:
'file = 'hst_11166_54_wfc3_ir_f110w_drz.fits'
hdulist = pyfits.open(dir + file)'
sci = hdulist[1].data # science image data
exp = hdulist[5].data # exposure time data
sci = sci*exp # converts electrons/second to electrons
file = 'test_counts.fits'
hdulist.writeto(file,sci,clobber=True)
hdulist.close()
I appreciate any help with this. Thanks in advance.

You're confusing the HDUList.writeto method, and the writeto function.
What you're calling is a method on the HDUList object that is returned when you call pyfits.open. You can think of this object as something like a file handle to your original drizzled FITS file. You can manipulate this object in place and either write it out to a new file or save updates in place (if you open the file in mode='update').
The writeto function on the other hand is not tied to any existing file. It's just a high-level function for writing an array out to a file. In your example you could write your array of electron counts out like:
pyfits.writeto(filename, data)
This will create a single-HDU FITS file with the array data in the PRIMARY HDU.
Do be aware of the admonishment at the top of this section of the docs: http://docs.astropy.org/en/v1.0.3/io/fits/index.html#convenience-functions
The functions like pyfits.writeto are there for convenience in interactive work, but are not recommendable for use in code that will be run repeatedly, as in a script. Instead have a look at these instructions to start.

It is probably because you should use hdulist.writeto(file, clobber=True). There is only one required argument:
https://pythonhosted.org/pyfits/api_docs/api_hdulists.html#pyfits.HDUList.writeto
If you give a second argument, it is used for output_verify which should be a string, not a numpy array. This probably explains your AttributeError ....

Storing data globally in Python

Django and Python newbie here. Ok, so I want to make a webpage where the user can enter a number between 1 and 10. Then, I want to display an image corresponding to that number. Each number is associated with an image filename, and these 10 pairs are stored in a list in a .txt file.
One way to retrieve the appropriate filename is to create a NumToImage model, which has an integer field and a string field, and store all 10 NumToImage objects in the SQL database. I could then retrieve the filename for any query number. However, this does not seem like such a great solution for storing a simple .txt file which I know is not going to change.
So, what is the way to do this in Python, without using a database? I am used to C++, where I would create an array of strings, one for each of the numbers, and load these from the .txt file when the application starts. This vector would then lie within a static object such that I can access it from anywhere in my application.
How can a similar thing be done in Python? I don't know how to instantiate a Python object and then enable it to be accessible from other Python scripts. The only way I can think of doing this is to pass the object instance as an argument for every single function that I call, which is just silly.
What's the standard solution to this?
Thank you.

The Python way is quite similar: you run code at the module level, and create objects in the module namespace that can be imported by other modules.
In your case it might look something like this:
myimage.py
imagemap = {}
# Now read the (image_num, image_path) pairs from the
# file one line at a time and do:
# imagemap[image_num] = image_path
views.py
from myimage import imagemap
def my_view(image_num)
image_path = imagemap[image_num]
# do something with image_path

How to output a Numpy array to a file object for Picloud

I have a matrix-factorization process that I'm running on picloud. The output is a set of numpy arrays (ndarray).
Now, I want to save it to my bucket, but I'm not able to zero in on the right way to do it. Let's assume that the array to be saved is P.
I tried:
cloud.bucket.putf(P,'p.csv')
but that returned an error: "IOError: File object is not seekable. Cannot transmit".
I tried
numpy.ndarray.tofile(P,f, sep=",", format="%s") #outputing the array to a file object f
cloud.bucket.putf(f,'p.csv') #saving the file object f in the bucket.
I tried a couple of other things, including using using numpy.savetext (as I would if I ran it locally) but I'm not able to solve this between the picloud documentation and stackexchange questions. I haven't tried pickle yet, though. I felt this was something straightforward, but I'm feeling quite silly after spending a few hours on this.

As you guessed, you want to pickle the array as follows:
import cloud
import cPickle as pickle
# to write
cloud.bucket.putf(pickle.dumps(P), 'p.csv')
# to read
obj = pickle.loads(cloud.bucket.getf('p.csv').read())
This is a general way to serialize and store any Python object in your PiCloud Bucket. I also recommend that you store your csv files under a prefix to keep it organized [1].
[1] http://docs.picloud.com/bucket.html#namespacing-with-prefix

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.