I've created a bilingual dictionary app1, and it's currently very simple, but we're going to be starting to develop the entries more fully and I'm trying to figure out the best database structure for it. Previous dictionary projects I've worked on have used xml (since dictionary entries are largely hierarchical), but I need to do it using a database.2
This is what a typical, medium-complexity entry would look like (simplified a bit):
dar
/dār/
noun
house, dwelling, abode
ar-rājl dkhul ad-dār, "The man entered the house."
home
rjaƷna lid-dār, "We returned home."
verb
to turn
dūr li-yamīn, "Turn right."
to turn around/about
As you can see, one word can have multiple parts of speech, so "part of speech" can't simply be an attribute of Entry, it has to be related to the senses. Each pos can have multiple senses (numbered), and of course each sense could have multiple near-synonymous translations. Senses may also have example sentences (possibly more than one), but not always. Thinking of how the entry parts relate to each other, I came up with the following structure, using five tables:
Entry
-id
-headword
-pronunciation
-...
PartOfSpeech
-id
-entry (ForeignKey)
-pos
Sense
-id
-sense_number
-part_of_speech (ForeignKey)
-...
Translation
-id
-tr
-sense (ForeignKey)
-...
Example
-id
-ex
-ex_tr
-sense (ForeignKey)
-...
Or, in other words:
_ Translation
Entry -- PartOfSpeech -- Sense --|
- Example
This seems simple and makes sense to me, but I'm wondering if it will be too complicated in the execution. For instance, to display a selection of entries, I would need to write several nested for loops (for e in entries → for p in pos → for s in senses → for tr in translations) — and all with reverse lookups!
And I don't think I could even edit a whole entry in the Django admin (unless it lets you somehow do an Inline of an Inline of an Inline). I'm going to build an editor interface anyway, but it's nice to be able to check things on the admin site when you want to.
Is there a better way to do this? I feel like there must be something clever that I'm missing.
Thanks,
Karen
1 If you're curious: tunisiandictionary.org. In its simple, current form it only has two tables (Entry, Sense), with the translations just comma-delineated in a single field. Which is bad.
2 For two reasons: 1) because it's a web app I've written with Python/Django, and 2) because I hate xml.
You can emulate saving dictionaries in sql databases as well. Someone wrote this awesome helper already:
Django Dictionary Model
I use it in my project as well.
Why not use the python dictionary data structure (or json/bson) along with mongodb?
In python, it is much more convenient than xml.
For example you can simply have a list of python dict objects to represent the entire dictionary. Each element can be a structured as follows:
[{
"_id": "1",
"word": "étudier",
'definitions': {
[(
"v",
"to study",
"j'étudie français",
"I study french"
), ...
]
}
}, ...]
where definitions is a list of tuples (first element is part of speech, second element is the definition, third element is an example in the first language, forth element is the translation of that example).
You can then easily index it within mongodb database.
This is a very simple structure and you don't need to deal with a over-complicated database with foreign keys. Using mongodb, retrieving definitions for a word is as easy as
record = db.collection.find({'word':'étudier').
Related
Given a word (English or non-English), how can I construct a list of words (English or non-English) with similar spelling?
For example, given the word 'sira', some similar words are:
sirra
seira
siara
saira
shira
I'd prefer this to be on the verbose side, meaning it should generate as many words as possible.
Preferably in Python, but code in any language is helpful.
The Australian Business Register ABN lookup tool (a tool that finds business registration numbers based on search keywords) does a good job of this.
Thanks
The thing you are looking for is provided by ispell (and family) of dictionaries. There is a relatively easy interface via hunspell library.
The actual data (dictionaries) you can download from here (among other places, like OpenOffice plugin pages).
There is an interface to get a number of similar words based on the edit distance suggested in the comment. Going with the example from GitHub:
>>> import hunspell
>>> hobj = hunspell.HunSpell('/usr/share/hunspell/en_US.dic', '/usr/share/hunspell/en_US.aff')
>>> hobj.spell('spookie')
False
>>> hobj.suggest('spookie')
['spookier', 'spookiness', 'spook', 'cookie', 'bookie', 'Spokane', 'spoken']
For searching in databases use "LIKE"
The query you'd want is
SELECT * FROM `testTable` WHERE name LIKE "%s%i%r%a%
xml file snapshot
From above .xml file I am extracting article-id, article-title, abstract and keywords. For normal text inside single tag getting correct results. But text with multiple tags such as:
<title-group>
<article-title>
Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium,
<italic>Rapidithrix thailandica</italic>
</article-title>
</title-group>
.
.
same is for abstract...
I got output as:
OrderedDict([(u'italic**', u'Rapidithrix thailandica'), ('#text', u'Acetylcholines terase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Ba cterium,')])
code has considered tag as a text and the o/p generated is also not in the sequence.
How to simply extract text from such input document as "Acetylcholinesterase-Inhibiting Activity of Pyrrole Derivatives from a Novel Marine Gliding Bacterium, Rapidithrix thailandica".
I am using below python code to perform above task..
import xmltodict
import os
from os.path import basename
import re
with open('2630847.nxml') as fd:
doc = xmltodict.parse(fd.read())
pmc_id = doc['article']['front']['article-meta']['article-id'][1]['#text']
article_title = doc['article']['front']['article-meta']['title-group']['article-title']
y = doc['article']['front']['article-meta']['abstract']
y = y.items()[0]
article_abstract = [g.encode('ascii','ignore') for g in y][1]
z = doc['article']['front']['article-meta']['kwd-group']['kwd']
zz = [g.encode('ascii','ignore') for g in z]
article_keywords = ",".join(zz).replace(","," ")
fout = open(str(pmc_id)+".txt","w")
fout.write(str(pmc_id)+"\n"+str(article_title)+". "+str(article_abstract)+". "+str(article_keywords))
Can somebody please suggest corrections..
xmltodict will likely be hard to use for your data. PMC journal articles are definitely not what the authors could have had in mind. Putting any but the most trivial XML into xmltodict is pounding a round peg into a square hole -- you might succeed, but it won't be pretty. I explain this further below under "tldr"....
Instead, I suggest you use a library whose data model fits your data better, such as xml.dom, minidom, or recent versions of BeautifulSoup. In many such libraries you just load the document with one call and then call some function like innerText() to get all the text content of it. You could even just load the document into a browser and call the Javascript innerText() function to get what you want. If the tool you choose doesn't provide innertext() already, it is:
def innertext(node):
t = ""
for curNode in node.childNodes:
if (isinstance(curNode, Text)):
t += curNode.nodeValue
elif (isinstance(curNode, Element)):
t += curNode.innerText
return(t)
You could tweak that to put spaces between the text nodes, depending on your data.
Hope that helps.
==tldr==
xmltodict makes an admirable attempt at making XML "as simple as possible"; but IMHO it errs in making it simpler than possible.
xmltodict basically works by turning every element into a dict, with its children as the dict items, keyed by their element names. But in many cases (such as yours), XML data isn't very much like that at all. For example, an element can have many children with the same name, but a dict can't.
So xmltodict has to do something special. It turns adjacent instances of the same element type into an array (without the element type). Here's an example excerpted from https://github.com/martinblech/xmltodict):
<and>
<many>elements</many>
<many>more elements</many>
</and>
becomes:
"and": {
"many": [
"elements",
"more elements"
]
},
First off, this means that xmltodict always loses the ordering information about child elements unless they are of the same type. So a section that contains a mix of paragraphs, lists, blockquotes, and so on, will either fail to load in xmltodict, or have all the scattered instances of each kind of child gathered together, completely losing their order.
The xmltodict approach also introduces frequent special-cases -- for example, you can't just get a list of all the children, or use len() to find out how many there are, etc. etc., because at every step you have to check whether you're really at a child element, or at a list of them.
Looking at xmltodict's own examples, you'll see that they mostly consist of walking down the tree by element names, but every now and then there's an integer subscript -- that's for the cases where these arrays are needed. But unless the data is unusually simple (which yours isn't), you won't know where that is. For example, if one DIV in an HTML document happens to contain only one P, the code to access the P needs one fewer subscript than with another DIV that happens to have more than one P.
It seems to me undesirable that the number of subscripts to get to something depends on how many siblings it has, and their types.
Alas, the structure still isn't good enough. Since child elements may have their own child elements, just making them strings in that extra array won't be enough. Sometimes they'll have to be dicts again, with some of their items in turn perhaps being arrays, some of whose items may be dicts, and so on. Writing the correct traversal algorithm to gather up the text is significantly harder than the DOM one shown above.
To be completely fair, there is some XML in which the order doesn't matter logically -- for example, you could export a SQL table into an XML file, using a container element for each record with a child element for each field. The order of fields is not information, so if you load such XML into xmltodict, losing the order doesn't matter. Likewise if you serialized Python data that was already just a dict. But those are very specialized edge cases. xmltodict might be an excellent choice for a case like that -- but the articles you're looking at are very far from that.
I have two files that I want to compare with each other and form a list. Each file have their own class. Book and Person. In these, I have different attributes. The ones I want to compare are: person.personalcode == book.borrowed. From this I want a list of all the borrowed books. I have started like this:
for person in person_list:
for book in booklibrary_list:
if person.personalcode == book.borrowed:
person.books.append(book, person)
for person in person_list:
if len(person.books) > 0:
print(person.personalcode + "," + person.firstname + person.lastname + "have borrowed the following books: ")
for book in person.books:
print(book)
for person in person_list:
person.books = []
But it does not work, what have I missed or done wrong?
Posting as an answer as this is too long for a comment.
First: improve your question. Show how you construct the Person and the Book class, and how you populate them. Describe what the personalcode is and how come personalcode would be the same as a book code. Some sample data and a bit more code would make this easier to answer.
Second: reading your other question, you seem to be storing your data in a text file, loading and querying, modifying and saving the data directly. This will lead you to problems and instead you should consider going down one of two lines:
Use an SQL database, possibly the easiest to start with is SQLite as it does not need a server to be set up and there is a module in the standard library that is very easy to use. Store your data there and you will find it easier in the long run.
Use Python objects (e.g. three classes: Person, Book, and BorrowedBook), manage lists of them within the program, and use shelve from the standard library to store and retrieve these lists of objects between queries.
The use of shelve would be easier if you have not used SQL before, and I hope you will forgive the pun when I say that it might be very appropriate for a book-related application!
For my project, the role of the Lecturer (defined as a class) is to offer projects to students. Project itself is also a class. I have some global dictionaries, keyed by the unique numeric id's for lecturers and projects that map to objects.
Thus for the "lecturers" dictionary (currently):
lecturer[id] = Lecturer(lec_name, lec_id, max_students)
I'm currently reading in a white-space delimited text file that has been generated from a database. I have no direct access to the database so I haven't much say on how the file is formatted. Here's a fictionalised snippet that shows how the text file is structured. Please pardon the cheesiness.
0001 001 "Miyamoto, S." "Even Newer Super Mario Bros"
0002 001 "Miyamoto, S." "Legend of Zelda: Skies of Hyrule"
0003 002 "Molyneux, P." "Project Milo"
0004 002 "Molyneux, P." "Fable III"
0005 003 "Blow, J." "Ponytail"
The structure of each line is basically proj_id, lec_id, lec_name, proj_name.
Now, I'm currently reading the relevant data into the relevant objects. Thus, proj_id is stored in class Project whereas lec_name is a class Lecturer object, et al. The Lecturer and Project classes are not currently related.
However, as I read in each line from the text file, for that line, I wish to read in the project offered by the lecturer into the Lecturer class; I'm already reading the proj_id into the Project class. I'd like to create an object in Lecturer called offered_proj which should be a set or list of the projects offered by that lecturer. Thus whenever, for a line, I read in a new project under the same lec_id, offered_proj will be updated with that project. If I wanted to get display a list of projects offered by a lecturer I'd ideally just want to use print lecturers[lec_id].offered_proj.
My Python isn't great and I'd appreciate it if someone could show me a way to do that. I'm not sure if it's better as a set or a list, as well.
Update
After the advice from Alex Martelli and Oddthinking I went back and made some changes and tried to print the results.
Here's the code snippet:
for line in csv_file:
proj_id = int(line[0])
lec_id = int(line[1])
lec_name = line[2]
proj_name = line[3]
projects[proj_id] = Project(proj_id, proj_name)
lecturers[lec_id] = Lecturer(lec_id, lec_name)
if lec_id in lecturers.keys():
lecturers[lec_id].offered_proj.add(proj_id)
print lec_id, lecturers[lec_id].offered_proj
The print lecturers[lec_id].offered_proj line prints the following output:
001 set([0001])
001 set([0002])
002 set([0003])
002 set([0004])
003 set([0005])
It basically feels like the set is being over-written or somesuch. So if I try to print for a specific lecturer print lec_id, lecturers[001].offered_proj all I get is the last the proj_id that has been read in.
set is better since you don't care about order and have no duplicate.
You can parse the file easily with the csv module (with a delimiter of ' ').
Once you have the lec_name you must check if that lecturer's already know; for that purpose, keep a dictionary from lec_name to lecturer objects (that's just another reference to the same lecturer object which you also refer to from the lecturer dictionary). On finding a lec_name that's not in that dictionary you know it's a lecturer not previously seen, so make a new lecturer object (and stick it in both dicts) in that case only, with an empty set of offered courses. Finally, just .add the course to the current lecturer's offered_proj. It's really a pretty smooth flow.
Have you tried implementing this flow? If so, what problems have you had? Can you show us the relevant code -- should be a dozen lines or so, at most?
Edit: since the OP has posted code now, I can spot the bug -- it's here:
lecturers[lec_id] = Lecturer(lec_id, lec_name)
if lec_id in lecturers.keys():
lecturers[lec_id].offered_proj.add(proj_id)
this is unconditionally creating a new lecturer object (trampling over the old one in the lecturers dict, if any) so of course the previous set gets tossed away. This is the code you need: first check, and create only if needed! (also, minor bug, don't check in....keys(), that's horribly inefficient - just check for presence in the dict). As follows:
if lec_id in lecturers:
thelec = lecturers[lec_id]
else:
thelec = lecturers[lec_id] = Lecturer(lec_id, lec_name)
thelec.offered_proj.add(proj_id)
You could express this in several different ways, but I hope this is clear enough. Just for completeness, the way I would normally phrase it (to avoid two lookups into the dictionary) is as follows:
thelec = lecturers.get(lec_id)
if thelec is None:
thelec = lecturers[lec_id] = Lecturer(lec_id, lec_name)
thelec.offered_proj.add(proj_id)
Sets are useful when you want to guarantee you only have one instance of each item. They are also faster than a list at calculating whether an item is present in the collection.
Lists are faster at adding items, and also have an ordering.
This sounds like you would like a set. You sound like you are very close already.
in Lecturer.init, add a line:
self.offered_proj = set()
That will make an empty set.
When you read in the project, you can simply add to that set:
lecturer.offered_proj.add(project)
And you can print, just as you suggest (although you may like to pretty it up.)
Thanks for the help Alex and Oddthinking! I think I've figured out what was going on:
I modified the code snippet that I added to the question. Basically, every time it read the line I think it was recreating the lecturer object. Thus I put in another if statement that checks if lec_id already exists in the dictionary. If it does, then it skips the object creation and simply moves onto adding projects to the offered_proj set.
The change I made is:
if not lec_id in lecturers.keys():
projects[proj_id] = Project(proj_id, proj_name)
lecturers[lec_id] = Lecturer(lec_id, lec_name)
lecturers[lec_id].offered_proj.add(proj_id)
I only recently discovered the concept behind if not thanks to my friend Samir.
Now I get the following output:
001 set([0001])
001 set([0001, 0002])
002 set([0003])
002 set([0003, 0004])
003 set([0005])
If I print for a chosen lec_id I get the fully updated set. Glee.
Usually when we search, we have a list of stories, we provide a search string, and expect back a list of results where the given search strings matches the story.
What I am looking to do, is the opposite. Give a list of search strings, and one story and find out which search strings match to that story.
Now this could be done with re but the case here is i wanna use complex search queries as supported by solr. Full details of the query syntax here. Note: i wont use boost.
Basically i want to get some pointers for the doesitmatch function in the sample code below.
def doesitmatch(contents, searchstring):
"""
returns result of searching contents for searchstring (True or False)
"""
???????
???????
story = "big chunk of story 200 to 1000 words long"
searchstrings = ['sajal' , 'sajal AND "is a jerk"' , 'sajal kayan' , 'sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python))' , 'bangkok']
matches = [[searchstr] for searchstr in searchstrings if doesitmatch(story, searchstr) ]
Edit: Additionally would also be interested to know if any module exists to convert lucene query like below into regex:
sajal AND (kayan OR bangkok OR Thailand OR ( webmaster AND python) OR "is a jerk")
After extensive googling, i realized what i am looking to do is a Boolean search.
Found the code that makes regex boolean aware : http://code.activestate.com/recipes/252526/
Issue looks solved for now.
Probably slow, but easy solution:
Make a query on the story plus each string to the search engine. If it returns anything, then it matches.
Otherwise you need to implement the search syntax yourself. If that includes things like "title:" and stuff this can be rather complex. If it's only the AND and OR from your example, then it's a recursive function that isn't too hairy.
Some time ago I looked for a python implementaion of lucene and I came accross of Woosh which is a pure python text-based research engine. Maybe it will statisfy your needs.
You can also try pyLucene, but i did'nt investigate this one.
Here's a suggestion in pseudocode. I'm assuming you store a story identifier with the search terms in the index, so that you can retrieve it with the search results.
def search_strings_matching(story_id_to_match, search_strings):
result = set()
for s in search_strings:
result_story_ids = query_index(s) # query_index returns an id iterable
if story_id_to_match in result_story_ids:
result.add(s)
return result
This is probably less interesting to you now, since you've already solved your problem, but what you're describing sounds like Prospective Search, which is what you call it when you have the query first and you want to match it against documents as they come along.
Lucene's MemoryIndex is a class that was designed specifically for something like this, and in your case it might be efficient enough to run many queries against a single document.
This has nothing to do with Python, though. You'd probably be better off writing something like this in java.
If you are writing Python on AppEngine, you can use the AppEngine Prospective Search Service to achieve exactly what you are trying to do here. See: http://code.google.com/appengine/docs/python/prospectivesearch/overview.html