Stemming for Polish language using Google App Engine Python Search Api

Stemming for Polish language using Google App Engine Python Search Api - python

I'm trying to use Python Search Api in Google App Engine to search through set of Polish documents and I found, that stemming feature is not working as expected.
The word "red" in English has only one form, although there are different forms of it in Polish, based on: gender, plurality and case:
Non-plural:
| | masculine | feminine | neuter |
|--------------|------------|-----------|------------|
| Nominative | czerwony | czerwona | czerwone |
| Genitive | czerwonego | czerwonej | czerwonego |
| Dative | czerwonemu | czerwonej | czerwonemu |
| Accusative | czerwony | czerwoną | czerwone |
| Instrumental | czerwonym | czerwoną | czerwonym |
| Locative | czerwonym | czerwonej | czerwonym |
| Vocative | czerwony | czerwona | czerwone |
Plural (neuter is the same as feminine):
| | masculine | feminine |
|--------------|------------|------------|
| Nominative | czerwoni | czerwone |
| Genitive | czerwonych | czerwonych |
| Dative | czerwonym | czerwonym |
| Accusative | czerwonych | czerwone |
| Instrumental | czerwonymi | czerwonymi |
| Locative | czerwonych | czerwonych |
| Vocative | czerwoni | czerwone |
As you can see there are in total 12 unique forms of "red" in Polish: 'czerwony', 'czerwonym', 'czerwonego', 'czerwonemu', 'czerwona', 'czerwoną', 'czerwonej', 'czerwone', 'czerwoni', 'czerwonymi', 'czerwonych', 'czerwonym'
What I'd expect from Google App Engine stemmer is to treat all of them as being the same (as being "red"). Let's test it by adding endpoint to the App Engine app, which does as follows:
def test_me():
forms = {'czerwony', 'czerwonym', 'czerwonego', 'czerwonemu',
'czerwona', 'czerwoną', 'czerwonej',
'czerwone', 'czerwoni', 'czerwonymi', 'czerwonych',
'czerwonym'}
# turn each form into document and insert to index
index = search.Index(name=str(uuid.uuid4()))
index.put([search.Document(language='pl',
fields=[
search.TextField(name='color', value=form, language='pl')
])
for form in forms])
missing = {}
for form in forms:
# find out what forms can we match to 'form' using ~ stemming operator
results = index.search(query="~" + form).results
matching_forms = set([doc.field('color').value for doc in results])
# and see which we missed
missing[form] = list(forms - matching_forms)
return json.dumps(missing)
It turns out there's bunch of items, which were not matched correctly:
"czerwonym": [
"czerwona",
"czerwoną",
"czerwoni",
"czerwonych",
"czerwonej",
"czerwonymi",
"czerwonemu"
],
"czerwonemu": [
"czerwona",
"czerwoną",
"czerwone",
"czerwoni",
"czerwonych",
"czerwonej",
"czerwonego",
"czerwony",
"czerwonym",
"czerwonymi"
],
...
Am I doing something wrong here? Or maybe I have wrong expectations for GAE stemmer?
Please, note that there's a open-source polish stemmer (https://github.com/morfologik/morfologik-stemming), which handles all 12 forms without any problems. This leads me to believe that my expectations for GAE stemmer are not outrageous.

Related

Button Not Interacting (Custom Type) - Pywinauto

I am working on automating a process that uses our ancient HRIS system that unfortunately doesn't have API Access.
I am fairly new to Python, so I have been taking this task bit by bit. I've managed to connect to the app and input my username and password to sign in. However, I am stuck on selecting a menu item. I've tried everything that I know to do and have Googled until I've gone cross-eyed.
Dialog - 'City of Conway LIVE Springbrook V7' (L0, T0, R1032, B1039)
['Dialog', 'City of Conway LIVE Springbrook V7Dialog', 'City of Conway LIVE Springbrook V7', 'Dialog0', 'Dialog1']
child_window(title="City of Conway LIVE Springbrook V7", auto_id="MainMenu", control_type="Window")
|
| Pane - '' (L231, T87, R1024, B118)
| ['Pane', 'Pane0', 'Pane1']
| child_window(auto_id="_panelExWorkArea", control_type="Pane")
| |
| | Pane - 'Desktop' (L231, T90, R1021, B115)
| | ['DesktopPane', 'Desktop', 'Pane2']
| | child_window(title="Desktop", auto_id="_ssiGroupHeaderWorkArea", control_type="Pane")
|
| Pane - '' (L228, T87, R231, B1005)
| ['Pane3']
| child_window(auto_id="_ssiExpandableSplitter1", control_type="Pane")
|
| Pane - '' (L8, T87, R228, B1005)
| ['Pane4']
| child_window(auto_id="_panelTaskArea", control_type="Pane")
| |
| | Pane - '' (L11, T90, R228, B1002)
| | ['Pane5']
| | child_window(auto_id="328582", control_type="Pane")
| | |
| | | TreeView - '' (L11, T115, R228, B1002)
| | | ['TreeView', 'TreeView0', 'TreeView1']
| | | child_window(auto_id="1775914", control_type="Tree")
| | | |
| | | | Pane - '' (L28, T269, R194, B609)
| | | | ['Pane6']
| | | | child_window(auto_id="726924", control_type="Pane")
| | | | |
| | | | | Pane - '' (L28, T269, R194, B609)
| | | | | ['Pane7']
| | | | | child_window(auto_id="1317216", control_type="Pane")
| | | | | |
| | | | | | TreeView - '' (L28, T269, R194, B609)
| | | | | | ['TreeView2']
| | | | | | child_window(auto_id="2101028", control_type="Tree")
| | | | | | |
| | | | | | | Custom - 'Maintenance' (L0, T0, R0, B0)
| | | | | | | ['Custom', 'Maintenance', 'MaintenanceCustom', 'Custom0', 'Custom1', 'Maintenance0', 'Maintenance1', 'MaintenanceCustom0', 'MaintenanceCustom1']
| | | | | | | child_window(title="Maintenance", control_type="Custom")
I'm using a few tools to inspect the GUI, and this one specifically allows me to do the desired task by selecting "do it". It allows me to expand and collapse the section, so surely I've got to be missing something somewhere?
enter image description here
enter image description here
Here is my code:
from pywinauto import Application
app=Application(backend="uia").connect(path=r"C:\Users\skywalker\AppData\Local\Apps\2.0\C38DNYDP.PZ6\07BV1NGN.8G6\spri..ons1_b443b3e57637483a_0007.000f_52ec298e739bfebb", timeout = 30)
maintenance = app.CityofConwayLIVESpringbrookV7.WindowsForms10.Window.8.app.0.a0f91b_r8_ad1, 263022
maintenance.click()
I would also like to mention that I CAN get it to work with Click_Input, but I would like to avoid that if at all possible.

SQLAlchemy - pretty print SQL query results

In Ruby console, it is possible to display SQL query results in a very human-friendly way (ActiveRecord + Hirb):
>> Tag.all :limit=>3, :order=>"id DESC"
+-----+-------------------------+-------------+-------------------+-----------+-----------+----------+
| id | created_at | description | name | namespace | predicate | value |
+-----+-------------------------+-------------+-------------------+-----------+-----------+----------+
| 907 | 2009-03-06 21:10:41 UTC | | gem:tags=yaml | gem | tags | yaml |
| 906 | 2009-03-06 08:47:04 UTC | | gem:tags=nomonkey | gem | tags | nomonkey |
| 905 | 2009-03-04 00:30:10 UTC | | article:tags=ruby | article | tags | ruby |
+-----+-------------------------+-------------+-------------------+-----------+-----------+----------+
3 rows in set
Is there a module that will allow me to do display SQLAlchemy result sets in a similar way in IPython?

scons sequential commands not as expected in dependency tree

I'm running an SConscript which is called by a SConstruct which does nothing but set the environment and Export('env'). The SConscript is supposed to iterate over files with filenames like mod_abc.c and for each of these files - First create an xml dir, generate a structdoc, create a file mod_abc_post.c and then an object file and a '.so' file. After that it should remove the xml file and restart the process for the next mod_*.c file.
Heres the script:
import os
Import('env')
my_libs = 'jansson'
postc_files = Glob('mod_*_post.c')
all_mods = Glob('mod_*.c')
mods = set(all_mods) - set(postc_files)
mods = list(mods)
env['STATIC_AND_SHARED_OBJECTS_ARE_THE_SAME']=1
xml_cmd_str = '(cat ../Doxyfile.configxml; echo "INPUT=%s";) | doxygen - > xml%s'
structdoc_cmd_str = 'python ../prep_structdoc.py xml mod_config mod_mtx update_mtx serialize_mtx "mod_evt_" > %s'
preprocess_cmd_str = 'python ../preprocess_mod.py xml %s %s > %s'
for mod in mods:
#create doxy file
xml_dir = env.Command('xml%s' % mod.name, mod, xml_cmd_str % (mod.name, mod.name))
mod_name = mod.name[:-2]
struct_doc = '%s.structdoc' % mod_name
#using Command instead of os.popen as clean can take care of structdoc
sdoc = env.Command(struct_doc, xml_dir, structdoc_cmd_str % struct_doc)
processed_file= '%s_post.c' % mod_name
pfile = env.Command(processed_file, sdoc, preprocess_cmd_str % (mod_name, struct_doc, processed_file))
obj_file = env.Object(target='%s.o' % mod_name, source=pfile)
shared_target = '%s.so' % mod_name
env.SharedLibrary(target=shared_target, source=obj_file, LIBS=my_libs)
py_wrapper = env.Command('%s.py' % mod_name, pfile, 'ctypesgen.py %s %s -o %s' % (processed_file, mod.name, '%s.py' % mod_name))
# remove xml once done
remove_xml = env.Command('dummy%s' %mod.name, py_wrapper, 'rm -rf xml')
Ive taken care that xml_dir target gets a particular name as that xml command should be run only for that mod_name. The problem is that the dependency tree is not as expected.
I expect a tree like this for each of the files
-remove xml
--create py_wrapper
---create so file
----create o file
-----create _post.c file
------create .structdoc file
-------create xml directory
But what I get by doing scons --tree=ALL is for example just one of them mod_serialize_example.c is:
The dont come in order, there are things in the middle as well which are for other mod_*.c files.
[Some other things before this]
+-dummymod_serialize_example.c
| +-mod_serialize_example.py
| | +-mod_serialize_example_post.c
| | | +-mod_serialize_example.structdoc
| | | | +-xmlmod_serialize_example.c
| | | | | +-mod_serialize_example.c
| | | | +-/usr/bin/python
| | | +-/usr/bin/python
| | +-/usr/local/bin/ctypesgen.py
| +-/bin/rm
[Some other things after this]
+-libmod_serialize_example.so
| +-mod_serialize_example.o
| | +-mod_serialize_example_post.c
| | | +-mod_serialize_example.structdoc
| | | | +-xmlmod_serialize_example.c
| | | | | +-mod_serialize_example.c
| | | | +-/usr/bin/python
| | | +-/usr/bin/python
| | +-mod_serialize_example.c
| | +-/path/to/header files included
| | +-/usr/bin/gcc
| +-/usr/bin/gcc
+-mod_addition.c [ Some other module ]
+-mod_serialize_example.c
+-mod_serialize_example.o
| +-mod_serialize_example_post.c
| | +-mod_serialize_example.structdoc
| | | +-xmlmod_serialize_example.c
| | | | +-mod_serialize_example.c
| | | +-/usr/bin/python
| | +-/usr/bin/python
| +-mod_serialize_example.c
| +-/path/to/header files included...
| +-/usr/bin/gcc
+-mod_serialize_example.py
| +-mod_serialize_example_post.c
| | +-mod_serialize_example.structdoc
| | | +-xmlmod_serialize_example.c
| | | | +-mod_serialize_example.c
| | | +-/usr/bin/python
| | +-/usr/bin/python
| +-/usr/local/bin/ctypesgen.py
+-mod_serialize_example.structdoc
| +-xmlmod_serialize_example.c
| | +-mod_serialize_example.c
| +-/usr/bin/python
+-mod_serialize_example_post.c
| +-mod_serialize_example.structdoc
| | +-xmlmod_serialize_example.c
| | | +-mod_serialize_example.c
| | +-/usr/bin/python
| +-/usr/bin/python
+-pfile
+-xml
[some other stuff]
+-xmlmod_serialize_example.c
+-mod_serialize_example.c
What i would expect for mod_serialize_example.c is
+-rm xml
|+-libmod_serialize_example.so
| +-mod_serialize_example.o
| | +-mod_serialize_example_post.c
| | | +-mod_serialize_example.structdoc
| | | | +-xmlmod_serialize_example.c
| | | | | +-mod_serialize_example.c
| | | | +-/usr/bin/python
| | | +-/usr/bin/python
| | +-mod_serialize_example.c
| | +-/path/to/header files included
| | +-/usr/bin/gcc
| +-/usr/bin/gcc
However, I see this and a lot more than required. (also the above one was just manually done to give an idea of the process, pardon the indentation with the + and | )
Shouldn't they all bunch up together ? (As shown in the expected tree, and repeat like a loop for the different filenames).
Also, Im just getting started with scons and any help in making this design cleaner would be helpful.
1. I would like to know how to get the expected tree
2. How can I make this script take a module name and run the for loop code only on that.
example: scons mod_abc.c should create the .so file only for that.
As of now, this doesnt produce anything if i do that.

Why would you expect a tree like that? There's no (explicit or implicit) dependency of anything else on your shared library for instance. So it will go as one of the targets at the top of the graph.

RDF/SKOS concept hierarchy as Python dictionary

In Python, how do I turn RDF/SKOS taxonomy data into a dictionary that represents the concept hierarchy only?
The dictionary must have this format:
{ 'term1': [ 'term2', 'term3'], 'term3': [{'term4' : ['term5', 'term6']}, 'term6']}
I tried using RDFLib with JSON plugins, but did not get the result I want.

I'm not much of a Python user, and I haven't worked with RDFLib, but I just pulled the SKOS and vocabulary from the SKOS vocabularies page. I wasn't sure what concepts (RDFS or OWL classes) were in the vocabulary, nor what their hierarchy was, so I ran this a SPARQL query using Jena's ARQ to select classes and their subclasses. I didn't get any results. (There were classes defined of course, but none had subclasses.) Then I decided to use both the SKOS and SKOS-XL vocabularies, and to ask for properties and subproperties as well as classes and subclasses. This is the SPARQL query I used:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT ?property ?subproperty ?class ?subclass WHERE {
{ ?subclass rdfs:subClassOf ?class }
UNION
{ ?subproperty rdfs:subPropertyOf ?property }
}
ORDER BY ?class ?property
The results I got were
-------------------------------------------------------------------------------------------------------------------
| property | subproperty | class | subclass |
===================================================================================================================
| rdfs:label | skos:altLabel | | |
| rdfs:label | skos:hiddenLabel | | |
| rdfs:label | skos:prefLabel | | |
| skos:broader | skos:broadMatch | | |
| skos:broaderTransitive | skos:broader | | |
| skos:closeMatch | skos:exactMatch | | |
| skos:inScheme | skos:topConceptOf | | |
| skos:mappingRelation | skos:broadMatch | | |
| skos:mappingRelation | skos:closeMatch | | |
| skos:mappingRelation | skos:narrowMatch | | |
| skos:mappingRelation | skos:relatedMatch | | |
| skos:narrower | skos:narrowMatch | | |
| skos:narrowerTransitive | skos:narrower | | |
| skos:note | skos:changeNote | | |
| skos:note | skos:definition | | |
| skos:note | skos:editorialNote | | |
| skos:note | skos:example | | |
| skos:note | skos:historyNote | | |
| skos:note | skos:scopeNote | | |
| skos:related | skos:relatedMatch | | |
| skos:semanticRelation | skos:broaderTransitive | | |
| skos:semanticRelation | skos:mappingRelation | | |
| skos:semanticRelation | skos:narrowerTransitive | | |
| skos:semanticRelation | skos:related | | |
| | | _:b0 | <http://www.w3.org/2008/05/skos-xl#Label> |
| | | skos:Collection | skos:OrderedCollection |
-------------------------------------------------------------------------------------------------------------------
It looks like there's not much concept hierarchy in SKOS at all. Could that explain why you didn't get the results you wanted before?

How to encode text in AL32UTF8 with Python

We are trying to match a hash that has gone through Oracle's MD5 hash algorithm using Python. According to their forums everything is encoded in AL21UTF8 prior to hashing:
-- Prior to encryption, hashing or keyed hashing, CLOB datatype is
-- converted to AL32UTF8. This allows cryptographic data to be
-- transferred and understood between databases with different
-- character sets, across character set changes and between
-- separate processes (for example, Java programs).
--
I thought at first that UTF-8 was good enough, but if I do that, my hashes still don't match. So after additional digging, I found this article which stated from the Oracle's Database Companion CD installation Guide:
AL32UTF8 is the Oracle Database character set that is appropriate for XMLType data. It is equivalent to the IANA registered standard UTF-8 encoding, which supports all valid XML characters.
Do not confuse the Oracle Database database character set UTF8 (no hyphen) with the database character set AL32UTF8 or with character encoding UTF-8. Database character set UTF8 has been superseded by AL32UTF8. Do not use UTF8 for XML data. UTF8 supports only Unicode version 3.1 and earlier; it does not support all valid XML characters. AL32UTF8 has no such limitation.
So I can't use UTF-8 and I can't figure out how to get Python's codecs module to differentiate between utf-8 and utf8. If I try AL32UTF8, it throws an error. Has anyone else ever encoded in AL32UTF8 in Python?
My codecs code looks like this:
import codecs
sourceFmt = "ascii"
targetFmt = "utf8"
utfFile = "kesa_utf8.dat"
with codecs.open(old, "rU", sourceFmt) as sourceFile:
with codecs.open(utfFile, "w", targetFmt) as targetFile:
targetFile.write(sourceFile.read())
The file itself looks like this:
WC000|IC |KESA |KESA | | | |2012-07-31-15.12.36 |0090| | |\c\n
WC001|100534 |W.47212-0100534 |2012-07-31-15.12.36 | 00000000001270.00|USD|\c\n
WC002|100534 |W.47212-0100534 |Sally |H |Klass |1235 14th St. W. || |Palma Sola ||FL |USA |34209 | | | | | | | | |9412587545 | | |O | | ||20800426|645858741 |SSN | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | || | | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |KESAPC | | | | | |N| | | || | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |\c\n
WC999|1000000000|1000000000|4000000000|
The hash should be 86D993FA7121E3B9EE1657A23345FE21
Anyway, I hash it using hashlib:
import hashlib
with open(path) as f:
data = f.read()
mdhash = hashlib.md5(data)
mdhash = mdhash.hexdigest()
print mdhash
which results in 8421877dd9cdf7235eec47765821998c

It turns out that whatever the client was doing caused the data itself to be changed in such a way that it had "\c\n" line endings and it also would make the lines in the file all the same size via padding (of spaces on the end) AFTER they hashed it. Once we got the client to stop feeding us bad data, we were able to replicate the hash. Thanks for the help though!

Develop Reference

Python is a programming language that lets you work quickly and integrate systems more effectively.

Stemming for Polish language using Google App Engine Python Search Api - python

Related

Button Not Interacting (Custom Type) - Pywinauto

SQLAlchemy - pretty print SQL query results

scons sequential commands not as expected in dependency tree

RDF/SKOS concept hierarchy as Python dictionary

How to encode text in AL32UTF8 with Python

Categories

Resources